<a href="https://colab.research.google.com/github/khiempmk/Devc/blob/master/Week5_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset

Let's first of all have a look at the data. You can download the file `labeledTrainData.tsv` on the [Kaggle website of the competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/data), or on our [Google Drive](https://drive.google.com/file/d/1a1Lyn7ihikk3klAX26fgO3YsGdWHWoK5/view?usp=sharing)


In [97]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [98]:
# Read dataset with extra params sep='\t', encoding="latin-1"
from google.colab import drive
drive.mount('/content/drive')
data_file   = open('/content/drive/My Drive/labeledTrainData.tsv')
data = pd.read_csv(data_file, sep ='\t', encoding="latin-1")
data.head(10)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


## 2. Preprocessing

In [94]:
remove_tag = ['<br />', "\\\"" , '(', ')' ,"\\" , ".", ",", "--", "-"]
for word in remove_tag :
  data['review']= data['review'].str.replace(word,' ')
data.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,The Classic War of the Worlds by Timothy Hi...
2,7759_3,0,The film starts with a manager Nicholas Bell ...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,This movie could have been very good but come...
7,10633_1,0,I watched this video at a friend's house I'm ...
8,319_1,0,A friend of mine bought this film for £1 and ...
9,8713_10,1,This movie is full of references Like Mad ...


In [99]:
X = data['review']
y = data['sentiment']

In [100]:
# stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{"didn't", 'of', 'than', 'or', 'her', 'to', 'no', 'don', 'this', "wasn't", 's', 'just', 'they', "mightn't", 'had', 'is', 'him', 'himself', 'doing', "aren't", 'am', 'an', 'all', 'them', 'before', 'any', 'should', "couldn't", 'while', 'your', 'here', 'will', 'ain', 're', 'did', 'off', 'been', 'between', 'for', 'll', 't', 'further', 'she', 'are', 'same', 'against', 'there', 've', 'ourselves', "isn't", 'how', 'nor', 'itself', 'too', 'through', 'can', 'hers', 'was', 'shan', 'herself', 'it', "it's", 'not', "mustn't", 'above', "weren't", 'we', 'o', 'aren', 'where', 'more', 'm', 'mightn', 'their', 'myself', 'each', 'hadn', 'as', 'his', "she's", 'into', 'wasn', "that'll", 'until', 'yourself', 'do', 'about', 'below', 'didn', 'haven', 'after', 'during', 'both', 'the', "hadn't", 'd', 'with', 'weren', 'theirs', 'now', 'needn', 'why', 'on', 'under', "you've", 'have', 'who', "won't",

## 3. Create Model and Train 


In [101]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import log_loss

In [102]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=102)

In [103]:
module_count_vector = CountVectorizer(stop_words=stop_words)
text_clfs = Pipeline([('vect', module_count_vector), # tiền xử lý dữ liệu 
                      ('tfidf', TfidfTransformer()), # tiền xử lý dữ liệu 
                      ('clf', LogisticRegression()), # mô hình 
                         ])
text_clfs.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words={'a', 'about', 'above', 'after',
                                             'again', 'against', 'ain', 'all',
                                             'am', 'an', 'and', 'any', 'are',...
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LogisticRegression(C=1.0, class_weigh

## 4. Evaluate Model

In [104]:
predictions = text_clfs.predict(X_test)
print("Accuracy score: %f" % accuracy_score(y_test, predictions))

Accuracy score: 0.886600


In [105]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print('Log loss:', log_loss(y_test, predictions)/len(y_test))

Confusion Matrix:
[[2156  347]
 [ 220 2277]]
              precision    recall  f1-score   support

           0       0.91      0.86      0.88      2503
           1       0.87      0.91      0.89      2497

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000

Log loss: 0.0007833505470489056


## 5. Export Model 

In [109]:
# Using pickle to export our trained model
import pickle
import os
filename = '/content/drive/My Drive/finalized_model.sav'
pickle.dump(text_clfs, open(filename, 'wb'))