<a href="https://colab.research.google.com/github/natuan310/w5-Sentiment-Classification-By-NLP-Logistic-Regression/blob/master/w5-Sentiment-Classification-By-NLP-Logistic-Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review's Sentiment Classification by Logistic Regression



---
## 1. Prepare the data


In [101]:
from google.colab import drive
drive.mount('/ggdrive', force_remount= True)

Mounted at /ggdrive


In [0]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [0]:
movie = pd.read_csv('/ggdrive/My Drive/FTMLE - Tonga/Data/movie_review.csv', sep='\t')

In [104]:
movie.head()

Unnamed: 0,id,review,sentiment
0,5814_8,With all this stuff going down at the moment w...,1
1,2381_9,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,7759_3,The film starts with a manager (Nicholas Bell)...,0
3,3630_4,It must be assumed that those who praised this...,0
4,9495_8,Superbly trashy and wondrously unpretentious 8...,1


In [105]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22500 entries, 0 to 22499
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22500 non-null  object
 1   review     22500 non-null  object
 2   sentiment  22500 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 527.5+ KB


In [106]:
sentiment = movie[['review', 'sentiment']]
sentiment

Unnamed: 0,review,sentiment
0,With all this stuff going down at the moment w...,1
1,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,The film starts with a manager (Nicholas Bell)...,0
3,It must be assumed that those who praised this...,0
4,Superbly trashy and wondrously unpretentious 8...,1
...,...,...
22495,It seems like more consideration has gone into...,0
22496,I don't believe they made this film. Completel...,0
22497,"Guy is a loser. Can't get girls, needs to buil...",0
22498,This 30 minute documentary BuÃ±uel made in the...,0


---
## 2. Clean our data



In [107]:
sentiment.nunique()

review       22425
sentiment        2
dtype: int64

In [108]:
sentiment.drop_duplicates(inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [109]:
sentiment.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22425 entries, 0 to 22499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     22425 non-null  object
 1   sentiment  22425 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 525.6+ KB


In [110]:
sentiment[sentiment['review'].isnull()]

Unnamed: 0,review,sentiment


In [111]:
sentiment[sentiment['sentiment'].isnull()]

Unnamed: 0,review,sentiment


---
#### Stopwords


Stop words are extremely common words that would be of little value in our analysis are often excluded from the vocabulary entirely. Some common examples are determiners like the, a, an, another, but your list of stop words (or stop list) depends on the context of the problem you're working on.

In [0]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

---
#### Preprocessor


Preprocess text data to remove HTML markup, non-word character and bring the emoticons to the end of text.

In [121]:
import re
import contractions

def preprocessor(text):
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)+|\(|D|P)', text)

    text = contractions.fix(text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))

    return text

print(preprocessor('''We are \\almost read:))y! There is ano//ther^^ trick we ca\\n use to reduce o'''))

we are almost read y there is ano ther trick we ca n use to reduce o :))


---
#### Stemmer


Use PorterStemmer to normalize all the word in the text.

In [0]:
from nltk.stem import PorterStemmer

def tokenizer(text):
  words = word_tokenize(text)
  s_words = pd.Series(words)
  s_words = lemmatize(s_words)
  s_words = stemming_porter(s_words)
  return s_words.tolist()

def lemmatize(s_words):
  wordnet_lemmatizer = WordNetLemmatizer()
  return s_words.apply(lambda w: wordnet_lemmatizer.lemmatize(w))

def tokenizer_porter(text):
    porter = PorterStemmer()
    return [porter.stem(i) for i in text.split()]



---


#### Use Vectorizer to transfrom preprocess methods to vectorize method

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words = stop, tokenizer = tokenizer_porter,
                        preprocessor = preprocessor)

---
#### Use train_test_split to split data to train and test set

In [0]:
from sklearn.model_selection import train_test_split

X = sentiment.review
y = sentiment.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)



---


#### Create Pipeline




Use Pipeline to apply preprocess methods to data then transfer data to our model

In [149]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state = 0, max_iter = 4000))])
clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f04164a3048>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7f04164a3378>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
         

Test model with test data set and evaluate the model

In [0]:
test_pred = clf.predict(X_test)

In [153]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(accuracy_score(y_test, test_pred))
print(classification_report(y_test, test_pred))
print(confusion_matrix(y_test, test_pred))

0.8818283166109253
              precision    recall  f1-score   support

           0       0.89      0.86      0.88      2185
           1       0.87      0.90      0.89      2300

    accuracy                           0.88      4485
   macro avg       0.88      0.88      0.88      4485
weighted avg       0.88      0.88      0.88      4485

[[1884  301]
 [ 229 2071]]


#### Load evaluate data

In [151]:
test_data = pd.read_csv('/ggdrive/My Drive/FTMLE - Tonga/Data/movie_review_evaluation.csv', sep = '\t')
test_data.head()

Unnamed: 0,id,review
0,10633_1,I watched this video at a friend's house. I'm ...
1,4489_1,`The Matrix' was an exciting summer blockbuste...
2,3304_10,This movie is one among the very few Indian mo...
3,3350_3,The script for this movie was probably found i...
4,1119_1,Even if this film was allegedly a joke in resp...


In [154]:
X_evaluate = test_data.review
X_evaluate.shape

(2500,)

In [158]:
evaluate_pred = clf.predict(X_evaluate)
evaluate_pred.shape, evaluate_pred

((2500,), array([0, 0, 1, ..., 1, 0, 1]))

In [0]:
df_evaluate_pred = pd.DataFrame(evaluate_pred)

In [190]:
df_evaluate_pred.columns = ['clf']
df_evaluate_pred

Unnamed: 0,clf
0,0
1,0
2,1
3,0
4,0
...,...
2495,1
2496,0
2497,1
2498,0


In [0]:
# import pickle
# import os

# pickle.dump(clf, open(os.path.join('NLP_movie_review.pkl'), 'wb'))

In [0]:
# Use cross_val_score to test model
# from sklearn.model_selection import cross_val_score

# scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
# print(scores)

---
#### Test model with list of C parameter to find the higher accuracy score

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV

C_list = np.arange(0.2682695795279725, 5.79474679231202, 5)
param_dict = dict(C = C_list)
score3 = []
for i in C_list:
    clf = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(C= i, max_iter=4000))])
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    score3.append((i, accuracy_score(y_test, pred)))

In [0]:
score

[(0.0001, 0.5018952062430323),
 (0.0013894954943731374, 0.7645484949832776),
 (0.019306977288832496, 0.822742474916388),
 (0.2682695795279725, 0.881159420289855),
 (3.727593720314938, 0.8987736900780379),
 (51.79474679231202, 0.8847268673355629),
 (719.6856730011514, 0.8735785953177257),
 (10000.0, 0.8693422519509476)]

In [0]:
score2

[(0.2682695795279725, 0.881159420289855),
 (10.268269579527972, 0.8958751393534002),
 (20.268269579527974, 0.8907469342251951),
 (30.268269579527974, 0.8876254180602007),
 (40.268269579527974, 0.8847268673355629),
 (50.268269579527974, 0.8845039018952062)]

In [0]:
score3

[(0.2682695795279725, 0.881159420289855),
 (5.268269579527972, 0.8972129319955406)]

---
#### Apply best C parameter to model

In [160]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

clf2 = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(C= 5.268269579527972, max_iter=4000))])
clf2.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f04164a3048>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 tokenizer=<function tokenizer_porter at 0x7f04164a3378>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=5.268269579527972, class_weight=None,
                                    dual=False, fit_intercept=True,
       

Predict with new model and check model scores

In [161]:
test_pred = clf2.predict(X_test)
print(accuracy_score(y_test, test_pred))
print(classification_report(y_test, test_pred))
print(confusion_matrix(y_test, test_pred))

0.8898550724637682
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      2185
           1       0.88      0.90      0.89      2300

    accuracy                           0.89      4485
   macro avg       0.89      0.89      0.89      4485
weighted avg       0.89      0.89      0.89      4485

[[1912  273]
 [ 221 2079]]


Use model clf2 predict the evaluate data

In [162]:
evaluate_pred2 = clf2.predict(X_evaluate)
evaluate_pred2.shape, evaluate_pred2

((2500,), array([0, 0, 1, ..., 1, 0, 1]))

Create model clf3 with other C value

In [166]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

clf3 = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(C= 3.727593720314938, max_iter=4000))])
clf3.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f04164a3048>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 tokenizer=<function tokenizer_porter at 0x7f04164a3378>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=3.727593720314938, class_weight=None,
                                    dual=False, fit_intercept=True,
       

In [168]:
test_pred = clf3.predict(X_test)
print(accuracy_score(y_test, test_pred))
print(classification_report(y_test, test_pred))
print(confusion_matrix(y_test, test_pred))

0.8878483835005574
              precision    recall  f1-score   support

           0       0.89      0.88      0.88      2185
           1       0.88      0.90      0.89      2300

    accuracy                           0.89      4485
   macro avg       0.89      0.89      0.89      4485
weighted avg       0.89      0.89      0.89      4485

[[1913  272]
 [ 231 2069]]


Use model clf3 predict the evaluate data

In [167]:
evaluate_pred3 = clf3.predict(X_evaluate)
evaluate_pred3.shape, evaluate_pred3

((2500,), array([0, 0, 1, ..., 1, 0, 1]))

Train model clf2 again with whole labeled data

In [171]:
clf2.fit(X, y)

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7f04164a3048>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 tokenizer=<function tokenizer_porter at 0x7f04164a3378>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=5.268269579527972, class_weight=None,
                                    dual=False, fit_intercept=True,
       

In [172]:
test_pred = clf2.predict(X_test)
print(accuracy_score(y_test, test_pred))
print(classification_report(y_test, test_pred))
print(confusion_matrix(y_test, test_pred))

0.966778149386845
              precision    recall  f1-score   support

           0       0.97      0.96      0.97      2185
           1       0.97      0.97      0.97      2300

    accuracy                           0.97      4485
   macro avg       0.97      0.97      0.97      4485
weighted avg       0.97      0.97      0.97      4485

[[2106   79]
 [  70 2230]]


Use re-train model clf2 to predict the evaluate data

In [174]:
evaluate_pred4 = clf2.predict(X_evaluate)
evaluate_pred4.shape, evaluate_pred4

((2500,), array([0, 0, 1, ..., 1, 0, 1]))

Compare different between predictions of models

In [191]:
df_evaluate_pred['clf2'] = pd.Series(evaluate_pred2)
df_evaluate_pred['clf3'] = pd.Series(evaluate_pred3)
df_evaluate_pred['clf2 re-train'] = pd.Series(evaluate_pred4)
df_evaluate_pred.sum()

clf              1282
clf2             1268
clf3             1276
clf2 re-train    1270
dtype: int64

Try re-train model clf with the whole labeled data

In [176]:
clf.fit(X, y)
pred5 = clf.predict(X_test)
print(accuracy_score(y_test, pred5))
print(classification_report(y_test, pred5))
print(confusion_matrix(y_test, pred5))

0.9237458193979933
              precision    recall  f1-score   support

           0       0.93      0.91      0.92      2185
           1       0.92      0.93      0.93      2300

    accuracy                           0.92      4485
   macro avg       0.92      0.92      0.92      4485
weighted avg       0.92      0.92      0.92      4485

[[1998  187]
 [ 155 2145]]


The precision score is lower than the re-train clf2

---
Add predictions to evaluate data and export the result


In [0]:
test_data['sentiment'] = df_evaluate_pred['clf2 re-train']
test_data.to_csv('AnhTuan.csv')

In [198]:
test_data.sample(10)

Unnamed: 0,id,review,sentiment
1902,1167_4,I really tried to like this movie but in the e...,0
1816,8727_2,Let me start out by saying I can enjoy just ab...,0
2473,4665_10,I first saw this film about 11 years ago when ...,1
1439,3287_1,"If this movie should be renamed, it should be ...",0
1793,10439_8,When I saw this movie in the theater when it c...,1
1806,3747_2,I love all his work but this looks like nothin...,0
2454,7065_2,OK well i found this movie in my dads old pile...,0
2210,3949_8,"Like many people on this site, I saw this movi...",1
1275,2525_2,"This movie was, as Homer Simpson would have pu...",0
628,4089_1,Warning: This review contains a spoiler.<br />...,0


In [0]:
import pickle
import os

pickle.dump(clf2, open(os.path.join('NLP_movie_review_90.pkl'), 'wb'))



---





---

