![](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fchrisbrantner%2Ffiles%2F2019%2F02%2Fimdb-freedive-1200x549.jpg)

Sentiment analysis is part of the Natural Language Processing (NLP) techniques that consists in extracting emotions related to some raw texts. This is usually used on social media posts and customer reviews in order to automatically understand if some users are positive or negative and why. The goal of this study is to show how sentiment analysis can be performed using python. Here are some of the main libraries we will use:

NLTK: the most famous python module for NLP techniques
Gensim: a topic-modelling and vector space modelling toolkit
Scikit-learn: the most used python machine learning library
We will use here some movie reviews data. Each customer’s review is composed of a textual feedback of the customer’s experience about the movie. The data can be found here:
https://drive.google.com/open?id=1vc-zzz1VmSCDCqEIoJiZjFbn-qXjX18e

For each textual review, we want to predict if it corresponds to a good review (the customer is happy) or to a bad one (the customer is not satisfied). In order to simplify the problem we will split those into two sentimental categories:

- bad reviews have sentiment value 0
- good reviews have sentiment value 1
The challenge here is to be able to predict this information using only the raw textual data from the review.
Let’s get it started!


# Phase 1: Read the dataset

In [34]:
# Useful library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Library for cleaning data
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')
import string
import re
translator = str.maketrans(" "," ", string.punctuation)

# Library for preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# library for training model
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# library for evaluate model
from sklearn.metrics import accuracy_score, classification_report

# Library for save model
import pickle

In [35]:
# import movie dataset 
# this movie include movie review from IMDB

movie_df = pd.read_csv('movie_review.csv', sep="\t")
movie_df.head()

Unnamed: 0,id,review,sentiment
0,5814_8,With all this stuff going down at the moment w...,1
1,2381_9,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,7759_3,The film starts with a manager (Nicholas Bell)...,0
3,3630_4,It must be assumed that those who praised this...,0
4,9495_8,Superbly trashy and wondrously unpretentious 8...,1


In [36]:
# check balance of the dataset

movie_df['sentiment'].value_counts()

1    11278
0    11222
Name: sentiment, dtype: int64

### There is a balance between the number of positive and negative reviews.

### For this project, we use the review as `Input` and sentiment as `Label`.

In [37]:
movie_df['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

### We could observed that the review has a lot of punctuation/stop-words. So we should go ahead and clean it.

----
# Phase 2: Clean data

> **The cleaning phase has 2 step:** 
- Remove punctuation/stop-words 
- stemming words. 

In [38]:
# define cleaning function

def clean_review(review):
    review = review.lower()
    review = re.sub(re.compile("[<.*>]"), '', review)
    review = review.translate(translator)
    words = word_tokenize(review)
    
    clean_words = []
    for word in words:
        if word not in stop_words:
            clean_words.append(stemmer.stem(word))
    return ' '.join(clean_words)

In [39]:
# test cleaning function

clean_review(movie_df['review'][1])

'classic war world timothi hine entertain film obvious goe great effort length faith recreat h g well classic book mr hine succeed watch film appreci fact standard predict hollywood fare come everi year eg spielberg version tom cruis slightest resembl book obvious everyon look differ thing movi envis amateur critic look critic everyth other rate movi import baseslik entertain peopl never agre critic enjoy effort mr hine put faith hg well classic novel found entertain made easi overlook critic perceiv shortcom'

In [40]:
# apply cleaning function to movie_df['review]

movie_df.loc[:, 'review'] = movie_df['review'].apply(lambda x: clean_review(x))

In [41]:
# save the cleaning-dataset for later use

movie_df.to_csv('imdb_clean.csv')

----
# Phase 3: Training model

> **The training phase has 3 step:** 
- Split the dataset into train and test sets using train_test_split.
- Vectorize the Input of train and test sets using TF-IDF
- Train model using LogisticRegression() and RandomForest() on train set and predict on test set.

In [42]:
# Split the dataset into train and test sets.

X = movie_df['review']
y = movie_df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [43]:
# Vectorize the Input of train and test sets using TF-IDF

tf_idf = TfidfVectorizer()
tf_idf_X_train = tf_idf.fit_transform(X_train)
tf_idf_X_test = tf_idf.transform(X_test)

In [54]:
# Save tf-idf model

pickle.dump(tf_idf, open('tf_idf', 'wb'))

In [44]:
# Train model using LogisticRegression()

lr_model = LogisticRegression()
lr_model.fit(tf_idf_X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [45]:
# Call predict on test set and print classification_report

lr_prediction = lr_model.predict(tf_idf_X_test)
print(classification_report(y_test, lr_prediction))

             precision    recall  f1-score   support

          0       0.90      0.87      0.89      2199
          1       0.88      0.91      0.89      2301

avg / total       0.89      0.89      0.89      4500



### We got a good result here. Now let's try RandomForest

In [46]:
# Do the same with RandomForest

rf_model = RandomForestClassifier(50)
rf_model.fit(tf_idf_X_train, y_train)
rf_prediction = rf_model.predict(tf_idf_X_test)
print(classification_report(y_test, rf_prediction))

             precision    recall  f1-score   support

          0       0.81      0.86      0.83      2199
          1       0.86      0.80      0.83      2301

avg / total       0.83      0.83      0.83      4500



### RandomForest get lower f1-score than LogisticRegression

### Now we gonna stack the result of two model in order to avoid overfitting

In [47]:
all_prediction = (rf_model.predict_proba(tf_idf_X_test)[:,1] + lr_model.predict_proba(tf_idf_X_test)[:,1])//1
print(classification_report(y_test, all_prediction))

             precision    recall  f1-score   support

          0       0.89      0.87      0.88      2199
          1       0.88      0.89      0.89      2301

avg / total       0.88      0.88      0.88      4500



----
## Phase 3.1: Training model using doc2vec for Vectorized review

_Intead of using TF-IDF_

![](https://miro.medium.com/max/368/1*keqyBCQ5FL6A7DZLrXamvQ.png)

In [48]:
# Import doc2vec from gensim library

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [49]:
# Build vectorized-model

tagged_data = [TaggedDocument(words=word_tokenize(_d), tags=[str(i)]) for i, _d in enumerate(movie_df['review'])]

max_epochs = 50
vec_size = 300
alpha = 0.025

model = Doc2Vec(alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

In [50]:
# train vectorized-model and save for later use

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

iteration 0


  import sys


iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
Model Saved


In [51]:
# Load vectorized-model

doc2vec_model = Doc2Vec.load('d2v.model')

In [52]:
# Vectorized training Input and test Input

d2v_X_train = []
d2v_X_test = []

for review in X_train:
    d2v_X_train.append(doc2vec_model.infer_vector(word_tokenize(review)))

for review in X_test:
    d2v_X_test.append(doc2vec_model.infer_vector(word_tokenize(review)))

In [53]:
# Train model using LogisticRegression(), call predict and print classification_report

d2v_lr_model = LogisticRegression()
d2v_lr_model.fit(d2v_X_train, y_train)
d2v_lr_prediction = d2v_lr_model.predict(d2v_X_test)
print(classification_report(y_test, d2v_lr_prediction))

             precision    recall  f1-score   support

          0       0.83      0.83      0.83      2199
          1       0.83      0.84      0.83      2301

avg / total       0.83      0.83      0.83      4500



In [55]:
# Stack 3 model 

final_prediction = (rf_model.predict(tf_idf_X_test) + lr_model.predict(tf_idf_X_test) + d2v_lr_model.predict(d2v_X_test))//2
print(classification_report(y_test, final_prediction))

             precision    recall  f1-score   support

          0       0.88      0.88      0.88      2199
          1       0.89      0.88      0.89      2301

avg / total       0.88      0.88      0.88      4500



----
# Phase 4: Training model on all sample and predict label on no_label dataset


In [56]:
# create X_train_all and y_train_all for TF-IDF

tf_idf_final = TfidfVectorizer()
X_train_all = tf_idf_final.fit_transform(X)
y_train_all = y

# create X_train_all and y_train_all for doc2vec
d2v_X_train_all = np.r_[d2v_X_train, d2v_X_test]
d2v_y_train_all = np.r_[y_train, y_test]

In [57]:
# Save tf-idf final model

pickle.dump(tf_idf_final, open('tf_idf_final', 'wb'))

In [58]:
# Train using TF-IDF

lr_model_final = LogisticRegression().fit(X_train_all, y_train_all)
rf_model_final = RandomForestClassifier(50).fit(X_train_all, y_train_all)

# Train using doc2vec

d2v_lr_model_final = LogisticRegression().fit(d2v_X_train_all, d2v_y_train_all)

In [59]:
# Save 3 model

pickle.dump(lr_model_final, open('lr_final', 'wb'))
pickle.dump(rf_model_final, open('rf_final', 'wb'))
pickle.dump(d2v_lr_model_final, open('d2v_lr_final', 'wb'))

In [60]:
# Load 3 model

rf_model = pickle.load(open('rf_final', 'rb'))
lr_model = pickle.load(open('lr_final', 'rb'))
d2v_lr_model = pickle.load(open('d2v_lr_final', 'rb'))

In [61]:
# Load Vectorized-model

tf_idf_final = pickle.load(open('tf_idf_final', 'rb'))
doc2vec_model = Doc2Vec.load('d2v.model')

In [62]:
# Load need-predicted dataset

no_label_df = pd.read_csv('movie_review_noLabel.csv', sep='\t')
no_label_df.head()

Unnamed: 0,id,review
0,10633_1,I watched this video at a friend's house. I'm ...
1,4489_1,`The Matrix' was an exciting summer blockbuste...
2,3304_10,This movie is one among the very few Indian mo...
3,3350_3,The script for this movie was probably found i...
4,1119_1,Even if this film was allegedly a joke in resp...


In [63]:
# Clean review (again)

X_nolabel_train = no_label_df['review'].apply(clean_review)

In [64]:
# Vectorize Input data using TF-IDF

X_nolabel_train_final = tf_idf_final.transform(X_nolabel_train)

In [65]:
# Vectorize Input data using doc2vec

d2v_X_nolabel_train_final = []
for review in X_nolabel_train:
    d2v_X_nolabel_train_final.append(doc2vec_model.infer_vector(word_tokenize(review)))

In [66]:
# predict labels using results from 3 models 

no_label_all_predict = (rf_model.predict(X_nolabel_train_final) + lr_model.predict(X_nolabel_train_final) + d2v_lr_model.predict(d2v_X_nolabel_train_final))//2
no_label_all_predict

array([0, 0, 1, ..., 1, 0, 1], dtype=int64)

In [69]:
# add predicted labels back to no-label dataset 

no_label_df['sentiment'] = no_label_all_predict
no_label_df.head()

Unnamed: 0,id,review,sentiment
0,10633_1,I watched this video at a friend's house. I'm ...,0
1,4489_1,`The Matrix' was an exciting summer blockbuste...,0
2,3304_10,This movie is one among the very few Indian mo...,1
3,3350_3,The script for this movie was probably found i...,0
4,1119_1,Even if this film was allegedly a joke in resp...,0


In [68]:
# save the labeled-dataset for later use

no_label_df.to_csv('label_submit.csv')