# Sentiment Analysis

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.

In this project, we will work with IMDB movie reviews and develop different machine learning models to predict a given review as positive or negative.

1. We will begin cleaning the reviews(removing stopwords, punctuation and digits). 
2. We will then use two different approaches to create feature space. 
3. In the 1st one, we will consider the presence/absence of the word(feature) into account. We will send 2 classification models, the 1st one will be support vector machine and the second one would be a random forest classifier. We will calculate the f1 score, test set accuracy and 10 most ifluential words for each of the models.
4. In the 2nd one, we will use TF-IDF vectorization and also include bigrams in the analysis. We will then build naive bayes and random forest classifier. We will again calculate the f1 score, test set accuracy and 10 most ifluential words for each of the models. 
5. We also use cross-validation and grid search approach to tune the random forest model in the 2nd appraoch.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import nltk, re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import defaultdict
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from collections import namedtuple
import random

In [2]:
stop_words = set(stopwords.words("english"))

In [3]:
#loading train and test data. train.csv contains reviews and train_labels.csv contains the corresponding labels. 
train=pd.read_table("train.csv",sep="\n")
y_train=pd.read_csv("train_labels.csv")
y_test=pd.read_csv("test_labels.csv")
train_reviews=list(train["review"])
test=pd.read_table("test.csv",sep="\n")
test_reviews=list(test["review"])
all_reviews=train_reviews+test_reviews

In [4]:
#finding size of train and test data
print(len(train_reviews))
print(len(test_reviews))
print(y_train.shape)
print(y_test.shape)
print((train_reviews[0]))
print((test_reviews[0]))

4000
1000
(4000, 1)
(1000, 1)
For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.
Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." <br /><br />Suddenly all hell broke loose and she was caug

In [5]:
#the purpose of clean_review function is to remove stopwords, punctuation, 
#special characters as well as extra spaces
def clean_review(review):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    review = review.lower()
    # Clean the text
    review = re.sub(r"<br />", " ", review)
    review = re.sub(r"[^a-z]", " ", review)
    review = re.sub(r"   ", " ", review) # Remove any extra spaces
    review = re.sub(r"  ", " ", review)
    #remove stopwords
    tokenized = word_tokenize(review)
    review = [w for w in tokenized if not w in stop_words]
    review = " ".join(review)


    
    # Return a list of words
    return(review)

Cleaning every review present in train_reviews and test_reviews using the function defined above. 

It should return two lists, one each for cleaned train and test reviews

In [6]:
def clean_data(train_reviews,test_reviews) :
    '''Input - train and test reviews
    Output - cleaned train and test reviews'''
    
    
    new_train=[]
    new_test=[]
    for x in train_reviews:
        new_train.append(clean_review(x))
        
    for y in test_reviews:
        new_test.append(clean_review(y))

    
    return new_train, new_test

In [7]:
train_reviews,test_reviews=clean_data(train_reviews,test_reviews)

print(len(train_reviews))
print(len(test_reviews))
print(train_reviews[0])
print("\n")
print(test_reviews[0])

4000
1000
movie gets respect sure lot memorable quotes listed gem imagine movie joe piscopo actually funny maureen stapleton scene stealer moroni character absolute scream watch alan skipper hale jr police sgt


based actual story john boorman shows struggle american doctor whose husband son murdered continually plagued loss holiday burma sister seemed like good idea get away passport stolen rangoon could leave country sister forced stay back could get papers american embassy fill day could fly took trip countryside tour guide tried finding something stone statues nothing stirred stone suddenly hell broke loose caught political revolt looked like escaped safely boarded train saw tour guide get beaten shot split second decided jump moving train try rescue thought continually life danger woman demonstrated spontaneous selfless charity risking life save another patricia arquette beautiful look beautiful heart unforgettable story taught suffering one promise life always keeps


Creating a bag of words using sklearn's CountVectorizer. We will create features as the presence/absence of a word in a particular review. Hence, keep the binary parameter in CountVectorizer to True.
We first train the vectorizer on train reviews to create a feature space and then transform the test reviews as well.

We then return the trained vectorizer object and train and test features.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
def create_bag_words(train_reviews,test_reviews) :
    '''Input - train and test reviews
    Output - the trained vectorizer and train and test feature matrix'''
    
    vectorizer = CountVectorizer(binary=True)
    train = vectorizer.fit_transform(train_reviews)
    train_features=train.toarray()
    test=vectorizer.transform(test_reviews)
    test_features=test.toarray()
    
    return vectorizer,train_features,test_features

In [9]:
vectorizer,X_train,X_test=create_bag_words(train_reviews,test_reviews)
print(X_train.shape)
print(X_test.shape)

(4000, 34444)
(1000, 34444)


Creating a support vector machine classifier using Scikit learn. We also return classification accuracy on test data and 5 most important features(words) each for positive and negative sentiment using a linear kernel. 

In [12]:
from sklearn.svm import SVC
from sklearn.metrics import f1_score
def execute_svm_model(X_train,y_train,X_test,y_test) :
    '''Input - train and test features and labels
    Output - trained svm model, test_accuracy, f1_score and important words'''
    
    results_svm=[]
    new_trainy=y_train.to_numpy()
    new_y_train=[x for z in new_trainy for x in z ]
    new_testy=y_test.to_numpy()
    new_y_test=[x for z in new_testy for x in z ]
    clf = SVC(kernel='linear')
    clf.fit(X_train, new_y_train)
    predictionsvm=clf.predict(X_test)
    yy=clf.score(X_test,new_y_test)
    lol=clf.coef_
    lol2=[]
    for x in lol:
        lol2.extend(x) 
    names=vectorizer.get_feature_names()
    Z = [x for _,x in sorted(zip(lol2,names),reverse=True)]
    total=Z[:5]+Z[-5:]
    results_svm.append(clf)
    results_svm.append(yy)
    results_svm.append(f1_score(new_y_test, predictionsvm, average='binary'))
    results_svm.append(total)
    
    return results_svm

In [13]:
results_svm=execute_svm_model(X_train,y_train,X_test,y_test)
print(results_svm[1])
print(results_svm[2])
print(results_svm[3])

0.839
0.8398009950248755
['worst', 'waste', 'terrible', 'awful', 'poor', 'excellent', 'best', 'history', 'amazing', 'wonderful']


Creating a random forest classifier using Scikit learn. It also returns classification accuracy on test data and 5 most important features(words) each for positive and negative sentiment.

In [16]:
from sklearn.ensemble import RandomForestClassifier
def execute_randomForest_model(X_train,y_train,X_test,y_test) :
    
    '''Input - train and test features and labels
    Output - trained random forest model, test_accuracy, f1_score and important words'''
    
    results=[]
    clf_2 = RandomForestClassifier(n_estimators=400, random_state=13)
    clf_2 = RandomForestClassifier(n_estimators=400, random_state=13)
    new_ytrain=y_train.to_numpy()
    new_y_train=[x for z in new_ytrain for x in z ]
    new_ytest=y_test.to_numpy()
    new_y_test=[x for z in new_ytest for x in z ]
    clf_2.fit(X_train, new_y_train)
    ans=clf_2.predict(X_test)
    names=vectorizer.get_feature_names()
    tt=accuracy_score(ans, new_y_test)
    featres=clf_2.feature_importances_
    Z_a = [x for _,x in sorted(zip(featres,names),reverse=True)]
    total=Z_a[:10]
    
    results.append(clf_2)
    results.append(tt)
    results.append(f1_score(new_y_test,ans, average='binary'))
    results.append(total)
    
    return results

In [17]:
results_rf1=execute_randomForest_model(X_train,y_train,X_test,y_test)
print(results_rf1[1])
print(results_rf1[2])
print(results_rf1[3])

0.851
0.8490374873353597
['bad', 'worst', 'great', 'waste', 'terrible', 'awful', 'best', 'boring', 'nothing', 'minutes']


Generating bigrams given a text review and return the list of bigrams.

In [18]:
import nltk
def generate_bigrams(review) :
    '''Input - a review
    Output - all possible bigrams for that review'''

    
    bigrm = list(nltk.bigrams(review.split()))
    new_bigrm=[]
    for x in bigrm:
        new_bigrm.append(' '.join(x))
    
    return new_bigrm

In [19]:
print(train_reviews[0])
print(generate_bigrams(train_reviews[0]))

movie gets respect sure lot memorable quotes listed gem imagine movie joe piscopo actually funny maureen stapleton scene stealer moroni character absolute scream watch alan skipper hale jr police sgt
['movie gets', 'gets respect', 'respect sure', 'sure lot', 'lot memorable', 'memorable quotes', 'quotes listed', 'listed gem', 'gem imagine', 'imagine movie', 'movie joe', 'joe piscopo', 'piscopo actually', 'actually funny', 'funny maureen', 'maureen stapleton', 'stapleton scene', 'scene stealer', 'stealer moroni', 'moroni character', 'character absolute', 'absolute scream', 'scream watch', 'watch alan', 'alan skipper', 'skipper hale', 'hale jr', 'jr police', 'police sgt']


Creating a bag of words using sklearn's TfidfVectorizer. We will create features using tf-idf values of words. Also, we will now include bigrams inaddition to unigrams(words).

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
def create_bag_words_2(train_reviews,test_reviews) :
    '''Input - train and test reviews
    Output - the trained vectorizer and train and test feature matrix'''
    
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    vectorizer.fit(train_reviews)
    X_train2 = vectorizer.transform(train_reviews)
    X_test2 = vectorizer.transform(test_reviews)
    
    return vectorizer,X_train2,X_test2

In [21]:
vectorizer2,X_train2,X_test2=create_bag_words_2(train_reviews,test_reviews)
print(X_train2.shape)
print(X_test2.shape)

(4000, 399070)
(1000, 399070)


Creating a naive bayes classifier using Scikit learn. It also returns classification accuracy on test data and 5 most important features(words) each for positive and negative sentiment.

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
def execute_naive_bayes(X_train,y_train,X_test,y_test) :
    '''Input - train and test features and labels
    Output - trained naive bayes model, test_accuracy, f1_score and important words'''
    
    results_nb=[]
    new_ytrain=y_train.to_numpy()
    new_y_train=[x for z in new_ytrain for x in z ]
    new_ytest=y_test.to_numpy()
    new_y_test=[x for z in new_ytest for x in z ]

    clf_NB = MultinomialNB()
    clf_NB.fit(X_train2, new_y_train)
    pred_NB=clf_NB.predict(X_test2)
    tt=accuracy_score(pred_NB, new_y_test)
    class_z=clf_NB.feature_log_prob_[0,:]
    class_o=clf_NB.feature_log_prob_[1,:]
    new_names= vectorizer2.get_feature_names()
    Z_NB_class_one = [x for _,x in sorted(zip(class_o,new_names),reverse=True)]
    Z_NB_class_two = [x for _,x in sorted(zip(class_z,new_names),reverse=True)]
    total=Z_NB_class_two[:5]+Z_NB_class_one[:5]
    results_nb.append(clf_NB)
    results_nb.append(tt)
    results_nb.append(f1_score(new_y_test,pred_NB, average='binary'))
    results_nb.append(total)
    
    
    return results_nb

In [23]:
results_nb=execute_naive_bayes(X_train2,y_train,X_test2,y_test)
print(results_nb[1])
print(results_nb[2])
print(results_nb[3])

0.844
0.8517110266159696
['movie', 'film', 'one', 'great', 'like', 'movie', 'film', 'one', 'bad', 'like']


Create a random forest classifier using Scikit learn. It also returns classification accuracy on test data and 5 most important features(words) each for positive and negative sentiment.

In [24]:
from sklearn.metrics import f1_score
def execute_randomForest_model2(X_train,y_train,X_test,y_test) :
    '''Input - train and test features and labels
    Output - trained random forest model, test_accuracy, f1_score and important words'''
   
    results_rf2=[]
    new_ytrain=y_train.to_numpy()
    new_y_train=[x for z in new_ytrain for x in z ]
    new_ytest=y_test.to_numpy()
    new_y_test=[x for z in new_ytest for x in z ]
    
    clf_rf2 = RandomForestClassifier(n_estimators=400, random_state=13)
    clf_rf2.fit(X_train,new_y_train)
    
    pred2=clf_rf2.predict(X_test)
    tt=accuracy_score(pred2, new_y_test)
    
    new_names=vectorizer2.get_feature_names()
    rf2_feat=clf_rf2.feature_importances_
    
    Z_rf2 = [x for _,x in sorted(zip(rf2_feat,new_names),reverse=True)]
    
    results_rf2.append(clf_rf2)
    results_rf2.append(tt)
    results_rf2.append(f1_score(new_y_test, pred2, average='binary'))
    total=Z_rf2[:10]
    results_rf2.append(total)
    
    return results_rf2

In [25]:
results_rf2=execute_randomForest_model2(X_train2,y_train,X_test2,y_test)
print(results_rf2[1])
print(results_rf2[2])
print(results_rf2[3])

0.837
0.8454976303317535
['bad', 'worst', 'waste', 'great', 'minutes', 'best', 'nothing', 'awful', 'terrible', 'even']


Tuning the max_depth parameter of random forest model created above to best performance. We will use GridSeachCV present in Scikit learn library to get 3-fold cross validation.

In [26]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
def tune_random_forest(rf_model,cv,parameters,X_train,y_train) :
    '''
    Input - random forest model, cross validation, parameters and data
    Output - fitted GridSeachCV model
    '''
    
    new_ytrain=y_train.to_numpy()
    new_y_train=[x for z in new_ytrain for x in z ]
    clf_grid= GridSearchCV(rf_model, parameters)
    
    clf_grid.fit(X_train,new_y_train)
    
    
    return clf_grid

In [27]:
parameters={'max_depth' : [i for i in range(4,12)]}
tuned_rf=tune_random_forest(results_rf2[0],3,parameters,X_train2,y_train)

In [30]:
print(tuned_rf.best_score_)
print(tuned_rf.best_params_)

0.83625
{'max_depth': 11}


We will now visualize results obtained from all the 4 models 

In [28]:
def visualize_results() :
    '''
    output - results dataframe
    '''
    
    test_accuracies=[]
    f1_scores=[]
    imp_words=[]
    headers=['svm','rf1','nb','rf2']
    test_accuracies.append(results_svm[1])
    test_accuracies.append(results_rf1[1])
    test_accuracies.append(results_nb[1])
    test_accuracies.append(results_rf2[1]) 
    f1_scores.append(results_svm[2])
    f1_scores.append(results_rf1[2])
    f1_scores.append(results_nb[2])
    f1_scores.append(results_rf2[2]) 
    imp_words.append(results_svm[3])
    imp_words.append(results_rf1[3])
    imp_words.append(results_nb[3])
    imp_words.append(results_rf2[3]) 
                     
    table = [test_accuracies,f1_scores,imp_words]
    df = pd.DataFrame(table,columns=headers).transpose()              
    df.columns = ['test_accuracies', 'f1_scores','imp_words']
#     df.rows = ['svm','rf1','nb','rf2']
    
    return df

In [29]:
results_df=visualize_results()
results_df.head()

Unnamed: 0,test_accuracies,f1_scores,imp_words
svm,0.839,0.839801,"[worst, waste, terrible, awful, poor, excellen..."
rf1,0.851,0.849037,"[bad, worst, great, waste, terrible, awful, be..."
nb,0.844,0.851711,"[movie, film, one, great, like, movie, film, o..."
rf2,0.837,0.845498,"[bad, worst, waste, great, minutes, best, noth..."
