# Movies Review Data

The problem contains the dataset which includes the movies review data with review as one column and the sentiment(positive or negative) associated with it in another column.

The objective is to perform sentiment analysis on the reviews and build a model to do sentiment analysis.

In [58]:
# import all the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Read the positive and negative reviews dataset and merge them to create a new dataset with all the reviews. Mark all the negative reviews as 0 and positive reviews as 1

In [59]:
# Read the positive and negative reviews dataset and create a dataset with all the reviews

neg = pd.read_csv('rt-polarity-neg.csv', sep='\n', header=None, names=['review'])
pos = pd.read_csv('rt-polarity-pos.csv', sep='\n', header=None, names=['review'])
neg['sentiment_label']=0
pos['sentiment_label']=1
reviews_df=neg.append(pos)
reviews_df.reset_index(inplace=True)
reviews_df.head()

Unnamed: 0,index,review,sentiment_label
0,0,"simplistic , silly and tedious .",0
1,1,"it's so laddish and juvenile , only teenage bo...",0
2,2,exploitative and largely devoid of the depth o...,0
3,3,[garbus] discards the potential for pathologic...,0
4,4,a visually flashy but narratively opaque and e...,0


## Exploratory Analysis

To understand the data, carry out some exploratory analysis: by checking the datatypes of the variables, size of the dataset and if there are any null values in the dataset.

In [60]:
# check the dataset
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10662 entries, 0 to 10661
Data columns (total 3 columns):
index              10662 non-null int64
review             10662 non-null object
sentiment_label    10662 non-null int64
dtypes: int64(2), object(1)
memory usage: 250.0+ KB


In [61]:
# check if there are any missing values in the dataset
reviews_df.isnull().sum()

index              0
review             0
sentiment_label    0
dtype: int64

In [62]:
# remove the index column from the dataset
reviews_df.drop('index',axis=1,inplace=True)

In [63]:
reviews_df.head()

Unnamed: 0,review,sentiment_label
0,"simplistic , silly and tedious .",0
1,"it's so laddish and juvenile , only teenage bo...",0
2,exploitative and largely devoid of the depth o...,0
3,[garbus] discards the potential for pathologic...,0
4,a visually flashy but narratively opaque and e...,0


## Train Test Split

Next, split the available dataset into training and test data with 10% of the total data assigned as the test dataset and remianing 90% as the training dataset.

In [64]:
from sklearn.cross_validation import train_test_split

In [65]:
X = reviews_df['review']
y = reviews_df['sentiment_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=101)

In [66]:
# create train and test dataset
train_review_df = reviews_df.ix[X_train.index]
test_review_df = reviews_df.ix[X_test.index]

In [67]:
print 'Training dataset shape',train_review_df.shape
print 'Test dataset shape',test_review_df.shape

Training dataset shape (9595, 2)
Test dataset shape (1067, 2)


## Feature Extraction

Import Natural language toolkit to clean the reviews by removing punctuations or numbers etc. Though punctuations may help to express the sentiments in some cases but not taking them into consideration for now. 
Import the stopwords list to remove all the stopwords from the reviews.

In [11]:
# import the libraries 

import nltk # Import the stop word library from python Natural Language Toolkit
nltk.download()
from nltk.corpus import stopwords # Import the stop word list
import re # Import regular expression library to find and replace the words
import string 

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [68]:
# Function to convert uncleaned reviews to a string of reviews 
# 1. Remove non letters
# 2. Change everything to lowercase
# 3. Remove stopwords

def cleanup_reviews(review):
    letters = re.sub("[^a-zA-Z]", " ", review) # Remove anything in the sentence other than letters
    words = letters.lower().split()   # change everything to lowercase
    stops = set(stopwords.words("english")) # convert to a set for faster processing
    meaningful_words = [w.strip() for w in words if not w in stops]   # remove the stop words
    sentence = " ".join( meaningful_words )  # join back all the remaining words into sentence separated by a space
    return sentence.strip()

In [69]:
# apply the cleanup_review function to all the reviews in the training dataset
train_review_df['clean_review'] = train_review_df['review'].apply(cleanup_reviews)

In [17]:
# Save the cleaned training dataset as a pickle
train_review_df.to_pickle("cleaned_movie_reviews2.pkl")

In [70]:
# apply the cleanup_review function to all the reviews in the test dataset
test_review_df['clean_review'] = test_review_df['review'].apply(cleanup_reviews)

In [20]:
# Save the cleaned test dataset as a pickle
train_review_df.to_pickle("cleaned_movie_reviews2_test.pkl")

## Tf-idf 

Generate feature matrix using tf-idf vectorization based on term frequency and inverse document frequency instead of using the bag of words which simply counts the word frequency in a sentence.

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [72]:
vectorizer = TfidfVectorizer(min_df=3, max_df=0.95, max_features=900)
train_data_features = vectorizer.fit_transform(train_review_df['clean_review'])
test_data_features = vectorizer.fit_transform(test_review_df['clean_review'])

After the feature matrix has been created, apply different machine learning models to it and check the model accuracy

## Random forest Classifier

Train the model by fitting it in the training dataset and then making predictions on the unknown test dataset.

In [78]:
from sklearn.ensemble import RandomForestClassifier

In [79]:
rfc = RandomForestClassifier(n_estimators=200)

In [80]:
rfc.fit(train_data_features,train_review_df['sentiment_label'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [81]:
predictions  = rfc.predict(test_data_features)

## Model validation

Check the model accuracy on the test dataset.

In [82]:
from sklearn.metrics import confusion_matrix,classification_report

In [83]:
rfc.score(test_data_features, test_review_df['sentiment_label'])

0.49203373945641987

In [84]:
print 'Confusion Matrix'+'\n\n', confusion_matrix(test_review_df['sentiment_label'],predictions)
print('\n')
print 'Classification_report'+'\n\n', classification_report(test_review_df['sentiment_label'],predictions)

Confusion Matrix

[[273 269]
 [273 252]]


Classification_report

             precision    recall  f1-score   support

          0       0.50      0.50      0.50       542
          1       0.48      0.48      0.48       525

avg / total       0.49      0.49      0.49      1067



The model accuracy obtained above is not high, hence trying to apply other models and check if they show any improvement.

## Naive Bayes Classifier

In [90]:
from sklearn.naive_bayes import MultinomialNB

In [91]:
mnb = MultinomialNB()

In [92]:
nb_model = mnb.fit(train_data_features,train_review_df['sentiment_label'])
predictions_nb = nb_model.predict(test_data_features)

## Model Validation

In [93]:
print 'Confusion Matrix'+'\n\n', confusion_matrix(test_review_df['sentiment_label'],predictions_nb)
print('\n')
print 'Classification_report'+'\n\n', classification_report(test_review_df['sentiment_label'],predictions_nb)

Confusion Matrix

[[265 277]
 [256 269]]


Classification_report

             precision    recall  f1-score   support

          0       0.51      0.49      0.50       542
          1       0.49      0.51      0.50       525

avg / total       0.50      0.50      0.50      1067



## KNN

In [94]:
from sklearn.neighbors import KNeighborsClassifier

In [95]:
knn =  KNeighborsClassifier()

In [96]:
knn.fit(train_data_features,train_review_df['sentiment_label'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [97]:
predictions_knn = knn.predict(test_data_features)

## Model Validation

In [98]:
print 'Confusion Matrix'+'\n\n', confusion_matrix(test_review_df['sentiment_label'],predictions_knn)
print('\n')
print 'Classification_report'+'\n\n', classification_report(test_review_df['sentiment_label'],predictions_knn)

Confusion Matrix

[[354 188]
 [350 175]]


Classification_report

             precision    recall  f1-score   support

          0       0.50      0.65      0.57       542
          1       0.48      0.33      0.39       525

avg / total       0.49      0.50      0.48      1067



## Logistic Regression

In [73]:
from sklearn.linear_model import LogisticRegression

In [74]:
lr = LogisticRegression()

In [75]:
lr.fit(train_data_features,train_review_df['sentiment_label'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [76]:
predictions_lr = lr.predict(test_data_features)

In [77]:
print 'Confusion Matrix'+'\n\n', confusion_matrix(test_review_df['sentiment_label'],predictions_lr)
print('\n')
print 'Classification_report'+'\n\n', classification_report(test_review_df['sentiment_label'],predictions_lr)

Confusion Matrix

[[271 271]
 [270 255]]


Classification_report

             precision    recall  f1-score   support

          0       0.50      0.50      0.50       542
          1       0.48      0.49      0.49       525

avg / total       0.49      0.49      0.49      1067

