# Movies Review Data

The problem contains the dataset which includes the movies review data with review as one column and the sentiment(positive or negative) associated with it in another column.

The objective is to perform sentiment analysis on the reviews and build a model to do sentiment analysis.

In [130]:
# import all the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Read the positive and negative reviews dataset and merge them to create a new dataset with all the reviews. Mark all the negative reviews as 0 and positive reviews as 1

In [131]:
# Read the positive and negative reviews dataset and create a dataset with all the reviews

neg = pd.read_csv('rt-polarity-neg.csv', sep='\n', header=None, names=['review'])
pos = pd.read_csv('rt-polarity-pos.csv', sep='\n', header=None, names=['review'])
neg['sentiment_label']=0
pos['sentiment_label']=1
reviews_df=neg.append(pos)
reviews_df.reset_index(inplace=True)
reviews_df.head()

Unnamed: 0,index,review,sentiment_label
0,0,"simplistic , silly and tedious .",0
1,1,"it's so laddish and juvenile , only teenage bo...",0
2,2,exploitative and largely devoid of the depth o...,0
3,3,[garbus] discards the potential for pathologic...,0
4,4,a visually flashy but narratively opaque and e...,0


## Exploratory Analysis

To understand the data, carry out some exploratory analysis: by checking the datatypes of the variables, size of the dataset and if there are any null values in the dataset.

In [132]:
# check the dataset
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10662 entries, 0 to 10661
Data columns (total 3 columns):
index              10662 non-null int64
review             10662 non-null object
sentiment_label    10662 non-null int64
dtypes: int64(2), object(1)
memory usage: 250.0+ KB


In [133]:
# check if there are any missing values in the dataset
reviews_df.isnull().sum()

index              0
review             0
sentiment_label    0
dtype: int64

In [134]:
# remove the index column from the dataset
reviews_df.drop('index',axis=1,inplace=True)

In [135]:
reviews_df.head()

Unnamed: 0,review,sentiment_label
0,"simplistic , silly and tedious .",0
1,"it's so laddish and juvenile , only teenage bo...",0
2,exploitative and largely devoid of the depth o...,0
3,[garbus] discards the potential for pathologic...,0
4,a visually flashy but narratively opaque and e...,0


## Train Test Split

Next, split the available dataset into training and test data with 10% of the total data assigned as the test dataset and remianing 90% as the training dataset.

In [136]:
from sklearn.cross_validation import train_test_split

In [137]:
X = reviews_df['review']
y = reviews_df['sentiment_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=101)

In [138]:
# create train and test dataset
train_review_df = reviews_df.ix[X_train.index]
test_review_df = reviews_df.ix[X_test.index]

In [139]:
print 'Training dataset shape',train_review_df.shape
print 'Test dataset shape',test_review_df.shape

Training dataset shape (9595, 2)
Test dataset shape (1067, 2)


## Feature Extraction

Import Natural language toolkit to clean the reviews by removing punctuations or numbers etc. Though punctuations may help to express the sentiments in some cases but not taking them into consideration for now. 
Import the stopwords list to remove all the stopwords from the reviews.

In [140]:
# import the libraries 

import nltk # Import the stop word library from python Natural Language Toolkit
nltk.download()
from nltk.corpus import stopwords # Import the stop word list
import re # Import regular expression library to find and replace the words
import string 

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [141]:
# Function to convert uncleaned reviews to a string of reviews 
# 1. Remove non letters
# 2. Change everything to lowercase
# 3. Remove stopwords

def cleanup_reviews(review):
    letters = re.sub("[^a-zA-Z]", " ", review) # Remove anything in the sentence other than letters
    words = letters.lower().split()   # change everything to lowercase
    stops = set(stopwords.words("english")) # convert to a set for faster processing
    meaningful_words = [w.strip() for w in words if not w in stops]   # remove the stop words
    sentence = " ".join( meaningful_words )  # join back all the remaining words into sentence separated by a space
    return sentence.strip()

In [142]:
# apply the cleanup_review function to all the reviews in the training dataset
train_review_df['clean_review'] = train_review_df['review'].apply(cleanup_reviews)

In [143]:
train_review_df.reset_index(inplace=True)

In [118]:
# Save the cleaned training dataset as a pickle
train_review_df.to_pickle("cleaned_movie_reviews2.pkl")

In [144]:
# apply the cleanup_review function to all the reviews in the test dataset
test_review_df['clean_review'] = test_review_df['review'].apply(cleanup_reviews)

In [145]:
test_review_df.reset_index(inplace=True)

In [20]:
# Save the cleaned test dataset as a pickle
train_review_df.to_pickle("cleaned_movie_reviews2_test.pkl")

## Pipeline -tfidf and classifier

Generate feature matrix using tf-idf vectorization based on term frequency and inverse document frequency instead of using the bag of words which simply counts the word frequency in a sentence.

Using Pipeline functionality to merge the feature extraction and classification steps into one operation

In [146]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 

In [187]:
pipeline = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
#       ('clf', RandomForestClassifier(n_estimators=200)),
#       ('clf', MultinomialNB()), 
       ('clf', LogisticRegression()),
#       ('clf',  KNeighborsClassifier())
    ])

## Cross Validation - Kfold

In [188]:
k_fold = KFold(n=len(train_review_df['clean_review']), n_folds=5)

In [191]:
confusion = np.array([[0, 0], [0, 0]])
scores = []
for train_indices, test_indices in k_fold:
    train_text = train_review_df.iloc[train_indices]['clean_review']
    train_y = train_review_df.iloc[train_indices]['sentiment_label']

    valid_text = train_review_df.iloc[test_indices]['clean_review']
    valid_y = train_review_df.iloc[test_indices]['sentiment_label']

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(valid_text)

    confusion += confusion_matrix(valid_y, predictions)

print('Confusion matrix:')
print(confusion)

predicted_results = pipeline.predict(test_review_df['clean_review'])
print '\n'
print(classification_report(test_review_df['sentiment_label'], predicted_results))

Confusion matrix:
[[3588 1201]
 [1211 3595]]


             precision    recall  f1-score   support

          0       0.75      0.76      0.75       542
          1       0.75      0.74      0.74       525

avg / total       0.75      0.75      0.75      1067



## Results
<pre>
---
|  | Logistic Regression    |  Naive Bayes    |  Random Forest    | KNN     |
|------|------|------|------|------|-------|------|------|
|  Bag of words -5000 features, uni-bigram  |0.52| 0.50 | 0.52 |  0.42  | 
|  Bag of words -5000 features  |0.50| 0.50 | 0.51 |  0.36  | 
|  tf-idf   |0.49| 0.50 | 0.49 | 0.48   | 
|  tf-idf + kfold=5  |0.75| 0.75 | 0.69 |  0.38  |

 Logistic Regression > Naive Bayes > Random Forest > KNN

## TO DO:

* Not removing the stop words and then trying ngram
* Representing word cloud - Most and least used words in positive and negative reviews
* Try other ML models like SVM, Naive Baye's other forms, Boosting
* Data Visualization