# IMDB Reviews Dataset

The IMDB Review dataset contains movie reviews along with associated binary sentiment polarity labels.The main dataset contains 50,000 reviews split evenly between a train and test set (25,000 each). The distribution of positive and negative labels are balanced.

In the entire dataset, no more than 30 movie reviews are allowed for the same movie. This is because reviews for the same movie tend to have correlation. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.

For the train/test set a negative review has a score <= 4 out of 10 and a positive review has a score >= 7 out of 10. Reviews with a more neutral score were not included in the dataset.

Link: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

The objective is to perform a sentiment analysis on this dataset using various machine learning models


## Loading the IMDB Dataset

Format the dataset into lists of strings and review a few examples of positive and negative reviews


In [1]:
# Importing necessary libraries (general)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [2]:
import tarfile
import wget
import os
import os.path

#DownLoad IMDB Data to your working path from the link below
URL = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

#Downloads the file to your current working directory
#May need to install the wget package: conda install -c conda-forge python-wget
if not os.path.exists('aclImdb_v1.tar.gz'):
    wget.download(URL)

#The file downloaded is in the aclImdb_v1.tar.gz file
tar = tarfile.open("aclImdb_v1.tar.gz")
tar.extractall()
tar.close()
print("Dataset unpacked in aclImdb Folder")

Dataset unpacked in aclImdb Folder


In [3]:
#Read the original text files from the aclImdb folder and write the contents to a new text file.
#The end result is 4 text files for positive and negative reviews in seperate train and test datasets
import shutil
import glob
import os
import os.path
from pathlib import Path

#Make a folder for the new text files
if not os.path.exists('IMDB_Data'):
    os.mkdir("IMDB_Data")

read_files = glob.glob(os.path.join('aclImdb/train/pos',"*.txt"))

with open('IMDB_Data/pos_train.txt','wb') as outfile:
    for f in read_files:
        with open(f,'rb') as infile:
            shutil.copyfileobj(infile, outfile)
        outfile.write(b"\n")
        
read_files = glob.glob(os.path.join('aclImdb/train/neg',"*.txt"))

with open('IMDB_Data/neg_train.txt','wb') as outfile:
    for f in read_files:
        with open(f,'rb') as infile:
            shutil.copyfileobj(infile, outfile)
        outfile.write(b"\n")
        
read_files = glob.glob(os.path.join('aclImdb/test/pos',"*.txt"))

with open('IMDB_Data/pos_test.txt','wb') as outfile:
    for f in read_files:
        with open(f,'rb') as infile:
            shutil.copyfileobj(infile, outfile)
        outfile.write(b"\n")
        
read_files = glob.glob(os.path.join('aclImdb/test/neg',"*.txt"))

with open('IMDB_Data/neg_test.txt','wb') as outfile:
    for f in read_files:
        with open(f,'rb') as infile:
            shutil.copyfileobj(infile, outfile)
        outfile.write(b"\n")

In [2]:
#Turn the contents of the text files into a list of strings
reviews_train_pos = []
for line in open('IMDB_Data/pos_train.txt', 'r', encoding = "utf8"):
    reviews_train_pos.append(line.strip())
    
reviews_train_neg = []
for line in open('IMDB_Data/neg_train.txt', 'r', encoding = "utf8"):
    reviews_train_neg.append(line.strip())
    
reviews_test_pos = []
for line in open('IMDB_Data/pos_test.txt', 'r', encoding = "utf8"):
    reviews_test_pos.append(line.strip())
    
reviews_test_neg = []
for line in open('IMDB_Data/neg_test.txt', 'r', encoding = "utf8"):
    reviews_test_neg.append(line.strip())

In [3]:
print(len(reviews_train_pos))
print(len(reviews_train_neg))
print(len(reviews_test_pos))
print(len(reviews_test_neg))

12500
12500
12500
12500


In [3]:
#Positive Review Example
reviews_train_pos[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [4]:
#Another Positive Review Example
reviews_train_pos[2]

'Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I\'m a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).'

In [5]:
#Negative Review Example
reviews_train_neg[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [6]:
#Another Negative Review Example
reviews_train_neg[2]

"This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.<br /><br />But could it have been the screenplay? Just exactly who was the chef in love with? He seemed more enamored of his culinary skills and restaurant, and ultimately of himself and his youthful exploits, than of anybody or anything else. He never convinced me he was in love with the princess.<br /><br />I was disappointed in this movie. But, don't forget it was nominated for an Oscar, so judge for yourself."

## Preprocessing and Cleaning the Text Data

As seen above, the original reviews are quite messy and need to be cleaned in order to help the machine learning models. This includes removing capital letters, removing punctuation, and any other uneccessary characters

In [4]:
import re

replace_no_space = re.compile("[.;:!\'?_,\"()\[\]]")
replace_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [replace_no_space.sub("", line.lower()) for line in reviews]
    reviews = [replace_with_space.sub(" ", line) for line in reviews]
    
    return reviews

reviews_train_pos_clean1 = preprocess_reviews(reviews_train_pos)
reviews_train_neg_clean1 = preprocess_reviews(reviews_train_neg)
reviews_test_pos_clean1 = preprocess_reviews(reviews_test_pos)
reviews_test_neg_clean1 = preprocess_reviews(reviews_test_neg)

In [8]:
#The same postive review example after being cleaned
reviews_train_pos_clean1[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

In [9]:
#The same negative review example after cleaning
reviews_train_neg_clean1[0]

'story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy chantings of its singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level its better than you might think with some good cinematography by future great vilmos zsigmond future stars sally kirkland and frederic forrest can be seen briefly'

### Possible Further Text Processing: Removing Stop Words and Normalization

Other methods of cleaning the data that can change the model performance include removing stop_words or Normalization (Stemming or Lematization)

Stop words are very common words such as 'in', 'of', 'a', 'at', or 'the' that usually don't provide any useful information to the text classifier

Normalization (Stemming or Lematization) is a common next step in text preprocessing that converts all the different forms of a certain word into one

In [5]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk

nltk.download('wordnet')


#Stemming
def stemming_text(text_data):
    stem = PorterStemmer()
    return [' '.join([stem.stem(word) for word in review.split()]) for review in text_data]


#Lemmatization
def lemmatize_text(text_data):
    lem = WordNetLemmatizer()
    return [' '.join([lem.lemmatize(word) for word in review.split()]) for review in text_data]

    
#Alternatively there is an easier way to remove stop words by using the stop_words argument with any of scikit-learn’s ‘Vectorizer’ classes
#Removing stop words often (but not always) improves the model accuracy   
#Need to create the list of stop_words (usually more effective than general lists)
#stop_words=['in','of','at','a','the']
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\patri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\patri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
reviews_train_pos_clean = lemmatize_text(reviews_train_pos_clean1)
reviews_train_neg_clean = lemmatize_text(reviews_train_neg_clean1)
reviews_test_pos_clean = lemmatize_text(reviews_test_pos_clean1)
reviews_test_neg_clean = lemmatize_text(reviews_test_neg_clean1)

In [12]:
reviews_train_pos_clean[0]

'bromwell high is a cartoon comedy it ran at the same time a some other program about school life such a teacher my 35 year in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teacher the scramble to survive financially the insightful student who can see right through their pathetic teacher pomp the pettiness of the whole situation all remind me of the school i knew and their student when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector im here to sack one of your teacher student welcome to bromwell high i expect that many adult of my age think that bromwell high is far fetched what a pity that it isnt'

In [13]:
reviews_train_neg_clean[0]

'story of a man who ha unnatural feeling for a pig start out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy chanting of it singer unfortunately it stay absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level it better than you might think with some good cinematography by future great vilmos zsigmond future star sally kirkland and frederic forrest can be seen briefly'

### Prepare Data Sets for Feature Vectorization and Classification Models

With these datasets the positve and negative results will be combined and the data needs to be shuffled since the positive and negative data is grouped together

In [7]:
#Combine all the positive and negative reviews to end up with one training set and one testing set of 25,000 samples each
reviews_train = []
reviews_train_combined = reviews_train_pos_clean + reviews_train_neg_clean
print(len(reviews_train_combined))

reviews_test = []
reviews_test_combined = reviews_test_pos_clean + reviews_test_neg_clean
print(len(reviews_test_combined))

25000
25000


In [8]:
#Shuffle
import random

target = [1 if i < 12500 else 0 for i in range(25000)]

c = list(zip(reviews_train_combined, target))
random.shuffle(c)
reviews_train, y_train = zip(*c)

print(reviews_train[0])
print(y_train[0])

b = list(zip(reviews_test_combined, target))
random.shuffle(b)
reviews_test, y_test = zip(*b)

print(reviews_test[0])
print(y_test[0])

te of the storm country is possibly the best movie of all of mary pickford film at two hour it wa quite long for a 1922 silent film yet continues to hold your interest some 80 year after it wa filmed mary give one of her finest performance at time the role seems like a greatest hit performance with bit of mary the innocent mary the little devil mary the little mother mary the spitfire mary the romantic heroine etc characteristic that often were used throughout a single film in the past the movie is surprisingly frank about one supporting character illegitimate child for 1922 and at one point our little mary is thought the unwed mother in question if the academy award had been around in 1922 no doubt the best actress oscar for the year would have been mary
1
this movie wa a low point for both jason robards and sam peckinpah major plot point are taken directly from sergio leone masterpiece once upon a time in the west released two year earlier and also featuring robards a man find a wate

## Feature Vectorization (Count Vectorizer and TF-IDF)

### Can use either the Unigram Vectorization Features, Bigram Vectorization Features, or TF-IDF Features in the Models

In [9]:
#Feature Vectorization (Unigram)
from sklearn.feature_extraction.text import CountVectorizer

#stop_words parameter: If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).
#ngram_range parameter: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
#max_features parameter: Max number of features
vectorizer = CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)

X_train_counts = vectorizer.fit_transform(reviews_train)
X_test_counts = vectorizer.transform(reviews_test)

print(X_train_counts.shape)
print(X_train_counts)
#print(vectorizer.get_feature_names())

(25000, 83709)
  (0, 73231)	1
  (0, 70744)	1
  (0, 16969)	1
  (0, 57561)	1
  (0, 8266)	2
  (0, 49626)	2
  (0, 46325)	9
  (0, 56285)	1
  (0, 27432)	3
  (0, 76757)	1
  (0, 35638)	1
  (0, 80001)	2
  (0, 59609)	1
  (0, 44171)	1
  (0, 397)	3
  (0, 67421)	1
  (0, 82964)	1
  (0, 16340)	1
  (0, 35080)	1
  (0, 38016)	1
  (0, 1348)	1
  (0, 82844)	2
  (0, 27471)	1
  (0, 30917)	1
  (0, 53067)	3
  :	:
  (24998, 78739)	1
  (24998, 64676)	2
  (24999, 49626)	2
  (24999, 80001)	2
  (24999, 57090)	1
  (24999, 49954)	1
  (24999, 31395)	1
  (24999, 72789)	1
  (24999, 38618)	1
  (24999, 64578)	1
  (24999, 82448)	1
  (24999, 51970)	1
  (24999, 41382)	1
  (24999, 46715)	1
  (24999, 63926)	1
  (24999, 33335)	1
  (24999, 26230)	1
  (24999, 36521)	1
  (24999, 48601)	1
  (24999, 3525)	1
  (24999, 80269)	1
  (24999, 12480)	1
  (24999, 51993)	1
  (24999, 72329)	1
  (24999, 19578)	1


In [17]:
#Feature Vectorization (Bigram)
vectorizer_bigram = CountVectorizer(stop_words = stop_words, ngram_range=(1,2), max_features = None)

X_train_counts_bigram = vectorizer_bigram.fit_transform(reviews_train)
X_test_counts_bigram = vectorizer_bigram.transform(reviews_test)

print(X_train_counts_bigram.shape)
print(X_train_counts_bigram)
#print(vectorizer_bigram.get_feature_names())

(25000, 1788926)
  (0, 1779801)	1
  (0, 1728851)	2
  (0, 1026100)	1
  (0, 85404)	1
  (0, 601879)	1
  (0, 972813)	1
  (0, 299608)	2
  (0, 1611947)	2
  (0, 1268660)	3
  (0, 809617)	3
  (0, 1574956)	1
  (0, 571570)	1
  (0, 627267)	1
  (0, 931363)	1
  (0, 768432)	1
  (0, 1674249)	1
  (0, 1146420)	1
  (0, 968664)	1
  (0, 1247843)	1
  (0, 1476007)	1
  (0, 1452565)	2
  (0, 1384059)	1
  (0, 1677473)	2
  (0, 1552026)	3
  (0, 461569)	2
  :	:
  (24999, 524869)	1
  (24999, 125362)	1
  (24999, 1579413)	1
  (24999, 1087997)	1
  (24999, 1037485)	1
  (24999, 528217)	1
  (24999, 1211676)	1
  (24999, 1276154)	1
  (24999, 701014)	1
  (24999, 667142)	1
  (24999, 125364)	1
  (24999, 1696612)	1
  (24999, 1211677)	1
  (24999, 910885)	1
  (24999, 121673)	1
  (24999, 690998)	1
  (24999, 42680)	1
  (24999, 171599)	1
  (24999, 966393)	1
  (24999, 1085926)	1
  (24999, 361923)	1
  (24999, 620623)	1
  (24999, 1638954)	1
  (24999, 1086124)	1
  (24999, 354342)	1


In [10]:
#Feature Vectorization (TF-IDF)
from sklearn.feature_extraction.text import TfidfTransformer 
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

print(X_train_tfidf.shape)
print(X_train_tfidf)

(25000, 83709)
  (0, 82964)	0.05221177274727149
  (0, 82844)	0.07972538428518645
  (0, 82300)	0.03271402497608624
  (0, 80001)	0.04402710997524492
  (0, 78581)	0.05695791592330504
  (0, 78348)	0.13524158505920802
  (0, 76757)	0.04010159789938834
  (0, 74746)	0.029104583514764668
  (0, 74527)	0.061070153767845387
  (0, 74412)	0.04741473491377038
  (0, 73231)	0.11324688717515709
  (0, 72086)	0.07724353249348244
  (0, 71976)	0.06766505065739878
  (0, 70744)	0.09436254444676614
  (0, 69573)	0.13685638381842014
  (0, 67599)	0.06700810382774898
  (0, 67421)	0.08142218998376172
  (0, 65626)	0.04834264932275125
  (0, 63038)	0.07028142865923703
  (0, 62959)	0.04712401462116071
  (0, 59609)	0.04767561632456453
  (0, 59507)	0.06501747020756085
  (0, 57561)	0.07093659985164538
  (0, 57090)	0.046901116291481935
  (0, 56285)	0.1218237897686146
  :	:
  (24998, 1273)	0.09800989033177318
  (24998, 213)	0.1025315157384896
  (24999, 82448)	0.1706071270253806
  (24999, 80269)	0.18296777718529145
  (24999,

# Build the Classifier Models



In [11]:
X_train = reviews_train
X_test = reviews_test


In [131]:
#from sklearn.model_selection import train_test_split

#target = [1 if i < 12500 else 0 for i in range(25000)]
#X_train, X_val, y_train, y_val = train_test_split(X_train_counts, target, train_size=0.8, test_size = 0.2)

#print(X_train.shape)
#print(X_val.shape)
#print(X_val.dtype)

## Build a Pipeline for Cross Validation

A pipeline for cross validation

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

## Logistic Regression

In [21]:
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,2), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

#Tol parameter:Tolerance for stopping criteria
#penalty parameter:Used to specify the norm used in the penalization
#max_iter parameter: Maximum number of iterations
parameters = {'clf__tol': [1e-1, 1e-2, 1e-3, 1e-4],
             'clf__penalty': ['l2','none'],
             'clf__max_iter': [5000 , 10000]}
n_folds = 5

LR_GridSearch = GridSearchCV(pipeline, param_grid = parameters, cv=n_folds)
LR_GridSearch.fit(X_train, y_train)

scores = LR_GridSearch.cv_results_['mean_test_score']
scores_std = LR_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % LR_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, LR_GridSearch.best_params_[param_name]))

scores: [0.88024 0.88024 0.88024 0.88024 0.89796 0.89768 0.8974  0.89832 0.88024
 0.88024 0.88024 0.88024 0.89796 0.89768 0.8974  0.89832]
scores_std [0.00392408 0.00392408 0.00392408 0.00392408 0.00351773 0.0034862
 0.00379684 0.00340963 0.00392408 0.00392408 0.00392408 0.00392408
 0.00351773 0.0034862  0.00379684 0.00340963]
Best score: 0.898

 Best Parameter Values: 
clf__max_iter: 5000
clf__penalty: 'none'
clf__tol: 0.0001


In [22]:
pipeline = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,2), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(penalty = 'none', max_iter = 5000, tol = 0.0001))
])

pipeline.fit(X_train, y_train)
y_predLR = pipeline.predict(X_test)

print("Predicted: ", y_predLR)

print("Accuracy: ", metrics.accuracy_score(y_test, y_predLR)*100, "%")

Predicted:  [0 0 0 ... 1 0 1]
Accuracy:  88.604 %


In [23]:
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

#Tol parameter:Tolerance for stopping criteria
#penalty parameter:Used to specify the norm used in the penalization
#max_iter parameter: Maximum number of iterations
parameters = {'clf__tol': [1e-1, 1e-2, 1e-3, 1e-4],
             'clf__penalty': ['l2','none'],
             'clf__max_iter': [5000 , 10000]}
n_folds = 5

LR_GridSearch = GridSearchCV(pipeline, param_grid = parameters, cv=n_folds)
LR_GridSearch.fit(X_train, y_train)

scores = LR_GridSearch.cv_results_['mean_test_score']
scores_std = LR_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % LR_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, LR_GridSearch.best_params_[param_name]))

scores: [0.8888  0.88884 0.88884 0.88884 0.87696 0.87632 0.87592 0.87572 0.8888
 0.88884 0.88884 0.88884 0.87696 0.87632 0.87592 0.87572]
scores_std [0.00217624 0.00217035 0.00217035 0.00217035 0.00317591 0.00400719
 0.00438835 0.0042757  0.00217624 0.00217035 0.00217035 0.00217035
 0.00317591 0.00400719 0.00438835 0.0042757 ]
Best score: 0.889

 Best Parameter Values: 
clf__max_iter: 5000
clf__penalty: 'l2'
clf__tol: 0.01


In [15]:
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(penalty = 'l2', max_iter = 5000, tol = 0.01))
])

pipeline.fit(X_train, y_train)
y_predLR = pipeline.predict(X_test)

print("Predicted: ", y_predLR)

print("Accuracy: ", metrics.accuracy_score(y_test, y_predLR)*100, "%")

Predicted:  [0 0 1 ... 0 0 1]
Accuracy:  88.096 %


### Check Which Features (Words) Were the Most Valuable for Classification as Positive or Negative

In [27]:
clf_LR = LogisticRegression(penalty = 'l2', max_iter = 5000, tol = 0.01)
clf_LR.fit(X_train_counts, y_train)
pred = clf_LR.predict(X_test_counts)

feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), clf_LR.coef_[0])}

print("Words most likely indicating a positive review: ")
for most_positive in sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)[:20]:
    print (most_positive)
  
print("\n Words most likely indicating a negative review: ")
for most_negative in sorted(feature_to_coef.items(), key=lambda x: x[1])[:20]:
    print (most_negative)

Words most likely indicating a positive review: 
('refreshing', 1.7175049814884682)
('wonderfully', 1.422528422037771)
('flawless', 1.3984179262985494)
('funniest', 1.3903952602260683)
('carrey', 1.3737672265950387)
('superb', 1.3370703605562042)
('excellent', 1.3369896009500923)
('whoopi', 1.2920025417369037)
('erotic', 1.2612708532873178)
('perfect', 1.2482001813653578)
('appreciated', 1.2380654331689525)
('highly', 1.2351227424279925)
('vengeance', 1.2303834876128503)
('rare', 1.221420058662947)
('kurosawa', 1.211920500426305)
('squirrel', 1.2033936929458637)
('hooked', 1.1939963892826)
('surprisingly', 1.1890965903955188)
('kitty', 1.1757888998076977)
('favorite', 1.1753785731595447)

 Words most likely indicating a negative review: 
('disappointment', -2.1434870063824376)
('worst', -2.1100052264304145)
('waste', -2.1001727417208125)
('poorly', -1.8327431499331674)
('awful', -1.7110023549387137)
('fails', -1.5656426308464777)
('disappointing', -1.5547261389013962)
('avoid', -1.5018

In [26]:
clf_LR = LogisticRegression(penalty = 'l2', max_iter = 5000, tol = 0.01)
clf_LR.fit(X_train_counts_bigram, y_train)
pred = clf_LR.predict(X_test_counts_bigram)

feature_to_coef = {word: coef for word, coef in zip(vectorizer_bigram.get_feature_names(), clf_LR.coef_[0])}

print("Words most likely indicating a positive review: ")
for most_positive in sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)[:20]:
    print (most_positive)
  
print("\n Words most likely indicating a negative review: ")
for most_negative in sorted(feature_to_coef.items(), key=lambda x: x[1])[:20]:
    print (most_negative)

Words most likely indicating a positive review: 
('excellent', 1.3904501045454358)
('perfect', 1.176889254407725)
('wonderful', 1.097221774949489)
('superb', 1.0786629665769676)
('amazing', 1.014681761894561)
('favorite', 0.9990117121247385)
('must see', 0.9826868317690972)
('funniest', 0.9800389023288956)
('10 10', 0.9637499218609806)
('wonderfully', 0.9188874929486968)
('rare', 0.8980345389760819)
('enjoyable', 0.8831662342029309)
('refreshing', 0.8777194815541556)
('brilliant', 0.8775231200083236)
('well worth', 0.8585792221178361)
('enjoyed', 0.8543388620257607)
('fantastic', 0.8445487505698687)
('highly', 0.8128121785445597)
('loved', 0.8056488353351996)
('hilarious', 0.7983441993322362)

 Words most likely indicating a negative review: 
('worst', -1.9978069635398459)
('awful', -1.6288989479826508)
('waste', -1.6010621540565426)
('boring', -1.4560925479867468)
('disappointment', -1.407918124782598)
('poorly', -1.3200912478926778)
('dull', -1.2420047097459963)
('disappointing', -1.

### Multinomial Naive Bayes


In [28]:
from sklearn.naive_bayes import MultinomialNB

pipelineMNB = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

#alpha parameter: Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing)
parameters = {'clf__alpha': [1, 0.5, 0.2, 0.1]}
n_folds = 5


MNB_GridSearch = GridSearchCV(pipelineMNB, param_grid = parameters, cv=n_folds)
MNB_GridSearch.fit(X_train, y_train)

scores = MNB_GridSearch.cv_results_['mean_test_score']
scores_std = MNB_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % MNB_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, MNB_GridSearch.best_params_[param_name]))

scores: [0.86408 0.86496 0.86344 0.86096]
scores_std [0.00218577 0.00301038 0.00298369 0.00252159]
Best score: 0.865

 Best Parameter Values: 
clf__alpha: 0.5


In [29]:
pipelineMNB = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha = 0.5))
])

pipelineMNB.fit(X_train, y_train)
y_predMNB = pipelineMNB.predict(X_test)

print("Predicted: ", y_predMNB)

print("Accuracy: ", metrics.accuracy_score(y_test, y_predMNB)*100, "%")

Predicted:  [0 0 1 ... 0 0 0]
Accuracy:  82.512 %


### SGDClassifier

In [30]:
from sklearn.linear_model import SGDClassifier

pipelineSGD = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier())
])

#loss parameter:The loss function to be used
#penalty parameter:The penalty (aka regularization term) to be used
#alpha parameter:Constant that multiplies the regularization term
#max_iter parameter:The maximum number of passes over the training data (aka epochs)
parameters = {'clf__loss': ['hinge', 'log'],
             'clf__penalty': ['l2'],
             'clf__alpha': [1e-1, 1e-2, 1e-3, 1e-4],
             'clf__max_iter': [60, 80, 100]}
n_folds = 5


SGD_GridSearch = GridSearchCV(pipelineSGD, param_grid = parameters, cv=n_folds)
SGD_GridSearch.fit(X_train, y_train)

scores = SGD_GridSearch.cv_results_['mean_test_score']
scores_std = SGD_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % SGD_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, SGD_GridSearch.best_params_[param_name]))


scores: [0.53756 0.55516 0.59668 0.6054  0.6542  0.64244 0.67712 0.71472 0.64604
 0.8028  0.81044 0.81304 0.8504  0.851   0.85096 0.84344 0.84456 0.84464
 0.89012 0.89068 0.89032 0.8814  0.88096 0.8812 ]
scores_std [0.06226876 0.07351771 0.11496473 0.0693687  0.11526158 0.11890537
 0.12176335 0.07459358 0.09064978 0.01504074 0.00612849 0.00346733
 0.00498799 0.00528999 0.00548948 0.00321098 0.00358307 0.00265149
 0.00228945 0.00315113 0.00248065 0.0026803  0.00215184 0.0015748 ]
Best score: 0.891

 Best Parameter Values: 
clf__alpha: 0.0001
clf__loss: 'hinge'
clf__max_iter: 80
clf__penalty: 'l2'


In [31]:
pipelineSGD = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4, max_iter=80))
])

pipelineSGD.fit(X_train, y_train)

y_predSGD = pipelineSGD.predict(X_test)
print("Predicted: ", y_predSGD)

print("Accuracy: ", metrics.accuracy_score(y_test, y_predSGD)*100, "%")

Predicted:  [0 0 0 ... 1 1 1]
Accuracy:  88.056 %


### Decision Tree Classifier

In [32]:
from sklearn.tree import DecisionTreeClassifier

pipeline_tree = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier())
])

#criterion parameter:The function to measure the quality of a split.
#max_depth parameter:The maximum depth of the tree. 
#max_features parameter:The number of features to consider when looking for the best split. If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features).
parameters = {'clf__criterion': ['gini', 'entropy'],
             'clf__max_depth': [None, 5, 10, 20, 40],
             'clf__max_features': ['sqrt', 'log2', None]}
n_folds = 5


tree_GridSearch = GridSearchCV(pipeline_tree, param_grid = parameters, cv=n_folds)
tree_GridSearch.fit(X_train, y_train)

scores = tree_GridSearch.cv_results_['mean_test_score']
scores_std = tree_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % tree_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, tree_GridSearch.best_params_[param_name]))

scores: [0.64208 0.58956 0.71524 0.54756 0.51076 0.68928 0.57996 0.50916 0.72696
 0.61656 0.51984 0.73632 0.62512 0.5372  0.7276  0.64596 0.59612 0.71444
 0.54736 0.51152 0.68672 0.56228 0.51352 0.71812 0.61328 0.52344 0.72784
 0.62428 0.54832 0.72232]
scores_std [0.00888648 0.00891058 0.00523129 0.01728602 0.00871725 0.00839771
 0.01613414 0.00632253 0.00868691 0.02759896 0.0123699  0.00573495
 0.01093881 0.00734956 0.00473793 0.01114587 0.01246281 0.00713683
 0.01431721 0.01056152 0.00827681 0.01015843 0.01425348 0.00952605
 0.01956552 0.00755608 0.00684503 0.01722491 0.0175009  0.00669758]
Best score: 0.736

 Best Parameter Values: 
clf__criterion: 'gini'
clf__max_depth: 20
clf__max_features: None


In [16]:
from sklearn.tree import DecisionTreeClassifier

pipeline_tree = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier(criterion ='gini', max_depth = 20, max_features = None))
])


pipeline_tree.fit(X_train, y_train)

y_pred_tree = pipeline_tree.predict(X_test)
print("Predicted: ", y_pred_tree)

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_tree)*100, "%")

Predicted:  [0 0 1 ... 1 0 1]
Accuracy:  73.78 %


### Support Vector Machines

In [34]:
from sklearn.svm import LinearSVC

pipelineSVM = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
])

#penalty parameter:Specifies the norm used in the penalization
#tol parameter:Tolerance for stopping criteria.
#C parameter: Regularization parameter. The strength of the regularization is inversely proportional to C.
#max_iter parameter:The maximum number of iterations to be run
parameters = {'clf__penalty': ['l2'],
             'clf__tol': [1e-1, 1e-2, 1e-3, 1e-4],
              'clf__C': [1.0, 0.5],
             'clf__max_iter': [1000, 1500, 2000]}
n_folds = 5


SVM_GridSearch = GridSearchCV(pipelineSVM, param_grid = parameters, cv=n_folds)
SVM_GridSearch.fit(X_train, y_train)

scores = SVM_GridSearch.cv_results_['mean_test_score']
scores_std = SVM_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % SVM_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, SVM_GridSearch.best_params_[param_name]))

scores: [0.89028 0.89032 0.89032 0.89032 0.89036 0.89028 0.89032 0.89032 0.89028
 0.89032 0.89032 0.89032 0.89372 0.8938  0.8938  0.89376 0.89372 0.89384
 0.8938  0.8938  0.89408 0.89384 0.8938  0.8938 ]
scores_std [0.00261258 0.00292192 0.00281311 0.00281311 0.00303947 0.00281879
 0.00281311 0.00281311 0.00286664 0.00281311 0.00281311 0.00281311
 0.00106283 0.0010198  0.0010198  0.00099116 0.00106283 0.00106883
 0.0010198  0.0010198  0.00108517 0.00106883 0.0010198  0.0010198 ]
Best score: 0.894

 Best Parameter Values: 
clf__C: 0.5
clf__max_iter: 2000
clf__penalty: 'l2'
clf__tol: 0.1


In [17]:
from sklearn.svm import LinearSVC

pipelineSVM = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C = 0.5, max_iter = 2000, penalty = 'l2', tol = 0.1))
])


pipelineSVM.fit(X_train, y_train)

y_predSVM = pipelineSVM.predict(X_test)
print("Predicted: ", y_predSVM)

print("Accuracy: ", metrics.accuracy_score(y_test, y_predSVM)*100, "%")

Predicted:  [0 0 1 ... 0 0 1]
Accuracy:  87.532 %


### Adaboost

In [35]:
from sklearn.ensemble import AdaBoostClassifier

pipelineADA = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', AdaBoostClassifier())
])

#base_estimator parameter: The base estimator from which the boosted ensemble is built.
#n_estimators: The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
#learning_rate:Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.
parameters = {'clf__base_estimator': [None],
              'clf__n_estimators': [50, 100, 150, 200],
             'clf__learning_rate': [0.1, 0.5, 1]}
n_folds = 5

Ada_GridSearch = GridSearchCV(pipelineADA, param_grid = parameters, cv=n_folds)
Ada_GridSearch.fit(X_train, y_train)

scores = Ada_GridSearch.cv_results_['mean_test_score']
scores_std = Ada_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % Ada_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, Ada_GridSearch.best_params_[param_name]))

scores: [0.72824 0.76388 0.78176 0.7922  0.7962  0.82464 0.83644 0.84184 0.79892
 0.82396 0.83032 0.83868]
scores_std [0.01028039 0.00997806 0.01042413 0.01077998 0.00930892 0.00906633
 0.00596577 0.00494514 0.00610652 0.00625127 0.00623872 0.00408039]
Best score: 0.842

 Best Parameter Values: 
clf__base_estimator: None
clf__learning_rate: 0.5
clf__n_estimators: 200


In [13]:
from sklearn.ensemble import AdaBoostClassifier
pipelineADA = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', AdaBoostClassifier(base_estimator = None, n_estimators = 500, learning_rate = 0.5))
])

pipelineADA.fit(X_train, y_train)

y_pred_Ada = pipelineADA.predict(X_test)
print("Predicted: ", y_pred_Ada)

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_Ada)*100, "%")

Predicted:  [0 1 1 ... 1 1 1]
Accuracy:  86.064 %


### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

pipelineForest = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])

#n_estimators parameter:The number of trees in the forest
#criterion parameter:The function to measure the quality of a split (gini or entropy)
#max_depth parameter:The maximum depth of the tree.
#max_features parameter:The number of features to consider when looking for the best split. If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.
#bootstrap parameter: Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
parameters = {'clf__n_estimators': [50, 70, 90],
             'clf__criterion': ['gini', 'entropy'],
             'clf__max_depth': [None, 10, 20],
             'clf__max_features': ['sqrt', 'log2', None],
             'clf__bootstrap': [True, False]}
n_folds = 5

forest_GridSearch = GridSearchCV(pipelineForest, param_grid = parameters, cv=n_folds)
forest_GridSearch.fit(X_train, y_train)

scores = forest_GridSearch.cv_results_['mean_test_score']
scores_std = forest_GridSearch.cv_results_['std_test_score']

print('scores:',scores)
print('scores_std',scores_std)

print("Best score: %0.3f" % forest_GridSearch.best_score_)

print("\n Best Parameter Values: ")
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, forest_GridSearch.best_params_[param_name]))

In [14]:
from sklearn.ensemble import RandomForestClassifier

pipelineForest = Pipeline([ 
    ('vectorizer', CountVectorizer(stop_words = stop_words, ngram_range = (1,1), max_features = None)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators = 120, criterion = 'entropy', max_depth = None, max_features = 'sqrt', bootstrap = False))
])

pipelineForest.fit(X_train, y_train)

y_pred_forest = pipelineForest.predict(X_test)
print("Predicted: ", y_pred_forest)

print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_forest)*100, "%")

Predicted:  [0 1 1 ... 1 0 1]
Accuracy:  85.32 %


## Analyze which Model is the Best

In [18]:
print("Logistic Regression Accuracy: ", metrics.accuracy_score(y_test, y_predLR)*100, "%")
print("Logistic Regression Recall: ", metrics.recall_score(y_test, y_predLR)*100, "%")
print("Logistic Regression Precision: ", metrics.precision_score(y_test, y_predLR)*100, "%", '\n')

print("Decision Tree Accuracy: ", metrics.accuracy_score(y_test, y_pred_tree)*100, "%")
print("Decision Tree Recall: ", metrics.recall_score(y_test, y_pred_tree)*100, "%")
print("Decision Tree Precision: ", metrics.precision_score(y_test, y_pred_tree)*100, "%", '\n')

print("SVM Accuracy: ", metrics.accuracy_score(y_test, y_predSVM)*100, "%")
print("SVM Recall: ", metrics.recall_score(y_test, y_predSVM)*100, "%")
print("SVM Precision: ", metrics.precision_score(y_test, y_predSVM)*100, "%", '\n')

print("Adaboost Accuracy: ", metrics.accuracy_score(y_test, y_pred_Ada)*100, "%")
print("Adaboost Recall: ", metrics.recall_score(y_test, y_pred_Ada)*100, "%")
print("Adaboost Precision: ", metrics.precision_score(y_test, y_pred_Ada)*100, "%", '\n')

print("Random Forest Accuracy: ", metrics.accuracy_score(y_test, y_pred_forest)*100, "%")
print("Random Forest Recall: ", metrics.recall_score(y_test, y_pred_forest)*100, "%")
print("Random Forest Precision: ", metrics.precision_score(y_test, y_pred_forest)*100, "%", '\n')

Logistic Regression Accuracy:  88.096 %
Logistic Regression Recall:  88.064 %
Logistic Regression Precision:  88.12039705411463 % 

Decision Tree Accuracy:  73.78 %
Decision Tree Recall:  83.072 %
Decision Tree Precision:  70.0532955542063 % 

SVM Accuracy:  87.532 %
SVM Recall:  86.504 %
SVM Precision:  88.31985624438454 % 

Adaboost Accuracy:  86.064 %
Adaboost Recall:  87.792 %
Adaboost Precision:  84.85926384163317 % 

Random Forest Accuracy:  85.32 %
Random Forest Recall:  84.896 %
Random Forest Precision:  85.6220751976763 % 

