In [2]:
import pandas as pd
import numpy as np
import os

#### Data Load/Acquisition

In [3]:
temp_df=pd.read_csv('IMDB Dataset.csv')

In [4]:
temp_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
temp_df.shape

(50000, 2)

In [6]:
temp_df=temp_df.iloc[:10000]

In [7]:
temp_df.shape

(10000, 2)

In [8]:
temp_df['review'][4]

'Petter Mattei\'s "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler\'s play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case wit

In [9]:
# class distribution
temp_df['sentiment'].value_counts()

sentiment
positive    5028
negative    4972
Name: count, dtype: int64

In [10]:
# checking null values
temp_df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [11]:
# checking duplicate reviews
temp_df.duplicated().sum()

17

In [12]:
#dropping duplicate reviews
temp_df.drop_duplicates(inplace=True)

In [13]:
# checking again duplicate reviews
temp_df.duplicated().sum()

0

#### Text Preprocesssing

In [14]:
# basic text preprocessing
# 1. removing html tags

import re

def remove_tags(text):
    cleaned_text=re.sub(re.compile('<.*?>'),'',text)
    return cleaned_text

In [15]:
temp_df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [16]:
temp_df['review']=temp_df['review'].apply(remove_tags)

In [17]:
temp_df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [18]:
# basic text preprocessing
# 2. Converting text into lower case

temp_df['review']=temp_df['review'].apply(lambda x:x.lower())

In [19]:
temp_df['review'][0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

In [20]:
# basic text preprocessing
# 3. removing stopwords

import nltk
from nltk.corpus import stopwords

In [21]:
stopword_list=stopwords.words('english')

In [22]:
temp_df['review']=temp_df['review'].apply(lambda x:[i for i in x.split() if i not in stopword_list]).apply(lambda x:" ".join(x))

In [23]:
temp_df['review'][0]

"one reviewers mentioned watching 1 oz episode hooked. right, exactly happened me.the first thing struck oz brutality unflinching scenes violence, set right word go. trust me, show faint hearted timid. show pulls punches regards drugs, sex violence. hardcore, classic use word.it called oz nickname given oswald maximum security state penitentary. focuses mainly emerald city, experimental section prison cells glass fronts face inwards, privacy high agenda. em city home many..aryans, muslims, gangstas, latinos, christians, italians, irish more....so scuffles, death stares, dodgy dealings shady agreements never far away.i would say main appeal show due fact goes shows dare. forget pretty pictures painted mainstream audiences, forget charm, forget romance...oz mess around. first episode ever saw struck nasty surreal, say ready it, watched more, developed taste oz, got accustomed high levels graphic violence. violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order g

In [24]:
# basic text preprocessing
# 4. removing punctuations

import string

In [25]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
exclude=string.punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('','',exclude))

In [27]:
temp_df['review']=temp_df['review'].apply(remove_punctuation)

In [28]:
temp_df['review'][0]

'one reviewers mentioned watching 1 oz episode hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust me show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz mess around first episode ever saw struck nasty surreal say ready it watched more developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away it well mannered middle class inmates t

In [29]:
# basic text preprocessing
# 5. Stemming

from nltk.stem.porter import PorterStemmer

ps=PorterStemmer()

def stem_words(text):
    return " ".join ([ps.stem(i) for i in text.split()])

In [30]:
temp_df['review']=temp_df['review'].apply(stem_words)

In [31]:
temp_df['review'][0]

'one review mention watch 1 oz episod hook right exactli happen meth first thing struck oz brutal unflinch scene violenc set right word go trust me show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz mess around first episod ever saw struck nasti surreal say readi it watch more develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away it well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom comfort uncomfor

In [32]:
temp_X=temp_df.iloc[:,0:1]
temp_y=temp_df['sentiment']

In [33]:
temp_X.head()

Unnamed: 0,review
0,one review mention watch 1 oz episod hook righ...
1,wonder littl product film techniqu unassum old...
2,thought wonder way spend time hot summer weeke...
3,basic there famili littl boy jake think there ...
4,petter mattei love time money visual stun film...


In [34]:
temp_y.head()

0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object

In [35]:
#since temp_y is in english. Converting it into number for the ease of modeling

from sklearn.preprocessing import LabelEncoder

encoder=LabelEncoder()

temp_y=encoder.fit_transform(temp_y)

In [36]:
temp_y

array([1, 1, 1, ..., 0, 0, 1])

In [37]:
#Spliting the data into train and test

from sklearn.model_selection import train_test_split
Xtrain, Xtest,ytrain, ytest=train_test_split(temp_X,temp_y, test_size=0.2, random_state=1)


In [38]:
Xtrain.shape

(7986, 1)

In [39]:
Xtest.shape

(1997, 1)

In [40]:
ytrain

array([1, 1, 0, ..., 0, 0, 1])

In [41]:
Xtrain

Unnamed: 0,review
6713,ive wait superhero movi like long time mysteri...
1178,movi excel act excel direct overal excel stori...
4707,movi make want throw everi time see it take fi...
6772,first saw movi elementari school back 1960 fas...
7461,show made person iq lower 80 joke show lame de...
...,...
2895,excel episod movi ala pulp fiction 7 day 7 sui...
7823,first off give idea tast movies2007 comedi enj...
905,well begin stori went movi tonight friend know...
5195,lot horror fan seem love scarecrow popular say...


#### Text Vectorization

##### Bag of Words

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

In [43]:
cv=CountVectorizer()

In [44]:
X_train_bow = cv.fit_transform(Xtrain['review']).toarray()
X_test_bow = cv.transform(Xtest['review']).toarray()


#cv_train  = cv.fit_transform(X_train)
#cv_test = cv.transform(X_test)

In [45]:
X_train_bow.shape

(7986, 55213)

In [46]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [47]:
X_test_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Data Modelling

##### Naive Bayes

In [48]:
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()


gnb.fit(X_train_bow,ytrain)

In [49]:
y_pred = gnb.predict(X_test_bow)

In [50]:
from sklearn.metrics import accuracy_score,confusion_matrix

In [51]:
accuracy_score(ytest,y_pred)

0.6464697045568353

In [52]:
confusion_matrix(ytest,y_pred)

array([[681, 271],
       [435, 610]], dtype=int64)

##### Random Forest on BOW

In [53]:
from sklearn.ensemble import RandomForestClassifier

rf= RandomForestClassifier()

In [54]:
rf.fit(X_train_bow,ytrain)

In [55]:
y_pred_rf=rf.predict(X_test_bow)

In [56]:
accuracy_score(ytest,y_pred_rf)

0.8417626439659489

In [57]:
confusion_matrix(ytest,y_pred_rf)

array([[802, 150],
       [166, 879]], dtype=int64)

In [58]:
# To improve the accuracy trying to use vocabulary's most frequent words rather than entire vocab
# so we have 55213 words. 
# trying to use most frequent 3k words
X_train_bow.shape

(7986, 55213)

In [59]:
cv_new=CountVectorizer(max_features=3000)

X_train_bow_new = cv_new.fit_transform(Xtrain['review']).toarray()
X_test_bow_new = cv_new.transform(Xtest['review']).toarray()

rf_new=RandomForestClassifier()

rf_new.fit(X_train_bow_new,ytrain)

y_pred_rf_new=rf_new.predict(X_test_bow_new)

In [60]:
accuracy_score(ytest,y_pred_rf_new)

0.8342513770655984

In [61]:
confusion_matrix(ytest,y_pred_rf_new)

array([[801, 151],
       [180, 865]], dtype=int64)

##### Random Forest on nGrams

In [62]:
#Since without max feature the model was throwing 'memory error' as vocab for bi gram and tri gram has increased exponentially

cv_ng=CountVectorizer(ngram_range=(1,2),max_features=10000)

X_train_bow_ng = cv_ng.fit_transform(Xtrain['review']).toarray()
X_test_bow_ng = cv_ng.transform(Xtest['review']).toarray()

rf_ng=RandomForestClassifier()

rf_ng.fit(X_train_bow_ng,ytrain)

y_pred_rf_ng=rf_ng.predict(X_test_bow_ng)

In [63]:
accuracy_score(ytest,y_pred_rf_ng)

0.8457686529794692

In [64]:
cv_ng=CountVectorizer(ngram_range=(1,3),max_features=10000)

X_train_bow_ng = cv_ng.fit_transform(Xtrain['review']).toarray()
X_test_bow_ng = cv_ng.transform(Xtest['review']).toarray()

rf_ng=RandomForestClassifier()

rf_ng.fit(X_train_bow_ng,ytrain)

y_pred_rf_ng=rf_ng.predict(X_test_bow_ng)

In [65]:
accuracy_score(ytest,y_pred_rf_ng)

0.8402603905858789

##### Random Forest on Tfidf 

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [67]:
tfidf=TfidfVectorizer()

In [68]:
X_train_tfidf = tfidf.fit_transform(Xtrain['review']).toarray()
X_test_tfidf = tfidf.transform(Xtest['review']).toarray()

In [69]:
rf_tfidf=RandomForestClassifier()

rf_tfidf.fit(X_train_tfidf,ytrain)

y_pred_tfidf=rf_tfidf.predict(X_test_tfidf)

In [70]:
accuracy_score(ytest,y_pred_tfidf)

0.8392588883324987

##### Word2Vec

In [71]:
import gensim

In [72]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [73]:
story=[]

for doc in temp_df['review']:
    raw_sent=sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [79]:
type(story)

list

In [82]:
story[1][0:25]

['wonder',
 'littl',
 'product',
 'film',
 'techniqu',
 'unassum',
 'oldtimebbc',
 'fashion',
 'give',
 'comfort',
 'sometim',
 'discomfort',
 'sens',
 'realism',
 'entir',
 'piec',
 'actor',
 'extrem',
 'well',
 'chosen',
 'michael',
 'sheen',
 'ha',
 'got',
 'polari']

In [83]:
model = gensim.models.Word2Vec (window=10, min_count=2)


In [84]:
model

<gensim.models.word2vec.Word2Vec at 0x220fb0dd910>

In [85]:
model.build_vocab(story)

In [86]:
model.train(story, total_examples=model.corpus_count,epochs=model.epochs)

(5536204, 6021045)

In [87]:
len(model.wv.index_to_key)

24859

In [88]:
# till this point we have build the model (word2vec) for words. Now..going to create the same for documents i.e. reviews

In [89]:
def document_vector(doc):
    # remove out of vocab words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc],axis=0)

In [90]:
document_vector(temp_df['review'].values[0])

array([-0.0815183 ,  0.14416261,  0.05117353,  0.03797391, -0.07799187,
       -0.45903867,  0.18225962,  0.6597433 , -0.21553531, -0.22197503,
        0.06272431, -0.30053288,  0.1005296 ,  0.08143651, -0.07159852,
       -0.04859932,  0.23424986, -0.4784025 ,  0.08534189, -0.7878843 ,
        0.19749102,  0.11886481,  0.18074664, -0.263865  , -0.14758374,
        0.13599564, -0.27858227, -0.31101984, -0.37387475, -0.04637045,
        0.42154342,  0.1874511 ,  0.31900817, -0.25023562, -0.10483135,
        0.3275052 ,  0.09927113, -0.44399506, -0.47762126, -0.47214597,
       -0.01095425, -0.30543667, -0.17200471, -0.09535902,  0.28363758,
       -0.38837352, -0.23604013, -0.19594686,  0.29105866,  0.37905523,
        0.02924929, -0.07195818, -0.20020859,  0.23572233, -0.1610906 ,
        0.04431829,  0.3355745 , -0.2494797 , -0.24137768,  0.17650522,
        0.06844756,  0.17256229, -0.08213758,  0.06077481, -0.35259992,
        0.33102673, -0.06677912,  0.21279505, -0.47790873,  0.24

In [91]:
from tqdm import tqdm

In [92]:
W2V =[]

for doc in tqdm(temp_df['review'].values):
    W2V.append(document_vector(doc))


100%|██████████| 9983/9983 [01:24<00:00, 118.57it/s]


In [93]:
W=np.array(W2V)

In [94]:
W[1]

array([-0.00879618,  0.16706415,  0.0202668 ,  0.0855204 ,  0.12787083,
       -0.61515415,  0.17188329,  0.4050233 , -0.48949823, -0.38317066,
       -0.3526686 , -0.5007181 , -0.05572539,  0.22550002,  0.22345851,
        0.02014802, -0.10950403, -0.39202967,  0.15054236, -0.6717477 ,
        0.17805468,  0.47067466,  0.36940545, -0.23348634,  0.25974807,
        0.33098018, -0.29859778, -0.14436191, -0.32274726,  0.20759252,
        0.51604074, -0.36775738,  0.32897803, -0.6359153 , -0.30911443,
        0.8224128 ,  0.25645956, -0.5206217 , -0.41101438, -0.53797966,
        0.15332063, -0.44316438, -0.07606869,  0.06427266, -0.03067522,
       -0.19370286, -0.5115933 , -0.17613298,  0.16680253,  0.16106573,
        0.31394276, -0.7022678 , -0.19111091,  0.00305641, -0.11256406,
        0.3114143 ,  0.2653072 , -0.20560558, -0.00959505,  0.09668474,
        0.13367745, -0.03316668,  0.06281686, -0.04503524, -0.47992134,
        0.3416583 ,  0.3335183 ,  0.21273458, -0.7408891 ,  0.39

In [95]:
W.shape

(9983, 100)

In [96]:
from sklearn.preprocessing import LabelEncoder

encoder=LabelEncoder()

V=encoder.fit_transform(temp_y)

In [97]:
V

array([1, 1, 1, ..., 0, 0, 1], dtype=int64)

In [98]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest,ytrain, ytest=train_test_split(W,V, test_size=0.2, random_state=1)

In [99]:
rf_w2v=RandomForestClassifier()

In [100]:
rf_w2v.fit(Xtrain,ytrain)

y_pred_w2vf=rf_w2v.predict(Xtest)

In [101]:
accuracy_score(ytest, y_pred_w2vf)

0.8162243365047571

In [102]:
y_pred_w2vf[[0,1,13,26,32,56]]

array([1, 1, 1, 1, 1, 0], dtype=int64)

In [103]:
temp_df['review'][0]

'one review mention watch 1 oz episod hook right exactli happen meth first thing struck oz brutal unflinch scene violenc set right word go trust me show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz mess around first episod ever saw struck nasti surreal say readi it watch more develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away it well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom comfort uncomfor

In [104]:
temp_df['review'][56]

'hill eye ii would expect noth more cours go oscar nomin film pure entertain lose 90 minutesth plot basic group nation guard traine find battl notori mutat hillbilli last day train desert fight back throughout whole film includ lot violenc which basic whole film blood gut constantli fli around throughout whole thing also yet anoth graphic rape scene pointlessli thrown shock audienceid give hill eye ii 4 10 pure entertain onli although even found look watch film went on began drag due fact continu tri shock audienc graphic gore occasion jump scene make sure audienc stay awak hill eye ii decent entertain someth pass time bore noth else410'