<h1><center>IMDB Reviews Sentimental Analysis</center></h1>


Data Source : https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve,auc
from nltk.stem.porter import PorterStemmer

import re
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [2]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [3]:
data = pd.read_csv('IMDB.csv')

In [4]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
data['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

* Here we can see that data is fully balanced between positive or negative sentiments.

In [6]:
data['sentiment'] = data['sentiment'].map({'positive':1,'negative':0})

In [7]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


# Text Processing

In [8]:
sent_0 = data['review'].values[0]
print(sent_0)
print('='*50)

sent_1000 = data['review'].values[1000]
print(sent_1000)
print('='*50)

sent_1500 = data['review'].values[1500]
print(sent_1500)
print('='*50)

sent_4900 = data['review'].values[4900]
print(sent_4900)
print('='*50)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [9]:
sent_0 = re.sub(r"http\s","",sent_0)
sent_1000 = re.sub(r"http\s","",sent_1000)
sent_1500 = re.sub(r"http\s","",sent_1500)
sent_4900 = re.sub(r"http\s","",sent_4900)

* Removed http or https from the reviews

In [10]:
print(sent_0)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

### Removing all html tags by BeautifulSoup

In [11]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0,'lxml')
text = soup.get_text()
print(text)
print('='*50)

soup = BeautifulSoup(sent_1000,'lxml')
text = soup.get_text()
print(text)
print('='*50)

soup = BeautifulSoup(sent_1500,'lxml')
text = soup.get_text()
print(text)
print('='*50)

soup = BeautifulSoup(sent_4900,'lxml')
text = soup.get_text()
print(text)
print('='*50)


One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wou

In [12]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't","will not",phrase)
    phrase = re.sub(r"can\'t","can not",phrase)
    
    #genrel
    
    phrase = re.sub(r"n\'t'"," not",phrase)
    phrase = re.sub(r"\'re"," are",phrase)
    phrase = re.sub(r"\'s"," is",phrase)
    phrase = re.sub(r"\'d"," would",phrase)
    phrase = re.sub(r"\'ll"," will",phrase)
    phrase = re.sub(r"\'t"," not",phrase)
    phrase = re.sub(r"\'ve"," have",phrase)
    phrase = re.sub(r"\'m"," am",phrase)
    return phrase

In [13]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)


Oh dear god. This was horrible. There is bad, then there was this. This movie makes no sense at all. It runs all over the map and isn not clear about what its saying at all. The music seemed like it was trying to be like Batman. The fact that 'Edison' isn not a real city, takes away. Since I live in Vancouver, watching this movie and recognizing all these places made it unbearable. Why didn not they make it a real city? The only writing that was decent was'Tilman' in which John Heard did a fantastic job. He was the only actor who played his role realistically and not over the top and campy. It was actually a shame to see John Heard play such a great bad guy with a lot of screen time, and the movie be a washout. Too bad. Hopefully someone important will see it, and at least give John Heard credit where credit is due, and hire him as lead bad guy again, which is where he should be. on the A List.


In [14]:
# removes words with numbers
sent_0 = re.sub("\S*\d\S*","",sent_0).strip()
print(sent_0)

One of the other reviewers has mentioned that after watching just  Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact

In [15]:
sent_1500 = re.sub('[^A-Za-z0-9]+',' ',sent_1500)
print(sent_1500)

Oh dear god This was horrible There is bad then there was this This movie makes no sense at all It runs all over the map and isn not clear about what its saying at all The music seemed like it was trying to be like Batman The fact that Edison isn not a real city takes away Since I live in Vancouver watching this movie and recognizing all these places made it unbearable Why didn not they make it a real city The only writing that was decent was Tilman in which John Heard did a fantastic job He was the only actor who played his role realistically and not over the top and campy It was actually a shame to see John Heard play such a great bad guy with a lot of screen time and the movie be a washout Too bad Hopefully someone important will see it and at least give John Heard credit where credit is due and hire him as lead bad guy again which is where he should be on the A List 


In [16]:
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [17]:
from tqdm import tqdm
preprocessed_reviews = []

for sentence in tqdm(data['review']):
    sentence  = re.sub(r"http\S+","",sentence)
    sentence = BeautifulSoup(sentence,'lxml').get_text()
    sentence = decontracted(sentence)
    sentence = re.sub("\S*\d\S*","",sentence).strip()
    sentence = re.sub('[^A-Za-z]+',' ',sentence)
    ps =PorterStemmer()
    sentence = ' '.join([ps.stem(word) for word in sentence.split()])
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentence.strip())

100%|███████████████████████████████████████████████████████████████████████████| 50000/50000 [03:11<00:00, 261.71it/s]


In [18]:
preprocessed_reviews[1500]

'oh dear god thi wa horribl bad wa thi thi movi make no sens run map not clear say music seem like wa tri like batman fact edison not real citi take away sinc live vancouv watch thi movi recogn place made unbear whi not make real citi onli write wa decent wa tilman john heard fantast job wa onli actor play hi role realist not top campi wa actual shame see john heard play great bad guy lot screen time movi washout bad hope someon import see least give john heard credit credit due hire lead bad guy list'

In [19]:
data['review'] = preprocessed_reviews
data.to_csv('preprocessed_reviews.csv')

In [20]:
data.head()

Unnamed: 0,review,sentiment
0,one review ha mention watch oz episod hook rig...,1
1,wonder littl product film techniqu veri unassu...,1
2,thought thi wa wonder way spend time hot summe...,1
3,basic famili littl boy jake think zombi hi clo...,0
4,petter mattei love time money visual stun film...,1


# Bag of Words

In [21]:
count_vect = CountVectorizer()
count_vect.fit(preprocessed_reviews)
print("some feature names ",count_vect.get_feature_names()[:10])
print("="*50)
final_counts = counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ",final_counts.get_shape()[1])

some feature names  ['aa', 'aaa', 'aaaaaaaaaaaahhhhhhhhhhhhhh', 'aaaaaaaargh', 'aaaaaaah', 'aaaaaaahhhhhhggg', 'aaaaagh', 'aaaaah', 'aaaaahhhh', 'aaaaargh']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (50000, 71371)
the number of unique words  71371


# Bi-Grams and n-Grams

In [22]:
count_vect = CountVectorizer(ngram_range = (1,2),min_df = 10,max_features = 5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BoW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ",final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BoW vectorizer  (50000, 5000)
the number of unique words including both unigrams and bigrams  5000


# TF-IDF

In [23]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

some sample features(unique words in the corpus) ['aa', 'aaa', 'aag', 'aaliyah', 'aam', 'aamir', 'aamir khan', 'aardman', 'aaron', 'aaron carter']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (50000, 90210)
the number of unique words including both unigrams and bigrams  90210


# Word2Vec

In [24]:
i = 0
list_of_sentence = []
for sentence in preprocessed_reviews:
    
    list_of_sentence.append(sentence.split())

In [25]:
w2v_model = Word2Vec(list_of_sentence,min_count=5,size = 50,workers = 4)
print(w2v_model.wv.most_similar('great'))
print('='*50)
print(w2v_model.wv.most_similar('worst'))

[('excel', 0.8256300687789917), ('fantast', 0.7941492199897766), ('good', 0.7600095272064209), ('fine', 0.7545821070671082), ('amaz', 0.7409884929656982), ('terrif', 0.7408449649810791), ('awesom', 0.7135226726531982), ('outstand', 0.7109720706939697), ('brilliant', 0.7056697607040405), ('superb', 0.6898691654205322)]
[('stupidest', 0.8620564937591553), ('scariest', 0.8152726292610168), ('best', 0.7948146462440491), ('dumbest', 0.7562673687934875), ('funniest', 0.7443196177482605), ('cheesiest', 0.7083956599235535), ('wors', 0.6735664010047913), ('greatest', 0.6651947498321533), ('coolest', 0.6622195839881897), ('weirdest', 0.6475479006767273)]


In [26]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ",w2v_words[:50])

number of words that occured minimum 5 times  26565
sample words  ['one', 'review', 'ha', 'mention', 'watch', 'oz', 'episod', 'hook', 'right', 'thi', 'exactli', 'happen', 'first', 'thing', 'struck', 'wa', 'brutal', 'unflinch', 'scene', 'violenc', 'set', 'word', 'go', 'trust', 'not', 'show', 'faint', 'heart', 'timid', 'pull', 'no', 'punch', 'regard', 'drug', 'sex', 'hardcor', 'classic', 'use', 'call', 'nicknam', 'given', 'oswald', 'maximum', 'secur', 'state', 'focus', 'mainli', 'emerald', 'citi', 'experiment']


# converting text into vectors using wAvg W2V,TFIDF-W2V

* Average W2V

In [27]:
sent_vectors = []
for sent in tqdm(list_of_sentence):
    sent_vec = np.zeros(50)
    cnt_words = 0
    for word in sent:
        try:
            vec = w2v_model.wv[word]
            sent_vec+=vec
            cnt_words+=1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|██████████████████████████████████████████████████████████████████████████| 50000/50000 [00:18<00:00, 2632.99it/s]

50000
50





* TFIDF Weighted W2V

In [28]:
model = TfidfVectorizer()
model.fit(preprocessed_reviews)
dictionary = dict(zip(model.get_feature_names(),list(model.idf_)))

In [29]:
tfidf_feat = model.get_feature_names()

tfidf_sent_vectors = []
row = 0
for sent in tqdm(list_of_sentence): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
        except:
            pass
    
    sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

100%|███████████████████████████████████████████████████████████████████████████| 50000/50000 [00:57<00:00, 868.48it/s]
