# Problem statement description

*IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/*

## ***Steps*** [ Plan of attack ]

1. Text Preprocessing
    - Removal of HTML tags. ( Due to inefficient web scraping of the data ) 
    - Removal of Stopwords which have no contribution in the analysis ( a, and, the, or, many, which,...) 
    - Removing the Special characters. For our usecase (differentiating the samtiment between positive and negative), special characters aren't required
    
    
2. Vectorization of Data (Technique : Bag of Words)

3. Apply appropriate ML algorithm

4. Hyperparameter Tuning

5. Building deployment ready pipelines

# Importing required libraries

In [1]:
import numpy as np
import pandas as pd

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# One review
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

# Text Cleaning 

1. Remove HTML tags
2. Remove special characters
3. Converting everything to lowercase
4. Removing stopwords
5. Stemming ( playing, played, plays, player, players, playful ------> play )

In [5]:
df.shape

(50000, 2)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


No missing values in both the columns

## Character Encoding the sentiment column

In [7]:
df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Removing HTML tags using RegEx

In [9]:
# Testing the RegEx
clean = re.compile('<.*?>')
re.sub(clean,'',df.iloc[2].review)

'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'

In [10]:
# Function to clean HTML tags
def clean_html(text):
    clean = re.compile('<.*?>')
    return re.sub(clean,'',text)

In [11]:
# Removing HTML tags from reviews column
df['review'] = df['review'].apply(clean_html)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. The filming tec...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Converting all the reviews to lower case

In [12]:
def convert_lower(text):
    return text.lower()

In [13]:
df['review'] = df['review'].apply(convert_lower)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production. the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,"petter mattei's ""love in the time of money"" is...",1


## Removing Special characters

In [14]:
def remove_special(text):
    x=''
    for t in text:
        if t.isalnum():
            x=x+t
        else:
            x=x+' '
    
    return x

In [15]:
df['review'] = df['review'].apply(remove_special)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1


## Removing the stopwords

In [16]:
def remove_stopwords(text):
    x = []
    for i in text.split():
        if i not in stopwords.words('english'):
            x.append(i)
    
    # Transporting all the contents of x to y
    y = x[:]
    x.clear()
    return y

In [17]:
df['review'] = df['review'].apply(remove_stopwords)
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, watching, 1, oz, e...",1
1,"[wonderful, little, production, filming, techn...",1
2,"[thought, wonderful, way, spend, time, hot, su...",1
3,"[basically, family, little, boy, jake, thinks,...",0
4,"[petter, mattei, love, time, money, visually, ...",1


## Stemming

In [18]:
ps = PorterStemmer()

In [35]:
y = []
def stem_words(text):
    for i in text:
        y.append(ps.stem(i))
    z = y[:]
    y.clear()
    return z

In [20]:
df['review'] = df['review'].apply(stem_words)

In [21]:
# Join back
def join_back(list_input):
    return " ".join(list_input)

In [22]:
df['review'] = df['review'].apply(join_back)
df.head()

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod hook righ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic famili littl boy jake think zombi closet...,0
4,petter mattei love time money visual stun film...,1


# Vectorization [ Bag of Words ]

In [23]:
cv = CountVectorizer(max_features=5000)

In [24]:
X = cv.fit_transform(df['review']).toarray()

In [25]:
X.shape

(50000, 5000)

In [26]:
y = df.iloc[:,-1].values
y.shape

(50000,)

# ML Algorithm

In [27]:
clf1 = GaussianNB()
clf2 = MultinomialNB()
clf3 = BernoulliNB()

In [28]:
print("GaussianNB accuracy : ", cross_val_score(clf1,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')
print("MultinomialNB accuracy : ", cross_val_score(clf2,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')
print("BernoulliNB accuracy : ", cross_val_score(clf3,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')

GaussianNB accuracy :  71.85400000000001  %
MultinomialNB accuracy :  84.44800000000001  %
BernoulliNB accuracy :  84.73999999999998  %


***It is clearly visible from the cross validated results that Bernoulli Naive Bayes works better than any other algorithm with a mean accuracy of 84.73%***

In [29]:
clf3.fit(X,y)

BernoulliNB()

In [30]:
trnf1 = FunctionTransformer(func = clean_html)
# Eg:
trnf1.transform("<p> Swades is an excellent, mindblowing movie played by Shah Rukh Khan </p>")

' Swades is an excellent, mindblowing movie played by Shah Rukh Khan '

In [31]:
trnf2 = FunctionTransformer(func = convert_lower)
# Eg: 
trnf2.transform(' Swades is an excellent, mindblowing movie played by Shah Rukh Khan ')

' swades is an excellent, mindblowing movie played by shah rukh khan '

In [32]:
trnf3 = FunctionTransformer(func = remove_special)
# Eg: 
trnf3.transform(' swades is an excellent, mindblowing movie played by shah rukh khan ')

' swades is an excellent  mindblowing movie played by shah rukh khan '

In [33]:
trnf4 = FunctionTransformer(func = remove_stopwords)
# Eg: 
trnf4.transform(' swades is an excellent  mindblowing movie played by shah rukh khan ')

['swades',
 'excellent',
 'mindblowing',
 'movie',
 'played',
 'shah',
 'rukh',
 'khan']

In [36]:
trnf5 = FunctionTransformer(func = stem_words)
# Eg: 
trnf5.transform(['swades',
 'excellent',
 'mindblowing',
 'movie',
 'played',
 'shah',
 'rukh',
 'khan'])

['swade', 'excel', 'mindblow', 'movi', 'play', 'shah', 'rukh', 'khan']

In [37]:
trnf6 = FunctionTransformer(func = join_back)
# Eg: 
trnf6.transform(['swade', 'excel', 'mindblow', 'movi', 'play', 'shah', 'rukh', 'khan'])

'swade excel mindblow movi play shah rukh khan'

# ***Building readily deployable Pipeline***

In [38]:
pipe = Pipeline([
    ('trnf1',trnf1),
    ('trnf2',trnf2),
    ('trnf3',trnf3),
    ('trnf4',trnf4),
    ('trnf5',trnf5),
    ('trnf6',trnf6)
])

In [39]:
review1 = 'Bhool Bhoolaiya, Phir Hera pheri, De Dana Dan and Bhaagam Bhaag are a few of the good comedy movies played by Akshay Kumar as a lead actor'
review2 = 'Golmaal 2 is the worst movie in the entire frenchise'

In [40]:
def sentiment_analyzer(text):
    buffer = []
    buffer.append(pipe.transform(text))
    estimator = clf3.predict(cv.transform(buffer))[0]

    if estimator == 0:
        return 'the review is negative'
    else: 
        return 'the review is positive'

In [41]:
sentiment_analyzer(review1)

'the review is positive'

In [42]:
sentiment_analyzer(review2)

'the review is negative'

## ***We have now built a pipeline which can be readily deployed on a testing environment wherein we just need to input the reviews and then run it down the pipeline of function transformers which preprocess our data. The Bernoulli Naive Bayes ML model would then predict the proprocessed data resulting in either a positive or a negative review***