#### We will perform Sentiment Analysis using Naive Bayes Classifier. We are going to implement different technique like Segmentation, Stopwords, Bag of words model.

* Let’s begin!

* We will import necessary libraries then read the dataset.

In [2]:
# importing necessary libraries
import pandas as pd
import numpy as np
import re
df = pd.read_csv("D:/NLP/INT344/archive/IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# Let’s check the size of dataset.
df.shape

(50000, 2)

In [4]:
# There are 50000 samples(rows). The data is huge. Let’s take a sample for now. This will make our job easy and quicker.
# Subset
df = df.sample(1000)
# resetting index
df.reset_index(drop=True, inplace=True)
# sample dataset size
df.shape

(1000, 2)

In [5]:
# Let’s update target variables as binary values 0 and 1
# positive:1 , negative:0
df['sentiment'].replace({'positive':1, 'negative':0}, inplace=True)
df.head()

Unnamed: 0,review,sentiment
0,"First off, this movie leaves you in a limbo mo...",0
1,"Undoubtedly, the least among the Spaghetti Wes...",0
2,This screened at Sundance last night to a rece...,0
3,I just finished this movie and my only comment...,1
4,I really wanted to like this movie. I absolute...,0


## Data Preprocessing
Data preprocessing is important stage in classification.

We will remove the noise from the dataset such as html tags, brackets, special characters. Let’s change everything to lower case (abcd).

In [6]:
# functions to remove noise
# remove html tags
def clean_html(text):
 clean = re.compile('<.*?>')
 return re.sub(clean, '', text)

In [7]:
# remove brackets
def remove_brackets(text):
 return re.sub('\[[^]]*\]', '', text)

In [8]:
# lower the cases
def lower_cases(text):
 return text.lower()

In [9]:
# remove special characters
def remove_char(text):
 pattern = r'[^a-zA-z0–9\s]'
 text = re.sub(pattern, '', text)
 return text


In [10]:
# remove noise(combine above functions)
def remove_noise(text):
 text = clean_html(text)
 text = remove_brackets(text)
 text = lower_cases(text) 
 text = remove_char(text) 
 return text

In [11]:
# call the function on predictors
df['review']=df['review'].apply(remove_noise)
df['review']

0      first off this movie leaves you in a limbo moo...
1      undoubtedly the least among the spaghetti west...
2      this screened at sundance last night to a rece...
3      i just finished this movie and my only comment...
4      i really wanted to like this movie i absolutel...
                             ...                        
995    in a nutshell skip this movie its that bad in ...
996    i found the movie at my local video store and ...
997    some major spoilers youve been warnedi saw thi...
998    an old intellectual talks about what he consid...
999    to be brutally honest i loved watching severed...
Name: review, Length: 1000, dtype: object

Now, we are going to use Stemming, a text normalization technique in Natural Language Processing (NLP). This technique will reduce the word size ( for example “calling” and “called” words will be reduced to “call”. So we will have 3 words as “call”)

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

In [17]:
from nltk.stem.porter import PorterStemmer
def stem_words(text):
 ps = PorterStemmer()
 stem_list = [ps.stem(word) for word in text.split()] 
 text = ''.join(ps.stem(word) for word in text)

 return text
df['review'] = df['review'].apply(stem_words)

Now let’s remove Stopwords. These words do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence in the reviews. “the, is, at, which, and on” are few common stop words we see in English.

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant.

In [18]:
# importing from nlptoolkit library
import nltk
from nltk.corpus import stopwords

In [16]:
# creating list of english stopwords
stopword_list = stopwords.words('english')

In [18]:
# removing the stopwords from review
def remove_stopwords(text):
    # list to add filtered words from review
    filtered_text = []
    # verify & append words from the text to filtered_text list
    for word in text.split():
        if word not in stopword_list:
            filtered_text.append(word)
    # add content from filtered_text list to new variable
    clean_review = filtered_text[:]
    # emptying the filtered_text list for new review
    filtered_text.clear()
    return clean_review
df['review']=df['review'].apply(remove_stopwords)
df['review']


0      [movies, one, redeemable, quality, besides, at...
1      [know, guts, beauty, guts, virgin, crap, films...
2      [film, two, characters, takes, closer, two, pe...
3      [gundam009, became, movie, trilogy, us, famili...
4      [bad, acting, bad, writing, poorly, written, f...
                             ...                        
995    [ok, gave, obviously, money, make, film, feel,...
996    [mistakenly, kept, awake, late, last, night, w...
997    [first, two, jim, thompson, adaptations, relea...
998    [confess, know, involved, forerunner, planet, ...
999    [despite, overall, pleasing, plot, expensive, ...
Name: review, Length: 1000, dtype: object

In [19]:
# join back all words as single paragraph
def join_back(text):
    return ' '.join(text)
df['review'] = df['review'].apply(join_back)

In [20]:
# check if changes are applied
df.head()

Unnamed: 0,review,sentiment
0,movies one redeemable quality besides ators ba...,0
1,know guts beauty guts virgin crap films hated ...,1
2,film two characters takes closer two people in...,1
3,gundam009 became movie trilogy us familiar lot...,1
4,bad acting bad writing poorly written film bad...,0


Now we are going to perform feature extraction on reviews column to into numerical feature vector. Feature extraction is most important sub-tasks in pattern classification. It yields better results than applying machine learning directly to the raw data.

Feature extracion refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set.

We will use Bag of Words Vectorization technique for conversion. This is a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support.

In [24]:
# Let’s begin converting out input text data.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=800)


In [25]:
# vectorizing words and storing in variable X(predictor)
X = cv.fit_transform(df['review']).toarray()
# predictor
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [26]:
# X size
X.shape

(1000, 800)

In [27]:
# target
y = df.iloc[:,-1].values

In [28]:
# y size
y.shape

(1000,)

We will get an array of 0 and 1 as input feature X. And columns have increased from 1 to 800, number of rows(samples) are 1000.
Now we will split the data into test and train sets

In [29]:
# train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Finally, let’s fit the Naive Bayes Classifiers

In [30]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

In [31]:
# Naive Bayes Classifiers
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [32]:
# fitting and predicting
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)

In [33]:
mnb.fit(X_train, y_train)
y_pred_mnb = mnb.predict(X_test)

In [34]:
bnb.fit(X_train, y_train)
y_pred_bnb = bnb.predict(X_test)

In [35]:
# accuracy scores
print("Gaussian", accuracy_score(y_test, y_pred_gnb))
print("Multinomial", accuracy_score(y_test, y_pred_mnb))
print("Bernoulli", accuracy_score(y_test, y_pred_bnb))

Gaussian 0.745
Multinomial 0.85
Bernoulli 0.855
