# Intro
In the Name of Allah

Sentiment analysis is a technique through which you can analyze a piece of text to determine the sentiment behind it. In this notebook, we're going to train a Naïve Bayes Classifier for the task of sentiment analysis on Imdb movie reviews dataset.

**Please pay attention to these notes:**

<br/>

- **Assignment Due:** 1401/09/20 23:59
- Write your code in the cells denoted by:
```
######## Your Code Here ########
```
- You can add more cells if necessary
- Finding any sort of copying will zero down your grade.
- When your solution is ready to submit, don't forget to set the name of this notebook like  "Name_StudentID.ipynb".
- If you have any questions about this assignment, feel free to drop us a line. You can also ask your questions on the telegram group.
- You must run this notebook on Google Colab platform.

<br/>



# Libraries

In [3]:
# importing the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import numpy as np
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
import collections
from collections import Counter
from sklearn.model_selection import train_test_split as tts

[nltk_data] Downloading package stopwords to /home/mehxi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/mehxi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mehxi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/mehxi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [142]:
!wget https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv

--2022-12-10 18:02:18--  https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66212309 (63M) [text/plain]
Saving to: ‘IMDB-Dataset.csv.2’

IMDB-Dataset.csv.2    9%[>                   ]   5.76M   589KB/s    eta 86s    ^C


# Load data

Load dataset and make it to pandas dataframe.

In [4]:
######## Your Code Here ########
df =  pd.read_csv('IMDB-Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


# Preprocess
The first step of NLP is text preprocessing. Data cleaning is a very crucial step in any machine learning model, but more so for NLP. Without the cleaning process, the dataset is often a cluster of words that the computer doesn’t understand. Raw data over a properly or improperly formed sentence is not always desirable as it contains lot of unwanted components like null/html/links/url/emoji/stopwords etc. So in this step, this unwanted components are removed for better performance and accuracy.

In this part I make class Preprocess to do preprocess on one column of data-frame.

In [5]:
######## Your Code Here ########

# remove html tage
def remove_tags(string):
    result = re.sub('<.*?>','',string)
    return result


# remove  links
def remove_links(string):
    result = re.sub('http[s]?:\/\/.*','',string)
    return result


class Preprocess:

    def __init__(self, data_frame,column_name):
        self.df = data_frame
        self.column_name = column_name

    def make_lower_case(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: x.lower())

    def remove_tags(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda cw : remove_tags(cw))

    def remove_link(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda cw : remove_links(cw))

    def remove_number(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: re.sub(r'\d+', '', x))

    def remove_punctuations(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

    def remove_double_spaces(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: re.sub(' +', ' ', x))

    def remove_emoji(self):
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

    def remove_stopwords(self):

        stop_words = stopwords.words('english')
        # add new stopwords to stop words.
        new_stopwords = ['<*>']
        stop_words.extend(new_stopwords)
        # remove stopwords not .
        stop_words.remove('not')

        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

    def do_lemmatization(self):
        lemmatizer = WordNetLemmatizer()
        self.df[self.column_name] = self.df[self.column_name].apply(lambda x: lemmatizer.lemmatize(x))

    def do_all_preprocess(self):

        self.make_lower_case()
        self.remove_tags()
        self.remove_link()
        self.remove_number()
        self.remove_punctuations()
        self.remove_punctuations()
        self.remove_double_spaces()
        self.remove_emoji()
        self.remove_stopwords()
        self.do_lemmatization()
        return self.df

model_pre = Preprocess(df,"review")
new_df = model_pre.do_all_preprocess()
new_df.to_csv("out.csv")
new_df.head(10)

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching oz episode yo...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive
5,probably alltime favorite movie story selfless...,positive
6,sure would like see resurrection dated seahunt...,positive
7,show amazing fresh innovative idea first aired...,negative
8,encouraged positive comments film looking forw...,negative
9,like original gut wrenching laughter like movi...,positive


<font size="5">Split the dataset</font>

Data splitting, or commonly known as train-test split, is the partitioning of data into subsets for model training and evaluation separately. Since the test set is not specified beforehand, we have to split the dataset into train and test set in an ideal proportion. 


In [184]:
######## Your Code Here ########
X_train, X_test, y_train, y_test = tts(df['review'], df['sentiment'],random_state=12, train_size = .80)


In [185]:
X_train

35235    disagree people saying lousy horror film good ...
36936    husbandandwife doctor team carole niles nelson...
46486    like cast pretty much however story sort unfol...
27160    movie awful bad cant bear expend anything word...
19490    purchased blood castle dvd ebay bucks not know...
                               ...                        
36482    strange thing see film scenes work rather weak...
40177    saw cheap dvd release title entity force since...
19709    one peculiar oftused romance movie plots one s...
38555    nothing positive say meandering nonsense huffi...
14155    low moments life bewildered depressed sitting ...
Name: review, Length: 40000, dtype: object

In [186]:
X_test

34622    hard tell noonan marshall trying ape abbott co...
1163     well startas one reviewers said know youre rea...
7637     wife kids opinion absolute abc classic havent ...
7045     surprise basic copycat comedy classic nutty pr...
43847    josef von sternberg directs magnificent silent...
                               ...                        
29299    yes fast times wannabe still decent entertainm...
29224    run dont walk rent movie rereleased excellent ...
16503    docudrama would expect richard attenborough ma...
40559    nepotism capitol world comes another junk flic...
24396    moviemakers even preview released script jumps...
Name: review, Length: 10000, dtype: object

# Training
Use Naive Beyes algorithm to train a Language Model

In this part I make class MehNaiveByes with algorithm of naive byes in fit and predict. with base of vector given Item.

In [198]:
######## Your Code Here ########
class MehNaiveBeyes:

    def __init__(self):
        self.p_pos = 0
        self.p_neg = 0
        self.list_of_p_pos = []
        self.list_of_p_neg = []

    def fit(self,vectors_feature,list_label,name_pos,name_neg):

        # At first, I find index of label pos and neg
        numbers_of_item_pos_label = [i for i in range(len(list_label)) if list_label[i] == name_pos]
        numbers_of_item_neg_label = [i for i in range(len(list_label)) if list_label[i] == name_neg]

        # Then I sum all vector of passive
        list_numbers_pos_item = np.zeros(vectors_feature.shape[1])
        for index in numbers_of_item_pos_label:
            item = vectors_feature[index].toarray()[0]

            list_numbers_pos_item = list_numbers_pos_item + item


        # And all vector of negative
        list_numbers_neg_item = np.zeros(vectors_feature.shape[1])
        for index in numbers_of_item_neg_label:
            item = vectors_feature[index].toarray()[0]
            list_numbers_neg_item = list_numbers_neg_item + item

        # Then Find Percentage of pas class and neg class
        self.p_pos = len(list_numbers_pos_item)/(len(list_numbers_pos_item)+len(list_numbers_neg_item))
        self.p_neg = len(list_numbers_neg_item)/(len(list_numbers_pos_item)+len(list_numbers_neg_item))


        # then for all element I find P(element | neg) P(element | pas)
        sum_list_numbers_pos_item = sum(list_numbers_pos_item)
        self.list_of_p_pos = np.array([(item+1)/(sum_list_numbers_pos_item+vectors_feature.shape[1]) for item in list_numbers_pos_item])

        sum_list_numbers_neg_item = sum(list_numbers_neg_item)
        self.list_of_p_neg = np.array([(item+1)/(sum_list_numbers_neg_item+vectors_feature.shape[1]) for item in list_numbers_neg_item])


    def predict(self,vectors_feature,name_pos,name_neg):


        label_predict = []
        # We have percentage of pas and neg and P(element | neg) P(element | pas) then we can find  Naive Beyes for each word
        # I use np.log because my score became vary small if use original formula
        for item in vectors_feature:
            item_arr = item.toarray()[0]

            pos_res = np.log(self.p_pos)
            neg_res = np.log(self.p_neg)

            pos_res = pos_res + np.log(self.list_of_p_pos.dot(item_arr))
            neg_res = neg_res + np.log(self.list_of_p_neg.dot(item_arr))

            # Each part have more probability I label with it .
            if pos_res >= neg_res:
                 label_predict.append(name_pos)
            else :
                 label_predict.append(name_neg)

        return label_predict




tfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidfvectorizer.fit(X_train)
tfidf_train = tfidfvectorizer.transform(X_train)
print("n_samples: %d, n_features: %d" % tfidf_train.shape)

n_samples: 40000, n_features: 185662


In [199]:
naive_beyes = MehNaiveBeyes()
naive_beyes.fit(tfidf_train, np.array(y_train),"positive","negative")

In [200]:
tfidf_test = tfidfvectorizer.transform(X_test)
y_pred = naive_beyes.predict(tfidf_test,"positive","negative")

# Test
Now you need to run inference on your test set

For Testing, I write class that have all function I need like accuracy_score, precision, recall, f1_measure and confusion_matrix_2D

In [201]:
class Me_Calcu_Accuracy:
    @staticmethod
    def accuracy_score(y_main, y_pred):
        y_main = list(y_main)
        y_pred = list(y_pred)
        # find all true Item model find
        upper = sum([1 for i in range(len(y_main)) if y_main[i]==y_pred[i]])
        # find accuracy
        return upper/len(y_main)
    @staticmethod
    def precision(y_main, y_pred,name):
        y_main = list(y_main)
        y_pred = list(y_pred)
        # find true possetive
        true_pos = sum([1 for i in range(len(y_main)) if y_main[i]==y_pred[i] and y_main[i] == name])
        # Find all item was pos in our prediction list
        down = sum([1 for i in range(len(y_main)) if  y_pred[i] == name])
        return true_pos/down

    @staticmethod
    def recall(y_main, y_pred,name):
        y_main = list(y_main)
        y_pred = list(y_pred)

        # find true possetive
        true_pos = sum([1 for i in range(len(y_main)) if y_main[i]==y_pred[i] and y_main[i] == name])
        # Find all item was pos in our main list
        pos =  sum([1 for i in range(len(y_main)) if  y_main[i] == name])

        return true_pos/pos

    @staticmethod
    def f1_measure(y_main, y_pred,name):
        precision = Me_Calcu_Accuracy.precision(y_main, y_pred,name)
        recall = Me_Calcu_Accuracy.recall(y_main, y_pred,name)
        return 2 * (precision * recall)/(precision+recall)

    @staticmethod
    def confustion_matrix_2D(y_main, y_pred):
        y_main = list(y_main)
        y_pred = list(y_pred)
        set_item = set(y_main)
        list_set = list(set_item)
        set_item = (list_set[1],list_set[0])
        if len(set_item) != 2:
            raise "this item for binary"

        confusion_matrix = []
        k = 0
        for item in set_item :
            list_make = []
            if k == 0:
                list_make.append(sum([1 for i in range(len(y_main)) if y_main[i]==y_pred[i] and y_main[i] == item]))
                list_make.append(sum([1 for i in range(len(y_main)) if y_main[i]!=y_pred[i] and y_pred[i] == item]))
                k +=1
            else:
                list_make.append(sum([1 for i in range(len(y_main)) if y_main[i]!=y_pred[i] and y_pred[i] == item]))
                list_make.append(sum([1 for i in range(len(y_main)) if y_main[i]==y_pred[i] and y_main[i] == item]))


            confusion_matrix.append(list_make)

        print(set_item[0],set_item[1])
        for item in confusion_matrix:
            print(item)


In [202]:
######## Your Code Here ########
y_pred = naive_beyes.predict(tfidf_test,"positive","negative")
score1 = Me_Calcu_Accuracy.accuracy_score(y_test,y_pred)

In [203]:
score1

0.7778

# Evaluation
After training is finished, we need some metrics to evaluate the trained model on the test set. Here, you need to write code for utilizing the metrics bellow without the sklearn libraries!

Precision

In [193]:
######## Your Code Here ########
precision = Me_Calcu_Accuracy.precision(y_test,y_pred,"positive")
print("precision : ",precision)

precision :  0.8925193465176269


Recall

In [194]:
######## Your Code Here ########
recall = Me_Calcu_Accuracy.recall(y_test,y_pred,"positive")
print("recall : ",recall)


recall :  0.6249247441300422


F-measure

In [195]:
######## Your Code Here ########
f_measure = Me_Calcu_Accuracy.f1_measure(y_test,y_pred,"positive")
print("f1_measure L ",f_measure)

f1_measure L  0.735127478753541


Confustion matrix

In [196]:
######## Your Code Here ########
Me_Calcu_Accuracy.confustion_matrix_2D(y_test,y_pred)


positive negative
[3114, 375]
[1869, 4642]
