# Naive Bayes from Scratch for Sentimental Analysis

In [1]:
import re
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('IMDB_Dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df['sentiment'] = df['sentiment'].map( {'negative':0, 'positive':1})

## Preprocessing

In [5]:
# Removing <br><br />
def remove_html(text):
    html = re.compile(r"<.*?>")
    return html.sub(r" ", text)

df['review'] = df['review'].map(lambda x: remove_html(x))

In [6]:
stop = set(stopwords.words("english"))

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]

    return " ".join(text)

df['review'] = df['review'].map(lambda x: remove_stopwords(x))

In [7]:
# Removing punctuation
def remove_punct(text):
    table = str.maketrans("", "", string.punctuation)
    return text.translate(table)

df['review'] = df['review'].map(lambda x: remove_punct(x))

In [8]:
# Converting into Lowercase
df['review'] = df['review'].str.lower()

In [9]:
def review_tokens(text):
    tokens = word_tokenize(text)
    return tokens

df['review'] = df['review'].map(lambda x: review_tokens(x))

## Feature Extraction

In [10]:
def count_reviews(result, reviews, sentiments):      # {(word, sentiment): frequency}
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        reviews: a list of reviews
        ys: a list corresponding to the sentiment of each review (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    for sentiment, review in zip(sentiments, reviews):
        for word in review:
            
            pair = (word,sentiment)

            if pair in result:
                result[pair] += 1

            else:
                result[pair] = 1

    return result

In [11]:
def lookup(result, word, label):
    '''
    Input:
        result: a dictionary with the frequency of each pair (or tuple)
        word: the word to look up
        label: the label corresponding to the word
    Output:
        n: the number of times the word with its corresponding label appears.
    '''
    n = 0  # freqs.get((word, label), 0)

    pair = (word, label)
    if (pair in freqs):
        n = freqs[pair]

    return n

In [12]:
freqs = {}
a = count_reviews(freqs, df['review'], df['sentiment'])

In [13]:
freqs

{('one', 1): 26293,
 ('reviewers', 1): 225,
 ('mentioned', 1): 489,
 ('watching', 1): 3769,
 ('1', 1): 762,
 ('oz', 1): 179,
 ('episode', 1): 1900,
 ('hooked', 1): 216,
 ('right', 1): 3239,
 ('exactly', 1): 915,
 ('happened', 1): 907,
 ('me', 1): 2664,
 ('first', 1): 8989,
 ('thing', 1): 3344,
 ('struck', 1): 162,
 ('brutality', 1): 92,
 ('unflinching', 1): 26,
 ('scenes', 1): 4828,
 ('violence', 1): 1014,
 ('set', 1): 2331,
 ('word', 1): 770,
 ('go', 1): 4607,
 ('trust', 1): 306,
 ('show', 1): 6517,
 ('faint', 1): 39,
 ('hearted', 1): 81,
 ('timid', 1): 32,
 ('pulls', 1): 243,
 ('punches', 1): 68,
 ('regards', 1): 85,
 ('drugs', 1): 362,
 ('sex', 1): 1258,
 ('hardcore', 1): 120,
 ('classic', 1): 2311,
 ('use', 1): 1698,
 ('called', 1): 1144,
 ('nickname', 1): 31,
 ('given', 1): 1646,
 ('oswald', 1): 23,
 ('maximum', 1): 47,
 ('security', 1): 152,
 ('state', 1): 519,
 ('penitentary', 1): 2,
 ('focuses', 1): 230,
 ('mainly', 1): 415,
 ('emerald', 1): 10,
 ('city', 1): 1446,
 ('experimen

**Spliting data for training and testing**

In [14]:
X = df['review']
y = df['sentiment']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

## Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### So how do you train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

In [16]:
def train_naive_bayes(result, train_x, train_y):
    '''
    Input:
        result: dictionary from (word, label) to how often the word appears
        train_x: a list of reviews
        train_y: a list of labels correponding to the reviews (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in result.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = V_pos = V_neg = 0
    for pair in result.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:
            
            V_pos += 1

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += result[pair]

        # else, the label is negative
        else:
            
            V_neg += 1

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += result[pair]

    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents
    D_pos = (len(list(filter(lambda x: x > 0, train_y))))

    # Calculate D_neg, the number of negative documents 
    D_neg = (len(list(filter(lambda x: x <= 0, train_y))))

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1)
        freq_neg = lookup(freqs,word,0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1)/(N_pos + V) 
        p_w_neg = (freq_neg + 1)/(N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos/p_w_neg)

    return logprior, loglikelihood

In [17]:
logprior, loglikelihood = train_naive_bayes(freqs, X_train, y_train)
print(logprior)
print(len(loglikelihood))

-0.0016888892903299535
167564


**For prediction of single review**

In [18]:
def naive_bayes_predict(review, logprior, loglikelihood):
    '''
    Input:
        review : a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    word_l = review

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    ### END CODE HERE ###

    return p

**Preprocessing on a single review**

In [19]:
def preprocess(review):
    '''
    Input:
        review: a string containing a review
    Output:
        review_clean: a list of words containing the processed review

    '''
    stopwords_english = stopwords.words('english')

    review = review.lower()
    
    #tokenize
    review_tokens = word_tokenize(review)
    
    review_clean = []
    for word in review_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
                review_clean.append(word)

    return review_clean

**Predicting single review**

In [20]:
my_review = "Wow this movie was so bad. Felt like I wasted my 2 hours. It started okay then as the time passed it got worse and worse. Casting was so bad and I don't think actors in this movie know how to act."

In [21]:
a = preprocess(my_review)
naive_bayes_predict(a, logprior, loglikelihood)

-13.226731731480768

**Succefully predicted review as of negative sentiment**

In [22]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of reviews
        test_y: the corresponding labels for the list of reviews
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of reviews classified correctly)/(total # of reviews)
    """
    accuracy = 0 

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.absolute(y_hats-test_y))

    # Accuracy is 1 minus the error
    accuracy = 1-error

    return accuracy

In [23]:
test_naive_bayes(X_test, y_test, logprior, loglikelihood)

0.9154