# Sentiment Analysis on IMDB dataset using Naive Bayes

In this example, I'll try to implement the Naive Bayes algorithm to classify the sentiment of the IMDB dataset. The dataset contains 50k reviews, 25k for training and 25k for testing. The reviews are labeled as positive or negative.

The idea behind this is to use a bag-of-words approach to represent the reviews and then use the Naive Bayes algorithm to classify them.

Given a documend $d$ and a class $c$, the Naive Bayes algorithm calculates the probability of the class given the document as:

$$P(c|d) = \frac{P(d|c)P(c)}{P(d)}$$

The model shoud return:

$$\hat{c} = \arg\max_c P(c|d) = \arg\max_c P(d|c) \cdot P(c) $$

If we represent the document with a set of features $f_1, \dots, f_n$, the model can be simplified to:

$$\hat{c} = \arg\max_c P(c) \prod_{i=1}^{n} P(f_i|c)$$

By using a bag-of-words approach our features will simply be the words $w_i$ in the document. The model can be simplified to:

$$\hat{c} = \arg\max_c P(c) \prod_{i=1}^{n} P(w_i|c)$$

In [198]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer # For Bag-Of-Words
from sklearn.model_selection import train_test_split

TARGET = "sentiment"
dataset_filename = "IMDB Dataset.csv"
df = pd.read_csv(dataset_filename)

df_train, df_test = train_test_split(df, test_size=0.3) 
df_test = df_test.reset_index()

In [199]:
df.columns

Index(['review', 'sentiment'], dtype='object')

In [200]:
df.head

<bound method NDFrame.head of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>

In [202]:
sum(df_train[TARGET].value_counts())

35000

In [193]:
def class_probability(df, c):
    return df[TARGET].value_counts()[c] / sum(df[TARGET].value_counts())

def word_probability(X, features, w, X_sum, laplace_smoothing = False):
    word_index = np.where(features == w)[0]
    if len(word_index) == 0:
        if laplace_smoothing:
            return 1 / len(X)
        else:
            return 0
    word_index = word_index[0]
    
    if laplace_smoothing:
        return (X[word_index, 0] + 1) / (X_sum + len(X))
    else:
        return X[word_index, 0] / X_sum

In [203]:
vectorizers = [CountVectorizer(), CountVectorizer()]
X = {"positive" : vectorizers[0].fit_transform(df_train[df_train[TARGET] == "positive"]["review"]),
     "negative" : vectorizers[1].fit_transform(df_train[df_train[TARGET] == "negative"]["review"])}

features = {"positive":vectorizers[0].get_feature_names_out(),
            "negative":vectorizers[1].get_feature_names_out()}

X = {"positive": np.sum(X["positive"], axis=0).T,
     "negative": np.sum(X["negative"], axis=0).T}

X_sum = {"positive": np.sum(X["positive"], axis=0)[0,0],
         "negative": np.sum(X["negative"], axis=0)[0,0],}

For numerical stability, we can use the log of the probabilities:

$$\hat{c} = \arg\max_c \log P(c) + \sum_{i=1}^{n} \log P(w_i|c)$$

In [211]:
def classify_document(d, laplace_smoothing = False, tokenize=False, tokenizer=None):
    if tokenize:
        d = tokenizer(d)
    likelihoods = {}
    for c in ["positive", "negative"]:
        res = np.log(class_probability(df, c))
        
        for word in d:
            word_p = word_probability(X[c], features[c], word, X_sum[c], laplace_smoothing=True)
            res += np.log(word_p)
        
        likelihoods[c] = res
        
    if likelihoods["positive"] > likelihoods["negative"]:
        return "positive"
    else:
        return "negative"
    # return likelihoods

In [212]:
tokenizer = vectorizers[0].build_tokenizer()
document = "I really loved this movie"
print(tokenizer(document))
classify_document(tokenizer(document), laplace_smoothing=False)

['really', 'loved', 'this', 'movie']


'positive'

In [222]:
from tqdm import tqdm
correct = 0
wrong = 0
for i in tqdm(range(1000)):
    true_class = df_test["sentiment"][i]
    predicted_class = classify_document(df_test["review"][i], laplace_smoothing=True, tokenize=True, tokenizer=tokenizer)
    # print(f"True class: {true_class}")
    # print(f"Predicted class: {predicted_class}")
    if predicted_class == true_class:
        correct += 1
    else:
        wrong += 1

100%|██████████| 1000/1000 [04:52<00:00,  3.41it/s]


In [223]:
accuracy = correct / (correct + wrong)
print("Accuracy: ", accuracy)

Accuracy:  0.811


# Conclusion

This model, despite its semplicity, is able to achieve pretty good results. Of course, some limitations include the impossibility of understanding the context of a word (e.g. "not good" is considered positive). It can be used for sure as a baseline for more complex models.