# **Multi-Nomial Naive-Bayes**

Naive-Bayes algorithm is used to classify strings such as emails and movie reviews into different classes.\
The datasets that we use are movie reviews made by a user and posted on IMDB.\
We then classify the movie reviews as either negative or positive.\
On github, I won't post the data used for this algorithm as there are in total 50 thousand movie reviews and It would be difficult to post all of them.


The Naive-Bayes algorithm works by considering, for eg an email, as a bag of words. We first remove redundant words such as "the", "of", etc and only keep usefull words which are called stopwords. 

take an example of an email that just says "Hello Friend"\
We first calculate the probability for an email to be Spam ($P_{spam}$) and not spam ($P_{normal}$)\
we then calculate the probability of eachword occuring in a spam email ($P(Hello|Spam)$, $P(Friend|Spam)$) and of them being not spam ($P(Hello|Normal)$, $P(Friend|Normal)$) we then multiply the probabilities of the string of words being spam and the probability of a message being spam
$$P(Hello,Friend|spam) = P(Hello|Spam)*P(Friend|Spam)*P_{spam}$$
and the same for noraml
$$P(Hello,Friend|normal) = P(Hello|normal)*P(Friend|Spam)*P_{normal}$$
We then compare the two probabilties and assign our prediction to be the one with maximum probability

One problem that can occur is, if a word comes up that is in the Normal Email bag of words but not in the Spam email bag of words. then probability of that word occuring in spam is 0 making our final answer for spam to be 0. To prevent this we add $\alpha$ to the count of all words, $\alpha$ is generally taken to be 1 but can be taken as any number

As probabilities can get very small when we have a huge bag of words. we use log(p) instead and add the log of probablitis

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.model_selection
from sklearn import datasets,preprocessing
import sklearn
import random
from collections import Counter

This code is to get the required number of positive and negative reviews from the training and testing datasets. It returns the dataset with mentioned length and also a "vocab" list which has all the stop words that come up in out training data (both positive and negative)

In [3]:
import re
import os
import glob
import random
from nltk.corpus import stopwords
import nltk

REPLACE_NO_SPACE = re.compile(r"[._;:!`¦\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile(r"(<br\s*/><br\s*/>)|(\-)|(\/)")


def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    text = REPLACE_NO_SPACE.sub("", text)
    text = REPLACE_WITH_SPACE.sub(" ", text)
    text = re.sub(r'\d+', '', text)
    text = text.lower()
    words = text.split()
    return [w for w in words if w not in stop_words]

def load_training_set(percentage_positives, percentage_negatives):
    vocab = set()
    positive_instances = []
    negative_instances = []
    for filename in glob.glob(r'train/pos/*.txt'):  # Adjust path as needed
        if random.random() > percentage_positives:
            continue
        with open(os.path.join(os.getcwd(), filename), 'r',encoding='utf-8') as f:
            contents = f.read()
            contents = preprocess_text(contents)
            positive_instances.append(contents)
            vocab = vocab.union(set(contents))
    for filename in glob.glob(r'train/neg/*.txt'):  # Adjust path as needed
        if random.random() > percentage_negatives:
            continue
        with open(os.path.join(os.getcwd(), filename), 'r',encoding='utf-8') as f:
            contents = f.read()
            contents = preprocess_text(contents)
            negative_instances.append(contents)
            vocab = vocab.union(set(contents))
    return positive_instances, negative_instances, vocab

def load_test_set(percentage_positives, percentage_negatives):
    positive_instances = []
    negative_instances = []
    for filename in glob.glob(r'test/pos/*.txt'):  # Adjust path as needed
        if random.random() > percentage_positives:
            continue
        with open(os.path.join(os.getcwd(), filename), 'r', encoding= 'utf-8') as f:
            contents = f.read()
            contents = preprocess_text(contents)
            positive_instances.append(contents)
    for filename in glob.glob(r'test/neg/*.txt'):  # Adjust path as needed
        if random.random() > percentage_negatives:
            continue
        with open(os.path.join(os.getcwd(), filename), 'r', encoding= 'utf-8') as f:
            contents = f.read()
            contents = preprocess_text(contents)
            negative_instances.append(contents)
    return positive_instances, negative_instances

In [4]:
# Get half of the training data
pos_training,neg_training,vocab = load_training_set(0.5,0.5)


In [5]:
# get half of the testing data
pos_test,neg_test = load_test_set(0.5,0.5)

In [6]:
# make the vocab variable a list
vocab = list(vocab)

# Initiallize to dictionaries which would be used to keep count of all the words
pos_vocab_count = {word: 0 for word in vocab}
neg_vocab_count = {word: 0 for word in vocab}


In [7]:
# Get the general probabilities of a movie review being negative or positive
prob_of_pos = len(pos_training)/(len(pos_training)+len(neg_training))
prob_of_neg = len(neg_training)/(len(pos_training)+len(neg_training))



In [8]:

# Initiallize a counter 
word_counts1 = Counter()

# go through all positive training data and count the words
for instance in pos_training:
    word_counts1.update(instance)
    


for keys in word_counts1.keys():
    pos_vocab_count[keys] = word_counts1[keys]





In [9]:
# Do the same for negative training data
word_counts2 = Counter()

for instance in neg_training:
    word_counts2.update(instance)


for keys in word_counts2.keys():
    neg_vocab_count[keys] = word_counts2[keys]


In [10]:
# initiallize lists for efficiency of predicting positive and negative reviews
pos_efficiencies = []
neg_efficiencies = []


In [11]:
# Add all vocab counts by 1, i.e. alpha is 1
for words in pos_vocab_count:
    pos_vocab_count[words] += 1
for words in neg_vocab_count:
    neg_vocab_count[words] += 1

In [12]:
# Total mumber of word count in positive and negative reviews
sum_of_pos = sum(pos_vocab_count.values())
sum_of_neg = sum(neg_vocab_count.values())

In [13]:
# Initiallize to dictionaries to store the probabilities of words in positve and negative reviews
prob_of_word_pos = {word:0 for word in vocab}
prob_of_word_neg = {word:0 for word in vocab}


# Calculate the probabilities of each word in positive and negative reviews
for keys in pos_vocab_count.keys():
    
    prob = (pos_vocab_count[keys])/(sum_of_pos)
    prob_of_word_pos[keys] = prob

for keys in neg_vocab_count.keys():
    
    prob = (neg_vocab_count[keys])/(sum_of_neg)
    prob_of_word_neg[keys] = prob







In [14]:
# Initialize and answer list
answer = []

# Test neagtive reviews testing data
for i in range(len(neg_test)):
        # Initialize log probabilities as 0 for positive and negative reviews
        logprobpos = 0
        logprobneg = 0

        # Iterate over each word of negative reviews and calculate the probablities
        for word in neg_test[i]:
            if  word in pos_vocab_count.keys():
                logprobpos = np.log(prob_of_word_pos[word]) + logprobpos
                logprobneg = np.log(prob_of_word_neg[word]) + logprobneg 
            
            
        # add the log of general probablities
        actual_prob_pos = logprobpos + np.log(prob_of_pos)
        actual_prob_neg = logprobneg + np.log(prob_of_neg)

        # Check which probablity is larger and predict the answer accordingly, 1 for pos and 0 for neg
        if actual_prob_pos > actual_prob_neg:
            answer.append(1)
        elif actual_prob_pos < actual_prob_neg:
            answer.append(0)

        # If both probablities are equal then assign randomly
        elif actual_prob_neg == actual_prob_pos:
            answer.append(random.randint(0,1))
            

# Calculate efficiency 
efficiency = answer.count(0)/len(answer)


print("Efficiency of the model in predicting negative reviews: ",efficiency)




Efficiency of the model in predicting negative reviews:  0.8763052208835341


In [15]:
answerpos = []
for i in range(len(pos_test)):
        logprobpos = 0
        logprobneg = 0
        for word in pos_test[i]:
            if  word in pos_vocab_count.keys():
                logprobpos = np.log(prob_of_word_pos[word]) + logprobpos
                logprobneg = np.log(prob_of_word_neg[word]) + logprobneg 
            
            

        actual_prob_pos = logprobpos + np.log(prob_of_pos)
        actual_prob_neg = logprobneg + np.log(prob_of_neg)

        if actual_prob_pos > actual_prob_neg:
            answerpos.append(1)
        elif actual_prob_pos < actual_prob_neg:
            answerpos.append(0)
        elif actual_prob_neg == actual_prob_pos:
            answerpos.append(random.randint(0,1))
            


efficiencypos = answerpos.count(1)/len(answerpos)


print("Efficiency of the model in predicting negative reviews: ",efficiencypos)

Efficiency of the model in predicting negative reviews:  0.7559916358372205


Now we make the confusion matrix of our model.\
For our data the confusion matrix would look like:

In [16]:
example_matrix= np.array([["True positives","False negatives"],["false positives","True negatives"]])
print(example_matrix)

[['True positives' 'False negatives']
 ['false positives' 'True negatives']]


In [17]:
confusuion_matrix = [[efficiencypos*len(answerpos), (1-efficiencypos)*len(answerpos)],[(1-efficiency)*len(answer) ,efficiency*len(answer)]]

In [18]:
confusuion_matrix = np.array(confusuion_matrix)
print(confusuion_matrix)

[[4700. 1517.]
 [ 770. 5455.]]


In [None]:
## precision is TP/(TP + FP) 
## recall is TP/(TP + FN)
## accuracy = (TP + TN)/(TP + FP + TN + FN)

accuracy = (confusuion_matrix[0][0] +confusuion_matrix[1][1])/(confusuion_matrix[0][0] + confusuion_matrix[0][1] + confusuion_matrix[1][0] + confusuion_matrix[1][1])

precision = confusuion_matrix[0][0]/(confusuion_matrix[0][0] + confusuion_matrix[1][0])

recall =  confusuion_matrix[0][0]/(confusuion_matrix[0][0] + confusuion_matrix[0][1])
print("Accuracy: ",accuracy)
print("Precision: ",precision)
print("Recall: ",recall)


Accuracy:  0.8161871081819643
Precision:  0.8592321755027422
Recall:  0.7559916358372205


Precision is important when False positives are more costly than false negatives.\
Recall is used when False negatives are more costly than False positives.\
Accuracy is used when both are equally costly.\
In our data, precision is more important as recommending a user a movie due to a false positive review is more costly.

## The naivety of Naive-Bayer algorithm

The negatives of the Naive-Bayes algorithm is that the model considers all strings as just a bag of words and nothing more.\
For eg, in our eg of emails, the model would consider "Hello Friend" and "Friend Hello" as the same thing. It cannot distinguish between sentence structure or grammer. Also, as we remove redundant words such as "the", an email which says "the the the the" would just be a blank email and hence the model would break down. We can solve this edge case by having a default classification.

## Summary

We have successfully made a Naive-Bayes algorithm to classify a movie review as either positive or negative. And also understood the shortcomings of our model