# Sentiment analysis using naive bayes
### Naive bayes
The naive bayes algorithm depends on a simple probability theorem called bayes theorem that states

The "naive" part in the name is a result of us treating each sentence in the data as independent from the other (no comment affects the other). This is not exactly true but is good enough for our purposes. Formally, we state this as (the data is iid "independent and identically distributed")

### Our data
To train our model, we will use a twitter sentiment dataset. This dataset consists of tweets and their correct sentiment ,positive or negative. It is made of around 100,000 tweets.

#### First we will read and observe the data

In [1]:
import numpy as np
import pandas as pd

pd_train = pd.read_csv("imbd_tr.csv",index_col=False,names=["row_index","SentimentText","Sentiment","Rating"],header=0)
print(pd_train.shape)
pd_train.head()

(25000, 4)


Unnamed: 0,row_index,SentimentText,Sentiment,Rating
0,0,Bromwell High is a cartoon comedy. It ran at t...,1,9
1,1,Homelessness (or Houselessness as George Carli...,1,8
2,2,Brilliant over-acting by Lesley Ann Warren. Be...,1,0
3,3,This is easily the most underrated film inn th...,1,7
4,4,This is not the typical Mel Brooks film. It wa...,1,8


In [2]:
pd_test = pd.read_csv("imbd_te.csv",index_col=False,names=["row_index","SentimentText","Sentiment","Rating"],header=0)
print(pd_test.shape)
pd_test.head()

(25000, 4)


Unnamed: 0,row_index,SentimentText,Sentiment,Rating
0,0,I went and saw this movie last night after bei...,1,0
1,1,Actor turned director Bill Paxton follows up h...,1,7
2,2,As a recreational golfer with some knowledge o...,1,9
3,3,"I saw this film in a sneak preview, and it is ...",1,8
4,4,Bill Paxton has taken the true story of the 19...,1,8


#### Now, we will preprocess the data by removing punctuation and converting the sentences in arrays of words. We will also shuffle the array to ensure that consecutive tweets are unrelated (iid)

In [3]:
punctuation = [".",",",":","!","?","(",")","/",";","*"]
def preprocess(data):
    preprocessed = []
    for idx,row in enumerate(data["SentimentText"]):
        row = row.lower()
        for punct in punctuation:
            row = row.replace(punct,"")
        preprocessed.append([row,data["Sentiment"][idx]])
    return preprocessed
preprocessed_data = preprocess(pd_train)    

train_data = [[row[0].split(),row[1]] for row in preprocessed_data]
np.random.shuffle(train_data)

classes = [0, 1] #For our data, we have only two classes (0 => negative and 1 => positive)

In [4]:
preprocessed_test_data = preprocess(pd_test)    

test_data = [[row[0].split(),row[1]] for row in preprocessed_test_data]
np.random.shuffle(test_data)

#### Training our model

In [5]:
from tqdm import tqdm

def train_mult_naive_bayes(training_data, classes):
    n_w_c = [0] * len(classes) # will contain the number of words in each class (ex: n_w_c[0] will be the number of words that were in documents that belong to class 0)
    log_prior = [0] * len(classes) # will contain the priors for each class (ex: log_prior[0] = log P(class=0))
    log_likelihood = [0] * len(classes) # will contain the log likelihood for each word given a class (ex: loglikelihood[0]["good"] is the log P("good"|class=0))
    documents_c = [[]] * len(classes) # will contain the document in each class (ex: documents_c[0] will contains all documents that are classified as class 0)
    n_docs = len(training_data) # numbers of documents in the training data
    Vocab = [] # a vocabulary of unique words
    dictionaries = [{},{}] # will contains the number of occurences of a word in each class (ex: dictionaries[0]["good"] is the number of occurences of the word "good" in documents that belonged to class 0)
    
    # loop over all the training data
    for review,parity in tqdm(training_data):
        
        # add this document to its correspoding class
        documents_c[parity].append([review,parity])
        
        # add the number of words to n_w_c in the correct class
        n_w_c[parity] += len(review)
        
        # loop over all words in the document and add the number of occurences to dictionaries. Also add unique words to Vocab
        for word in review:
            if word in dictionaries[parity]:
                dictionaries[parity][word] += 1
            else:
                dictionaries[parity][word] = 1
                
            if word not in Vocab:
                Vocab.append(word)
  
    
    Vocab_size = len(Vocab)
    n_w_c[0] += Vocab_size
    n_w_c[1] += Vocab_size
    
    
    # loop over all classes
    for parity in range(len(classes)):
        n_docs_in_class = len(documents_c[parity])
        
        log_prior[parity] = np.log((n_docs_in_class)/ n_docs)

        denom = n_w_c[parity]

        # loop over all unique words and calculate log_likelihoods for each word
        for word in tqdm(Vocab):
            if word in dictionaries[parity]:
                dictionaries[parity][word] = np.log((dictionaries[parity][word])/denom)
            else:
                dictionaries[parity][word] = np.log(1.0/denom)
        
        log_likelihood[parity] = dictionaries[parity]
        
    return (Vocab, log_prior, log_likelihood)

In [6]:
Vocab_,log_prior_,log_likelihood_ = train_mult_naive_bayes(train_data,classes)

100%|███████████████████████████████████| 25000/25000 [17:53<00:00, 23.30it/s]
100%|█████████████████████████████| 155010/155010 [00:00<00:00, 485074.23it/s]
100%|█████████████████████████████| 155010/155010 [00:00<00:00, 459071.15it/s]


### Testing the accuracy of our model

In [7]:
def test_mult_naive_bayes(review, classes, log_prior, log_likelihood, Vocab):
    log_posteriors = [0] * len(classes)
    
    # loop over all classes and calculate the log_posterior for each
    for c in classes:
        sum_log_likelihoods = 0
        
        for word in review:
            if word in Vocab:
                sum_log_likelihoods += log_likelihood[c][word]
        
        log_posteriors[c] = log_prior[c] + sum_log_likelihoods
    
    # Find the class with the maximum log_posterior
    max_log_posterior = max(log_posteriors)
    argmax_log_posterior = log_posteriors.index(max_log_posterior)
        
    return classes[argmax_log_posterior]

num_of_reviews_to_test = 2000
correct_count = 0.0

# loop over a number of test points = num_of_reviews_to_test and count the number of correctly classified ones
for review,parity in tqdm(test_data[:num_of_reviews_to_test]):
    pred = test_mult_naive_bayes(review,classes,log_prior_,log_likelihood_,Vocab_)
    if pred == parity:
        correct_count += 1

# print the accuracy of the model
accuracy = correct_count/num_of_reviews_to_test
print("Test accuracy: {} %".format(accuracy*100))

100%|█████████████████████████████████████| 2000/2000 [04:02<00:00,  7.84it/s]


Test accuracy: 81.25 %
