# **Naive Bayes Classifier**
***

**What is Naive Bayes algorithm?**

It is a classification method built on the Bayes Theorem and predicated on the idea of predictor independence. A Naive Bayes classifier, to put it simply, believes that the presence of one feature in a class has nothing to do with the presence of any other feature.

A fruit might be categorized as an apple, for instance, if it is red, rounded, and around 3 inches in diameter. Even if these characteristics depend on one another or on the presence of other characteristics, each of these traits separately increases the likelihood that this fruit is an apple, which is why it is called "Naive."

> Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) - *(read as Probability of **c** given **x**)*,  from P(c), P(x) and P(x|c). Look at the equation below:
>
> $$\mathbf{P} \left({x \mid c} \right) = \frac{\mathbf{P} \left ({c \mid x} \right) \mathbf{P} \left({c} \right)}{\mathbf{P} \left( {x} \right)}$$

where,

* *x is set of features*
* *c is set of classes*
* P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
* P(c) is the prior probability of class **c**.
* P(x|c) is the observation density or likelihood which is the probability of predictor(the query  **x**) given class.
* P(x) is the prior probability of predictor **x**, and it is also called as Evidence.

**Why should we use Naive Bayes ?**

* As stated above, It is **_easy_** to build and is particularly useful for **_very large data sets_**.
* It is **extremely fast** for both training and prediction.
* It provide straightforward probabilistic prediction.
* It is often very easily interpretable.
* It has very few (if any) tunable parameters.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

**Disadvantages of Naïve Bayes Classifier:**

Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.

**Applications of Naïve Bayes Classifier:**

It is used for Credit Scoring.

It is used in medical data classification.

It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.

It is used in Text classification such as Spam filtering and Sentiment analysis.

**Types of Naïve Bayes Model:**

There are three different forms of naive bayes models, which are listed below:

**Gaussian:** The Gaussian model presupposes that features are distributed normally. This indicates that the model thinks that predictor values are samples from the Gaussian distribution if they take continuous values rather than discrete ones.

**Multinomial Naive Bayes:** When the data is multinomially distributed, the Multinomial Naive Bayes classifier is employed. It is mostly used for document classification issues, indicating which category a specific document belongs to.
Word frequency is used by the classifier as a predictor.

**Bernoulli classifier:** In contrast to the Multinomial classifier, the Bernoulli classifier uses independent Boolean values as predictor variables. such as determining whether a word is used or not in a document.

**Steps to implement:**

Data Pre-processing step

Fitting Naive Bayes to the Training set

Predicting the test result

Test accuracy of the result(Creation of Confusion matrix)

Visualizing the test set result.

In [109]:
import os
import numpy as np
from sys import path
import re
import random

### read file

In [169]:
TRAIN_DATA= "/content/train_data.csv"
TEST_DATA = "/content/test_data.csv"
DEV ="/content/sample_submission.csv"
data_train = pd.read_csv(TRAIN_DATA)
data_test  = pd.read_csv(TEST_DATA)

In [123]:
merged_df= pd.concat([data_train, data_test], axis=0)

## a.Divide the dataset as train, development and test. 

In [170]:
# Split dataset to k folds
def crossValidationSplit(data, k_folds):
    data_split = list()
    data_copy = list(data)
    size = int(len(data) / k_folds)
    for _ in range(k_folds):
        fold = list()
        while len(fold) < size:
            k = random.randrange(len(data_copy))
            fold.append(data_copy.pop(k))
        data_split.append(fold)
    return data_split

def splitDataToTrainAndDev(dataset, k_folds):
    folds = crossValidationSplit(dataset, k_folds)
    train_set, dev_set = [], []
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        dev_set = list()
        for row in fold:
                row_copy = list(row)
                dev_set.append(row_copy)
        break
    return train_set, dev_set

data_train = pd.read_csv(TRAIN_DATA)

dev = pd.read_csv(DEV)

data_test  = pd.read_csv(TEST_DATA)




print('Train dataset size: ' + str(len(data_train)))
print('Dev dataset size: ' + str(len(dev)))
print('Test dataset size: ' + str(len(data_test)))

Train dataset size: 60115
Dev dataset size: 5
Test dataset size: 15029


## b.Build a vocabulary as list. 

\[‘the’ ‘I’ ‘happy’ … \] 

You may omit rare words for example if the occurrence is less than five times 

A reverse index as the key value might be handy 

{“the”: 0, “I”:1, “happy”:2 , … }


In [139]:
def SegmentLineToWords(string):
    string=string.replace('<br />', '')
    return set([x.lower() for x in re.split(r'[\s|,|;|.|/|\[|\]|;|\!|?|\'|\\|\)|\(|\"|@|&|#|-|*|%|>|<|^|-]\s*',string.strip()) if x])

def buildVocabularyList(dataset):
    dict_list = {} #{'word':[merged_df, dev_count]}
    for row in dataset:
        words = set() #Words that appear multiple times in the same comment are counted only once
        words = words.union(SegmentLineToWords(str(row[0])))
        for word in words:
            if word not in dict_list:
                dict_list[word] = [0,0]
            if row[1] == -1:
                dict_list[word][0] += 1
            else:
                dict_list[word][1] += 1
    for word in list(dict_list.keys()):
        if dict_list[word][0] + dict_list[word][1]<5:
            del dict_list[word]
    return dict_list
train_dict = buildVocabularyList(data_train)
train_dict

{}

## c.Calculate the following probability

Probability of the occurrence

P\[“the”\] = num of documents containing ‘the’ / num of all documents

In [140]:
def getProbabilityOfOccurrence(word):
    if word not in train_dict:
        return 0
    else: 
        return (train_dict[word][0] + train_dict[word][1])/(len(data_train))
print("P[“the”] = " + str(getProbabilityOfOccurrence("the")))

P[“the”] = 0


Conditional probability based on the sentiment

P\[“the” | Positive\]  = # of positive documents containing “the” / num of all positive review documents


In [141]:
def getPosConditionalProbability(word):
    if word not in train_dict:
        return 0
    else:
        return train_dict[word][1]/(len(data_train))

print("P[“the” | data_train] = " + str(getPosConditionalProbability("the")))


P[“the” | data_train] = 0


## d.Calculate accuracy using dev dataset 

In [143]:
def predict(review, smoothing_flag):
    words = set()
    words = words.union(SegmentLineToWords(review))
    data_train_probability = 1
    
    for word in words:
        if smoothing_flag == 1:
            data_train_probability *= getPosConditionalProbabilityUsingSmoothing(word)
        
            
        else:
            #print(word)
            #print("getdatatrainConditionalProbability: " + str(getPosConditionalProbability(word)))
         
            data_train_probability *= getdatatrainConditionalProbability(word)
            
    #print("data_train_probability: " + str(data_train_probability))
    
   

def accuracy_metric(test_dataset, smoothing_flag):
    correct = 0
    for row in test_dataset:
        #print( predict(str(row[0]), smoothing_flag))
        #print(row[1])
        if row[1] == predict(str(row[0]), smoothing_flag):
            correct += 1
    return correct / float(len(test_dataset)) * 100.0

In [149]:
#train_dict
print('Accuracy: %.3f%%' )

Accuracy: %.3f%%


### Conduct five fold cross validation

In [174]:
def evaluate_algorithm(pos_dataset, neg_dataset, k_folds, smoothing_flag):
    data_train_folds = crossValidationSplit(pos_dataset, k_folds)
    scores = list()
    for i in range(0,len(data_train_folds)):
        data_train = list(data_train_folds)
       
        
        data_train.remove(data_train_folds[i])
       
        
        data_train = sum(data_train, [])
        
        
        dev = list()
        
       
        
        dev = dev
        accuracy = accuracy_metric
        scores.append(accuracy)
    print('Scores: %s' % scores)
    print('Mean Accuracy: %.3f%%' )
smoothing_flag = 0
evaluate_algorithm(dataset_pos, dataset_neg, 5, smoothing_flag)

Scores: [<function accuracy_metric at 0x7f9f6cfeab00>, <function accuracy_metric at 0x7f9f6cfeab00>, <function accuracy_metric at 0x7f9f6cfeab00>, <function accuracy_metric at 0x7f9f6cfeab00>, <function accuracy_metric at 0x7f9f6cfeab00>]
Mean Accuracy: %.3f%%


## e.Do following experiments

### Compare the effect of Smoothing

In [156]:
lambda_value = 1
def getPosConditionalProbabilityUsingSmoothing(word):
    if word not in train_dict:
        return lambda_value/(2*lambda_value+len(data_train))
    else:
        return (lambda_value + train_dict[word][1])/(2*lambda_value+len(data_train))


In [179]:
smoothing_flag = 1

print('Accuracy not using Smoothing: %.3f%%' )
print('Accuracy by using Smoothing: %.3f%%'  )

Accuracy not using Smoothing: %.3f%%
Accuracy by using Smoothing: %.3f%%


### As we can see,if only the training set and formula are used to count the data obtained, the performance is not good. So the in the follows i will use the dev and algorithm to get top 10 word

In [163]:
def getDevPredictsList():
    predicts_list = {}
    correct = 0
    for row in dev:
        if row[1] == predict(str(row[0]), 1):
            words = set()
            words = words.union(SegmentLineToWords(str(row[0])))
            for word in words:
                if word not in predicts_list:
                    predicts_list[word] = [0,0]
                if row[1] == -1:
                    predicts_list[word][0] += 1
                else:
                    predicts_list[word][1] += 1
    return predicts_list
predicts_list = getDevPredictsList()

In [166]:
def getTop10UsingDev(label):
    positive_list = []
    for word in list(predicts_list.keys()):
        #value = ((train_dict[word][1] / len(train_pos)) * (len(train_pos) / (len(train)))) / (train_dict[word][1]/len(train_pos)*(len(train_pos) / len(train)) + train_dict[word][0])/len(train_neg)*(len(train_neg) / len(train))
        #value = float(train_dict[word][1]) / float(len(train) * (train_dict[word][1] + train_dict[word][0]))
        value = (predicts_list[word][label] + lambda_value) / (predicts_list[word][1] + lambda_value + predicts_list[word][0] + lambda_value)
        positive_list.append([word,value])
    return positive_list
positive_list = np.array(getTop10UsingDev(1))


negative_list = np.array(getTop10UsingDev(0))


print("Top 10 words that predicts  using dev data :")
printTop10(positive_list)
print("")


Top 10 words that predicts  using dev data :



## f.Using the test dataset

In [168]:
print('The accuracy by using Smoothing of test dataset: %.3f%%' % accuracy_metric(data_test, smoothing_flag))

The accuracy by using Smoothing of test dataset: 0.000%
