# Bernoulli's Naive Bayes

- Is based on a boolean value (true or false) or (1 or 0) for presence or absence of a test word in our training data.
- Based on Bag of Words, creating a Vocab of all possible unique words across all documents.
- A document is represented by a feature vector with integer elements whose value is the presence of that word in the document.
- Mostly used for text classification for shorter documents.
- Order is not maintained thus sentiment of a sentnece is lost.

# Model 1: Building from Scratch

## AIM

`To Build a text classifier based on a collection of few documents each having a small corpus of SMS data under three classes: Spam, Ham and to pedict the class of a new document.`

## Data

Dataset is a collection of documents each containing a huge set of SMS data. Label, Document and DocNumber are the fields

## Imports

In [81]:
import pandas as pd
import numpy as np
import math
import random
import re
import string

from nltk.stem import WordNetLemmatizer
from sklearn.utils import shuffle

## Dataset

In [82]:
df = pd.read_excel('Data//data_5_3_Spam_Ham_Dataset_BinaryClass.xlsx')
df.head()

Unnamed: 0,Doc_Number,Document,Label
0,1,"Go until jurong point, crazy.. Available only ...",ham
1,2,Ok lar... Joking wif u oni...,ham
2,3,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,4,U dun say so early hor... U c already then say...,ham
4,5,"Nah I don't think he goes to usf, he lives aro...",ham


In [83]:
df.shape

(5572, 3)

## Labelling Target

In [84]:
dict_target, id_target = {}, 0

for key in df.Label.unique():
    dict_target[key] = id_target
    id_target += 1

print dict_target
    
df.Label = df.Label.map(dict_target)
df.head()

{u'ham': 0, u'spam': 1}


Unnamed: 0,Doc_Number,Document,Label
0,1,"Go until jurong point, crazy.. Available only ...",0
1,2,Ok lar... Joking wif u oni...,0
2,3,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,4,U dun say so early hor... U c already then say...,0
4,5,"Nah I don't think he goes to usf, he lives aro...",0


## Cleaning Document

In [85]:
def preprocessing(document_column):
    
    # Convert unicode to string
    document_column = document_column.apply(lambda x: x.encode('ascii', 'ignore') if type(x) == unicode else x)
    
    # Loading Stop_Words, Stop_Letters test file
    StopWords_txt = open('Data/Stop_Words.txt', 'r').read()
    stop_words = re.findall("\'(\w+)\'",StopWords_txt)
    stop_letters = re.findall(r'\w', string.letters)
    
    # Loading NLTK's leammatizer-> Example: 'plays' is converted to 'play'
    lm = WordNetLemmatizer()
    
    processed_documents = []
    for doc in document_column:
        
        doc = str(doc)
        doc_words, final_doc_words = [], []        
            
        # Basic regex
        doc = re.sub('[\s\t\n\r\.\{\(\[\}\]\)\,\;\'\"\/\\\?\_\-\>\<\\:\-\+\=\@\#\$\%\&\*\!0-9]+', ' ', doc)
        doc = re.sub('\s+', ' ', doc)
        doc = ' '.join(re.findall('\w+', doc))
        doc = str(doc.lower())
        
        # Stop words, Stop letters
        for w in doc.split():
            if w not in stop_words and w not in stop_letters:
                doc_words.append(w)    
                
        # Lematize
        for w in doc_words:
            final_doc_words.append(str(lm.lemmatize(w)))
        
        # Appending to a final list
        processed_documents.append(' '.join(final_doc_words))
        
    return processed_documents

In [86]:
df['Processed_Document'] = preprocessing(df.Document)
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,1,"Go until jurong point, crazy.. Available only ...",0,go jurong point crazy available bugis great wo...
1,2,Ok lar... Joking wif u oni...,0,ok lar joking wif oni
2,3,Free entry in 2 a wkly comp to win FA Cup fina...,1,free entry wkly comp win fa cup final tkts st ...
3,4,U dun say so early hor... U c already then say...,0,dun say early hor already say
4,5,"Nah I don't think he goes to usf, he lives aro...",0,nah think go usf life around though


## Shuffle

In [87]:
df = shuffle(df).reset_index(drop=True)
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,4773,Hi..i got the money da:),0,hi got money da
1,2981,"Xmas Offer! Latest Motorola, SonyEricsson & No...",1,xmas offer latest motorola sonyericsson nokia ...
2,4579,Had your contract mobile 11 Mnths? Latest Moto...,1,contract mobile mnths latest motorola nokia et...
3,1272,"Sorry chikku, my cell got some problem thts y ...",0,sorry chikku cell got problem thts nt able rep...
4,3747,"Aight, let me know when you're gonna be around...",0,aight let know gonna around usf


## Train - Test Split

In [88]:
def train_test_split(df, train_size):
    
    train_index = int(train_size*len(df))
    test_index = train_index + 1
    
    train_data = df[0:train_index+1]
    test_data = df[test_index:]
    print "-> train data = {} Rows;  test data = {} Rows".format(len(train_data), len(test_data))
    
    return train_data, test_data

In [89]:
train_data, test_data = train_test_split(df, 0.80)

-> train data = 4458 Rows;  test data = 1114 Rows


## Analysis

AIM: P(Class = ? |W1, W2, W3, W4, W5, ..., Wj) = ?

Solution: 

        Bayes Rule: 
        P(X|Y) = P(Y|X) * P(X) / P(Y)
        
        Naive Bayes Modified Bayes Theorem:
        NB Likelihood P(Class = Y| X1,X2...,Xj) ~= P(X1|Y)*P(X2|Y)* ...*P(Xj|Y) * P(Y)        {Can igonre the denominator}
        
        Assumption:
        All features are conditionally independent.
        
        
        Vocab = [V1, V2, V3, ..., Vn]
        
        Test Doc:
        Words = Presence or absence of words = [V1 = 1 or 0, V2 = 1 or 0,..., Vn=1 or 0] = [1,0,0,1,0....,1]
        
        Likelihood:
        # likelihhod P(Class=C1|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=C1) = P(W1|C1)*P(W2|C1)* ...*P(Wj|C1) * P(C1)
        # likelihhod P(Class=C2|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=C2) = P(W1|C2)*P(W2|C2)* ...*P(Wj|C2) * P(C2)
        # ...
        # likelihhod P(Class=Ck|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=Ck) = P(W1|Ck)*P(W2|Ck)* ...*P(Wj|Ck) * P(Ck)
                
        And compare which likelihood(P) is greater, that will be the predicted class!
        
        
        Calculations:
        
        1. P(C1)    =  n(docs) where class = C1 / n(Total Docs)
        
        2. P(W1=1|C1) =  n(docs where W1 is present where Class = C1) + aplha / n(total docs where Class = c1) + len(Vocab)
                    
                      Numerator represents the presence of W1 in all docs where class is C1
                       
                   alpha -> Smoothing Fucntion Parameter
                   -That if number of W1 words under Class C1 is equal to 0, it'll make the entire P = 0, as its a product.
                   - Also we are not ignoring the chances of this word even if it didn't occur, by giving it a less Prob.
                   - Alpha should be small value, so that we are giving it a less Prob.
                   - Dividing by Vocab to normailize

## Fitting and Transforming  Data

1. Creating a Training Data Vocabulary (Fitting Vocab)

2. Transforming 'Training Data' on Fitted Trainng Data Vocab

3. Transforming 'Testing Data' on Fitted Trainng Data Vocab

#### 1. Creating a Training Data Vocabulary (Fitting Vocab)

In [90]:
# This is a very crude and generalised way
Vocab = sorted(list(set(' '.join(train_data.Processed_Document).split())))

print "Vocabulary (Train Data)= {} words".format(len(Vocab))

Vocabulary (Train Data)= 6375 words


In [91]:
print "Total Words in Dataset = {} words".format(len(sorted(list(set(' '.join(df.Processed_Document).split())))))

Total Words in Dataset = 7116 words


#### 2. Transforming 'Training Data' on Fitted Trainng Data Vocab:- Building a Presence/Absence table

In [115]:
Word_Presence_Table = pd.DataFrame()

for word in Vocab:
    
    presence_per_doc = []

    for doc in train_data['Processed_Document']:
        
        if word in doc.split():
            presence_per_doc.append(1)
        else:
            presence_per_doc.append(0)
    
    Word_Presence_Table[word] = presence_per_doc

Word_Presence_Table['Class'] = train_data.Label

In [117]:
Word_Presence_Table.head()

Unnamed: 0,aa,aah,aaniye,aaooooright,aathi,ab,abeg,abel,aberdeen,abi,...,zebra,zed,zero,zf,zhong,zindgi,zoe,zoom,zyada,Class
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [118]:
Word_Presence_Table.shape

(4458, 6376)

- No of columns = 6376 = 6375(Vocab Words) + 1(Class)

- Columns = Vocab learned by Training Data = Total 6375 Words learned from Train Data. Will be used to fit on the test data as well

- Represnts all the words in the Train Data along with their presence across all docs
- Example, word 'aa' doesn't appear in first 5 documents.

#### 3. Transforming 'Testing Data' on Fitted Trainng Data Vocab:- Building a Presence/Absence table

In [149]:
Test_Word_Presence_Table = pd.DataFrame()

for word in Vocab:
    
    presence_per_doc = []

    for doc in test_data['Processed_Document']:
        
        if word in doc.split():
            presence_per_doc.append(1)
        else:
            presence_per_doc.append(0)
    
    Test_Word_Presence_Table[word] = presence_per_doc
    
Test_Word_Presence_Table.head()

Unnamed: 0,aa,aah,aaniye,aaooooright,aathi,ab,abeg,abel,aberdeen,abi,...,zealand,zebra,zed,zero,zf,zhong,zindgi,zoe,zoom,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [162]:
Test_Word_Presence_Table.values

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Calculations

#### Prob of a class: P(Class = C_i)

In [123]:
train_data.Label.value_counts()

0    3850
1     608
Name: Label, dtype: int64

In [124]:
def prob_class(cls):
    
    N = len(train_data.Label)
    Prob = train_data.Label.value_counts()[cls].astype(float)/N
    return Prob

#### Prob of  a word given a particular class: P(Wordj |Class = C_i)

In [133]:
def prob_word_given_class(word, cls, alpha):
    
    # For a word existing in our training data's Vocabulary
    if word in Vocab:
        Appearance = Word_Presence_Table[Word_Presence_Table['Class'] == cls][word].sum()
        N = len(Word_Presence_Table[Word_Presence_Table['Class'] == cls])
        Prob = (Appearance + alpha)/(N + len(Vocab))
    
    # For a new word only present in our test data (Real world scenario)
    else:
        Prob = (0 + alpha)/(N + len(Vocab))
        
    return Prob

## Prediction

#### Predicting Class for individual test docs

In [168]:
def prob_class_with_test_feature_vector(test_doc_feature_vector):
    
    
    # Vocab (Train Data) = [V1, V2, V3, ..., Vn-1, Vn]
    #
    # test_doc_feature_vector = [V1=1|0, V2=1|0, V3=1|0, ..., Vn-1=1|0, Vn=1|0] = [0,0,0, ..., 1, 0]
    #
    # Example: test_doc_feature_vector = [0,0,0, ..., 1,0]
    #          
    #  - It means, V1 is absent, V2 is absent, V3 is absent, ..., Vn-1 is present, Vn is absent
    #  - P(V1,V2,V3...|Class=C1) = { [1- P(V1)] * [1- P(V2)] * [1- P(V3)] * ... * [P(Vn-1)] * [1 - P(Vn)] } x P(C1)
    
    global Vocab
    
    # Intilising "big_class" with a random choice
    big = 0
    big_class = random.choice(df.Label.unique())

    # Smoothing Function Parameter 
    alpha = 0.0001

    for cls in df.Label.unique():
        
        # For each class, 
        likelihood = 1
        for word, word_vector in zip(Vocab, test_doc_feature_vector):

            Prob_word = prob_word_given_class(word, cls, alpha)

            if word_vector == 1:
                likelihood *= Prob_word
            else:
                likelihood *= (1.0 - Prob_word)

        likelihood *= prob_class(cls)

        if likelihood > big:
            big = likelihood
            big_class = cls

        
    return big_class

#### Likelihood: Prob of a unknown class given all words in the test data: P(Class=? | W1,W2,W3,W4...... Wj)

In [None]:
Y_test_hat = []

# Test_Word_Presence_Table.values = 2D Array of 1 or 0 values for each Vocab Word, 6375 Columns
for test_doc_feature_vector in Test_Word_Presence_Table.values:

    predicted_class = prob_class_with_test_feature_vector(test_doc_feature_vector)
    Y_test_hat.append(predicted_class)

## Accuracy

In [None]:
score = []
for i in range(len(test_data)):

    if list(test_data.Label)[i] == Y_test_hat[i]:
        score.append(100)
    else:
        score.append(0)
        
accuracy_matrix = pd.DataFrame({'Y_test': test_data.Label, 
                                'Y_test_hat (predicted)': Y_test_hat,
                                'Score': score })

In [97]:
print "Accuracy Score of model = {}%".format(accuracy_matrix.Score.mean())

Accuracy Score of model = 97.8456014363%


----

# Model 2: Using Sklearn

## Imports

In [175]:
import pandas as pd
import numpy as np

from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

from nltk.tokenize import word_tokenize
from gensim.models import word2vec

In [179]:
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,4773,Hi..i got the money da:),0,hi got money da
1,2981,"Xmas Offer! Latest Motorola, SonyEricsson & No...",1,xmas offer latest motorola sonyericsson nokia ...
2,4579,Had your contract mobile 11 Mnths? Latest Moto...,1,contract mobile mnths latest motorola nokia et...
3,1272,"Sorry chikku, my cell got some problem thts y ...",0,sorry chikku cell got problem thts nt able rep...
4,3747,"Aight, let me know when you're gonna be around...",0,aight let know gonna around usf


#### X and Y

In [239]:
X = df.Processed_Document
Y = df.Label

#### Train_test_split

In [240]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80)

print len(X_train), len(Y_train)
print len(X_test), len(Y_test)

4457 4457
1115 1115


#### CountVectorizer

#### Creating a Train-data Vocab and vectorising train data

In [248]:
Vector = CountVectorizer(binary=True)         # binary = True; Will check the presence or absence and put 1 or 0 only.

Vector.fit(X_train)

print 'Vocab (Train Data) = {} words'.format(len(Vector.get_feature_names()))

Vocab (Train Data) = 6279 words


##### Transform our train data (Vectorising)

In [257]:
# Vector Standardized using transform()
X_train_vector = Vector.transform(X_train)
print "Shape =", X_train_vector.toarray().shape

Shape = (4457L, 6279L)


##### Transform test-data (using fitted vocabulary) into a document-term matrix

In [258]:
# Using training data's Vocab (fitted on vector) to transform test sample
X_test_vector = Vector.transform(X_test)
print "Shape =", X_test_vector.toarray().shape

Shape = (1115L, 6279L)


#### CHECK

`Binary = True in CountVectorizer will limit the integer values to True(1) or False(0) only. Checking that by printing the train vector and finding the max value`

In [261]:
X_train_vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [266]:
X_train_vector.toarray().max()

1

X = { 1, 0 }

- Columns/Features should be same as we have used training-data's Vocab to transform test sample

#### MODEL

In [251]:
model = BernoulliNB(alpha = 0.01)

In [252]:
model.fit(X_train_vector, Y_train)

BernoulliNB(alpha=0.01, binarize=0.0, class_prior=None, fit_prior=True)

#### Predict

In [253]:
Y_test_hat = model.predict(X_test_vector)

#### Accuracy Score

In [254]:
accuracy_score(Y_test, Y_test_hat)*100.0

98.744394618834093