# Multinomial Naive Bayes Classifier

- Based on Bag of Words, creating a Vocab of all possible unique words across all documents
- A document is represented by a feature vector with integer elements whose value is the frequency of that word in the document.
- Mostly used for text classification

# Model 1: Building from Scratch

## AIM

`To Build a text classifier based on a collection of few documents each having a small corpus of SMS data under three classes: Spam, Ham and to pedict the class of a new document.`

## Data

Dataset is a collection of documents each containing a huge set of SMS data. Label, Document and DocNumber are the fields

## Imports

In [2]:
import pandas as pd
import numpy as np
import math
import random
import re
import string

from nltk.stem import WordNetLemmatizer
from sklearn.utils import shuffle

## Dataset

In [3]:
df = pd.read_excel('Data//data_5_3_Spam_Ham_Dataset_BinaryClass.xlsx')
df.head()

Unnamed: 0,Doc_Number,Document,Label
0,1,"Go until jurong point, crazy.. Available only ...",ham
1,2,Ok lar... Joking wif u oni...,ham
2,3,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,4,U dun say so early hor... U c already then say...,ham
4,5,"Nah I don't think he goes to usf, he lives aro...",ham


#### Labelling Target

In [4]:
dict_target, id_target = {}, 0

for key in df.Label.unique():
    dict_target[key] = id_target
    id_target += 1

print dict_target
    
df.Label = df.Label.map(dict_target)
df.head()

{u'ham': 0, u'spam': 1}


Unnamed: 0,Doc_Number,Document,Label
0,1,"Go until jurong point, crazy.. Available only ...",0
1,2,Ok lar... Joking wif u oni...,0
2,3,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,4,U dun say so early hor... U c already then say...,0
4,5,"Nah I don't think he goes to usf, he lives aro...",0


#### Cleaning Document

In [5]:
def preprocessing(document_column):
    
    # Convert unicode to string
    document_column = document_column.apply(lambda x: x.encode('ascii', 'ignore') if type(x) == unicode else x)
    
    # Loading Stop_Words, Stop_Letters test file
    StopWords_txt = open('Data/Stop_Words.txt', 'r').read()
    stop_words = re.findall("\'(\w+)\'",StopWords_txt)
    stop_letters = re.findall(r'\w', string.letters)
    
    # Loading NLTK's leammatizer-> Example: 'plays' is converted to 'play'
    lm = WordNetLemmatizer()
    
    processed_documents = []
    for doc in document_column:
        
        doc = str(doc)
        doc_words, final_doc_words = [], []        
            
        # Basic regex
        doc = re.sub('[\s\t\n\r\.\{\(\[\}\]\)\,\;\'\"\/\\\?\_\-\>\<\\:\-\+\=\@\#\$\%\&\*\!0-9]+', ' ', doc)
        doc = re.sub('\s+', ' ', doc)
        doc = ' '.join(re.findall('\w+', doc))
        doc = str(doc.lower())
        
        # Stop words, Stop letters
        for w in doc.split():
            if w not in stop_words and w not in stop_letters:
                doc_words.append(w)    
                
        # Lematize
        for w in doc_words:
            final_doc_words.append(str(lm.lemmatize(w)))
        
        # Appending to a final list
        processed_documents.append(' '.join(final_doc_words))
        
    return processed_documents

In [144]:
df['Processed_Document'] = preprocessing(df.Document)
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,5496,"Good afternoon, my love ... How goes your day ...",0,good afternoon love go day sleep hope well boy...
1,5370,Hi mom we might be back later than &lt;#&gt;,0,hi mom might back later lt gt
2,3815,Pls i wont belive god.not only jesus.,0,pls wont belive god jesus
3,2234,Nothing just getting msgs by dis name wit diff...,0,nothing getting msg dis name wit different
4,4559,I am in hospital da. . I will return home in e...,0,hospital da return home evening


#### Shuffle

In [145]:
df = shuffle(df).reset_index(drop=True)
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,368,"Update_Now - Xmas Offer! Latest Motorola, Sony...",1,update xmas offer latest motorola sonyericsson...
1,2449,Tmr then Ã¼ brin lar... Aiya later i come n c ...,0,tmr brin lar aiya later come lar mayb neva set...
2,1893,Probably earlier than that if the station's wh...,0,probably earlier station think
3,1705,Just taste fish curry :-P,0,taste fish curry
4,2534,Yup ok...,0,yup ok


#### Train - Test Split

In [148]:
def train_test_split(df, train_size):
    
    train_index = int(train_size*len(df))
    test_index = train_index + 1
    
    train_data = df[0:train_index+1]
    test_data = df[test_index:]
    print "-> train data = {} Rows;  test data = {} Rows".format(len(train_data), len(test_data))
    
    return train_data, test_data

In [149]:
train_data, test_data = train_test_split(df, 0.80)

-> train data = 4458 Rows;  test data = 1114 Rows


#### Building a Vocabulary using _ONLY _  "Training Data"

In [156]:
# This is a very crude and generalised way
Vocab = sorted(list(set(' '.join(train_data.Processed_Document).split())))

print "Vocabulary (Train Data)= {} words".format(len(Vocab))

Vocabulary (Train Data)= 6305 words


In [157]:
print "Total Words in Dataset = {} words".format(len(sorted(list(set(' '.join(df.Processed_Document).split())))))

Total Words in Dataset = 7116 words


#### Building a Frequency table

In [158]:
Frequency_table = pd.DataFrame(index=Vocab)

for cls in train_data.Label.unique():
    
    col_name = cls
    
    perClass_all_doc_words = pd.Series(' '.join(train_data[train_data.Label == cls]['Processed_Document']).split())
    perClass_all_term_frequency = perClass_all_doc_words.value_counts()
    
    perClass_term_freq = []
    for w in Frequency_table.index:
        if w in perClass_all_term_frequency.index:            
            perClass_term_freq.append(perClass_all_term_frequency[w])
        else:
            perClass_term_freq.append(0)
            
    Frequency_table[col_name] = perClass_term_freq

Frequency_table.head(10)

Unnamed: 0,1,0
aa,0,1
aah,0,3
aaooooright,0,1
aathi,0,3
ab,1,0
abbey,0,1
abdomen,0,1
abeg,0,1
abel,0,1
aberdeen,1,0


- Represnts all the words in the index axis along with their frequencies per class 0 and 1

#### Prob of a class: P(Class = C_i)

In [159]:
def prob_class(cls):
    P = train_data.Label.value_counts()[cls].astype(float)/train_data.Label.value_counts().sum()
    return P

#### Prob of  a word given a particular class: P(Wordj |Class = C_i)

In [160]:
# P(Wordj | Class=C_i) = [ n(Wordj) when class=C_i + alpha ] / [n(Total words) where class=C_i + len(Vocab)]
# alpha -> Smoothing Fucntion Parameter
#        - That if number of W1 words under Class C1 is equal to 0, it might make the entire P = 0, as P = P(W1)xP(W2)...
#        - Also we are not ignoring the chances of this word even if it didn't occur in this class, by giving it a less Prob.
#        - Alpha should be small value, so that we are giving it a less Prob.
#        - Dividing by Vocab to normailize

def prob_word_given_class(word, cls, alpha):
    P = (Frequency_table[Frequency_table.index == word][cls].values[0].astype(float) + alpha) /(Frequency_table[cls].sum() + len(Vocab))
    return P

#### Likelihood: Prob of a unknown class given all words in the test data: P(Class=? | W1,W2,W3,W4...... Wj)

In [70]:
def prob_class_with_test_feature_vector(test_doc):
    
    global Vocab
    
    # Intilising "big_class" with a random choice
    big = 0
    big_class = random.choice(df.Label.unique())

    # Smoothing Function Parameter 
    alpha = 0.0001

    for cls in df.Label.unique():
        
        # Bayes Rule: P(X|Y) = P(Y|X) * P(X) / P(Y)
        # Applying & ignoring the denominator, as it will be a same value.
        # likelihhod P(Class=C1|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=C1) = P(W1|C1)*P(W2|C1)* ...*P(Wj|C1) * P(C1)
        # likelihhod P(Class=C2|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=C2) = P(W1|C2)*P(W2|C2)* ...*P(Wj|C2) * P(C2)
        # ...
        # likelihhod P(Class=Ck|W1,W2,...,Wj) = P(W1, W2..., Wj) * P(Class=Ck) = P(W1|Ck)*P(W2|Ck)* ...*P(Wj|Ck) * P(Ck)
        
        likelihood = 1
        for word in test_doc.split():  
            
            # Checking if there's a new word in new data which was not even present in the Vocab 
            # Example: For a newer word which was not in train or test data either...
            if word in Vocab:
                likelihood *= prob_word_given_class(word, cls, alpha)
            else:
                pass

        likelihood *= prob_class(cls)

        if likelihood > big:
            big = likelihood
            big_class = cls

    return big_class

In [71]:
Y_test_hat = []

for test_doc in test_data['Processed_Document'].astype(str):

    predicted_class = prob_class_with_test_feature_vector(test_doc)
    Y_test_hat.append(predicted_class)

#### Accuracy

In [94]:
score = []
for i in range(len(test_data)):

    if list(test_data.Label)[i] == Y_test_hat[i]:
        score.append(100)
    else:
        score.append(0)
        
accuracy_matrix = pd.DataFrame({'Y_test': test_data.Label, 
                                'Y_test_hat (predicted)': Y_test_hat,
                                'Score': score })
accuracy_matrix.head()

Unnamed: 0,Score,Y_test,Y_test_hat (predicted)
4458,100,0,0
4459,100,0,0
4460,100,0,0
4461,100,0,0
4462,100,0,0


In [97]:
print "Accuracy Score of model = {}%".format(accuracy_matrix.Score.mean())

Accuracy Score of model = 97.8456014363%


----

# Model 2: Using Sklearn

## Imports

In [257]:
import pandas as pd
import numpy as np

from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from nltk.tokenize import word_tokenize
from gensim.models import word2vec

In [258]:
df.head()

Unnamed: 0,Doc_Number,Document,Label,Processed_Document
0,368,"Update_Now - Xmas Offer! Latest Motorola, Sony...",1,update xmas offer latest motorola sonyericsson...
1,2449,Tmr then Ã¼ brin lar... Aiya later i come n c ...,0,tmr brin lar aiya later come lar mayb neva set...
2,1893,Probably earlier than that if the station's wh...,0,probably earlier station think
3,1705,Just taste fish curry :-P,0,taste fish curry
4,2534,Yup ok...,0,yup ok


#### X and Y

In [259]:
X = df.Processed_Document
Y = df.Label

#### Train_test_split

In [260]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80)

print len(X_train), len(Y_train)
print len(X_test), len(Y_test)

4457 4457
1115 1115


#### CountVectorizer

#### Creating a Train-data Vocab and vectorising train data

In [277]:
# Create a Vector for our train data's vocabulary...
Vector = CountVectorizer()

# Vector to fit with our training data to create Vocabulary, Vocab Built!
Vector.fit(X_train)

# Printing length of our train data's Vocab..
print 'Vocab (Train Data) = {} words'.format(len(Vector.get_feature_names()))

Vocab (Train Data) = 6247 words


##### Transform our train data (Vectorising)

In [278]:
# Vector Standardized using transform()
X_train_vector = Vector.transform(X_train)

In [279]:
print "X_train = {}".format(X_train_vector.toarray())

X_train = [[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [280]:
print 'Shape =', X_train_vector.toarray().shape

Shape = (4457L, 6247L)


##### Transform test-data (using fitted vocabulary) into a document-term matrix

In [281]:
# Using training data's Vocab (fitted on vector) to transform test sample
X_test_vector = Vector.transform(X_test)

print "Shape =", X_test_vector.toarray().shape

Shape = (1115L, 6247L)


- Columns/Features should be same as we have used training-data's Vocab to transform test sample

#### MODEL

In [282]:
model = MultinomialNB(alpha = 0.01)

In [283]:
model.fit(X_train_vector, Y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

#### Predict

In [284]:
Y_test_hat = model.predict(X_test_vector)

#### Accuracy Score

In [285]:
accuracy_score(Y_test, Y_test_hat)*100.0

97.399103139013448