<center>
    <H1> NAIVE BAYES CLASSIFIER </H1>
    <br>
======================================================================================================================
<br>
Naive Bayes Classification algorithm is a type of supervised machine learning algorithm. It is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks.
We’ll build a simple email classifier using Multinomial Naive Bayes Classifier.

## STEP 1: IMPORTING LIBRARIES

In [1]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer  
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
import math 
from collections import Counter, defaultdict
from sklearn.model_selection import train_test_split  
from sklearn.metrics import confusion_matrix, accuracy_score

## STEP 2: LOADING DATASET

Load the required dataset

In [2]:
#mails_dataset = pd.read_csv('Dataset/trial_spam.csv', encoding = 'latin-1')
#mails_dataset = pd.read_csv('Dataset/spam.csv', encoding = 'latin-1')
mails_dataset = pd.read_csv('Dataset/emails.csv', encoding = 'latin-1')

mails_dataset.head()           #show first 5 rows

Unnamed: 0,sno,time,subject,sender,body,folder,label
0,1,"Tue, 17 Sep 2019 09:51:40 -0700","Megha, your profile is getting hits",=?UTF-8?B?TGlua2VkSW4=?= <linkedin@e.linkedin....,Please view this email in a browser: \r\nhttps...,Inbox,Social
1,2,"Sun, 29 Sep 2019 02:50:00 +0000 (UTC)","Building A Logistic Regression in Python, Step...","""Medium Daily Digest"" <noreply@medium.com>",Today's highlights\r\n\r\nBuilding A Logistic ...,Inbox,Updates
2,3,"Mon, 12 Aug 2019 01:43:21 +0000 (GMT)",Help us protect you: Security advice from Google,Google <no-reply@accounts.google.com>,Confirm your recovery phone\r\n\r\nmegha.mcs18...,Inbox,Personal
3,4,"Mon, 30 Sep 2019 08:02:38 +0000 (UTC)","Megha, start a conversation with your new conn...",Kashish Gupta via LinkedIn <invitations@linked...,Kashish Gupta has accepted your invitation. Le...,Inbox,Social
4,5,"Wed, 21 Aug 2019 16:21:31 +0000 (UTC)",Prepare for a coding interview in Python,"""Mari from DataCamp"" <team@datacamp.com>","Hi there,\r\n\r\nWe have 10 new courses coveri...",Inbox,Promotions


In [2]:
import re
print(re.findall(r'\<(.*?)\>', '"Mari from DataCamp" <team@datacamp.com>'))

['team@datacamp.com']


## STEP 3: FEATURE SELECTION

Select the relevant features, important for mail classification. We can see that column Unamed are irrelevant for our classifier. Thus, we need to remove them. 

In [5]:
#drop undesirable columns
#drop_list = ['Unnamed: 3', 'Unnamed: 4','Unnamed: 5','Unnamed: 6']
mails_dataset.drop(mails_dataset.columns.difference(['sno','subject','body','label']), axis = 1, inplace = True)

mails_dataset.head()

Unnamed: 0,sno,subject,body,label
0,1,"Megha, your profile is getting hits",Please view this email in a browser: \r\nhttps...,Social
1,2,"Building A Logistic Regression in Python, Step...",Today's highlights\r\n\r\nBuilding A Logistic ...,Updates
2,3,Help us protect you: Security advice from Google,Confirm your recovery phone\r\n\r\nmegha.mcs18...,Personal
3,4,"Megha, start a conversation with your new conn...",Kashish Gupta has accepted your invitation. Le...,Social
4,5,Prepare for a coding interview in Python,"Hi there,\r\n\r\nWe have 10 new courses coveri...",Promotions


In [6]:
mails_dataset['message'] = mails_dataset[['subject', 'body']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

In [7]:
#Rename the columns, to make it easy to read and manipulate
#mails_dataset.rename(columns = {'sno': 'docID', 'v1': 'label', 'v2': 'message'}, inplace = True)
mails_dataset.drop(['subject','body'], axis = 1, inplace = True)

mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,Social,"Megha, your profile is getting hits Please vie..."
1,2,Updates,"Building A Logistic Regression in Python, Step..."
2,3,Personal,Help us protect you: Security advice from Goog...
3,4,Social,"Megha, start a conversation with your new conn..."
4,5,Promotions,Prepare for a coding interview in Python Hi th...


In [8]:
len(mails_dataset)

681

In [9]:
mails_dataset['label'].value_counts()  #count number of instances of each label

 Personal      238
 Social        151
 Forums        141
 Updates       119
 Promotions     32
Name: label, dtype: int64

In [10]:
total_mails = mails_dataset.shape[0]            #total number on instances in our dataset
total_mails

681

## STEP 4: DATA PREPROCESSING

We need to clean our data for further processing. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the class of the email

###  A. Convert to lowercase

In [11]:
#convert the data into lower case
mails_dataset['message'] =  mails_dataset['message'].str.lower()
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,Social,"megha, your profile is getting hits please vie..."
1,2,Updates,"building a logistic regression in python, step..."
2,3,Personal,help us protect you: security advice from goog...
3,4,Social,"megha, start a conversation with your new conn..."
4,5,Promotions,prepare for a coding interview in python hi th...


### B. Convert categorical values to numbers



In [12]:
'''
    Personal : 0
    Social : 1
    Forums : 2
    Updates : 3
    Promotions : 4
    
'''
#mails_dataset['label'] = mails_dataset['label'].map({'ham': 0, 'spam': 1})
#mails_dataset['label'] = mails_dataset['label'].map({'Personal': 0, 'Social' : 1, 'Forums' : 2,'Updates' : 3, 'Promotions' : 4})
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,Social,"megha, your profile is getting hits please vie..."
1,2,Updates,"building a logistic regression in python, step..."
2,3,Personal,help us protect you: security advice from goog...
3,4,Social,"megha, start a conversation with your new conn..."
4,5,Promotions,prepare for a coding interview in python hi th...


In [13]:
mails_dataset['label']=mails_dataset['label'].str.strip()

In [14]:
dictionary = {'Personal': 0, 'Social' : 1, 'Forums' : 2,'Updates' : 3, 'Promotions' : 4}
mails_dataset=mails_dataset.replace({"label": dictionary})
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,1,"megha, your profile is getting hits please vie..."
1,2,3,"building a logistic regression in python, step..."
2,3,0,help us protect you: security advice from goog...
3,4,1,"megha, start a conversation with your new conn..."
4,5,4,prepare for a coding interview in python hi th...


### C. Remove digits and punctuations

In [15]:
#remove all digits
mails_dataset['message'] = mails_dataset['message'].str.replace('\d+.\d+', '')
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,1,"megha, your profile is getting hits please vie..."
1,2,3,"building a logistic regression in python, step..."
2,3,0,help us protect you: security advice from goog...
3,4,1,"megha, start a conversation with your new conn..."
4,5,4,prepare for a coding interview in python hi th...


In [16]:
'''
     ^   :  Not these characters
     \w  :  Word characters
     \s  :  Space characters

    Replace any character that is not a word character or a space character with nothing/blank.
    
'''
mails_dataset['message'] = mails_dataset['message'].str.replace('[^\w\s]', '')
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,1,megha your profile is getting hits please view...
1,2,3,building a logistic regression in python step ...
2,3,0,help us protect you security advice from googl...
3,4,1,megha start a conversation with your new conne...
4,5,4,prepare for a coding interview in python hi th...


In [17]:
sample_mail = mails_dataset.iloc[0]
sample_mail['message']

'megha your profile is getting hits please view this email in a browser \r\nhttpselinkedincompubsfformlink_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxmtx3dyqpglljhjlyqgu2ukmcdioiucuedsd8nm5wgjocvkgokgzgni1qryl75on6zdv7zcwcdf_ei_evpx2zyzpfwmdastehmmwcjongvnq7xm_nijszrbxhbpzjyjg3d3d\r\n\r\nif you need assistance or have questions please contact linkedin customer service \r\nhttpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswtdcsy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vjwvrf62zxhf1xwquo9ggtofxh0anawteojssxgjg3d3d \r\n\r\nthis is an occasional email to help you get the most out of linkedin unsubscribe \r\nhttpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswsuusy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vj9mjfzbx7rmjiwphlqcq2lo9bykgkwjfw_qmhspmnpfx8wuujzy1pn_2zuq52as8etm7nsukdkxvmqo91cluos8gymlqczsiohieeg1emqgvotjxusdabmuafbmgo

### D. Convert all the slang words to corresponding formal words

Slang is the popular informal form of a word or group of words.

In [18]:
#create a dictionary of slang words and their corresponding terms

slang_list = {'u': 'you', 'r': 'are', 'd': "the", 'urs' : 'yours', 'wkly' : 'weekly', 'st' : 'such that', 
              'txt': 'text','comp': 'competition', 'prctc' : 'practice', 'dffrnc': 'difference', 'y': 'why', 
              'f9':'fine', 'tkts': 'tickets', 'csh': 'cash', 'phn': 'phone', 'im': 'i am', 'm': 'am', 
              'spcl': 'special', 'fone': 'phone', 'wks' : 'weeks', 'å': 'a', 'n': 'and', 'wat':'what'}


In [19]:
#replace slang with formal word

sample_mail = mails_dataset.iloc[0]
message = sample_mail['message']
print(message)

new_message = ' '.join(slang_list[i] if i in slang_list else i for i in message.split())
new_message

megha your profile is getting hits please view this email in a browser 
httpselinkedincompubsfformlink_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxmtx3dyqpglljhjlyqgu2ukmcdioiucuedsd8nm5wgjocvkgokgzgni1qryl75on6zdv7zcwcdf_ei_evpx2zyzpfwmdastehmmwcjongvnq7xm_nijszrbxhbpzjyjg3d3d

if you need assistance or have questions please contact linkedin customer service 
httpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswtdcsy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vjwvrf62zxhf1xwquo9ggtofxh0anawteojssxgjg3d3d 

this is an occasional email to help you get the most out of linkedin unsubscribe 
httpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswsuusy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vj9mjfzbx7rmjiwphlqcq2lo9bykgkwjfw_qmhspmnpfx8wuujzy1pn_2zuq52as8etm7nsukdkxvmqo91cluos8gymlqczsiohieeg1emqgvotjxusdabmuafbmgoo1pv5ih8vaciga3

'megha your profile is getting hits please view this email in a browser httpselinkedincompubsfformlink_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxmtx3dyqpglljhjlyqgu2ukmcdioiucuedsd8nm5wgjocvkgokgzgni1qryl75on6zdv7zcwcdf_ei_evpx2zyzpfwmdastehmmwcjongvnq7xm_nijszrbxhbpzjyjg3d3d if you need assistance or have questions please contact linkedin customer service httpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswtdcsy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vjwvrf62zxhf1xwquo9ggtofxh0anawteojssxgjg3d3d this is an occasional email to help you get the most out of linkedin unsubscribe httpselinkedincompubcc_ri_x0gzc2x3dyqpglljhjlyqgn9zfezbymbzf35enzgsrm98sgilihzbazbkwzdwezdzebuncrk6homprvxtpkx3dswsuusy_ei_eolaggf4snmvxff7kuckuwmuru2mcbogxku09vj9mjfzbx7rmjiwphlqcq2lo9bykgkwjfw_qmhspmnpfx8wuujzy1pn_2zuq52as8etm7nsukdkxvmqo91cluos8gymlqczsiohieeg1emqgvotjxusdabmuafbmgoo1pv5ih8vaciga3ltbng9xojg3d

In [20]:
#applying to all rows

def convert_slangs(row):
    message = row['message']
    new_message = ' '.join(slang_list[i] if i in slang_list else i for i in message.split())
    return new_message

mails_dataset['message'] = mails_dataset.apply(convert_slangs, axis=1)
mails_dataset.head()

Unnamed: 0,sno,label,message
0,1,1,megha your profile is getting hits please view...
1,2,3,building a logistic regression in python step ...
2,3,0,help us protect you security advice from googl...
3,4,1,megha start a conversation with your new conne...
4,5,4,prepare for a coding interview in python hi th...


### E. Tokenization

In [21]:
#pick every message and convert it into tokens

def identify_tokens(row):
    message = row['message']
    tokens = word_tokenize(message)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

mails_dataset['tokens'] = mails_dataset.apply(identify_tokens, axis=1)
mails_dataset.head()

Unnamed: 0,sno,label,message,tokens
0,1,1,megha your profile is getting hits please view...,"[megha, your, profile, is, getting, hits, plea..."
1,2,3,building a logistic regression in python step ...,"[building, a, logistic, regression, in, python..."
2,3,0,help us protect you security advice from googl...,"[help, us, protect, you, security, advice, fro..."
3,4,1,megha start a conversation with your new conne...,"[megha, start, a, conversation, with, your, ne..."
4,5,4,prepare for a coding interview in python hi th...,"[prepare, for, a, coding, interview, in, pytho..."


### F. Stemming / Lemmatization

Both processes reduce the inflectional forms of word into a common base or root. But we are using lemmatization because it takes care of the context, while stemming simply performs crude cutoff...

In [22]:
'''
stemming = PorterStemmer()
sample_mail = mails_dataset.iloc[0]
tokens = sample_mail['tokens']
stemmed_list = [stemming.stem(word) for word in tokens]
stemmed_list
'''

"\nstemming = PorterStemmer()\nsample_mail = mails_dataset.iloc[0]\ntokens = sample_mail['tokens']\nstemmed_list = [stemming.stem(word) for word in tokens]\nstemmed_list\n"

In [23]:
lemmatizer = WordNetLemmatizer() 

sample_mail = mails_dataset.iloc[0]
tokens = sample_mail['tokens']
lemmatize_list = [lemmatizer.lemmatize(word) for word in tokens]
lemmatize_list

['megha',
 'your',
 'profile',
 'is',
 'getting',
 'hit',
 'please',
 'view',
 'this',
 'email',
 'in',
 'a',
 'browser',
 'if',
 'you',
 'need',
 'assistance',
 'or',
 'have',
 'question',
 'please',
 'contact',
 'linkedin',
 'customer',
 'service',
 'this',
 'is',
 'an',
 'occasional',
 'email',
 'to',
 'help',
 'you',
 'get',
 'the',
 'most',
 'out',
 'of',
 'linkedin',
 'unsubscribe',
 'this',
 'email',
 'wa',
 'intended',
 'for',
 'megha',
 'learn',
 'why',
 'we',
 'include',
 'this',
 'httphelplinkedincomappanswersglobalidfteng',
 'copyright',
 'linkedin',
 'corporation',
 'all',
 'right',
 'reserved',
 'linkedin',
 'corp',
 'west',
 'maude',
 'avenue',
 'sunnyvale',
 'ca']

In [24]:
def lemmatize_tokens(row):
    tokens = row['tokens']
    lemmatized_list = [lemmatizer.lemmatize(word) for word in tokens]
    return (lemmatized_list)

mails_dataset['tokens'] = mails_dataset.apply(lemmatize_tokens, axis=1)
mails_dataset.head()

Unnamed: 0,sno,label,message,tokens
0,1,1,megha your profile is getting hits please view...,"[megha, your, profile, is, getting, hit, pleas..."
1,2,3,building a logistic regression in python step ...,"[building, a, logistic, regression, in, python..."
2,3,0,help us protect you security advice from googl...,"[help, u, protect, you, security, advice, from..."
3,4,1,megha start a conversation with your new conne...,"[megha, start, a, conversation, with, your, ne..."
4,5,4,prepare for a coding interview in python hi th...,"[prepare, for, a, coding, interview, in, pytho..."


### G. Remove stopwords

Stopwords are common words that carry less important meaning than keywords. So, we will remove them.

In [25]:
stop_words = set(stopwords.words('english'))
print(stop_words,)

{'him', 'no', 'is', "couldn't", 'by', 'which', 'isn', 'all', 'shouldn', 'yourself', 'does', 'the', "you're", 'your', 'yours', 'himself', 'too', 'below', "you'd", 'on', 'll', 'she', 'couldn', 'me', 'myself', 'been', "shouldn't", 'up', 'it', 'down', 'are', 'just', 'ourselves', 'we', 'when', 'do', "that'll", 'doing', 'because', 'did', 'our', 'before', 'very', 'can', "you'll", 'wasn', "wouldn't", 's', "hasn't", 'being', 'his', 'than', 'until', 'some', 't', 'needn', 'theirs', 'these', 'herself', 'mightn', 'so', 'where', 'you', 'her', "it's", 'in', 'between', 'other', 'ours', 'this', 'for', 'not', 'then', 'such', 'wouldn', 'should', "should've", 'or', 'an', "shan't", 'y', "doesn't", 'had', 'most', 'after', 'more', 'hadn', 'm', 'here', 'there', 'am', 'd', 're', 'while', 'about', 'don', 'ain', 'over', 'each', 'was', 'but', 'further', 'hers', 'at', 'above', 'own', "isn't", 'from', 'hasn', 'any', 'were', 'itself', "she's", 'as', "won't", "didn't", 'having', 'now', 'why', 'how', "you've", 'with',

In [26]:
def remove_stopwords(row):
    tokens = row['tokens']
    filtered_list = [w for w in tokens if not w in stop_words]
    return (filtered_list)

mails_dataset['tokens'] = mails_dataset.apply(remove_stopwords , axis=1)
mails_dataset.head()

Unnamed: 0,sno,label,message,tokens
0,1,1,megha your profile is getting hits please view...,"[megha, profile, getting, hit, please, view, e..."
1,2,3,building a logistic regression in python step ...,"[building, logistic, regression, python, step,..."
2,3,0,help us protect you security advice from googl...,"[help, u, protect, security, advice, google, c..."
3,4,1,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ..."
4,5,4,prepare for a coding interview in python hi th...,"[prepare, coding, interview, python, hi, new, ..."


## >>>> Now our data is cleaned and ready for training <<<<

## STEP 5: CREATING TEST AND TRAIN SETS

We will randomly split our dataset in 80–20 ratio. Where 80% of the total data will be used as training set and rest 20% will be considered as test set. 

In [27]:
#before splitting. Let's, create vocabulary

all_tokens = []

for row in range(total_mails):
        #pick every token list from each row and append to the all tokens list
        tokens = mails_dataset.tokens.iloc[row]
        #append the list to the final list 
        all_tokens += tokens 
    
vocab = np.unique(all_tokens)      #vocabulary is the collection of all terms in the corpus

print("Total number of tokens in the corpus : ", len(all_tokens))
print("Total terms : ", len(vocab))

Total number of tokens in the corpus :  61583
Total terms :  7686


In [28]:
X = mails_dataset.drop('label',axis=1) 
y = mails_dataset['label']

#random state = 0, will give same split evry time. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [29]:
test_size, train_size = X_test.shape[0], X_train.shape[0]
print("Number of instance in :\n Training set = ", train_size, "\n Test set = ", test_size)

Number of instance in :
 Training set =  544 
 Test set =  137


In [30]:
X_train.head()

Unnamed: 0,sno,message,tokens
200,263,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ..."
380,514,advice from a 33yearold who wants you to stop ...,"[advice, want, stop, worrying, much, tim, denn..."
236,312,collection hello juniors we hope you all are d...,"[collection, hello, junior, hope, well, said, ..."
101,133,graphs for average comparisons forwarded messa...,"[graph, average, comparison, forwarded, messag..."
667,904,student volunteers for ducs alumni assocation ...,"[student, volunteer, ducs, alumnus, assocation..."


In [31]:
#y_train is an array. so in order to concat it with our X variables, we need to convert it into dataframe

labels_train = pd.DataFrame(y_train, columns = ['label'])
labels_train.head()

Unnamed: 0,label
200,1
380,3
236,2
101,0
667,0


In [32]:
'''
    We have X_train, y_train, X_test, y_test.
    Using these lists and dataframes we will randomly create two non-overlapping datasets 
        1. training set
        2. testing set
'''

#creating training set
train_set = pd.concat([X_train, labels_train], axis = 1).reset_index(drop=True)
train_set.head()

Unnamed: 0,sno,message,tokens,label
0,263,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ...",1
1,514,advice from a 33yearold who wants you to stop ...,"[advice, want, stop, worrying, much, tim, denn...",3
2,312,collection hello juniors we hope you all are d...,"[collection, hello, junior, hope, well, said, ...",2
3,133,graphs for average comparisons forwarded messa...,"[graph, average, comparison, forwarded, messag...",0
4,904,student volunteers for ducs alumni assocation ...,"[student, volunteer, ducs, alumnus, assocation...",0


In [33]:
# Our training set is ready, similarly creating test set

labels_test = pd.DataFrame(y_test, columns = ['label'])
test_set = pd.concat([X_test, labels_test], axis = 1).reset_index(drop=True)
test_set.head()

Unnamed: 0,sno,message,tokens,label
0,146,megha thanks for being an active member please...,"[megha, thanks, active, member, please, view, ...",1
1,805,shrikant and manish sent new messages messages...,"[shrikant, manish, sent, new, message, message...",1
2,337,shrikant sent you a new message conversation w...,"[shrikant, sent, new, message, conversation, s...",3
3,750,delivery status notification failure hello meg...,"[delivery, status, notification, failure, hell...",0
4,373,mcs_exploratory analysis pfa,"[analysis, pfa]",0


Our dataset is divided into two parts : train_set and test_set. Now we need to train our model

## STEP 6: TRAINING CLASSIFIER 

## <center>  Multinomial Naive Bayes Classifier

![image.png](attachment:image.png)

### A . Compute prior probability of each class

Prior probaility of class c, is computed as:

   <center><b> P(c) = (examples with label c) / (total examples)
    

In [34]:
total_classes = train_set['label'].nunique()
total_classes      # total number of classes

5

In [35]:
# create set of all instances wth class label = 1
y1 = train_set[train_set.label == 1]

#instances with class label = 1 
y1.head()

Unnamed: 0,sno,message,tokens,label
0,263,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ...",1
5,778,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ...",1
7,532,megha start a conversation with your new conne...,"[megha, start, conversation, new, connection, ...",1
11,914,megha please add me to your linkedin network h...,"[megha, please, add, linkedin, network, hi, me...",1
13,916,himanshu sent you a new message conversation w...,"[himanshu, sent, new, message, conversation, h...",1


In [36]:
#now calculate the prior probaility of each label/class

prior_prob = {}          #empty array
class_set = {}           #divide the instances into classes

#create set of examples with same class labels and then compute prior prob of each class

for ci in range(total_classes):
    class_set[ci] = train_set[train_set.label == ci]     
    count_label = len(class_set[ci])
    prior_prob[ci] = count_label/train_size       
    
prior_prob    

{0: 0.35661764705882354,
 1: 0.22242647058823528,
 2: 0.20772058823529413,
 3: 0.1636029411764706,
 4: 0.04963235294117647}

### B. Creating Bag of Words for each class

Create a BOW model and count number of times each term occurs in each class.

In [37]:
from collections import Counter
z = ['blue', 'red', 'blue', 'yellow', 'blue', 'red']
Counter(z)

Counter({'blue': 3, 'red': 2, 'yellow': 1})

In [38]:
#create bow for each class
bow_class = {}        #set to store bow of each class

#for every class ci
for ci in range(total_classes):
    tokens_class = []            #create empty list of tokens in class ci
    for row in range(len(class_set[ci])):
        #pick tokens from each row and append to the row tokens list
        tokens = class_set[ci].tokens.iloc[row]
        #append the list to the final list 
        tokens_class.extend(tokens) 
    
    bow_class[ci] = tokens_class      # bow of class ci

#bow_class[0]

In [39]:
# calculate the number of occurrences of each term in each class 
# Counter returns a dictionary where key is the term and value is the no of time it occurs 

freq_terms_class = {}

for ci in range(total_classes):
    freq_terms_class[ci] = dict(Counter(bow_class[ci]))

freq_terms_class[0]  

{'graph': 13,
 'average': 1,
 'comparison': 2,
 'forwarded': 51,
 'message': 66,
 'megha': 65,
 'surname': 1,
 'date': 88,
 'wed': 16,
 'aug': 26,
 'pm': 77,
 'subject': 60,
 'student': 102,
 'volunteer': 5,
 'ducs': 5,
 'alumnus': 8,
 'assocation': 2,
 'dear': 37,
 'need': 12,
 'two': 7,
 'every': 2,
 'batch': 2,
 'association': 2,
 'pls': 12,
 'provide': 5,
 'name': 20,
 'sapna': 14,
 'varshney': 13,
 'assistant': 13,
 'professor': 16,
 'adhoc': 6,
 'department': 61,
 'computer': 66,
 'science': 72,
 'university': 75,
 'delhi': 107,
 'fwd': 50,
 'india': 38,
 'fest': 2,
 'bring': 6,
 'world': 8,
 'vasudha': 23,
 'bhatnagar': 24,
 'vasudhabhatnagargmailcom': 15,
 'sun': 18,
 'sapnavarshgmailcom': 6,
 'ronnie': 1,
 'chakre': 9,
 'mca': 46,
 'elective': 4,
 'form': 31,
 'ratheepriyankagmailcom': 4,
 'fri': 16,
 'dec': 1,
 'image': 61,
 'google': 86,
 'fill': 10,
 'option': 8,
 'choose': 1,
 'algo': 16,
 'hw': 1,
 'list': 25,
 'correctness': 1,
 'insertion': 2,
 'sort': 2,
 'tn': 2,
 'pr

In [40]:
freq_terms_class[0]['vasudha']  #check frequency of term 'yeah' in class 0

23

### C. Compute Conditional Probability of each term given class

Cond probability of term t given class c, is computed as:

<b><center>P(t|c) = {(Frequency of term t in class c) + 1} / {( Sum of frequencies of each term in class c) + |V|} </b>

where |V| = size of the vocabulary

In [41]:
import copy
#Python never implicitly copies objects. dict2 = dict1, both refer to the same object, 
#so when we mutate dic2, all references to it keep referring to the object in its current state and change dict1 as well

cond_prob_class = {}

for ci in range(total_classes):
    #cond_prob_class[ci] = freq_terms_class[ci]
    cond_prob_class[ci] = freq_terms_class[ci].copy()
    
cond_prob_class[0]   

{'graph': 13,
 'average': 1,
 'comparison': 2,
 'forwarded': 51,
 'message': 66,
 'megha': 65,
 'surname': 1,
 'date': 88,
 'wed': 16,
 'aug': 26,
 'pm': 77,
 'subject': 60,
 'student': 102,
 'volunteer': 5,
 'ducs': 5,
 'alumnus': 8,
 'assocation': 2,
 'dear': 37,
 'need': 12,
 'two': 7,
 'every': 2,
 'batch': 2,
 'association': 2,
 'pls': 12,
 'provide': 5,
 'name': 20,
 'sapna': 14,
 'varshney': 13,
 'assistant': 13,
 'professor': 16,
 'adhoc': 6,
 'department': 61,
 'computer': 66,
 'science': 72,
 'university': 75,
 'delhi': 107,
 'fwd': 50,
 'india': 38,
 'fest': 2,
 'bring': 6,
 'world': 8,
 'vasudha': 23,
 'bhatnagar': 24,
 'vasudhabhatnagargmailcom': 15,
 'sun': 18,
 'sapnavarshgmailcom': 6,
 'ronnie': 1,
 'chakre': 9,
 'mca': 46,
 'elective': 4,
 'form': 31,
 'ratheepriyankagmailcom': 4,
 'fri': 16,
 'dec': 1,
 'image': 61,
 'google': 86,
 'fill': 10,
 'option': 8,
 'choose': 1,
 'algo': 16,
 'hw': 1,
 'list': 25,
 'correctness': 1,
 'insertion': 2,
 'sort': 2,
 'tn': 2,
 'pr

In [42]:
len(vocab)

7686

In [43]:
#denominator remains same for all the terms in same class. 
#denom for class 0

denom = sum(freq_terms_class[0].values())+len(vocab)
denom

17255

In [44]:
# for each term in the class, calculate cond prob. this is to be done for all the classes.

#iterate through each class one by one
for ci in range(total_classes):
    # compute cond_prob for each term in this class
    denom = sum(freq_terms_class[ci].values())+len(vocab)
    
    # go to each term in this class and compute its cond_prob within this class
    for term in cond_prob_class[ci]:
        freq_term = freq_terms_class[ci][term]
        cond_prob_class[ci][term] = (freq_term+1)/denom

cond_prob_class[0]   #check cond prob for for each term, given class 0

{'graph': 0.0008113590263691683,
 'average': 0.00011590843233845262,
 'comparison': 0.00017386264850767892,
 'forwarded': 0.003013619240799768,
 'message': 0.003882932483338163,
 'megha': 0.0038249782671689364,
 'surname': 0.00011590843233845262,
 'date': 0.005157925239061141,
 'wed': 0.0009852216748768472,
 'aug': 0.0015647638365691105,
 'pm': 0.004520428861199652,
 'subject': 0.003535207186322805,
 'student': 0.00596928426543031,
 'volunteer': 0.00034772529701535785,
 'ducs': 0.00034772529701535785,
 'alumnus': 0.0005215879455230368,
 'assocation': 0.00017386264850767892,
 'dear': 0.0022022602144305997,
 'need': 0.000753404810199942,
 'two': 0.0004636337293538105,
 'every': 0.00017386264850767892,
 'batch': 0.00017386264850767892,
 'association': 0.00017386264850767892,
 'pls': 0.000753404810199942,
 'provide': 0.00034772529701535785,
 'name': 0.0012170385395537525,
 'sapna': 0.0008693132425383947,
 'varshney': 0.0008113590263691683,
 'assistant': 0.0008113590263691683,
 'professor':

In [45]:
cond_prob_class[0]['sapna']   

0.0008693132425383947

In [46]:
freq_terms_class[0]['sapna']   

14

Training part is done. We are ready with:

    - vocabulary
    - class prior probability of each class
    - cond prob for each term
    

## STEP 7: TESTING CLASSIFIER 

Now our model is ready. We will test our data against given labels.
For every test case, calculate class score (using Bayes theorem) and assign the class to the test case, having maximum score.

In [47]:
test_set.head()

Unnamed: 0,sno,message,tokens,label
0,146,megha thanks for being an active member please...,"[megha, thanks, active, member, please, view, ...",1
1,805,shrikant and manish sent new messages messages...,"[shrikant, manish, sent, new, message, message...",1
2,337,shrikant sent you a new message conversation w...,"[shrikant, sent, new, message, conversation, s...",3
3,750,delivery status notification failure hello meg...,"[delivery, status, notification, failure, hell...",0
4,373,mcs_exploratory analysis pfa,"[analysis, pfa]",0


In [48]:
#given a list of tokens of a test doc, predict best suitable class label

def predictClass(token_list):
    class_score = []
    
    #compute posterior prob of each class
    for ci in range(total_classes):
        score = prior_prob[ci]
        
        for token in token_list:
            if token in cond_prob_class[ci]:
                #print(cond_prob_class[0][token])
                score += math.log(cond_prob_class[ci][token])
            else:
                #print("inside else")
                score += math.log(1/(len(vocab)+sum(freq_terms_class[ci].values())))
                
        class_score.append(score)
        
    return np.argmax(class_score)    #return class label with maximum score

In [49]:
#predict for a test example

example = test_set.tokens.iloc[35]
cls = predictClass(example)
cls

2

In [50]:
test_set.label.iloc[35]   #actual label of the test example

2

In [51]:
'''
    Determine posterior probability of each test example against all classes and predict the 
    label against which the class probability is maximum

'''       
       
predictions = []                       #to store prediction of each test example\

for test_case in range(test_size): 
    
    token_list = test_set.tokens.iloc[test_case]
    
    #predict the class label for each example and append to the predictions list
    predictions.append(predictClass(token_list))

#predictions

<b> Testing is over ! </b>
<br>We have predicted labels for each sample in the test_set.

## STEP 8: ACCURACY OF THE CLASSIFIER

Accuracy is the fraction of correct predictions our model out of total predictions. 
Formally, accuracy has the following definition:
<br><br>
<center><b> Accuracy = (Number of correct predictions) / (Number of total predictions) 


In [52]:
#calculate accuracy
predict_labels = np.array(predictions)
actual_labels = np.array(test_set.label)

test_accuracy = np.sum(predict_labels == actual_labels)/float(test_size) 

print ("******* Test Set Examples ******* : ", test_size)
print ("******* Test Set Accuracy ******* : ", (test_accuracy*100) ,"%") 

******* Test Set Examples ******* :  137
******* Test Set Accuracy ******* :  78.1021897810219 %


In [53]:
conf_matrix = confusion_matrix(actual_labels, predict_labels)
print(conf_matrix)



[[35  0  2  6  1]
 [ 0 29  0  0  1]
 [ 1  0 27  0  0]
 [ 2 14  2 12  0]
 [ 0  0  0  1  4]]


In [54]:
#true_negative
TN = [0]*total_classes
#false_negative
FN = [0]*total_classes
#false_positive
FP = [0]*total_classes
#true_positive
TP = [0]*total_classes

for class_no in range(total_classes):
    for i in range(total_classes):
        for j in range(total_classes):
            if(i==j and i==class_no):
                TP[class_no] = conf_matrix[i][j]
            if(i!=class_no and j!=class_no):
                TN[class_no] += conf_matrix[i][j]
            if(i==class_no and j!=class_no):
                FN[class_no] += conf_matrix[i][j]
            if(j==class_no and i!=class_no):
                FP[class_no] += conf_matrix[i][j]

In [55]:
correct=0
total=0
for class_no in range(total_classes):
    # Recall is the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
    # High Recall indicates the class is correctly recognized (small number of FN)
    recall = (TP[class_no])/(TP[class_no] + FN[class_no])

    # Precision is the the total number of correctly classified positive examples divided by the total number of predicted positive examples. 
    # High Precision indicates an example labeled as positive is indeed positive (small number of FP)
    precision = (TP[class_no])/(TP[class_no] + FP[class_no])

    fmeasure = (2*recall*precision)/(recall+precision)
    correct+=TP[class_no]+TN[class_no]
    total+=TN[class_no] + FN[class_no] + FP[class_no] + TP[class_no]
    print("------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ "\
      "\n class : ", class_no, \
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )


accuracy = correct/total
print("\n Accuracy : ", (accuracy*100) ,"%" )
#accuracy_score(y_test, y_predict)




------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 
 class :  0 
 Recall :  79.54545454545455 %
 Precision :  92.10526315789474 %
 F-measure :  85.36585365853658 %
------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 
 class :  1 
 Recall :  96.66666666666667 %
 Precision :  67.44186046511628 %
 F-measure :  79.45205479452055 %
------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 
 class :  2 
 Recall :  96.42857142857143 %
 Precision :  87.09677419354838 %
 F-measure :  91.52542372881356 %
------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 
 class :  3 
 Recall :  40.0 %
 Precision :  63.1578947368421 %
 F-measure :  48.97959183673469 %
------ CLASSIFICATION PERFORMANCE OF THE NAIVE BAYES MODEL ------ 
 class :  4 
 Recall :  80.0 %
 Precision :  66.66666666666666 %
 F-measure :  72.72727272727272 %

 Accuracy :  91.24087591240875 %
