The goal of this notebook is to implement a Naive Bayes algorithm.  For this purpose, I will need to obtain some labeled text data, where a collection of words is assigned a label of either 0 or 1.  I will get this text data from Kaggle.  It is called the Naive Bayes training set.  I will use spam_ham.csv, which lists a bunch of e-mails and classifies them as "spam" or "ham".

In [2]:
import os
import pandas as pd
os.listdir("Naive Bayes Training Set")

['play_golf_train.csv',
 'Iris_Data.csv',
 'play_golf_test.csv',
 'ag_news.csv',
 'spam_ham.csv']

In [3]:
df = pd.read_csv("Naive Bayes Training Set/spam_ham.csv", encoding = "latin1")[['v1', 'v2']]

In [4]:
df['target'] = df['v1'].apply(lambda x: 1 if x == "spam" else 0)

Approach:

$\phi_{j|y=1} = p(x_j | y = 1) = $ of those samples with y = 1, what fraction contain word j?

$\phi_{j|y=0} = p(x_j | y = 0) = $ of those samples with y = 0, what fraction contain word j?

$\phi_{y=k} = p(y = k)$ = out of all samples, what fraction have the given value of y?

$p(x|y=0)$ = $\Pi_{i=1}^n p(x_i | y = 0)$

$p(x|y=1)$ = $\Pi_{i=1}^n p(x_i | y = 1)$

$p(x, y = k) = p(x|y=k) p(y=k)$

$p(y = k|x) = \frac{p(x, y = k)}{p(x)} = \frac{p(x|y=k) p(y = k)}{p(x)}$

$p(x) = p(x, y = 1) + p(x, y = 0)$

In [5]:
import numpy as np
def isword(word):
    if len(word) == 0:
        return 0
    for letter in word:
        if letter.lower() not in "abcdefghijklmnopqrstuvwxyz":
            return 0
    return 1

In [6]:
unique_words = np.unique([word.lower() for lst in df['v2'].apply(lambda x: x.split()) for word in lst if isword(word)])

In [7]:
unique_words[3000:3100]

array(['listed', 'listen', 'listener', 'listening', 'listn', 'lit',
       'literally', 'litres', 'little', 'live', 'lived', 'liverpool',
       'lives', 'living', 'lk', 'll', 'lmao', 'lnly', 'lo', 'load',
       'loads', 'loan', 'loans', 'lobby', 'local', 'location',
       'locations', 'lock', 'locks', 'lodge', 'lodging', 'log', 'logged',
       'logging', 'login', 'logo', 'logoff', 'logon', 'logos', 'loko',
       'lol', 'lololo', 'londn', 'london', 'loneliness', 'lonely', 'long',
       'longer', 'lonlines', 'loo', 'look', 'looked', 'lookin', 'looking',
       'looks', 'loooooool', 'looovvve', 'loose', 'loosing', 'loosu',
       'lor', 'lord', 'lose', 'losers', 'loses', 'losing', 'loss', 'lost',
       'lot', 'lotr', 'lots', 'lotta', 'lotto', 'lotz', 'lou', 'loud',
       'lounge', 'lousy', 'lov', 'lovable', 'love', 'loved', 'lovejen',
       'lovely', 'loveme', 'lover', 'loverboy', 'lovers', 'loves',
       'loving', 'lovingly', 'low', 'lower', 'loxahatchee', 'loyal',
       'loya

In [8]:
unique_words

array(['a', 'aa', 'aah', ..., 'zoom', 'zouk', 'zyada'], dtype='<U34')

In [9]:
df_Bayes = df.copy()

Create a column in df_Bayes for each word.  This column first contains only 0's but later it will contain a 1 if that particular word is in the e-mail text for that row.

In [10]:
for word in unique_words:
    df_Bayes[word] = 0

Process df_Bayes to assign y = 1 to each word that is in the e-mail text for that row.

In [12]:
for i in df_Bayes.index:
    for word in df_Bayes.loc[i, 'v2'].split():
        if isword(word):
            df_Bayes.loc[i, word.lower()] = 1

In [57]:
def p_x_given_k(df_Bayes, phrase, k):
    prob = 1
    for word in np.unique([word.lower() for word in phrase.split()]):
        if isword(word):
            if word in df_Bayes.columns:
                print(word, df_Bayes[(df_Bayes['target'] == k) & (df_Bayes[word] == 1)].shape[0],
                    df_Bayes[df_Bayes['target'] == k].shape[0])
                prob *= df_Bayes[(df_Bayes['target'] == k) & (df_Bayes[word] == 1)].shape[0] / df_Bayes[df_Bayes['target'] == k].shape[0]
            else:
                prob *= 0
    return prob

Find the first five spam e-mails and their index values.

In [68]:
df_Bayes[df_Bayes['target'] == 1].iloc[0:5]

Unnamed: 0,v1,v2,target,a,aa,aah,aaniye,aaooooright,abdomen,abi,...,z,zac,zebra,zed,zhong,zindgi,zoe,zoom,zouk,zyada
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,spam,WINNER!! As a valued network customer you have...,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,spam,Had your mobile 11 months or more? U R entitle...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,spam,"SIX chances to win CASH! From 100 to 20,000 po...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Use the exact phrase for the second e-mail as the phrase to be tested.

In [66]:
phrase = df_Bayes.iloc[2]['v2']
phrase

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

This function finds the probability that the phrase is y = 1, i.e. spam.

In [70]:
def prob_1_phrase(df_Bayes, phrase):
    p1 = p_x_given_k(df_Bayes, phrase, 1) * df_Bayes[df_Bayes["target"] == 1].shape[0]
    p0 = p_x_given_k(df_Bayes, phrase, 0) * df_Bayes[df_Bayes["target"] == 0].shape[0]
    return (p1 / (p1 + p0))

In [71]:
prob_1_phrase(df_Bayes, phrase)

a 291 747
apply 15 747
comp 9 747
cup 5 747
entry 21 747
fa 2 747
final 16 747
free 151 747
in 60 747
may 7 747
receive 29 747
text 96 747
tkts 4 747
to 464 747
txt 129 747
win 58 747
wkly 14 747
a 870 4825
apply 2 4825
comp 1 4825
cup 3 4825
entry 0 4825
fa 0 4825
final 2 4825
free 47 4825
in 712 4825
may 34 4825
receive 4 4825
text 61 4825
tkts 0 4825
to 1210 4825
txt 12 4825
win 7 4825
wkly 0 4825


1.0

With Laplace smoothing.  This means that even words with zero instances for the given target are assigned some nonzero probability.

In [None]:
def Laplace_p_x_given_k(df_Bayes, phrase, k):
    prob = 1
    for word in np.unique([word.lower() for word in phrase.split()]):
        if isword(word):
            if word in df_Bayes.columns:
                print(word, df_Bayes[(df_Bayes['target'] == k) & (df_Bayes[word] == 1)].shape[0],
                    df_Bayes[df_Bayes['target'] == k].shape[0])
                prob *= (df_Bayes[(df_Bayes['target'] == k) & (df_Bayes[word] == 1)].shape[0] + 1) / (df_Bayes[df_Bayes['target'] == k].shape[0] + 2)
            else:
                prob *= 1 / (df_Bayes[df_Bayes['target'] == k].shape[0] + 2)
    return prob

In [75]:
def Laplace_prob_1_phrase(df_Bayes, phrase):
    p1 = Laplace_p_x_given_k(df_Bayes, phrase, 1) * df_Bayes[df_Bayes["target"] == 1].shape[0]
    p0 = Laplace_p_x_given_k(df_Bayes, phrase, 0) * df_Bayes[df_Bayes["target"] == 0].shape[0]
    return (p1 / (p1 + p0))

This phrase gets a probability of almost 1.0 to be spam, because the words: "fa", "tkts", and "wkly" appear ONLY in spam phrases.  In fact, "wkly" appears 14 times in spam phrases.  We couldn't reduce this probability by admitting only words with a large incidence.  I wonder if most phrases will have at least one word that never appears on one group or the other; therefore that phrase will have a probability of either 0 (if it has a word that never appears in y = 1) or 1 (if it has a word that never appears in y = 0).  Perhaps we need a larger data set than ~5,000 ham and spam e-mails.

In [77]:
Laplace_prob_1_phrase(df_Bayes, phrase)

a 291 747
apply 15 747
comp 9 747
cup 5 747
entry 21 747
fa 2 747
final 16 747
free 151 747
in 60 747
may 7 747
receive 29 747
text 96 747
tkts 4 747
to 464 747
txt 129 747
win 58 747
wkly 14 747
a 870 4825
apply 2 4825
comp 1 4825
cup 3 4825
entry 0 4825
fa 0 4825
final 2 4825
free 47 4825
in 712 4825
may 34 4825
receive 4 4825
text 61 4825
tkts 0 4825
to 1210 4825
txt 12 4825
win 7 4825
wkly 0 4825


0.999999999999742