# Naive Bayes

Bayes Formula is given as :
$$ P(E|F) = \frac{P(F|E)•P(E)}{P(F|E)•P(E) + P(F|E^c)•P(E^c)} $$

In [41]:
import pandas as pd
import numpy as np

In [42]:
emails = pd.read_csv('emails.csv')
emails

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [43]:
emails.text[2]

'Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 169 home loan at a 3 . 72 fixed rate .  this offer is being extended to you unconditionally and your credit is in no way a factor .  to take advantage of this limited time opportunity  all we ask is that you visit our website and complete  the 1 minute post approval form  look foward to hearing from you ,  dorcas pittman'

## Data Preprocessing

> <font color = 'red'> Turning the text strings into the list of words

In [44]:
def process_email(text):
    text = text.lower()
    return list(set(text.split()))

In [45]:
# we will now apply this function to the entire column
emails['words'] = emails.text.apply(process_email)

In [46]:
emails

Unnamed: 0,text,spam,words
0,Subject: naturally irresistible your corporate...,1,"[specially, through, nowadays, do, provide, pr..."
1,Subject: the stock trading gunslinger fanny i...,1,"[palfrey, but, optima, diffusion, deoxyribonuc..."
2,Subject: unbelievable new homes made easy im ...,1,"[easy, your, $, extended, been, our, 454, made..."
3,Subject: 4 color printing special request add...,1,"[/, 8090, fax, information, our, ca, graphix, ..."
4,"Subject: do not have money , get software cds ...",1,"[great, it, best, finish, do, are, here, compa..."
...,...,...,...
5723,Subject: re : research and development charges...,0,"[/, et, reviewing, 26, do, research, @, apodac..."
5724,"Subject: re : receipts from visit jim , than...",0,"[/, r, la, fax, lot, following, dossier, reimb..."
5725,Subject: re : enron case study update wow ! a...,0,"[/, fax, meetings, following, do, box, slot, t..."
5726,"Subject: re : interest david , please , call...",0,"[/, time, @, set, up, re, 3528, discipline, gr..."


> <font color = 'red'> Finding the priors -> finding the probability that email is spam (the prior)

In [47]:
# prior would be the probability of spam that we have so far from our initial knowledge

sum(emails['spam'])/len(emails)

0.2388268156424581

> <font color = 'red'> Finding the posterior -> Prob. that spam or ham emails contain certain word

In [48]:
model = {}

for index, email in emails.iterrows():
    for word in email['words']:
        if word not in model:
            model[word] = {'spam':1, 'ham':1}
        if word in model:
            model[word]['spam'] += 1
        else:
            model[word]['ham'] += 1

In [49]:
model['lottery']

{'spam': 9, 'ham': 1}

In [50]:
model['step']

{'spam': 101, 'ham': 1}

In [51]:
model['brother']

{'spam': 14, 'ham': 1}

In [52]:
model['reply']

{'spam': 308, 'ham': 1}

> <font color = 'red'> Implementing the Naive Bayes

In [66]:
def predict_naive_bayes(email):
    total = len(emails)
    num_spam = sum(emails['spam'])
    num_ham = total - num_spam
    email = email.lower()
    words = set(email.split())
    spams = [1.0]
    hams = [1.0]

    for word in words:
        if word in model:
            spams.append(model[word]['spam']/num_spam * total)     #finding the conditional probability
            hams.append(model[word]['ham']/num_ham * total)        #that email containing that word is spam

    prod_spams = np.longlong(np.prod(spams)* num_spam)         #product of sums being seen as prior for
    prod_hams = np.longlong(np.prod(hams)* num_ham)            #each words and long so that each time is long
                                                           #enough for python to handle

    return  prod_spams/(prod_spams + prod_hams)      # normalizing the probability


In [67]:
predict_naive_bayes('ready for tomorrow seeing you')

1.000000000000002

In [68]:
predict_naive_bayes('enter the lottery to win three million dollars')

1.0000000000000042

In [69]:
predict_naive_bayes('Grokking Machine Learning by Luis Serrano')

0.999999999863745