# Programming Assignment: Email Spam Naive Bayes

## Overview/Task

The goal of this programming assignment is to build a naive bayes classifier from scratch that can determine whether email text should be labled spam or not spam based on its contents

## Review

Remeber that a naive bayes classifier realizes the following probability:

$$P(Y|X_1,X_2,...,X_n) \propto P(Y)*P(Y|X_1)*P(Y|X_2)*...*P(Y|X_n)$$

Where $Y$ is a binary class {0,1}

Where $X_i$ is a feature of the input

The classifier will decide what class each input belongs to based on highest probability from the equation above

## Reminders

Please remember that the classifier must be written from scratch; do NOT use any libraries that implement the classifier for you, such as but not limited to sklearn.

You CAN, however, use SKlearn to split up the dataset between testing and training.

Feel free to look up any tasks you are not familiar with, e.g. the function call to read a csv

## Task list/Recommended Order

In order to provide some guidance, I am giving the following order/checklist to solve this task:
<ol>
  <li>Compute the "prior": P(Y) for Y = 0 and Y = 1</li>
  <li>Compute the "likelihood": $P(Y|X_n)$</li>
  <li>Write code that uses the two items above to make a decision on whether or not an email is spam or ham (aka not spam)</li>
  <li>Write code to evaluate your model. Test model on training data to debug </li>
  <li>Test model on testing data to debug </li>
</ol>

In [1]:
#import cell
import numpy as np
import pandas as pd
import random
import csv

In [53]:
#syntax testing:
#data frame:
df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,4696,2784,ham,Subject: urgent\ned has requested that we comp...,0
1,4272,2528,ham,Subject: eastrans nominations change effective...,0
2,53,3439,ham,Subject: new enrononline functionality\nin ord...,0
3,3270,836,ham,Subject: june ' s update\nhere is the lastest ...,0
4,3324,1071,ham,Subject: re : natural gas nomination for 07 / ...,0


## Function template

In [116]:
def prior(df):
    
    # get subset of emails which are not spam
    ham_df = df[df['label']=='ham']
    
    # count the total number of ham emails, and the sample size
    total, _ = df.shape
    ham_total, _ = ham_df.shape
    
    # calculate priors
    ham_prior = ham_total / total
    spam_prior = 1 - ham_prior
    
    return ham_prior, spam_prior

def likelihood(df):
    # return a tuple of dictionaries representing the likelihood an email containing a given word is spam. 
    
    ham_like_dict = {}
    spam_like_dict = {}
    
    # create independent frames for ham and spam
    ham_df = df[df['label']=='ham']
    spam_df = df[df['label']=='spam']
    
    # get email totals
    total, _ = df.shape
    ham_size, _ = ham_df.shape
    spam_size = total - ham_size
    
    # raw count dictionaries
    ham_cnt_dict = {}
    spam_cnt_dict = {}
    
    for index in range(ham_size):
        
        # remove first 9 characters ("Subject: ") and strip punctuation! Then, remove the empty string.
        stripped = set([i.strip("/.,:?!'\"") for i in (ham_df['text'].values[index])[9:].split()]) - {''}
        
        for word in stripped:
            if word not in ham_cnt_dict:
                ham_cnt_dict[word] = 1
            else:
                ham_cnt_dict[word] += 1
            
        
    for index in range(spam_size):
        
        # remove first 9 characters ("Subject: ") and strip punctuation! Then, remove the empty string.
        stripped = set([i.strip("/.,:?!'\"") for i in (spam_df['text'].values[index])[9:].split()]) - {''}

        for word in stripped:
            if word not in spam_cnt_dict:
                spam_cnt_dict[word] = 1
            else:
                spam_cnt_dict[word] += 1
    
    
    # build sets, intersection, and differences 
    ham_set = set(ham_cnt_dict.keys())
    spam_set = set(spam_cnt_dict.keys())
    
    
    intersection = ham_set & spam_set
    ham_only = ham_set - intersection
    spam_only = spam_set - intersection
    
    
    
    for word in set(ham_cnt_dict.keys()):
        ham_like_dict[word] = ham_cnt_dict[word] / total
        
    for word in set(spam_cnt_dict.keys()):
        spam_like_dict[word] = spam_cnt_dict[word] / total
    
    
    
    
    
    
    # calculate likelihood based on the number of times the words occur in spam and ham emails!
        
    for word in intersection:
        spam_cnt = spam_cnt_dict[word]
        ham_cnt = ham_cnt_dict[word]
        spamicity = spam_cnt / (ham_cnt + spam_cnt)
        spam_like_dict[word] = spamicity
        ham_like_dict[word] = 1 - spamicity
        
    for word in ham_only:
        ham_like_dict[word] = 1.0
        spam_like_dict[word] = 0.0
    
    for word in spam_only:
        spam_like_dict[word] = 1.0
        ham_like_dict[word] = 0.0
    
    return ham_like_dict, spam_like_dict


def predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, text):
    '''
    prediction function that uses prior and likelihood structure to compute proportional posterior for a single line of text
    '''
    #ham_spam_decision = 1 if classified as spam, 0 if classified as normal/ham
    ham_spam_decision = None

    '''YOUR CODE HERE'''
    
    
    
    
    #ham_posterior = posterior probability that the email is normal/ham
    ham_posterior = ham_prior
    #spam_posterior = posterior probability that the email is spam
    spam_posterior = spam_prior
    
    

    '''END'''
    return ham_spam_decision


def metrics(ham_prior, spam_prior, ham_dict, spam_dict, df):
    '''
    Calls "predict" function and report accuracy, precision, and recall of your prediction
    '''
    
    '''YOUR CODE HERE'''


    '''END'''
    return acc, precision, recall

In [117]:
# TESTING AREA!

print(prior(df))
print(likelihood(df))


(0.5, 0.5)


## Generate answers with your functions

In [None]:
#loading in the training data
train_df = pd.read_csv("./TRAIN_balanced_ham_spam.csv")
test_df = pd.read_csv("./TEST_balanced_ham_spam.csv")
df = train_df
df.info()

In [None]:
#compute the prior

ham_prior, spam_prior = prior(df)

print(ham_prior, spam_prior)

In [None]:
# compute likelihood

ham_like_dict, spam_like_dict = likelihood(df)

In [None]:
# Test your predict function with some example TEXT

some_text_example = "write your test case here"
print(predict(ham_prior, spam_prior, ham_like_dict, spam_like_dict, some_text_example))

In [None]:
# Predict on test_df and compute metrics 
    
df = test_df
acc, precision, recall = metrics(ham_prior, spam_prior, ham_like_dict, spam_like_dict, df)
print(acc, precision, recall)