# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [50]:
# Your code here
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('SMSSpamCollection', sep = '\t', names = ['label', 'text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [4]:
# Your code here
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
minority = df[df['label'] == 'spam']
sample_majority = df[df['label'] == 'ham'].sample(n = len(minority))
df_b = pd.concat([minority, sample_majority])
df_b['label'].value_counts()

spam    747
ham     747
Name: label, dtype: int64

In [19]:
df_b

Unnamed: 0,label,text
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
999,ham,Then ü wait 4 me at bus stop aft ur lect lar. ...
2182,ham,Ok.
3719,ham,Cool. Do you like swimming? I have a pool and ...
5122,ham,NOT ENUFCREDEIT TOCALL.SHALL ILEAVE UNI AT 6 +...


## Train-test split

Now implement a train-test split on the dataset: 

In [40]:
# Your code here
from sklearn.model_selection import train_test_split
X = df_b['text']
y = df_b['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 17)
train_df = pd.concat([X_train, y_train], axis = 1)
test_df = pd.concat([X_test, y_test], axis = 1)

In [41]:
train_df

Unnamed: 0,text,label
2646,"Sorry, I'll call later",ham
312,Think ur smart ? Win £200 this week in our wee...,spam
844,Urgent! call 09066350750 from your landline. Y...,spam
2130,Mine here like all fr china then so noisy.,ham
2118,Wish u many many returns of the day.. Happy bi...,ham
...,...,...
2987,Reply to win £100 weekly! What professional sp...,spam
3851,I to am looking forward to all the sex cuddlin...,ham
910,"January Male Sale! Hot Gay chat now cheaper, c...",spam
4596,Yo sorry was in the shower sup,ham


In [42]:
test_df

Unnamed: 0,text,label
5200,Call Germany for only 1 pence per minute! Call...,spam
3657,Oh really?? Did you make it on air? What's you...,ham
2693,Urgent Urgent! We have 800 FREE flights to Eur...,spam
1374,"Bears Pic Nick, and Tom, Pete and ... Dick. In...",spam
2413,I don't know u and u don't know me. Send CHAT ...,spam
...,...,...
5524,You are awarded a SiPix Digital Camera! call 0...,spam
418,FREE entry into our £250 weekly competition ju...,spam
1663,Hi if ur lookin 4 saucy daytime fun wiv busty ...,spam
4307,Awww dat is sweet! We can think of something t...,ham


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [43]:
# Your code here
# nested dict:  {class_ham: {word1:_, word2:_, ...}, class_spam: {word1:_, word2:_, ...}}
class_word_freq = {}
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df['label'] == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

In [44]:
len(class_word_freq['ham'])

2914

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

My solution:

In [45]:
# Your code here
lst_of_words = list(class_word_freq['spam'].keys())
for word in class_word_freq['ham'].keys():
    if word not in lst_of_words:
        lst_of_words.append(word)
V = len(lst_of_words)

In [46]:
V

5959

Better solution:

In [47]:
vocab = set()
for text in train_df['text']:
    for word in text.split():
        vocab.add(word)
        
V = len(vocab)

In [48]:
V

5959

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [49]:
# Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag    

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [51]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word] + 1
            den = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/den)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [52]:
# Your code here
p_classes = dict(df_b['label'].value_counts(normalize=True))
p_classes

{'spam': 0.5, 'ham': 0.5}

In [53]:
y_hat = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat
residuals.value_counts(normalize = True)

False    0.761607
True     0.238393
dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!