# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
import pandas as pd
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [10]:
# Your code here
df_spam = df[df.label=='spam']
df_spam

Unnamed: 0,label,text
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
5537,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5540,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,spam,Had your contract mobile 11 Mnths? Latest Moto...
5566,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [13]:
df_ham = df[df.label=='ham'].sample(len(df_spam))
df_ham

Unnamed: 0,label,text
4529,ham,HOW ARE U? I HAVE MISSED U! I HAVENT BEEN UP 2...
1828,ham,Hey gorgeous man. My work mobile number is. Ha...
5040,ham,Pls clarify back if an open return ticket that...
2298,ham,Draw va?i dont think so:)
1865,ham,You call him now ok i said call him
...,...,...
567,ham,Oooh bed ridden ey? What are YOU thinking of?
4164,ham,I told that am coming on wednesday.
3329,ham,No we put party 7 days a week and study lightl...
1763,ham,Sometimes Heart Remembrs someone Very much... ...


In [18]:
df2 = pd.concat([df_spam, df_ham], axis=1)

In [23]:
df2

Unnamed: 0,label,text
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
...,...,...
567,ham,Oooh bed ridden ey? What are YOU thinking of?
4164,ham,I told that am coming on wednesday.
3329,ham,No we put party 7 days a week and study lightl...
1763,ham,Sometimes Heart Remembrs someone Very much... ...


In [44]:
p_classes = dict(df2['label'].value_counts(normalize=True))
p_classes

{'spam': 0.5, 'ham': 0.5}

## Train-test split

Now implement a train-test split on the dataset: 

In [36]:
# Your code here
from sklearn.model_selection import train_test_split
X = df2['text']
y = df2['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

In [37]:
train_df

Unnamed: 0,text,label
2261,SplashMobile: Choose from 1000s of gr8 tones e...,spam
34,Thanks for your subscription to Ringtone UK yo...,spam
3463,Bloomberg -Message center +447797706009 Why wa...,spam
4152,Ü comin to fetch us oredi...,ham
3574,You won't believe it but it's true. It's Incre...,spam
...,...,...
3508,"Two fundamentals of cool life: ""Walk, like you...",ham
431,At home watching tv lor.,ham
4349,You give us back my id proof and &lt;#&gt; r...,ham
4149,Please call Amanda with regard to renewing or ...,spam


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [38]:
# Your code here
class_word_freq = {} 
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df.label == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

In [39]:
classes

array(['spam', 'ham'], dtype=object)

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [40]:
# Your code here
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)
V

5997

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [41]:
# Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [42]:
# Your code here
import numpy as np

def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = p_classes[class_]
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p *= (num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [59]:
# Your code here

y_train_hat = y_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V, return_posteriors=False))
residuals = y_train == y_train_hat
residuals.value_counts(normalize=True)

False    0.502679
True     0.497321
Name: label, dtype: float64

In [61]:
y_test_hat = y_test.map(lambda x: classify_doc(x, class_word_freq, p_classes, V, return_posteriors=False))
residuals = y_test == y_test_hat
residuals.value_counts(normalize=True)

True     0.508021
False    0.491979
Name: label, dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!