# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [12]:
import pandas as pd

In [35]:
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['class', 'text'])
data['class'].value_counts()

ham     4825
spam     747
Name: class, dtype: int64

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [71]:
#Your code here

ham = data[data['class']=='ham']
ham = ham[0:747]
spam = data[data['class']=='spam']
ham = ham.reset_index(drop=True)
spam = spam.reset_index(drop=True)

In [73]:
ham['text'][2].split()

['U',
 'dun',
 'say',
 'so',
 'early',
 'hor...',
 'U',
 'c',
 'already',
 'then',
 'say...']

## Train - Test Split

Now implement a train test split on your dataset.

In [87]:
from sklearn.model_selection import train_test_split
df = pd.concat([spam, ham], axis=0)

X = df['text']
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1) 
test_df = pd.concat([X_test, y_test], axis=1)

In [89]:
test_df.head()

Unnamed: 0,text,class
703,Call Germany for only 1 pence per minute! Call...,spam
502,Tell them u have a headache and just want to u...,ham
368,Urgent Urgent! We have 800 FREE flights to Eur...,spam
198,"Bears Pic Nick, and Tom, Pete and ... Dick. In...",spam
336,I don't know u and u don't know me. Send CHAT ...,spam


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class.

In [90]:
#Your code here
ham_bag = {}
for ind in range(len(ham)):
    for word in ham['text'][ind].split():
        ham_bag[word] = ham_bag.get(word, 0) + 1
         
spam_bag = {}
for ind in range(len(spam)):
    for word in spam['text'][ind].split():
        spam_bag[word] = spam_bag.get(word, 0) + 1
        
class_word_freq = {} #Will be a nested dictionary of class_i : {word1:freq, word2:freq..., wordn:freq},.... class_m : {}
class_word_freq['ham'] = ham_bag
class_word_freq['spam'] = spam_bag
    

In [97]:
p_classes = dict(df['class'].value_counts(normalize=True))
p_classes

{'spam': 0.5, 'ham': 0.5}

## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [126]:
#Your code here
# V = len(ham_bag) + len(spam_bag)
# V
vocabulary = set()
for text in train_df.text:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)
V

5926

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [93]:
#Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) +1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [101]:
import numpy as np

In [116]:
#Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
# using the function created earlier to create a bag of words dictionary
    bag = bag_it(doc)

    classes = []
    posteriors = []

# iterating through the dictionary containing counts of all the words in the dataframe
    for class_ in class_word_freq.keys():
        
    # getting the frequency of instances of each category in the data set
        p = np.log(p_classes[class_])
        
    # iterating through keys of the bag of words         
        for word in bag.keys():
          # the numerator is the frequency of the word in the bag +1 
            num = bag[word]+1
          # the denominator is the frequency of the word in the frequency dictionary for the whole data set
            denom = class_word_freq[class_].get(word, 0) + V
          # adding the above to p
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
        
    if return_posteriors:
        print(posteriors)
        
    return classes[np.argmax(posteriors)]

## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [123]:
#Your Code here
classify_doc(train_df.iloc[5]['text'], class_word_freq, p_classes, V, return_posteriors=True)

[-192.23620607058257, -192.37826874776394]


'ham'

In [125]:
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.757143
True     0.242857
dtype: float64

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing naive Bayes' for document classification!