# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='	', header=None)
df.columns = ['label', 'text']
df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [3]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [4]:
spam_df = df[df.label == 'spam']

In [5]:
ham_df = df[df.label == 'ham']

In [6]:
balanced_df = pd.concat([spam_df, ham_df.sample(len(spam_df), random_state=17)])
balanced_df.label.value_counts()

ham     747
spam    747
Name: label, dtype: int64

In [7]:
p_classes = dict(balanced_df['label'].value_counts(normalize=True))
p_classes

{'ham': 0.5, 'spam': 0.5}

## Train-test split

Now implement a train-test split on the dataset: 

In [8]:
# Your code here
from sklearn.model_selection import train_test_split
X = balanced_df.text
y = balanced_df.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [9]:
word_frequency_dict = {}
for class_ in train_df.label.unique():
    freq_dict_for_class = {}
    for text in train_df[train_df.label == class_]['text']:
        for word in text.split():
            freq_dict_for_class[word] = freq_dict_for_class.get(word, 0) + 1
    word_frequency_dict[class_] = freq_dict_for_class
word_frequency_dict

{'ham': {'Lol': 8,
  'I': 184,
  'was': 25,
  'gonna': 6,
  'last': 8,
  'month.': 2,
  'cashed': 1,
  'some': 12,
  'in': 106,
  'but': 25,
  'left': 2,
  '&lt;#&gt;': 31,
  'just': 28,
  'case.': 1,
  'collecting': 2,
  'more': 8,
  'during': 2,
  'the': 128,
  'week': 4,
  'cause': 2,
  'they': 7,
  'announced': 1,
  'it': 39,
  'on': 40,
  'blog.': 1,
  "I'll": 16,
  'probably': 4,
  'be': 29,
  'by': 13,
  'tomorrow': 5,
  '(or': 1,
  'even': 6,
  'later': 12,
  'tonight': 6,
  'if': 23,
  "something's": 1,
  'going': 21,
  'on)': 1,
  'WHITE': 1,
  'FUDGE': 1,
  'OREOS': 1,
  'ARE': 1,
  'IN': 3,
  'STORES': 1,
  'Gudnite....tc...practice': 2,
  'Is': 7,
  'there': 13,
  'coming': 6,
  'friday': 2,
  'is': 66,
  'leave': 11,
  'for': 59,
  'pongal?do': 1,
  'you': 172,
  'get': 34,
  'any': 10,
  'news': 1,
  'from': 14,
  'your': 46,
  'work': 5,
  'place.': 2,
  'S....s...india': 1,
  'to': 168,
  'draw': 1,
  'series': 1,
  'after': 12,
  'many': 5,
  'years': 2,
  'south': 2,

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [10]:
V = set()
for text in train_df.text:
    for word in text.split():
        V.add(word)
len(V)

5955

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [11]:
def bag_it(text):
    bag = {}
    for word in text.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [12]:
import numpy as np

def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    """
    Classify a document unsing Naive Bayes
    
    Parameters:
    doc: string to classify
    class_word_freq: dictionary of word frequencies for each class
    p_classes: dictionary of the frequencies of each class
    V: number of words in the corpus vocabulary
    return_posteriors: whether to print the list of probabilities
    
    Returns:
    The Class
    """
    
    classes = [] # list for the classes b/c dict.keys() does not guarentee the order of the list of keys
    posteriors = [] # list of probabilities for each class
    
    bag = bag_it(doc)
    
    for class_ in class_word_freq.keys():
        # get P(class)
        p = np.log(p_classes[class_]) # take log to avoid underflow
        # get the conditional log probabilities for P(word|class) for all the words
        # using log probabilities to avoid underflow
        for word in bag.keys():
            numerator = bag[word] + 1
            denominator = class_word_freq[class_].get(word, 0) + V
            p += np.log(numerator / denominator)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]
    

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [13]:
y_hat_train = []
for text in X_train:
    y_hat_train.append(classify_doc(text, word_frequency_dict, p_classes, len(V), False))

residuals = (y_hat_train == y_train)
accuracy = sum(residuals) / len(residuals)
print(accuracy)

0.23482142857142857


In [14]:
y_hat_test = []
for text in X_test:
    y_hat_test.append(classify_doc(text, word_frequency_dict, p_classes, len(V), False))

residuals = (y_hat_test == y_test)
accuracy = sum(residuals) / len(residuals)
print(accuracy)

0.2647058823529412


## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!