# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using Naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

In [137]:
import pandas as pd
import numpy as np


## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [138]:
#Your code here
df = pd.read_csv('SMSSpamCollection', sep="	", header=None)
df.columns = ['class', 'text']
df.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [139]:
len(df)

5572

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [140]:
#Your code here
# Identify length of each class

num_spam = len(df.loc[df['class']=='spam'])
num_ham = len(df.loc[df['class']=='ham'])

print(f"Number of Spam: {num_spam}, Number of Ham: {num_ham}")

Number of Spam: 747, Number of Ham: 4825


In [141]:
# Separate the dataframes

df_spam = df.loc[df['class']=='spam']
df_ham = df.loc[df['class']=='ham']

# Select 747 random rows from the ham dataframe

df_ham_reduced = df_ham.sample(n=747, random_state=20)
df_ham_reduced.shape

(747, 2)

In [142]:
#concatenate the two dataframes

df_complete = pd.concat([df_spam, df_ham_reduced], ignore_index=True)
df_complete.head()
df_complete.shape

(1494, 2)

## Train - Test Split

Now implement a train test split on your dataset.

In [144]:
from sklearn.model_selection import train_test_split

X = df_complete['text']
y = df_complete['class']

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=15)

train_df = pd.concat([y_train, X_train], axis=1)
test_df = pd.concat([y_test, X_test], axis=1)

## Create the Word Frequency Dictionary for Each Class

Create a word frequency dictionary for each class.

In [145]:
train_df.shape

(1120, 2)

In [146]:
word_freq = {}
classes = ['spam', 'ham']

for class_ in classes:
    temp_df = train_df[train_df['class']==class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    
    word_freq[class_] = bag


In [147]:
word_freq

{'spam': {'How': 4,
  'about': 6,
  'getting': 2,
  'in': 52,
  'touch': 1,
  'with': 82,
  'folks': 1,
  'waiting': 13,
  'for': 132,
  'company?': 1,
  'Just': 21,
  'txt': 46,
  'back': 15,
  'your': 138,
  'NAME': 4,
  'and': 81,
  'AGE': 6,
  'to': 463,
  'opt': 9,
  'in!': 1,
  'Enjoy': 3,
  'the': 132,
  'community': 1,
  '(150p/SMS)': 1,
  'YOUR': 4,
  'CHANCE': 3,
  'TO': 3,
  'BE': 2,
  'ON': 5,
  'A': 11,
  'REALITY': 2,
  'FANTASY': 2,
  'SHOW': 2,
  'call': 137,
  'now': 41,
  '=': 5,
  '08707509020': 4,
  '20p': 5,
  'per': 35,
  'min': 8,
  'NTT': 7,
  'Ltd,': 6,
  'PO': 24,
  'Box': 19,
  '1327': 5,
  'Croydon': 5,
  'CR9': 5,
  '5WB': 5,
  '0870': 5,
  'is': 121,
  'a': 279,
  'national': 6,
  'rate': 7,
  'call.': 1,
  'Mila,': 2,
  'age23,': 2,
  'blonde,': 2,
  'new': 31,
  'UK.': 4,
  'I': 21,
  'look': 2,
  'sex': 3,
  'UK': 7,
  'guys.': 3,
  'if': 14,
  'u': 34,
  'like': 11,
  'fun': 5,
  'me.': 4,
  'Text': 33,
  'MTALK': 2,
  '69866.18': 2,
  '.': 8,
  '30pp/

## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [148]:
#Your code here

vocabulary = []

for text in train_df['text']:
    for word in text.split():
        vocabulary.append(word)
    
V = len(set(vocabulary))
V
        

6124

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [149]:
#Your code here

def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [150]:
prob_class = dict(train_df['class'].value_counts(normalize=True))
prob_class

{'spam': 0.5071428571428571, 'ham': 0.4928571428571429}

In [151]:
#Your code here

classes = []
posteriors = []

def email_classifier(doc, V, word_freq, prob_class):
    bag = bag_it(doc)
    for class_ in ['spam', 'ham']:
        p = np.log(prob_class[class_])
        for word in bag.keys():
            num = bag[word] + 1
            denom = word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    return classes[np.argmax(posteriors)]
    

In [152]:
test_df.reset_index(inplace=True)
test_df.head()

Unnamed: 0,index,class,text
0,1289,ham,"ER, ENJOYIN INDIANS AT THE MO..yeP. SaLL gOoD ..."
1,38,spam,We tried to contact you re your reply to our o...
2,1074,ham,Ok not a problem will get them a taxi. C ing ...
3,853,ham,ARE YOU IN TOWN? THIS IS V. IMPORTANT
4,696,spam,FREE for 1st week! No1 Nokia tone 4 ur mobile ...


In [153]:
np.log(prob_class['spam'])

-0.678962545567989

In [154]:
email_classifier(test_df.loc[6]['text'], V, word_freq, prob_class)

'ham'

## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [155]:
#Your Code here
y_hat_train = X_train.map(lambda x: email_classifier(x, V, word_freq, prob_class))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

True     0.505357
False    0.494643
dtype: float64

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!