# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [25]:
# Your code here
import pandas as pd
import numpy as np

df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'text'])
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [34]:
# Your code here
minority = df[df['label'] == 'spam']
majority = df[df['label'] == 'ham']
adj_maj = majority.sample(n=len(minority), random_state = 10)
df2 = pd.concat([minority, adj_maj])

In [35]:
p_classes = dict(df2['label'].value_counts(normalize=True))

## Train-test split

Now implement a train-test split on the dataset: 

In [36]:
# Your code here
from sklearn.model_selection import train_test_split
X = pd.concat([minority.text, adj_maj.text])
y = pd.concat([minority.label, adj_maj.label])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 12)
train_df = pd.concat([X_train, y_train], axis = 1)
test_df = pd.concat([X_test, y_test], axis = 1)

In [37]:
test_df

Unnamed: 0,text,label
3780,"Claim a 200 shopping spree, just call 08717895...",spam
2827,Ok lor...,ham
4222,Plz note: if anyone calling from a mobile Co. ...,ham
4834,"New Mobiles from 2004, MUST GO! Txt: NOKIA to ...",spam
4778,Sorry completely forgot * will pop em round th...,ham
...,...,...
3017,"&lt;#&gt; is fast approaching. So, Wish u a v...",ham
5004,CDs 4u: Congratulations ur awarded £500 of CD ...,spam
3807,URGENT! We are trying to contact you. Last wee...,spam
3506,"life alle mone,eppolum oru pole allalo",ham


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [38]:
# Your code here
class_m : {}
class_word_freq = {} 
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df.label == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [39]:
# Your code here
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [40]:
# Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [41]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = p_classes[class_]
        for word in bag.keys():
            numerator = bag[word]+1
            denominator = class_word_freq[class_].get(word, 0) + V
            p *= (numerator/denominator)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [42]:
# Your code here
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.719617
True     0.280383
dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!