# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using Naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [1]:
import numpy as np
import pandas as pd

In [16]:
df = pd.read_csv('SMSSpamCollection', header=None, sep='\t')
df.columns = ['label', 'text']
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [18]:
#Your code here
spam = df[df['label'] == 'spam']
num_spam = len(spam)
ham = df[df['label'] == 'ham'][:num_spam]
len(ham)

747

In [19]:
df2 = pd.concat([spam, ham])
df2['label'].value_counts()

ham     747
spam    747
Name: label, dtype: int64

## Train - Test Split

Now implement a train test split on your dataset.

In [20]:
from sklearn.model_selection import train_test_split

X = df2['text']
y = df2['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

## Create the Word Frequency Dictionary for Each Class

Create a word frequency dictionary for each class.

In [21]:
class_word_freq = {}
classes = train_df.label.unique()

for class_ in classes:
    temp = train_df[train_df.label == class_]
    bag = {}
    for row in temp.index:
        doc = temp['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [24]:
vocab = set()
for text in train_df.text:
    for word in text.split():
        vocab.add(word)
V = len(vocab)
V

5926

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [25]:
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [26]:
def nb_classify(doc, class_word_freq, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(0.5)
        for word in bag.keys():
            num = bag[word] + 1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [28]:
y_hat_train = X_train.map(lambda x: nb_classify(x, class_word_freq, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.761607
True     0.238393
dtype: float64

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

In [43]:
class NaiveBayesDocClassifier:
    """Perform Naive Bayes classification of texts.
       Takes a DataFrame containing 'label' and 'text' columns
       and provides methods for predicting labels for the texts."""
    
    # Define a function to make a bag of words from a string
    def bag_it(doc):
        bag = {}
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
        return bag

    # Create a dictionary with classes (labels) as keys and 
    # bags-of-words for those classes as values
    def get_class_word_freq(self, df):
        class_word_freq = {}
        classes = df.label.unique()

        for class_ in classes:
            temp = df[df.label == class_]
            bag = {}
            for row in temp.index:
                doc = temp['text'][row]
                bag = bag_it(doc)
            class_word_freq[class_] = bag
        
        return class_word_freq
    
    # Define a function to find the number of unique words in a corpus
    def count_vocab(self, df):
        vocab = set()
        for text in df.text:
            for word in text.split():
                vocab.add(word)
        return len(vocab)
    
    # Define a function to classify a single document
    def nbay_classify(self, df, doc, return_posteriors=False):
        bag = bag_it(df['text'][doc])
        class_word_freq = self.get_class_word_freq(df)
        V = self.count_vocab(df)
        classes = []
        posteriors = []
        for class_ in class_word_freq.keys():
            p = np.log(0.5)
            for word in bag.keys():
                num = bag[word] + 1
                denom = class_word_freq[class_].get(word, 0) + V
                p += np.log(num/denom)
            classes.append(class_)
            posteriors.append(p)
        if return_posteriors:
            print(posteriors)
        return classes[np.argmax(posteriors)]
    
    # Define a function to run the whole naive Bayes algorithm
    def run_nb_algorithm(self, df):
        y_hat = []
        for row in df.index:
            pred = self.nbay_classify(df, row)
            y_hat.append(pred)
        return y_hat

In [44]:
classy = NaiveBayesDocClassifier()

In [51]:
from collections import Counter
y_hat = classy.run_nb_algorithm(train_df)
Counter(y_hat)

Counter({'ham': 1109, 'spam': 11})

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!