# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [1]:
import pandas as pd
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label','text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label    5572 non-null object
text     5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [3]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [4]:
df_spam = df[df.label=='spam']

In [5]:
df_ham = df[df.label=='ham'].sample(n=len(df_spam))

In [6]:
df2 = pd.concat([df_spam, df_ham], axis=0)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1494 entries, 2 to 1453
Data columns (total 2 columns):
label    1494 non-null object
text     1494 non-null object
dtypes: object(2)
memory usage: 35.0+ KB


## Train - Test Split

Now implement a train test split on your dataset.

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X = df2.text
y = df2.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1) 
test_df = pd.concat([X_test, y_test], axis=1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class.

In [9]:
#Make a nested dictionary of class_i:{word1:freq, word2:freq..., wordn:freq},... class_m:{}
class_word_freq = {}

classes = train_df.label.unique()
for class_ in classes:
    temp_df = train_df[train_df.label==class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [10]:
vocabulary = set()
for text in train_df.text:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)
V

6042

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [11]:
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag


## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [12]:
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]


## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [13]:
import numpy as np

In [14]:
p_classes = dict(df2.label.value_counts(normalize=True))
p_classes

{'ham': 0.5, 'spam': 0.5}

In [15]:
classify_doc(train_df.iloc[0]['text'], class_word_freq, p_classes, V, return_posteriors=True)

[-231.9198996805509, -231.90379598912506]


'spam'

In [16]:
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.783036
True     0.216964
dtype: float64

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

In [17]:
class Document(object):
    
    def __init__(self):
        self.name = name
    
    def get_label_counts(self):
        labels = {}
        for lbl in df.label.unique():
            labels[lbl] = len(df.loc[df.label==lbl])
        min_length = min(labels.values()) 
        label_names = list(labels.keys())
        return min_length, label_names
    
    def sub_sample(self, label_names):
        df2 = pd.DataFrame(columns=df.columns)
        for lbl in label_names:
            temp_df = df[df.label==lbl].sample(n=min_length)
            df2 = df2.append(temp_df)
        return df2
    

In [18]:
labels = {}
for lbl in df.label.unique():
    labels[lbl] = len(df.loc[df.label==lbl])
min_length = min(labels.values()) 
label_names = list(labels.keys())

In [19]:
df2 = pd.DataFrame(columns=df.columns)

temp_df = df[df.label==lbl].sample(n=min_length)
pd.concat([df2, temp_df], axis=0)

temp_df.head()

Unnamed: 0,label,text
5378,spam,Free entry to the gr8prizes wkly comp 4 a chan...
1430,spam,For sale - arsenal dartboard. Good condition b...
2014,spam,Great News! Call FREEFONE 08006344447 to claim...
876,spam,"Shop till u Drop, IS IT YOU, either 10K, 5K, £..."
1628,spam,You have been selected to stay in 1 of 250 top...


In [24]:
df2 = pd.DataFrame(columns=df.columns)
for lbl in label_names:
    temp_df = df[df.label==lbl].sample(n=min_length)
    df2 = df2.append(temp_df)


Unnamed: 0,label,text
3877,ham,did u get that message
3395,ham,Bull. Your plan was to go floating off to IKEA...
2365,ham,Ok then no need to tell me anything i am going...
2967,ham,"Are you being good, baby? :)"
5225,ham,Smile in Pleasure Smile in Pain Smile when tro...


In [25]:
df2.tail()

Unnamed: 0,label,text
5314,spam,Get the official ENGLAND poly ringtone or colo...
4735,spam,Buy Space Invaders 4 a chance 2 win orig Arcad...
837,spam,Do you want 750 anytime any network mins 150 t...
2313,spam,tddnewsletter@emc1.co.uk (More games from TheD...
5456,spam,For the most sparkling shopping breaks from 45...


In [23]:
df2.append(temp_df)

Unnamed: 0,label,text
463,spam,"UpgrdCentre Orange customer, you may now claim..."
856,spam,Talk sexy!! Make new friends or fall in love i...
4199,spam,Want to funk up ur fone with a weekly new tone...
1380,spam,No. 1 Nokia Tone 4 ur mob every week! Just txt...
455,spam,"Loan for any purpose £500 - £75,000. Homeowner..."
4592,spam,Well done ENGLAND! Get the official poly ringt...
1839,spam,Hack Chat. Get backdoor entry into 121 chat ro...
648,spam,PRIVATE! Your 2003 Account Statement for shows...
593,spam,PRIVATE! Your 2003 Account Statement for 07753...
4394,spam,RECPT 1/3. You have ordered a Ringtone. Your o...


## Summary

Well done! In this lab, you practiced implementing naive Bayes' for document classification!