# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
import pandas as pd
import numpy as np

df = pd.read_csv('SMSSpamCollection',sep = '\t' )
df.columns = ['target','feature']
df['target'] = [1 if x == 'spam' else 0 for x in df['target']]
df.head()

Unnamed: 0,target,feature
0,0,Ok lar... Joking wif u oni...
1,1,Free entry in 2 a wkly comp to win FA Cup fina...
2,0,U dun say so early hor... U c already then say...
3,0,"Nah I don't think he goes to usf, he lives aro..."
4,1,FreeMsg Hey there darling it's been 3 week's n...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [29]:
minority = df[df['target'] == 1]
majority = df[df['target'] == 0].sample(n = df[df['target'] == 1].shape[0])
dfUnderSample = pd.concat([majority, minority]).reset_index().drop(columns = 'index')
dfUnderSample.head()

Unnamed: 0,target,feature
0,0,Jay's getting really impatient and belligerent
1,0,"Tell you what, if you make a little spreadshee..."
2,0,I can't describe how lucky you are that I'm ac...
3,0,"Hmm... Dunno leh, mayb a bag 4 goigng out dat ..."
4,0,Huh but i got lesson at 4 lei n i was thinkin ...


In [30]:
# # Your code here
# from imblearn.over_sampling import SMOTE 
# df['target'].value_counts()
# smote = SMOTE(random_state = 1)
# X_resamp_train, y_resamp_train = smote.fit_resample(X_train,y_train)


## Train-test split

Now implement a train-test split on the dataset: 

In [31]:
# Your code here
from sklearn.model_selection import train_test_split
X = dfUnderSample['feature']
y = dfUnderSample['target']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 1)
dfTrain = pd.concat([X_train, y_train], axis = 1)
dfTest =  pd.concat([X_test, y_test], axis = 1)
dfTrain.head()

Unnamed: 0,feature,target
262,"Call me da, i am waiting for your call.",0
1351,Want explicit SEX in 30 secs? Ring 02073162414...,1
494,"Ya ok, then had dinner?",0
1452,Had your mobile 11mths ? Update for FREE to Or...,1
148,"Ill call u 2mrw at ninish, with my address tha...",0


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [63]:
# Your code here
def bag_it(string):
    words = string.split()
    print(words)
    dictOfWords = {}
    for x in words:
        dictOfWords[x] = dictOfWords.get(x, 0) + 1
    return dictOfWords

wordsFilter = []
for i in df['target'].unique():
    target = {}
    for x in df['feature'].loc[df['target'] == i]:
        for word in x.split():
            target[word] = target.get(word,0)+1
    wordsFilter.append(target)

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [33]:
# Your code here
totalWords = 0
for x in dfUnderSample.feature:
    thing = x.split()
    totalWords += len(thing)
totalWords

28833

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [94]:
def bag_it(string):
    words = string.split()
    dictOfWords = {}
    for x in words:
        dictOfWords[x] = dictOfWords.get(x, 0) + 1
    return dictOfWords


print(dfUnderSample.feature[0],
      '\n\n',
      bag_it(dfUnderSample.feature[0])
)

Jay's getting really impatient and belligerent 

 {"Jay's": 1, 'getting': 1, 'really': 1, 'impatient': 1, 'and': 1, 'belligerent': 1}


## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [95]:
words_in_spam = 0
for x in dfUnderSample.feature[df['target'] == 1]:
    thing = x.split()
    words_in_spam += len(thing)

words_in_normal = 0
for x in dfUnderSample.feature[df['target'] == 0]:
    thing = x.split()
    words_in_spam += len(thing)
class_word_freq = [words_in_normal, words_in_spam]

In [96]:
# Your code here
def classify_doc(doc, totalWords = totalWords, return_posteriors=False):
    bagged_doc = bag_it(doc)
    probs = []
    for i in [0,1]:
        p = 0
        for word,count in bagged_doc.items():
            num = count+1
            denom = (wordsFilter[i].get(word,0)+1)/sum([x for x in wordsFilter[i].values()]) + totalWords
            p += np.log(num/denom)
        probs.append(p)
    return probs

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [99]:
# Your code here
classify_doc(dfUnderSample['feature'][100])

[-76.6090298365489, -76.60902945296178]

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!