# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [1]:
!ls

CONTRIBUTING.md  index.ipynb  LICENSE.md  README.md  SMSSpamCollection


In [6]:
df = pd.read_csv("SMSSpamCollection", delimiter="\t")

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [7]:
df.head()

Unnamed: 0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...


In [11]:
df_spam = df[df["ham"] == "spam"]

In [12]:
df_ham = df[df["ham"] == "ham"][0:747]

In [15]:
df2 = pd.concat([df_spam, df_ham])

In [16]:
df2.columns = ["ham", "text"]

## Train - Test Split

Now implement a train test split on your dataset.

In [13]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(df2["text"], df2["ham"], test_size = 0.2)


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class.

In [23]:
X_train_ham = X_train[y_train == "ham"]
X_train_spam = X_train[y_train == "spam"]

In [24]:
X_train_ham.head()

246              I asked you to call him now ok
720               Oh is it? Send me the address
222                      Sorry, I'll call later
111             Going for dinner.msg you after.
110    What is the plural of the noun research?
Name: text, dtype: object

In [25]:
from collections import Counter

In [26]:

all_words_ham = ""
for row in X_train_ham:
    all_words_ham += " " + row
all_words_ham = all_words_ham.split(" ")
all_ham = Counter(all_words_ham)

all_words_spam = ""
for row in X_train_spam:
    all_words_spam += " " + row
all_words_spam = all_words_spam.split(" ")
all_spam = Counter(all_words_spam)


## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [32]:
all_count = len(all_words_spam) + len(all_words_ham)
all_words = Counter(all_words_spam + all_words_ham)

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [None]:
#ignoring

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [35]:
sum(all_spam.values())

13908

In [38]:
import numpy as np
all_spam_probs = {}
for k,v in all_words.items():
    if k:
        if k in all_spam:
            num = (all_spam[k] + 1)
            den = sum(all_spam.values()) + sum(all_words.values())
            all_spam_probs[k] =num/den
        else:
            all_spam_probs[k] = 1/sum(all_spam.values()) + sum(all_words.values())


In [39]:
import numpy as np
all_ham_probs = {}
for k,v in all_words.items():
    if k:
        if k in all_ham:
            num = (all_ham[k] + 1)
            den = sum(all_ham.values()) + sum(all_words.values())
            all_ham_probs[k] =num/den
        else:
            all_ham_probs[k] = 1/sum(all_ham.values()) + sum(all_words.values())


## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [None]:
def classify(words):
    ham_spam = [0,0]
    for word in words.split(" "):
        ham_prob = all_ham_probs[k]
        spam_prob = all_spam_probs[k]
        ham_spam[0] *= ham_prob
        ham_spam[1] *= spam_prob
    if ham_spam[0] > ham_spam[1]:
        return "ham"
    else:
        return "spam"
    
classify(X_test)

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing naive Bayes' for document classification!