# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [2]:
# Your code here
!ls -ltr



total 512
-rw-r--r-- 1 poari 197609   1849 Sep 26 18:08 CONTRIBUTING.md
-rw-r--r-- 1 poari 197609   1371 Sep 26 18:08 LICENSE.md
-rw-r--r-- 1 poari 197609   2355 Sep 26 18:08 README.md
-rw-r--r-- 1 poari 197609 483481 Sep 26 18:08 SMSSpamCollection
-rw-r--r-- 1 poari 197609  20662 Sep 30 07:00 index.ipynb


In [3]:

import pandas as pd
df= pd.read_csv('SMSSpamCollection',names= ['label','text'],sep='\t')
df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [4]:
# Your code here
import numpy as np
print(df['label'].value_counts())

# extract the ham rows in a separate df
df_ham=df[df['label']=='ham']

# extract the spam rows in a separate df
df_spam=df[df['label']=='spam']

# reset index of df_ham
df_ham.reset_index(inplace = True)

# generate 747 random numbers in the interval 0-4824
indices = np.random.randint(0, 4825, 747)

# select from df_ham
df_ham_select=df_ham.iloc[list(indices)]

# put together the selected ham records and the spam records
df_balanced = pd.concat([df_ham_select,df_spam],axis=0,sort=True)
df_balanced

ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,index,label,text
1192,1394.0,ham,Oh ok..
2219,2569.0,ham,Hey. For me there is no leave on friday. Wait ...
2317,2683.0,ham,I got a call from a landline number. . . I am ...
4190,4840.0,ham,That's one of the issues but california is oka...
679,795.0,ham,There generally isn't one. It's an uncountable...
...,...,...,...
5537,,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5540,,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,,spam,Had your contract mobile 11 Mnths? Latest Moto...
5566,,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [5]:
df_balanced.reset_index(inplace=True)


In [6]:
df_balanced


Unnamed: 0,level_0,index,label,text
0,1192,1394.0,ham,Oh ok..
1,2219,2569.0,ham,Hey. For me there is no leave on friday. Wait ...
2,2317,2683.0,ham,I got a call from a landline number. . . I am ...
3,4190,4840.0,ham,That's one of the issues but california is oka...
4,679,795.0,ham,There generally isn't one. It's an uncountable...
...,...,...,...,...
1489,5537,,spam,Want explicit SEX in 30 secs? Ring 02073162414...
1490,5540,,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
1491,5547,,spam,Had your contract mobile 11 Mnths? Latest Moto...
1492,5566,,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [7]:
print(df_balanced.columns)
df_balanced.drop(['level_0','index'],axis=1,inplace=True)
print(df_balanced.columns)

Index(['level_0', 'index', 'label', 'text'], dtype='object')
Index(['label', 'text'], dtype='object')


## Train-test split

Now implement a train-test split on the dataset: 

In [8]:
# Your code here
from sklearn.model_selection import train_test_split
# remove class label from df_balanced
X = df_balanced.drop(columns=['label'],axis = 1)
y = df_balanced['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)
train_df = pd.concat([X_train,y_train],axis=1)
test_df = pd.concat([X_test,y_test],axis=1)

In [9]:
print(train_df.shape,test_df.shape)

(1195, 2) (299, 2)


In [1]:
299/(299+1195)


0.2001338688085676

In [12]:
train_df.columns

Index(['text', 'label'], dtype='object')

In [16]:
list(train_df['label'].unique())

['ham', 'spam']

In [17]:
train_df.index

Int64Index([ 709, 1204, 1413, 1016,  734,  150,  724, 1249, 1272,  343,
            ...
             984,  978,  772,  502, 1109,  944,  429, 1100,  999, 1439],
           dtype='int64', length=1195)

In [18]:
train_df.head()

Unnamed: 0,text,label
709,Do you like shaking your booty on the dance fl...,ham
1204,Freemsg: 1-month unlimited free calls! Activat...,spam
1413,"Hi this is Amy, we will be sending you a free ...",spam
1016,New TEXTBUDDY Chat 2 horny guys in ur area 4 j...,spam
734,He is impossible to argue with and he always t...,ham


In [22]:
train_df.iloc[1]

text     Freemsg: 1-month unlimited free calls! Activat...
label                                                 spam
Name: 1204, dtype: object

In [24]:
train_df.index

Int64Index([ 709, 1204, 1413, 1016,  734,  150,  724, 1249, 1272,  343,
            ...
             984,  978,  772,  502, 1109,  944,  429, 1100,  999, 1439],
           dtype='int64', length=1195)

In [30]:
train_df.iloc[2]

text     Hi this is Amy, we will be sending you a free ...
label                                                 spam
Name: 1413, dtype: object

In [33]:
train_df['text'][1413]

'Hi this is Amy, we will be sending you a free phone number in a couple of days, which will give you an access to all the adult parties...'

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [13]:
# Your code here

# Will be a nested dictionary of class_i : {word1:freq, word2:freq..., wordn:freq} 
# will be based on training set only

# list of classes ham/spam
classes=list(train_df['label'].unique())
dico={}
for class_ in classes:
    # temporary dataframe containing only class_ 
    temp_df=train_df[train_df['label']==class_]
    # need to go through each row, take the text, split it and add up each word's count
    for row in temp_df.index:
        texto = temp_df.iloc[row]['text']

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [None]:
# Your code here

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [None]:
# Your code here

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [None]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    pass

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [None]:
# Your code here


## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!