# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
!ls -ltr



total 580
-rw-r--r-- 1 poari 197609   1849 Sep 26 18:08 CONTRIBUTING.md
-rw-r--r-- 1 poari 197609   1371 Sep 26 18:08 LICENSE.md
-rw-r--r-- 1 poari 197609   2355 Sep 26 18:08 README.md
-rw-r--r-- 1 poari 197609 483481 Sep 26 18:08 SMSSpamCollection
-rw-r--r-- 1 poari 197609  91030 Oct 11 15:33 index.ipynb


In [2]:

import pandas as pd
df= pd.read_csv('SMSSpamCollection',names= ['label','text'],sep='\t')
df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [3]:
# Your code here
import numpy as np
print(df['label'].value_counts())

# extract the ham rows in a separate df
df_ham=df[df['label']=='ham']

# extract the spam rows in a separate df
df_spam=df[df['label']=='spam']

# reset index of df_ham
df_ham.reset_index(inplace = True)

# generate 747 random numbers in the interval 0-4824
indices = np.random.randint(0, 4825, 747)

# select from df_ham
df_ham_select=df_ham.iloc[list(indices)]

# put together the selected ham records and the spam records
df_balanced = pd.concat([df_ham_select,df_spam],axis=0,sort=True)
df_balanced

ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,index,label,text
1443,1682.0,ham,Y lei?
1974,2291.0,ham,"HEY THERE BABE, HOW U DOIN? WOT U UP 2 2NITE L..."
9,16.0,ham,Oh k...i'm watching here:)
2866,3310.0,ham,Okie ü wan meet at bishan? Cos me at bishan no...
3863,4466.0,ham,CHEERS FOR CALLIN BABE.SOZI CULDNT TALKBUT I W...
...,...,...,...
5537,,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5540,,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,,spam,Had your contract mobile 11 Mnths? Latest Moto...
5566,,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [4]:
# easier method:
df[df['label']=='ham'].sample(n=747)


Unnamed: 0,label,text
5280,ham,"Vikky, come around &lt;TIME&gt; .."
4870,ham,1. Tension face 2. Smiling face 3. Waste face ...
1854,ham,I just made some payments so dont have that mu...
553,ham,"Sure, if I get an acknowledgement from you tha..."
37,ham,I see the letter B on my car
...,...,...
3361,ham,Please attend the phone:)
4661,ham,You call him and tell now infront of them. Cal...
1998,ham,"YEH I AM DEF UP4 SOMETHING SAT,JUST GOT PAYED2..."
1390,ham,"Haha... Where got so fast lose weight, thk muz..."


In [5]:
df_balanced.reset_index(inplace=True)


In [6]:
df_balanced


Unnamed: 0,level_0,index,label,text
0,1443,1682.0,ham,Y lei?
1,1974,2291.0,ham,"HEY THERE BABE, HOW U DOIN? WOT U UP 2 2NITE L..."
2,9,16.0,ham,Oh k...i'm watching here:)
3,2866,3310.0,ham,Okie ü wan meet at bishan? Cos me at bishan no...
4,3863,4466.0,ham,CHEERS FOR CALLIN BABE.SOZI CULDNT TALKBUT I W...
...,...,...,...,...
1489,5537,,spam,Want explicit SEX in 30 secs? Ring 02073162414...
1490,5540,,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
1491,5547,,spam,Had your contract mobile 11 Mnths? Latest Moto...
1492,5566,,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [7]:
print(df_balanced.columns)
df_balanced.drop(['level_0','index'],axis=1,inplace=True)
print(df_balanced.columns)

Index(['level_0', 'index', 'label', 'text'], dtype='object')
Index(['label', 'text'], dtype='object')


## Train-test split

Now implement a train-test split on the dataset: 

In [8]:
# Your code here
from sklearn.model_selection import train_test_split
# remove class label from df_balanced
X = df_balanced.drop(columns=['label'],axis = 1)
y = df_balanced['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)
train_df = pd.concat([X_train,y_train],axis=1)
test_df = pd.concat([X_test,y_test],axis=1)

In [9]:
print(train_df.shape,test_df.shape)

(1195, 2) (299, 2)


In [10]:
299/(299+1195)


0.2001338688085676

In [11]:
train_df.columns

Index(['text', 'label'], dtype='object')

In [12]:
list(train_df['label'].unique())

['spam', 'ham']

In [13]:
train_df.index

Int64Index([1276, 1370,  319, 1246,   55,  323,  370,  803,  958,  941,
            ...
             236, 1374, 1291, 1130,  931,  132, 1272,  929, 1384, 1127],
           dtype='int64', length=1195)

In [14]:
train_df.head()

Unnamed: 0,text,label
1276,ringtoneking 84484,spam
1370,Sunshine Quiz! Win a super Sony DVD recorder i...,spam
319,"As in missionary hook up, doggy hook up, stand...",ham
1246,FREE for 1st week! No1 Nokia tone 4 ur mob eve...,spam
55,"Watching cartoon, listening music &amp; at eve...",ham


In [15]:
train_df.iloc[1]

text     Sunshine Quiz! Win a super Sony DVD recorder i...
label                                                 spam
Name: 1370, dtype: object

In [16]:
train_df.index

Int64Index([1276, 1370,  319, 1246,   55,  323,  370,  803,  958,  941,
            ...
             236, 1374, 1291, 1130,  931,  132, 1272,  929, 1384, 1127],
           dtype='int64', length=1195)

In [17]:
train_df.iloc[2]

text     As in missionary hook up, doggy hook up, stand...
label                                                  ham
Name: 319, dtype: object

In [18]:
train_df['text'][1413].split()

['Hi',
 'this',
 'is',
 'Amy,',
 'we',
 'will',
 'be',
 'sending',
 'you',
 'a',
 'free',
 'phone',
 'number',
 'in',
 'a',
 'couple',
 'of',
 'days,',
 'which',
 'will',
 'give',
 'you',
 'an',
 'access',
 'to',
 'all',
 'the',
 'adult',
 'parties...']

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [19]:
# Your code here

# Will be a nested dictionary of class_i : {word1:freq, word2:freq..., wordn:freq} 
# will be based on training set only

# list of classes ham/spam
classes=list(train_df['label'].unique())
dico={}
for class_ in classes:
    # temporary dataframe containing only class_ 
    temp_df=train_df[train_df['label']==class_]
    # initialize an empty dictionary for bag of words
    bag={}
    # need to go through each row, take the text, split it and add up each word's count
    for row in temp_df.index:
        texto = temp_df['text'][row]
        for word in texto.split():
            # look for each word in the keys of dictionary bag. 
            # if found, return the value of word + 1
            # if not found, return  1
            bag[word]=bag.get(word,0)+1
    dico[class_]=bag
dico

{'spam': {'ringtoneking': 1,
  '84484': 1,
  'Sunshine': 7,
  'Quiz!': 1,
  'Win': 10,
  'a': 290,
  'super': 1,
  'Sony': 5,
  'DVD': 6,
  'recorder': 1,
  'if': 16,
  'you': 126,
  'canname': 1,
  'the': 135,
  'capital': 1,
  'of': 72,
  'Australia?': 1,
  'Text': 31,
  'MQUIZ': 1,
  'to': 501,
  '82277.': 5,
  'B': 5,
  'FREE': 82,
  'for': 139,
  '1st': 23,
  'week!': 11,
  'No1': 8,
  'Nokia': 38,
  'tone': 17,
  '4': 79,
  'ur': 89,
  'mob': 12,
  'every': 30,
  'week': 26,
  'just': 40,
  'txt': 59,
  'NOKIA': 15,
  '8007': 19,
  'Get': 29,
  'txting': 8,
  'and': 96,
  'tell': 11,
  'mates': 5,
  'www.getzed.co.uk': 9,
  'POBox': 9,
  '36504': 7,
  'W45WQ': 6,
  'norm150p/tone': 6,
  '16+': 17,
  'Double': 11,
  'mins': 13,
  'txts': 7,
  '6months': 2,
  'Bluetooth': 8,
  'on': 111,
  'Orange.': 4,
  'Available': 2,
  'Sony,': 2,
  'Motorola': 5,
  'phones.': 2,
  'Call': 115,
  'MobileUpd8': 15,
  '08000839402': 15,
  'or': 149,
  'call2optout/N9DX': 2,
  '09066362231': 3,
  

In [20]:
dico.keys()

dict_keys(['spam', 'ham'])

In [21]:
dico['spam']['We'],dico['spam']['urgent'],

(31, 4)

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [22]:
# Your code here
# create an empty set:
corpus_train = set()
for row in train_df.index:
    for word in train_df['text'][row].split():
        # as corpus_train is a set, it allows only unique elements.
        corpus_train.add(word)
corpus_train
V=len(corpus_train)
V

6135

In [23]:
tt=set()
tt.add(1)
tt.add(2)
tt.add(1)

In [24]:
tt


{1, 2}

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [25]:
# Your code here
def bag_it(text):
    #input = string
    bag={}
    for word in text.split():
        bag[word]=bag.get(word,0)+1
    return bag
        
    

In [26]:

for word in train_df['text'][3].split():
    print(word)

Okie
ü
wan
meet
at
bishan?
Cos
me
at
bishan
now.
I'm
not
driving
today.


In [27]:
bag_it(train_df['text'][3])

{'Okie': 1,
 'ü': 1,
 'wan': 1,
 'meet': 1,
 'at': 2,
 'bishan?': 1,
 'Cos': 1,
 'me': 1,
 'bishan': 1,
 'now.': 1,
 "I'm": 1,
 'not': 1,
 'driving': 1,
 'today.': 1}

In [28]:
who

V	 X	 X_test	 X_train	 bag	 bag_it	 class_	 classes	 corpus_train	 
df	 df_balanced	 df_ham	 df_ham_select	 df_spam	 dico	 indices	 np	 pd	 
row	 temp_df	 test_df	 texto	 train_df	 train_test_split	 tt	 word	 y	 
y_test	 y_train	 


In [37]:
train_df['text'][1276]
type(train_df['text'][1276])
type(train_df['text'][1276].split())

list

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [38]:
p_classes = dict(train_df['label'].value_counts(normalize=True))
p_classes2 = dict(train_df['label'].value_counts())

In [39]:
p_classes
p_classes2

{'spam': 605, 'ham': 590}

In [101]:
# Your code here
# 
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    # doc = one of the texts in train_df
    # class_word_freq = 'dico' = word frequency dictionary for each class (ham or spam) for train_df
    # p_classes= count of occurences for each class in df_train
    # V total corpus words number
    # return whether doc is spam or ham 
    bag=bag_it(doc) # return a dictionary with the number of occurence of each word in doc.
    classes=[]  # initialize an empty list
    posteriors=[]   # initialize an empty list
    for class_ in dico.keys():  # loop over spam, ham 
        # will compute and compare the probabilities of the text being spam or ham )
        p=np.log(p_classes[class_])  # log of overall probability to find class_ (ham or spam) in train_df
        for word in bag.keys():  # loop over words in the text we need to classify as spam or ham
            # use laplacian smoothing formula to avoid division by zero
            # use logarithm of probabilities to avoid number too close to zero (add instead of multiply) 
            num = bag[word] + 1  # numerator = word freq in text + 1
            denom = dico[class_].get(word,0) + V  # word's frequency over all spam  + number of words in corpus
            p+=np.log(num/denom) # add the log of the probability of 'word' if the class is 'class_'
        classes.append(class_) # add class_ to the list classes 
        posteriors.append(p)  # add probability to the list posteriors
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

In [102]:
# 'class_word_freq' was called 'dico' here
dico.keys()

dict_keys(['spam', 'ham'])

In [103]:
dico['spam'].get('capital')

1

In [104]:
V

6135

In [105]:
train_df.iloc[1]


text     Sunshine Quiz! Win a super Sony DVD recorder i...
label                                                 spam
Name: 1370, dtype: object

In [106]:
train_df['text'][1370]

'Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B'

In [107]:
bag=bag_it(train_df['text'][1370])

In [108]:
bag['DVD']

1

In [109]:
classify_doc(train_df['text'][1370], dico, p_classes, V, return_posteriors=True)

[-161.44646237173353, -161.37257336971908]


'ham'

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [110]:
X_train

Unnamed: 0,text
1276,ringtoneking 84484
1370,Sunshine Quiz! Win a super Sony DVD recorder i...
319,"As in missionary hook up, doggy hook up, stand..."
1246,FREE for 1st week! No1 Nokia tone 4 ur mob eve...
55,"Watching cartoon, listening music &amp; at eve..."
...,...
132,I know but you need to get hotel now. I just g...
1272,You have been specially selected to receive a ...
929,"As a SIM subscriber, you are selected to recei..."
1384,Camera - You are awarded a SiPix Digital Camer...


In [111]:
type(X_train)

pandas.core.frame.DataFrame

In [112]:
type(X_train['text'])

pandas.core.series.Series

In [113]:
# Your code here
y_hat_train = X_train['text'].map(lambda x: classify_doc(x, dico, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.853556
True     0.146444
dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!