# Naive Bayes Classification Assignment

1. Dataset download
Download the Email Spam Classification Dataset from here: http://www.codeheroku.com/static/workshop/datasets/ham_spam.csv
2. Build a Classifier using Scikit-learn to classify emails as Spam / Not spam with highest accuracy.


In [1]:
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer
import re
from sklearn.model_selection import train_test_split

Loading the ham_spam.csv file

In [2]:
df = pd.read_csv("ham_spam.csv", encoding = 'latin-1')
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


Removing the unnamed columns - because most of those values are Null.

In [3]:
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1)
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


This is a function which will take the mail row and process the content - v1 column
1. consitent casing
2. Tokenization
3. Removing stop words
4. Stemming and Lemmatization are applied 
a list of unique words is returned.

In [4]:
def return_words(row):
    
    # consistent casing
    row = row.lower()
    
    # # Tokenization
    row = re.sub('[^A-Za-z0-9\s]+', '', row)
    words = word_tokenize(row)
    
    # Removing Common words - stop words
    clean_list = []
    stop_words = stopwords.words('english')
    stop_words.append(["etc", "also"])
    for word in words:
        if word not in stop_words:
            clean_list.append(word)
    
    # # Stemming - Using this one - the below ones are just for reference:
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    words = []
    for word in clean_list:
        w = lemmatizer.lemmatize(word,pos='a')
        if w == word:
            w = lemmatizer.lemmatize(w,pos='v')
        if w == word:
            w = lemmatizer.lemmatize(w,pos='n')
        if (w == word) and (len(w)) > 3:
            w = stemmer.stem(w)
        words.append(w)
        
    words = list(set(words))

    return words

Using the lambda funstion to apply the above function to all the rows of the dataframe

In [5]:
df['words'] = df['v2'].apply(lambda x: return_words(x))

In [6]:
df

Unnamed: 0,v1,v2,words
0,ham,"Go until jurong point, crazy.. Available only ...","[cine, crazi, world, get, e, point, jurong, go..."
1,ham,Ok lar... Joking wif u oni...,"[wif, u, lar, ok, joke, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[entri, may, 21st, text, win, tkt, 2005, 2, fi..."
3,ham,U dun say so early hor... U c already then say...,"[alreadi, u, dun, hor, earli, c, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[live, dont, around, though, usf, go, nah, think]"
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,"[per, try, claim, call, now1, pound, prize, 75..."
5568,ham,Will Ì_ b going to esplanade fr home?,"[home, b, go, fr, esplanad]"
5569,ham,"Pity, * was in mood for that. So...any other s...","[mood, soani, piti, suggestion]"
5570,ham,The guy did some bitching but I acted like i'd...,"[buy, id, els, bitch, someth, u, like, week, a..."


This function will do the following:
    1. Creates a bag_of_words - by combining all the list of words from the v1 column and applying set function on it to get a unique set of words
    2. The words in v1 of a mail row is checked against the bag_of_words.
    3. A new dataframe is created with every word as a column and mail id as the row.
    4. If a mail contains a word then that particular element is set to 1
    5. Writting the output dataframe into a file since this task is very time comsuming and we dont want to lose the result.

In [7]:
def mapping(df1):

    print("Processing ....")

    bag_of_words = []
    for index, row in df1.iterrows():
        bag_of_words = bag_of_words + row['words']
    
    bag_of_words = list(set(bag_of_words))
    column_names = ["mail_id"] + bag_of_words
    df2 = pd.DataFrame(columns= column_names)

    for index, row in df1.iterrows():
        print("Processing mail id: ", index)
        # Adding mail id
        df2_dict = dict.fromkeys(column_names,[0])
        df2_dict['mail_id'] = index
        # Populating the words columns
        for word in row['words']:
            if word in bag_of_words:
                df2_dict[word] = 1
        
        df2_row = pd.DataFrame.from_dict(df2_dict)
        df2 = df2.append(df2_row)

    df2.to_csv("test.csv", index=False, header=True)
    return df2

In [15]:
mapping(df)

Processing ....
Processing mail id:  0
Processing mail id:  1
Processing mail id:  2
Processing mail id:  3
Processing mail id:  4
Processing mail id:  5
Processing mail id:  6
Processing mail id:  7
Processing mail id:  8
Processing mail id:  9
Processing mail id:  10
Processing mail id:  11
Processing mail id:  12
Processing mail id:  13
Processing mail id:  14
Processing mail id:  15
Processing mail id:  16
Processing mail id:  17
Processing mail id:  18
Processing mail id:  19
Processing mail id:  20
Processing mail id:  21
Processing mail id:  22
Processing mail id:  23
Processing mail id:  24
Processing mail id:  25
Processing mail id:  26
Processing mail id:  27
Processing mail id:  28
Processing mail id:  29
Processing mail id:  30
Processing mail id:  31
Processing mail id:  32
Processing mail id:  33
Processing mail id:  34
Processing mail id:  35
Processing mail id:  36
Processing mail id:  37
Processing mail id:  38
Processing mail id:  39
Processing mail id:  40
Processing

KeyboardInterrupt: 

Loading the mail-id vs words Matrix file 

In [16]:
df2 = pd.read_csv("output_mapping.csv")
df2.head(3)

Unnamed: 0,mail_id,oha,beehoon,tui,armand,milkdayno,spare,secur,mca,moan,...,haul,discus,wwwtxt43com,goggle,fail,afew,11,mad1,gon,sez
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Preparing the X and y dataframes for training the model.

In [17]:
X = df2
y = df["v1"]

In [18]:
y = y.str.replace("ham",'0').str.replace("spam",'1')
y

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: v1, Length: 5572, dtype: object

Using the 80/20 rule to split the X and y dataframes into training and test samples

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Why choose BernoulliNB?
1. Naive Bayes is chosen when the number of features is too large. In our case it is 8061 - which is very huge
2. The dataset is also huge - 5572 Rows 
3. So, Naive Bayes is best to have fastest computation and also best result with so many features
4. Since the values are not continues, we choose MultinomialNB/BernoulliNB instead of GaussianNB
5. We are only recording the occurances of a word in the mail, and not the number of times it occured. 
So, clearly BernoulliNB is a better choice than MultinomialNB and it is evident from the scores below too.

In [20]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)

In [21]:
model.score(X_test, y_test)

0.9506726457399103

In [28]:
model.score(X_train, y_train)

0.9625308503477675

In [29]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train, y_train)

In [30]:
model.score(X_test, y_test)

0.9721973094170404

In [31]:
model.score(X_train, y_train)

0.9829481714157505

Just checking the top 20 words in ham mails and spam mails.

In [25]:
df2['v1'] = df['v1']
hams = df2[df2['v1'] == 'ham']
spams = df2[df2['v1'] == 'spam'] 

In [26]:
hams = hams.drop('v1', axis=1)
hams.sum().sort_values(ascending=False)[:20]

mail_id    13478260
u               724
get             554
go              464
im              420
come            296
ok              265
call            265
2               257
good            245
know            241
dont            238
ill             236
like            224
ltgt            214
time            206
want            203
say             203
love            201
day             199
dtype: int64

In [27]:
spams = spams.drop('v1', axis=1)
spams.sum().sort_values(ascending=False)[:20]

mail_id    2042546
call           326
free           166
txt            153
2              134
u              125
text           121
ur             114
4              110
claim          109
mobil          109
stop            94
get             91
repli           91
prize           86
send            78
new             71
win             62
urgent          62
cash            61
dtype: int64

1. I wanted to use parallelization to run the mapping function which creates the mapping matrix dataframe.
2. But I could not do it successfully on Windows.
3. When I tried on text editor, the program ran, but i could only see 1 process ID. <br>
The Parallelization code used was:<br>
import os<br>

def info(title):<br>
    print(title)<br>
    print('module name:', __name__)<br>
    if hasattr(os, 'getppid'):  # only available on Unix<br>
        print('parent process:', os.getppid())<br>
    print('process id:', os.getpid())<br>

def f(name):<br>
    info('function f')<br>
    print('hello', name)<br>

if __name__ == '__main__':<br>
    info('main line')<br>
    p = Process(target=workers.mapping_parallelize, args=(df,))<br>
    p.start()<br>
    p.join()<br>

