# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [1]:
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = pd.concat([data, dataFrameFromDirectory("emails/spam", "spam")]);
data = pd.concat([data, dataFrameFromDirectory("emails/ham", "ham")])

#For Pandas 1.3:
#data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
#data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [2]:
data.head()

Unnamed: 0,message,class
emails/spam/00249.5f45607c1bffe89f60ba1ec9f878039a,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",spam
emails/spam/00373.ebe8670ac56b04125c25100a36ab0510,ATTENTION: This is a MUST for ALL Computer Use...,spam
emails/spam/00214.1367039e50dc6b7adb0f2aa8aba83216,This is a multi-part message in MIME format.\n...,spam
emails/spam/00210.050ffd105bd4e006771ee63cabc59978,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,spam
emails/spam/00033.9babb58d9298daa2963d4f514193d7d6,This is the bottom line. If you can GIVE AWAY...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB()

Let's try it out:

In [4]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

In [28]:
#Let's try running some new test data
examples = ['Free bitcoin just send ssn!!!', 'I am a Nigerian prince and need help moving my family to America', 'Are you still available for dinner tonight?']
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'ham'], dtype='<U4')

In [39]:
#We can see that the classifier is not that accurate as it mislabelled the first test email as "not spam"
#Now let's split the data into train and test sets to calculate how accurate the classifier is quantitatively
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

In [40]:
train.head()

Unnamed: 0,message,class
emails/spam/00343.37d895b3a54847548875136ad6b0192d,Are you tired of spending a fortune on printer...,spam
emails/ham/02166.13587bcaeaf8c5c3c8b234e4ab90bc40,"URL: http://www.newsisfree.com/click/-2,842319...",ham
emails/ham/00156.d4205c868d67d1c334455950b21d226d,"http://www.guardian.co.uk/religion/Story/0,276...",ham
emails/ham/00414.c2dee68136358ceec7d235d03185822b,Of course we've had select() since BSD 4.2 and...,ham
emails/ham/01013.5e43d991c292968615961b1ccfc378b3,"Date: Fri, 13 Sep 2002 11:26:30 +10...",ham


In [41]:
test.head()

Unnamed: 0,message,class
emails/ham/00683.7d0233053c9421fe2bf29e23a4aa5b18,"\n\nIt was the best of times, it was the worst...",ham
emails/ham/02404.da865baa8492a392d7a5a035f0d3b7a0,URL: http://diveintomark.org/archives/2002/10/...,ham
emails/ham/00768.ce54703f83eb3bf04a04f665dbf55e96,"\n\nYes, it is. You just want to be called a ...",ham
emails/ham/00803.a9faabf181ecae3ece9f7003a005aeea,">>>>> ""B"" == Bill Humphries <bill@whump.com> w...",ham
emails/ham/00085.badc533c7037554017afb30c94dfcb55,"On Mon, 2 Sep 2002, Russell Turpin wrote:\n\n\...",ham


In [42]:
#Create a vectorizer to classify words used in the emails
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)

#Create classifier, train it with the training data
classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)

MultinomialNB()

In [43]:
#Use the test data to calculate how accurate the classifier is
from sklearn.metrics import accuracy_score
testdata = test['message']
testdata_counts = vectorizer.transform(testdata)
predictions = classifier.predict(testdata_counts)
accuracy_score(test['class'].values, predictions)

0.965

In [45]:
#We can see that the classifier is very accurate for the data provided. However, we manually saw that is is not
#correct all the time