In [84]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import io

## Naive Bayes Classifier Demo

There are a couple steps we need to take to successfully make our own spam filter. To do this, we'll use a Naive Bayes Classifier, which is one of the most commonly used classifiers for this use case.

The first thing we need to do is find a dataset! I've included one that I found here with the notebook.

In [77]:
with io.open("english_big.txt", encoding='cp1252') as f:
    data = f.read().splitlines()

sents = []
labels = []
for line in data:
    splits = line.rsplit(',', 1)
    sents.append(splits[0])
    labels.append(splits[1])

df2 = pd.DataFrame({'v1': labels, 'v2': sents}).dropna()
print(df2.head())

     v1                                                 v2
0  spam  Urgent! call 09061749602 from Landline. Your c...
1  spam  +449071512431 URGENT! This is the 2nd attempt ...
2  spam  FREE for 1st week! No1 Nokia tone 4 ur mob eve...
3  spam  Urgent! call 09066612661 from landline. Your c...
4  spam  WINNER!! As a valued network customer you have...


## Transforming Data

Now that we have got our dataset, we need to transform both types of data entries, the sentences and the labels, into numbers that our classifier can interpret.

For the labels (spam vs. ham), its pretty straightforward, we'll represent spam with a 0 and not-spam, or ham, with a 1.

In [78]:
def transform(L):
    res = []
    for i in range(len(L)):
        if(L[i] == 'spam'):
            res.append(0)
        else:
            res.append(1)
    return pd.Series(res)

df2['v1'] = transform(df2['v1'])
print(df2.head())

  v1                                                 v2
0  0  Urgent! call 09061749602 from Landline. Your c...
1  0  +449071512431 URGENT! This is the 2nd attempt ...
2  0  FREE for 1st week! No1 Nokia tone 4 ur mob eve...
3  0  Urgent! call 09066612661 from landline. Your c...
4  0  WINNER!! As a valued network customer you have...


Now, for the X values, we need to somehow convert a sentence into values our classifier can interpret. To do this, we'll use an approach called **bag of words**.

### Bag of Words

Bag of words effectively generates a frequency vector for each sentence, which contains word frequencies for each sentence. As a quick small example, consider the following sentences:

1. Jon Jon likes ice cream.
2. Jeff likes to scream.

Bag of words will convert these two sentences into the following:

1. {Jon: 2, likes: 1, ice: 1, cream: 1}
2. {Jeff: 1, likes: 1, to: 1, scream: 1}

This is then further converted into a vector of all possible words as follows:

1. [2, 1, 1, 1, 0, 0, 0]
2. [0, 1, 0, 0, 1, 1, 1]

Finally, we need to ensure that these weights are normalized, as longer sentences will have higher weights. Thus, we can use a concept called tf-idf to do this, which normalizes the frequencies inside a sentence (thus, above we would have frequences with 2/8, 1/8, etc). Tf-idf also considers the inverse document frequency, or the inverse of the number of documents a word appears in. For more information tf-idf, take a look here: http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/.

## Classification

After running the Tfidf Vectorizer on our data set, we can now split our dataset into training and testing. From there, we will use the Naive Bayes classifier provided by the sklearn library.

We'll specifically be using the MultinomialNB classifier, as our features are discrete. If we were dealing with continous features (like number of words in all caps, for example), we can use a GaussianNB, which would model each probability as a Gaussian variable.

Since our features are discrete, we only need to consider their probabilities directly.

In [87]:
vec = TfidfVectorizer(stop_words='english') #ignore words like but, so, etc.
X_total = vec.fit_transform(df2['v2'])
y_total = np.array(df2['v1']).astype('int')
print(X_total.shape)
print(len(y_total))
X_train, X_test, y_train, y_test = train_test_split(X_total, y_total, test_size=0.3, random_state=420)
model = MultinomialNB()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))



(1324, 3294)
1324
0.982412060302


## A note about Accuracy

98%! This is a perfect example about why accuracy can be misleading. We've constructed a classifier that may work well for our given dataset - however, there are only 1000 data points that we used. In practice, this is far too little, as 1000 entries can't be used to accurately model all kinds of spam.

In the specific case of this type of spam, however, we've achieved pretty high accuracy with a relatively simple classifier!