## Spam Words Detection using Naive Bayes

**Outline**

* [Introduction and dataset](#data)
* [Feature extraction](#feats)
* [Model train and predict](#model)
* [Further examination for insights](#eda)
* [References](#ref)

In [45]:
import os, glob
import pandas as pd
import numpy as np

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer #leaving only the word stem
from nltk import pos_tag

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, make_scorer, confusion_matrix

## <a id="data">Introduction and Dataset</a>

This project uses the [SMS Spam Collection dataset from UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

The goal is to experiment with different techniques and train a classifier using Naive Bayes to detect whether a sms msg is spam or ham. 

**1. Data Readin**

In [15]:
corpus_folder = './data/smsspamcollection'
col_features =  ['label', 'sms_message']

df = pd.read_table(corpus_folder+'/SMSSpamCollection', 
                  sep='\t',
                  header=None,
                  names=col_features)
print(df.shape, '\n', 
      df.label.value_counts())
df.head()

(5572, 2) 
 ham     4825
spam     747
Name: label, dtype: int64


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**1.1 Pre-prorocess label column into numerical values**

In [17]:
# convert label to a numerical variable
df['label_num'] = df.label.map({'ham':0, 'spam':1})
df.label_num.value_counts()

0    4825
1     747
Name: label_num, dtype: int64

## <a id="feats">Feature extraction</a>

**2. Train/test split**

In [23]:
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'],
                                                    df['label_num'],
                                                    random_state=1)

print ("Original set contains", df.shape[0], "observations")
print ("Training set contains", X_train.shape[0]/df.shape[0]*100, "% of observations")
print ("Testing set contains", X_test.shape[0]/df.shape[0]*100, "% of observations")

Original set contains 5572 observations
Training set contains 75.0 % of observations
Testing set contains 25.0 % of observations


**2.1 Feature vectorization of training and testing data**

In [35]:
# instantiate the vectorizer
vectorizer = CountVectorizer(lowercase=True) #count word frequency

# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vectorizer.fit_transform(X_train)

# transform test data using fitted vocabulary into document-term matrix
X_test_dtm = vectorizer.transform(X_test)

# examine the document-term matrix, the feature count should be the same 
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(4179, 7456)
(1393, 7456)


## <a id="model">Model train and predict</a>

**3. Train the model**

In [26]:
# instantiate a Multinomial Naive Bayes model
nb = MultinomialNB()

# using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

CPU times: user 3 ms, sys: 722 µs, total: 3.73 ms
Wall time: 3.19 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [28]:
X_train_dtm.shape

(4179, 7456)

In [29]:
X_test_dtm.shape

(1393, 4056)

**4. Make class predictions on for X_test_dtm **

In [70]:
y_pred_class = nb.predict(X_test_dtm)
print("Accuracy: ", accuracy_score(y_test, y_pred_class))

y_pred_prob = nb.predict_proba(X_test_dtm)[:,1]
      
print("AUC: ", roc_auc_score(y_test, y_pred_prob))

Accuracy:  0.9885139985642498
AUC:  0.9866431000536962


**5. Model evaluation **

Trained model performed better than the null model based on accuracy comparison.

In [68]:
#Calculate null accuracy (what is the accuracy if we predict all msgs are of majority class)
y_test.value_counts()
null_accuracy = y_test.value_counts()[0]/y_test.shape[0]
null_accuracy

0.8671931083991385

In [71]:
y_test.value_counts()

0    1208
1     185
Name: label_num, dtype: int64

In [76]:
# print the confusion matrix
#confusion_matrix(y_test, y_pred_class, labels=[])

print(pd.DataFrame(confusion_matrix(y_test, y_pred_class), 
                   index=['true:not spam', 'true:spam'], 
                   columns=['pred:not spam', 'pred:spam']))


               pred:not spam  pred:spam
true:not spam           1203          5
true:spam                 11        174


In [77]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_pred_class > y_test]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: sms_message, dtype: object

## <a id="eda">Further examination for insights</a>

**Calculate spamminess of each token, and spam/ham ratio**

In [84]:
nb.feature_count_.shape

(2, 7456)

In [80]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0,:]
ham_token_count

array([0., 0., 0., ..., 1., 1., 1.])

In [88]:
# number of times each token appears across all HAM messages
spam_token_count = nb.feature_count_[1,:]
spam_token_count

array([ 5., 23.,  2., ...,  0.,  0.,  0.])

In [89]:
X_train_tokens = vectorizer.get_feature_names()

# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 
                       'ham':ham_token_count, 
                       'spam':spam_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,5.0
0,0.0,23.0
8704050406,0.0,2.0
121,0.0,1.0
1223585236,0.0,1.0


In [90]:
# examine 5 random DataFrame rows
# random_state=6 is a seed for reproducibility
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0


In [92]:
nb.class_count_

array([3617.,  562.])

In [93]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65.0,3.0
nasty,2.0,2.0
villa,1.0,2.0
beloved,2.0,1.0
textoperator,1.0,3.0


In [94]:
# convert the ham and spam counts into frequencies
tokens['ham_freq'] = tokens.ham / nb.class_count_[0]
tokens['spam_freq'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,ham_freq,spam_freq
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
very,65.0,3.0,0.017971,0.005338
nasty,2.0,2.0,0.000553,0.003559
villa,1.0,2.0,0.000276,0.003559
beloved,2.0,1.0,0.000553,0.001779
textoperator,1.0,3.0,0.000276,0.005338


In [95]:
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam_freq / tokens.ham_freq
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,ham_freq,spam_freq,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
very,65.0,3.0,0.017971,0.005338,0.297044
nasty,2.0,2.0,0.000553,0.003559,6.435943
villa,1.0,2.0,0.000276,0.003559,12.871886
beloved,2.0,1.0,0.000553,0.001779,3.217972
textoperator,1.0,3.0,0.000276,0.005338,19.307829


In [97]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spam_ratio', ascending=False)[:5]

Unnamed: 0_level_0,ham,spam,ham_freq,spam_freq,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
claim,1.0,89.0,0.000276,0.158363,572.798932
prize,1.0,76.0,0.000276,0.135231,489.131673
150p,1.0,49.0,0.000276,0.087189,315.36121
tone,1.0,48.0,0.000276,0.085409,308.925267
guaranteed,1.0,43.0,0.000276,0.076512,276.745552


In [99]:
# look up the spam_ratio for a given token
tokens.loc['national', 'spam_ratio']

109.41103202846975

## <a id="ref">References</a>
* [Ritchie Ng's blog on Vetorization, Multinomial Naive Bayes Classifier and Evaluation](https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/)
* [Paul Graham's classic post, A Plan for Spam](http://www.paulgraham.com/spam.html)