# SMS Spam

Another signal that we can bring in to our detection is to indicate who closely the submitted message represents SMS Spam.  To do this we will train a Naive Bayes model using a collection of spam and non-spam SMS messages.  The collection of messages are from [http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/).

In [49]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

from sklearn.externals import joblib

> The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

In [2]:
sms = pd.read_table('../data/SMSSpamCollection.txt', encoding='latin-1', header=None)
sms.columns = ['label', 'message']

As an initial attempt, simply vectorize the corpus of messages

In [3]:
vectorizer = TfidfVectorizer("english")
features = vectorizer.fit_transform(sms['message'].values)

Naive Bayes performs very well with text classification, but we can tweak the performance by adjusting the alpha

In [32]:
nb = MultinomialNB()
params = {
    'alpha': np.arange(0.1, 1.1, 0.1)
}
clf = GridSearchCV(nb, params)

In [33]:
clf.fit(features, sms['label'])

GridSearchCV(cv=None, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [34]:
print(clf.best_estimator_)

MultinomialNB(alpha=0.10000000000000001, class_prior=None, fit_prior=True)


In [36]:
x_train, x_test, y_train, y_test = train_test_split(features, sms['label'], test_size=0.3, random_state=1)

In [46]:
clf = MultinomialNB(alpha=0.1)
clf.fit(x_train, y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [38]:
predictions = clf.predict(x_test)
accuracy_score(y_test, predictions)

0.9856459330143541

In [39]:
y_true, y_pred = y_test, clf.predict(x_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

        ham       0.99      0.99      0.99      1442
       spam       0.96      0.94      0.95       230

avg / total       0.99      0.99      0.99      1672



In [47]:
f = vectorizer.transform(["please call us now to collect your prize!!"])
print(clf.predict_proba(f)[0][1])
print(np.sum(clf.predict_proba(f)))

0.870011088531
1.0


In [48]:
f = vectorizer.transform(["hello my dear, pick you up at 8?"])
print(clf.predict_proba(f)[0][1])
print(np.sum(clf.predict_proba(f)))

0.00111580525493
1.0


In [50]:
joblib.dump(vectorizer, '../cache/prod_sms_vectorizer.pkl')
joblib.dump(clf, '../cache/prod_multinomialnb.pkl')

['../cache/prod_multinomialnb.pkl']