# Spam or ham -- naive Bayes classifier

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import zero_one_loss
import matplotlib.pyplot as plt

Let's import a data set with spam and non-spam text messages as a csv file. It is hosted here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection


In [2]:
# download spam or ham dataset from
# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# and unpack it
!wget -q https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
!unzip -q sms+spam+collection.zip

data = pd.read_csv("SMSSpamCollection", sep='\t',header=None,encoding = "latin-1",names=["class","sms"])
data.head()

Unnamed: 0,class,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   class   5572 non-null   object
 1   sms     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
# if not transformed to numpy array, shuffle_index below won't work
X=np.array(data["sms"])
y=np.array(data["class"]=="spam") # spam is class 1
shuffle_index = np.random.permutation(len(y))
X, y = X[shuffle_index], y[shuffle_index]

split=4500
# split training set and test set, if max_size is set, then take only part of data as test set
X_train,y_train,X_test,y_test = X[:split],y[:split],X[split:],y[split:]

In [5]:
# ratio of spam to all
sum(y_train==1)/len(y_train)

0.13444444444444445

We see that the data set is skewed: only about 13% of all text messages is spam.

# bag of words

Let's turn the messages into term count vectors. To see how this works, we first only look at three messages.

In [6]:
# CountVectorizer transforms text into bag of words
from sklearn.feature_extraction.text import CountVectorizer
cvec=CountVectorizer()
cvec.fit(X[:3])

In [7]:
X[0]

'Please protect yourself from e-threats. SIB never asks for sensitive information like Passwords,ATM/SMS PIN thru email. Never share your password with anybody.'

In [8]:
cvec.transform([X[0]]).toarray()

array([[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1, 0, 1,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1]])

In [9]:
cvec.vocabulary_

{'please': 18,
 'protect': 19,
 'yourself': 32,
 'from': 7,
 'threats': 26,
 'sib': 23,
 'never': 12,
 'asks': 2,
 'for': 5,
 'sensitive': 21,
 'information': 8,
 'like': 9,
 'passwords': 16,
 'atm': 3,
 'sms': 24,
 'pin': 17,
 'thru': 27,
 'email': 4,
 'share': 22,
 'your': 31,
 'password': 15,
 'with': 30,
 'anybody': 1,
 'lol': 10,
 'ok': 13,
 'forgiven': 6,
 'mm': 11,
 'am': 0,
 'on': 14,
 'the': 25,
 'way': 29,
 'to': 28,
 'railway': 20}

In [10]:
# we can also try to strip stop words, ie very common words
cvec=CountVectorizer(stop_words="english")
cvec.fit(X)
len(cvec.vocabulary_)

8480

Let's have a look at the first ten stop words.

In [11]:
list(cvec.get_stop_words())[:10]

['made',
 'herein',
 'without',
 'forty',
 'call',
 'can',
 'our',
 'by',
 'sometimes',
 'under']

## Naive Bayes

see also https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

and https://scikit-learn.org/stable/modules/naive_bayes.html for naive Bayes

We fit a naive Bayes classifier based on a multinomial model.

In [12]:
from sklearn.naive_bayes import MultinomialNB
cvec=CountVectorizer(stop_words="english")
X_train_vec=cvec.fit_transform(X_train)
naive=MultinomialNB()
naive.fit(X_train_vec,y_train)

Let's compute train and test error:

In [13]:
y_pred=naive.predict(cvec.transform(X_test))
y_train_pred=naive.predict(cvec.transform(X_train))
na_test=zero_one_loss(y_pred,y_test)
na_train=zero_one_loss(y_train_pred,y_train)
print("errors train / test: {:.1%} / {:.1%}".format(na_train,na_test))

errors train / test: 0.6% / 0.6%


Not bad for such a simple algorithm! Since the classes are imbalanced, we should compute the confusion matrix and the *true positive* and the *true negative rate*. (Or *precision* and *recall* if you prefer those.) 

In [14]:
from sklearn.metrics import confusion_matrix
tn,fp,fn,tp=confusion_matrix(y_test,y_pred).ravel()
tpr=tp/(tp+fn)
tnr=tn/(tn+fp)
print("true positive rate: {:.1f}% (rate of correctly classified spam among all spam)".format(tpr*100))
print("true negative rate: {:.1f}% (rate of correctly classified ham among all ham)".format(tnr*100))

true positive rate: 96.5% (rate of correctly classified spam among all spam)
true negative rate: 99.9% (rate of correctly classified ham among all ham)
