# Spam or ham -- naive Bayes classifier

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import zero_one_loss
import matplotlib.pyplot as plt

Let's import a data set with spam and non-spam text messages as a csv file. You can download it at https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection


In [3]:
# spam or ham dataset from
# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
data = pd.read_csv("spam_or_ham/SMSSpamCollection.csv", sep='\t',header=None,encoding = "latin-1",names=["class","sms"])
data.head()

Unnamed: 0,class,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   class   5572 non-null   object
 1   sms     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
# if not transformed to numpy array, shuffle_index below won't work
X=np.array(data["sms"])
y=np.array(data["class"]=="spam")
shuffle_index = np.random.permutation(len(y))
X, y = X[shuffle_index], y[shuffle_index]

split=4500
# split training set and test set, if max_size is set, then take only part of data as test set
X_train,y_train,X_test,y_test = X[:split],y[:split],X[split:],y[split:]

In [6]:
# ratio of spam to all
sum(y_train==1)/len(y_train)

0.13333333333333333

We see that the data set is skewed: only about 13% of all text messages is spam.

# bag of words

Let's turn the messages into term count vectors. To see how this works, we first only look at three messages.

In [7]:
# CountVectorizer transforms text into bag of words
from sklearn.feature_extraction.text import CountVectorizer
cvec=CountVectorizer()
cvec.fit(X[:3])

CountVectorizer()

In [8]:
X[0]

"He has lots of used ones babe, but the model doesn't help. Youi have to bring it over and he'll match it up"

In [9]:
cvec.transform([X[0]]).toarray()

array([[0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 2, 1, 0, 2, 0, 0, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1]])

In [10]:
cvec.vocabulary_

{'he': 12,
 'has': 10,
 'lots': 19,
 'of': 23,
 'used': 34,
 'ones': 24,
 'babe': 3,
 'but': 6,
 'the': 28,
 'model': 22,
 'doesn': 8,
 'help': 13,
 'youi': 37,
 'have': 11,
 'to': 32,
 'bring': 5,
 'it': 15,
 'over': 25,
 'and': 2,
 'll': 18,
 'match': 20,
 'up': 33,
 'leaving': 16,
 'soon': 27,
 'be': 4,
 'there': 29,
 'little': 17,
 'after': 1,
 'what': 35,
 'you': 36,
 'thinked': 30,
 'about': 0,
 'me': 21,
 'first': 9,
 'time': 31,
 'saw': 26,
 'in': 14,
 'class': 7}

In [11]:
# we can also try to strip stop words, ie very common words
cvec=CountVectorizer(stop_words="english")
cvec.fit(X)
len(cvec.vocabulary_)

8480

Let's have a look at the first ten stop words.

In [13]:
list(cvec.get_stop_words())[:10]

['mostly',
 'somewhere',
 'last',
 'yet',
 'another',
 'must',
 'therein',
 'neither',
 'amongst',
 'without']

## Naive Bayes

see also https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

and https://scikit-learn.org/stable/modules/naive_bayes.html for naive Bayes

We fit a naive Bayes classifier based on a multinomial model.

In [17]:
from sklearn.naive_bayes import MultinomialNB
cvec=CountVectorizer(stop_words="english")
X_train_vec=cvec.fit_transform(X_train)
naive=MultinomialNB()
naive.fit(X_train_vec,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's compute train and test error:

In [18]:
y_pred=naive.predict(cvec.transform(X_test))
y_train_pred=naive.predict(cvec.transform(X_train))
na_test=zero_one_loss(y_pred,y_test)
na_train=zero_one_loss(y_train_pred,y_train)
print("errors train / test: {:.1%} / {:.1%}".format(na_train,na_test))

errors train / test: 0.6% / 1.1%


Not bad for such a simple algorithm!