# Naive Bayes Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. The data was edited to remove some unwanted columns and add headings. 



In [2]:
import pandas as pd
df = pd.read_csv('../data/sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...



### Text preprocessing

Before applying a machine learning algorithm, the text will be preprocessed by removing stop words and creating a tf-idf representation of the data.


In [3]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords)

In [4]:
# set up X and y
X = df.text
y = df.spam

In [5]:
# take a peek at X
X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: text, dtype: object

In [6]:
# look at y
y[:10]

0    0
1    0
2    1
3    0
4    0
5    1
6    0
7    0
8    1
9    1
Name: spam, dtype: int64

### train and test sets

Split the data into train and test sets, with 20% of the data going to the test set.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

X_train.shape

(3869,)

In [8]:
# apply tfidf vectorizer
X_train = vectorizer.fit_transform(X_train)  # fit and transform the train data
X_test = vectorizer.transform(X_test)        # transform only the test data


In [9]:
# take a peek at the data
# this is a very sparse matrix because most of the 8613 words don't occur in each sms message

print('train size:', X_train.shape)
print(X_train.toarray()[:5])

print('\ntest size:', X_test.shape)
print(X_test.toarray()[:5])

train size: (3869, 7810)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

test size: (968, 7810)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### train the naive bayes classifier

For this data, let's try the MultinomialNB. In a Multinomial Naive Bayes classifier, features are discrete. This fits perfectly for word counts, but can also be used for tfidf representations. 

We used the default settings. You should always research the documentation and see what these mean:

- alpha: additive (Laplace) smoothing (0 for no smoothing)
- fit_prior: if True, learn priors from data; if false, use a uniform prior
- class_prior: lets you specify class priors


In [10]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [11]:
# priors
import math
prior_p = sum(y_train == 1)/len(y_train)
print('prior spam:', prior_p, 'log of prior:', math.log(prior_p))

# the model prior matches the prior calculated above
naive_bayes.class_log_prior_[1]

prior spam: 0.13388472473507365 log of prior: -2.01077611244103


-2.0107761124410306

In [12]:
# what else did it learn from the data?
# the log likelihood of words given the class

naive_bayes.feature_log_prob_

array([[-9.643029  , -9.67373923, -9.47714135, ..., -9.53897898,
        -9.68907421, -6.31041976],
       [-8.23356461, -7.60523447, -9.19154209, ..., -9.19154209,
        -8.98794189, -9.19154209]])


### evaluate on the test data

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# make predictions on the test data
pred = naive_bayes.predict(X_test)

# print confusion matrix
print(confusion_matrix(y_test, pred))


[[848   0]
 [ 32  88]]


In [14]:
# confusion matrix has this form
#     tp   fp
#     fn   tn


In [15]:
print('accuracy score: ', accuracy_score(y_test, pred))
      
print('\nprecision score (not spam): ', precision_score(y_test, pred, pos_label=0))
print('precision score (spam): ', precision_score(y_test, pred))

print('\nrecall score: (not spam)', recall_score(y_test, pred, pos_label=0))
print('recall score: (spam)', recall_score(y_test, pred))
      
print('\nf1 score: ', f1_score(y_test, pred))

accuracy score:  0.9669421487603306

precision score (not spam):  0.9636363636363636
precision score (spam):  1.0

recall score: (not spam) 1.0
recall score: (spam) 0.7333333333333333

f1 score:  0.846153846153846


In [16]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98       848
           1       1.00      0.73      0.85       120

    accuracy                           0.97       968
   macro avg       0.98      0.87      0.91       968
weighted avg       0.97      0.97      0.96       968



How good is our accuracy?

In the data set, there are 4199 not-spam messages out of 4837. The test data distribution is similar. So if we guess not spam every time we would have 87% accuracy. It seems that Naive Bayes did learn something. The accuracy was several points above this simple baseline.

In [17]:
print('spam size in test data:',y_test[y_test==0].shape[0])
print('test size: ', len(y_test))
baseline = y_test[y_test==0].shape[0] / y_test.shape[0] 
print(baseline)

spam size in test data: 848
test size:  968
0.8760330578512396


Examine some wrong classificataions.

In [18]:
y_test[y_test != pred]

4179    1
677     1
2073    1
2466    1
4721    1
3144    1
1788    1
801     1
511     1
4062    1
4731    1
1150    1
2754    1
924     1
444     1
1583    1
3230    1
4757    1
165     1
4214    1
1266    1
851     1
827     1
2932    1
3003    1
4635    1
4363    1
1558    1
366     1
3528    1
4479    1
2584    1
Name: spam, dtype: int64

In [19]:
for i in [1536, 4692, 2915, 2464, 1101, 1268, 227]:
    print(df.loc[i])

spam                                                    1
text    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
Name: 1536, dtype: object
spam                                                    1
text    Santa Calling! Would your little ones like a c...
Name: 4692, dtype: object
spam                                                    1
text    You have 1 new voicemail. Please call 08719181...
Name: 2915, dtype: object
spam                                                    1
text    INTERFLORA - ÂIt's not too late to order Inte...
Name: 2464, dtype: object
spam                                                    1
text    CLAIRE here am havin borin time & am now alone...
Name: 1101, dtype: object
spam                                                    1
text    500 free text msgs. Just text ok to 80488 and ...
Name: 1268, dtype: object
spam                                                    1
text    Will u meet ur dream partner soon? Is ur caree...
Name: 227, dtype: object


#### analysis

There are capital letters and exclamation points in these messages that were misclassified as not spam, but they really are spam.  The way we preprocessed got rid of this information so our algorithm could not learn from it. 

Will we get better performance if we process the data differently?


# Second Try

Let's preprocess the text differently to recognize punctuation and caps.

In [20]:
import re

df['text'].replace('[\d][\d]+', ' num ', regex=True, inplace=True)
df['text'].replace('[!@#*][!@#*]+', ' punct ', regex=True, inplace=True)
df['text'].replace('[A-Z][A-Z]+', ' caps ', regex=True, inplace=True)
    
# these are known problem messages 
for i in [1536, 4692, 2915, 2464, 1101, 1268, 227]:
    print(df.loc[i])

spam                                                    1
text     caps   num  &  caps   caps   caps   caps   ca...
Name: 1536, dtype: object
spam                                                    1
text    Santa Calling! Would your little ones like a c...
Name: 4692, dtype: object
spam                                               1
text    You have 1 new voicemail. Please call  num .
Name: 2915, dtype: object
spam                                                    1
text     caps  - ÂIt's not too late to order Interflo...
Name: 2464, dtype: object
spam                                                    1
text     caps  here am havin borin time & am now alone...
Name: 1101, dtype: object
spam                                                    1
text     num  free text msgs. Just text ok to  num  an...
Name: 1268, dtype: object
spam                                                    1
text    Will u meet ur dream partner soon? Is ur caree...
Name: 227, dtype: object


In [21]:
# do the rest of the processing
X = df.text
y = df.spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)




In [22]:
# apply tfidf vectorizer
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


In [23]:
# train the algorithm

naive_bayes.fit(X_train, y_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [24]:
# evaluate

pred = naive_bayes.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))
confusion_matrix(y_test, pred)


accuracy score:  0.981404958677686
precision score:  1.0
recall score:  0.85
f1 score:  0.9189189189189189


array([[848,   0],
       [ 18, 102]])

It seems that we moved 14 observations that were misclassified as not-spam into spam. We got better recall which in turn led to a better f1 score. 

## Third Try

The next code blocks compare the results using the Binomial classifier instead of the Multinomial classifier.

In [25]:
# binary=True gives binary data instead of counts
vectorizer_b = TfidfVectorizer(stop_words=stopwords, binary=True)

# set up X and y
X = vectorizer_b.fit_transform(df.text)
y = df.spam

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

In [26]:
from sklearn.naive_bayes import BernoulliNB

naive_bayes2 = BernoulliNB()
naive_bayes2.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [27]:
# make predictions on the test data
pred = naive_bayes2.predict(X_test)

# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[847,   1],
       [ 13, 107]])

In [28]:
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.9855371900826446
precision score:  0.9907407407407407
recall score:  0.8916666666666667
f1 score:  0.9385964912280701


### Analysis

The Binomial classifier performed better than the Multinomial classifier. The Binomial classifier makes a different model of the data, just the presence or absence of words, rather than counts. If the data coming into the Bernoulli classifier is not binary, the classifier will binarize it. This seems to have worked for this data set, probably because the presences or absences of certain words is a strong predictor of spam or not-spam.