## Naive Bayes SMS spam classification example

In [1]:
import csv

In [2]:
smsdata = open('../dataset/smsspamcollection/SMSSpamCollection.txt', 'r')
csv_reader = csv.reader(smsdata, delimiter='\t')

In [3]:
# Normal coding starts  from here as usual
smsdata_data = []
smsdata_labels = []

for line in csv_reader:
    smsdata_labels.append(line[0])
    smsdata_data.append(line[1])

smsdata.close()

In [4]:
smsdata_data[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [5]:
for i in range(5):
    print(smsdata_data[i], smsdata_labels[i])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... ham
Ok lar... Joking wif u oni... ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's spam
U dun say so early hor... U c already then say... ham
Nah I don't think he goes to usf, he lives around here though ham


In [6]:
from collections import Counter
c = Counter(smsdata_labels)
print(c)

Counter({'ham': 4825, 'spam': 747})


In [7]:
# nltk processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import pandas as pd
from nltk import pos_tag
from nltk.stem import PorterStemmer

In [8]:
def preprocessing(text):
    
    # The following line of the code splits the word and checks each character if it is in standard
    # punctuations, if so it will be replaced with blank and or else it just does not replace with
    # blanks:
    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())
    
    # The following code tokenizes the sentences into words based on white spaces and put them
    # together as a list for applying further steps:
    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
    
    # Converting all the cases (upper, lower, and proper) into lowercase reduces duplicates in
    # corpus:
    tokens = [word.lower() for word in tokens]
    
    # As mentioned earlier, stop words are the words that do not carry much weight in
    # understanding the sentence; they are used for connecting words, and so on. We have
    # removed them with the following line of code:
    stopwds = stopwords.words('english')
    tokens = [token for token in tokens if token not in stopwds]
    
    # Keeping only the words with length greater than 3 in the following code for removing small
    # words, which hardly consists of much of a meaning to carry:
    tokens = [word for word in tokens if len(word) >= 3]
    
    # Stemming is applied on the words using PorterStemmer function, which stems the extra
    # suffixes from the words:
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    
    # POS tagging is a prerequisite for lemmatization, based on whether the word is noun or
    # verb, and so on, it will reduce it to the root word:
    tagged_corpus = pos_tag(tokens)
    
    # The pos_tag function returns the part of speed in four formats for noun and six formats for
    # verb. NN (noun, common, singular), NNP (noun, proper, singular), NNPS (noun, proper,
    # plural), NNS (noun, common, plural), VB (verb, base form), VBD (verb, past tense), VBG (verb,
    # present participle), VBN (verb, past participle), VBP (verb, present tense, not third person
    # singular), VBZ (verb, present tense, third person singular):
    Noun_tags = ['NN', 'NNP', 'NNPS', 'NNS']
    Verb_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    lemmatizer = WordNetLemmatizer()
    
    # The prat_lemmatize function has been created only for the reasons of mismatch between
    # the pos_tag function and intake values of the lemmatize function. If the tag for any word
    # falls under the respective noun or verb tags category, n or v will be applied accordingly in
    # the lemmatize function:
    
    def prat_lemmatize(token, tag):
        if tag in Noun_tags:
            return lemmatizer.lemmatize(token, 'n')
        elif tag in Verb_tags:
            return lemmatizer.lemmatize(token, 'v')
        else:
            return lemmatizer.lemmatize(token, 'n')
    
    # After performing tokenization and applied all the various operations, we need to join it
    # back to form stings and the following function performs the same:
    pre_proc_text = " ".join([prat_lemmatize(token, tag) for token, tag in tagged_corpus])
    return pre_proc_text

In [9]:
smsdata_data_2 = []
for i in smsdata_data:
    smsdata_data_2.append(preprocessing(i))

In [10]:
import numpy as np
trainset_size = int(round(len(smsdata_data_2)*0.70))
print('The training set size of this classifier is ' + str(trainset_size) + '\n')
X_train = np.array([''.join(rec) for rec in smsdata_data_2[0:trainset_size]])
y_train = np.array([rec for rec in smsdata_labels[0:trainset_size]])

X_test = np.array([''.join(rec) for rec in smsdata_data_2[trainset_size + 1:len(smsdata_data_2)]])
y_test = np.array([rec for rec in smsdata_labels[trainset_size + 1:len(smsdata_labels)]])

The training set size of this classifier is 3900



The following code converts the words into a vectorizer format and applies term
frequency-inverse document frequency (TF-IDF) weights, which is a way to increase
weights to words with high frequency and at the same time penalize the general terms such
as the, him, at, and so on. In the following code, we have restricted to most frequent 4,000
words in the vocabulary, none the less we can tune this parameter as well for checking
where the better accuracies are obtained:

In [11]:
# building TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', max_features=4000,
                           strip_accents='unicode', norm='l2')

The TF-IDF transformation has been shown as follows on both train and test data. The
todense function is used to create the data to visualize the content:

In [12]:
X_train_2 = vectorizer.fit_transform(X_train).todense()
X_test_2 = vectorizer.transform(X_test).todense()

Multinomial Naive Bayes classifier is suitable for classification with discrete features
(example word counts), which normally requires large feature counts. However, in practice,
fractional counts such as TF-IDF will also work well. If we do not mention any Laplace
estimator, it does take the value of 1.0 means and it will add 1.0 against each term in
numerator and total for denominator:

In [13]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_2, y_train)

ytrain_nb_predicted = clf.predict(X_train_2)
ytest_nb_predicted = clf.predict(X_test_2)

from sklearn.metrics import accuracy_score, classification_report
print('\nNaive Bayes - Train Confusion Matrix\n\n', pd.crosstab(y_train, ytrain_nb_predicted,
                                                               rownames=['Actuall'], colnames=['Predicted']), sep='')
print('\nNaive Bayes - Train accuracy', round(accuracy_score(y_train, ytrain_nb_predicted), 3))
print('\nNaive Bayes - Train Classification Report\n', classification_report(y_train, ytrain_nb_predicted))


print('\nNaive Bayes - Test Confusion Matrix\n\n', pd.crosstab(y_test, ytest_nb_predicted,
                                                               rownames=['Actuall'], colnames=['Predicted']), sep='')
print('\nNaive Bayes - Test accuracy', round(accuracy_score(y_test, ytest_nb_predicted), 3))
print('\nNaive Bayes - Test Classification Report\n', classification_report(y_test, ytest_nb_predicted))


Naive Bayes - Train Confusion Matrix

Predicted   ham  spam
Actuall              
ham        3381     0
spam         77   442

Naive Bayes - Train accuracy 0.98

Naive Bayes - Train Classification Report
              precision    recall  f1-score   support

        ham       0.98      1.00      0.99      3381
       spam       1.00      0.85      0.92       519

avg / total       0.98      0.98      0.98      3900


Naive Bayes - Test Confusion Matrix

Predicted   ham  spam
Actuall              
ham        1440     3
spam         54   174

Naive Bayes - Test accuracy 0.966

Naive Bayes - Test Classification Report
              precision    recall  f1-score   support

        ham       0.96      1.00      0.98      1443
       spam       0.98      0.76      0.86       228

avg / total       0.97      0.97      0.96      1671



From the previous results it is appearing that Naive Bayes has produced excellent results of
96.6 percent test accuracy with significant recall value of 76 percent for spam and almost
100 percent for ham.
However, if we would like to check what are the top 10 features based on their coefficients
from Naive Bayes, the following code will be handy for this:

However, if we would like to check what are the top 10 features based on their coefficients
from Naive Bayes, the following code will be handy for this:

In [114]:
# printing top features
feature_names = vectorizer.get_feature_names()
coefs = clf.coef_
intercept = clf.intercept_
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))

print('\nTop 10 features - both first & last\n')
n=10
top_n_coefs = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top_n_coefs:
    print('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))



Top 10 features - both first & last

	-8.7130	1hr            		-5.5795	free           
	-8.7130	1st love       		-5.7187	txt            
	-8.7130	2go            		-5.8721	text           
	-8.7130	2morrow        		-6.0066	claim          
	-8.7130	2mrw           		-6.0704	stop           
	-8.7130	2nd inning     		-6.0785	mobil          
	-8.7130	2nd sm         		-6.1074	repli          
	-8.7130	30ish          		-6.1514	prize          
	-8.7130	3rd            		-6.2015	servic         
	-8.7130	3rd natur      		-6.2208	tone           


In [119]:
# seq[start:end:step]
coefs_with_fns[:-11:-1]

[(-5.579497917408297, 'free'),
 (-5.718713033530435, 'txt'),
 (-5.872109068378337, 'text'),
 (-6.006611788084338, 'claim'),
 (-6.070368904423681, 'stop'),
 (-6.078481599936406, 'mobil'),
 (-6.107370536961926, 'repli'),
 (-6.151415272013495, 'prize'),
 (-6.201462737225315, 'servic'),
 (-6.2207812681156, 'tone')]

In [20]:
clf.intercept_

array([-2.01682795])