# **Comparative Analysis Of Multinomial Naïve Bayes And Support Vector Classifier For SMS Spam Detection**

[Download Dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

# **Import the Dataset**

In [2]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer


# if you are NOT using google colab, remove this part starting from here
from google.colab import files

uploaded=files.upload()
# to here

sms = pd.read_csv('spam.csv',encoding='latin1')[['v2','v1']]
sms.columns = ['text','clas']
sms = sms[(sms['clas'] == 'spam') | (sms['clas'] == 'ham')]
sms = sms.reset_index(drop=True)
sms.tail()
print(sms.head())
print(sms.tail())

Saving spam.csv to spam (1).csv
                                                text  clas
0  Go until jurong point, crazy.. Available only ...   ham
1                      Ok lar... Joking wif u oni...   ham
2  Free entry in 2 a wkly comp to win FA Cup fina...  spam
3  U dun say so early hor... U c already then say...   ham
4  Nah I don't think he goes to usf, he lives aro...   ham
                                                   text  clas
5567  This is the 2nd time we have tried 2 contact u...  spam
5568              Will Ì_ b going to esplanade fr home?   ham
5569  Pity, * was in mood for that. So...any other s...   ham
5570  The guy did some bitching but I acted like i'd...   ham
5571                         Rofl. Its true to its name   ham


Basic Stats


In [3]:
sms.groupby('clas').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
clas,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


# **Preprocess The Data**

Remove any encoding from the text

In [4]:
import re
def preprocessor(txt):
    txt = re.sub('<[^>]*>', '', txt)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', txt)
    txt = re.sub('[\W]+', ' ', txt.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return txt

In [5]:
sms['text'] = sms['text'].apply(preprocessor)

Download stopwords from the NLTK, remove them from the tweets.

First we split the tweets into indiviual words

In [6]:
def split_into_tokens(text):
    text = str(text) #converts bytes into proper unicode
    return TextBlob(text).words

Review the text before being tokenized

In [7]:
sms.text.head()

0    go until jurong point crazy available only in ...
1                             ok lar joking wif u oni 
2    free entry in 2 a wkly comp to win fa cup fina...
3         u dun say so early hor u c already then say 
4    nah i don t think he goes to usf he lives arou...
Name: text, dtype: object

Tokenize

In [8]:
import nltk
nltk.download('punkt')
sms.text.head().apply(split_into_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, don, t, think, he, goes, to, usf, he,...
Name: text, dtype: object

Detect part-of-speech tags

In [9]:
import nltk
nltk.download('averaged_perceptron_tagger')

TextBlob("hello world, how is it going?").tags  #list of (word,POS) pairs

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('hello', 'JJ'),
 ('world', 'NN'),
 ('how', 'WRB'),
 ('is', 'VBZ'),
 ('it', 'PRP'),
 ('going', 'VBG')]

Remove stopwords and normalize words into their base form

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
stop = stop + [u'a',u'b',u'c',u'd',u'e',u'f',u'g',u'h',u'i',u'j',u'k',u'l',u'm',u'n',u'o',u'p',u'q',u'r',u's',u't',u'v',u'w',u'x',u'y',u'z']

In [12]:
import nltk
nltk.download('wordnet')
def split_into_lemmas(tweet):
    tweet = str(tweet).lower()
    words = TextBlob(tweet).words
    return [word.lemma for word in words if word not in stop]

sms.text.head().apply(split_into_lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


0    [go, jurong, point, crazy, available, bugis, g...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, 2, wkly, comp, win, fa, cup, fin...
3           [u, dun, say, early, hor, u, already, say]
4          [nah, think, go, usf, life, around, though]
Name: text, dtype: object

Convert data to vectors

In [13]:
%%time
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(sms['text'])
print(len(bow_transformer.vocabulary_))

8028
CPU times: user 1.8 s, sys: 10.3 ms, total: 1.81 s
Wall time: 1.84 s


Get bag-of-words counts as a vector

In [14]:
%%time
sms_bow = bow_transformer.transform(sms['text'])
print('sparse matrix shape:', sms_bow.shape)
print('number of non-zeros:', sms_bow.nnz)
print('sparsity: %.2f%%' % (100.0 * sms_bow.nnz / (sms_bow.shape[0] * sms_bow.shape[1])))

sparse matrix shape: (5572, 8028)
number of non-zeros: 49523
sparsity: 0.11%
CPU times: user 1.8 s, sys: 4.74 ms, total: 1.81 s
Wall time: 1.81 s


# **Split The Data**

In [15]:
from sklearn.model_selection import train_test_split

X = sms_bow
y = sms['clas'].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)

(4457, 8028)
(1115, 8028)


# **Apply Multinomial Naive Bayes model**

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix

#Fit the model
%time text_class = MultinomialNB().fit(X_train, y_train)

#Make predictions
predictions = text_class.predict(X_test)

#Compute Accuracy and Print Confusion Matrix
print('accuracy', accuracy_score(y_test, predictions))
print('confusion matrix\n', confusion_matrix(y_test, predictions))
print('(row=expected, col=predicted)')

#Compute precision, recall or F1
print(classification_report(y_test, predictions))

#Apply model to new unseen data
def predict_sms(new_sms):
  new_sample = bow_transformer.transform([new_sms])
  print(new_sms, np.around(text_class.predict_proba(new_sample), decimals=5),"\n")

predict_sms('Hi, I hope the professor awards me with good marks')
predict_sms('Free trial for a new game with zero upfront pay. Just click the button below for more.')

CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms
Wall time: 15.7 ms
accuracy 0.9802690582959641
confusion matrix
 [[954  11]
 [ 11 139]]
(row=expected, col=predicted)
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       965
        spam       0.93      0.93      0.93       150

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Hi, I hope the professor awards me with good marks [[0.99698 0.00302]] 

Free trial for a new game with zero upfront pay. Just click the button below for more. [[0.03631 0.96369]] 



# **Apply SVC**

In [17]:
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics.pairwise import cosine_similarity

#Fit the model
sms_class_svc = SVC(kernel=cosine_similarity, C=0.85).fit(X_train, y_train)

#Make predictions
svc_predictions = sms_class_svc.predict(X_test)

#Print SVC results
print("SVC Results: \n")
print('accuracy', accuracy_score(y_test, svc_predictions))
print('confusion matrix\n', confusion_matrix(y_test, svc_predictions))
print('(row=expected, col=predicted)', "\n")

print(classification_report(y_test, svc_predictions))

#Apply model to new unseen data
def predict_sms_svc(new_sms):
    new_sample = bow_transformer.transform([new_sms])
    print(new_sms)
    print("SVC Prediction: ", sms_class_svc.predict(new_sample)[0], "\n")

predict_sms_svc('Hi, I hope the professor awards me with good marks')
predict_sms_svc('Free trial for a new game with zero upfront pay. Just click the button below for more.')

SVC Results: 

accuracy 0.9802690582959641
confusion matrix
 [[963   2]
 [ 20 130]]
(row=expected, col=predicted) 

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.98      0.87      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Hi, I hope the professor awards me with good marks
SVC Prediction:  ham 

Free trial for a new game with zero upfront pay. Just click the button below for more.
SVC Prediction:  spam 

