# Using Naive Bayes & SVM to predict spam
I've just completed the introduction to Naive Bayes and SVM from Udacity and would like to implement my newly gained understanding in a project. The data I'm using is from [kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and contains text messages classified as either ham (legitimate) or spam. 

Naive Bayes uses a probabilistic approach which doesn't account for interaction between words. SVM on the other hand can account for interaction, but is slower. On this dataset, we expect SVM to outperform Naive Bayes (as there is definitely interaction between words in a text message) and the longer processing times for SVM to be unimportant as the dataset is small. 


# Import libraries

In [2]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn import feature_extraction, naive_bayes


# Exploring the data

In [16]:
spam_data = pd.read_csv('spam.csv', encoding='latin-1')
print(spam_data.shape)
spam_data.head(n=3)

(5572, 5)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,


In [20]:
# let's drop the columns we don't need
spam_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
print(spam_data.groupby('v1').count())
print('-----------------------------------')
# next let's see which words are used most:
spam_words = ' '.join(spam_data[spam_data['v1'] == 'spam']['v2'])
ham_words = ' '.join(spam_data[spam_data['v1'] == 'ham']['v2'])

count_spam = Counter(spam_words.split())
count_ham = Counter(ham_words.split())
print('most common words in spam messages:')
print(count_spam.most_common(10))
print('----------------------------------')
print('most common words in ham messages:')
print(count_ham.most_common(10))


        v2  Unnamed: 2  Unnamed: 3  Unnamed: 4
v1                                            
ham   4825          45          10           6
spam   747           5           2           0
-----------------------------------
most common words in spam messages:
[('to', 604), ('a', 358), ('your', 187), ('call', 185), ('or', 185), ('the', 178), ('2', 169), ('for', 169), ('you', 164), ('is', 143)]
----------------------------------
most common words in ham messages:
[('to', 1530), ('you', 1458), ('I', 1436), ('the', 1019), ('a', 969), ('and', 738), ('i', 736), ('in', 734), ('u', 645), ('is', 638)]


We can see that of the 5500 messages, 747 (sp 14%) are spam. We can also see that spam messages use words like "you" more often than non-spam messages, which use words like "I" more. For the time being we'll stick to an analysis as-is, but obviously it would make sense to group words like "your" and "you" together.

# change text into features
We can use the CountVectorizer to transform the text into numbers, following this tutorial:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [21]:
# transform the text into features
count_vect = feature_extraction.text.CountVectorizer()
X = count_vect.fit_transform(spam_data["v2"])
print(X.shape)
y = spam_data['v1'].map({'spam':1, 'ham':0})
print(y.shape)

(5572, 8672)
(5572,)


# Naive Bayes
The MultinomialNB already supports the format which is given by CountVectorizer, so let's use that:

In [22]:
# split data into test train data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42)
# fit Naive Bayes
clf = naive_bayes.MultinomialNB()
clf.fit(X_train,y_train)
print(clf.score(X_test,y_test))

0.9793365959760739


# SVM
Next, let's use a SVM to analyze the same dataset:

In [23]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svc =SVC(kernel='rbf')
svc.fit(X_train,y_train)
prediction = svc.predict(X_test)
accuracy_score(y_test,prediction)

0.8629690048939641

# Result
98% accuracy on a simple Naive_Bayes prediction. And 86% using an SVM without optimizing either. This is surprising as we expected the SVM to perform better. It's possible that if we optimize both SVM will perform better.

# Conclusion
A dataset containing 5.5k text messages was used to train two ML algorithms to predict spam. Both algorithms (Naive Bayes and SVM) can be trained in a short period of time. Naive Bayes outperforms SVM. Considering the 14% spam rate, a 98% accuracy from Naive Bayes is not bad!