## Classifying Text Messages
SVM and Naive Bayes


The SMS Spam Collection v.1 (text file: smsspamcollection) has a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

The files contain one message per line. Each line is composed by two columns: one with label (ham or spam) and other with the raw text. Here are some examples:

ham   What you doing?how are you?
ham   Ok lar... Joking wif u oni...
ham   dun say so early hor... U c already then say...
ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham   Siva is in hostel aha:-.
ham   Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam   FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam   Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam   URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

Note: messages are not chronologically sorted.

In [7]:
#import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import time

#try different methods for text vectorization
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

#algorithms
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

#metrics
from sklearn.metrics import confusion_matrix

In [9]:
# # reading in a file for practice sake
# file = open('SMSSpamCollection','r')
# datsy = file.readlines()
# datsy

In [5]:
sms_data = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'text'])
sms_data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
#description of data values
print(sms_data.describe())

       label                    text
count   5572                    5572
unique     2                    5169
top      ham  Sorry, I'll call later
freq    4825                      30


In [16]:
# counts of each label
print("Ham:",sum(sms_data.label == 'ham'))
print("Spam:",sum(sms_data.label == 'spam'))
print(sms_data.shape)

Ham: 4825
Spam: 747
(5572, 2)


In [20]:
#vectorizing the text
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(sms_data.text.values)
tfvec = TfidfVectorizer()
tfidf = tfvec.fit_transform(sms_data.text.values)

In [28]:
#make train and test sets
#split up train/test data
X_train, X_test, y_train, y_test = train_test_split(counts, sms_data.label, test_size=0.25, random_state=10)

In [29]:
def train_NB(X, y):
    NB = MultinomialNB()
    NB.fit(X,y)
    return NB 

In [30]:
#using just the text
train_start = time.time()
NB = train_NB(X_train, y_train)
train_end = time.time()
print('Training complete', train_end-train_start)
 

accuracy = NB.score(X_test, y_test)
print('Accuracy:',accuracy)

Training complete 0.06606554985046387
Accuracy: 0.969131371141


In [38]:
#add a confusion matrix
#confusion_matrix(y_true,y_pred)
predictions_NB = NB.predict(X_test)
# print(predictions)
confusion_matrix(y_test,predictions_NB)

array([[1190,   25],
       [  18,  160]])

In [35]:
print("pred ham",sum(predictions_NB == 'ham'))
print("pred spam",sum(predictions_NB == 'spam'))
print("true ham",sum(y_test == 'ham'))
print("true spam",sum(y_test == 'spam'))


pred ham 1208
pred spam 185
true ham 1215
true spam 178


In [18]:
#talk about ways to deal with unbalanced data

#establish baseline
print("Ham %: ",100*sum(sms_data.label=='ham') / sms_data.shape[0])
print("Spam %: ",100*sum(sms_data.label=='spam') / sms_data.shape[0])


Ham %:  86.5936826992
Spam %:  13.4063173008
