<h3>SMS Spam Filtering using Machine Learning and NLTK/NLP<h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk import word_tokenize, NaiveBayesClassifier, classify, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
#reading data from CSV file
df = pd.read_csv('C:\\Users\\Karthik\\Desktop\\spam.csv', encoding = "ISO-8859-1", skipinitialspace=True, usecols=['v1', 'v2'])
#renaming and shuffling the columns
df.columns = ['label','text']
df = df[['text', 'label']]

In [3]:
df.head()

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
text     5572 non-null object
label    5572 non-null object
dtypes: object(2)
memory usage: 43.6+ KB


<span>We observe that there are 5571 records in the dataset</span>

In [5]:
#creating the bag of words model
CommonWords = stopwords.words('english')
wordLemmatizer = WordNetLemmatizer()
corpus = []
for i in range(0, 5572):
    #reading each text
    text = df['text'][i]
    #lemmatizing each word of the text. When we tokeninze a sentence we get individual words 
    wordtokens = [wordLemmatizer.lemmatize(word.lower()) for word in word_tokenize(text)] 
    #filtering out the stopwords from the text and combining them into a list again.
    text = ' '.join([x for x in wordtokens if x not in CommonWords])
    corpus.append(text)

In [6]:
print(corpus[0:3])

['go jurong point , crazy.. available bugis n great world la e buffet ... cine got amore wat ...', 'ok lar ... joking wif u oni ...', "free entry 2 wkly comp win fa cup final tkts 21st may 2005 . text fa 87121 receive entry question ( std txt rate ) & c 's apply 08452810075over18 's"]


<span>I have not removed the '...' from the sms as a part of data cleaning as I think they are an important feature. People tend to type '...' quiet often.</span>

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
#creating the sparse matrix
cv= CountVectorizer(max_features= 5000)  
X= cv.fit_transform(corpus).toarray()
y= df.iloc[:,1].values

In [8]:
from sklearn.cross_validation import train_test_split
#creating the training and testing set for our models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)



<h3>Using Naive Bayes to fit the dataset</h3>

In [9]:
from sklearn.naive_bayes import GaussianNB
classifierNB = GaussianNB()
classifierNB.fit(X_train, y_train)
#making the prediction for the test results
y_pred_NB = classifierNB.predict(X_test)

In [10]:
#making the confusion matrix
cm_NB = confusion_matrix(y_test, y_pred_NB)
cm_NB

array([[1068,  128],
       [  21,  176]])

In [11]:
score = accuracy_score(y_test, y_pred_NB)
print(score*100)

89.303661163


<h3>Using kNN to fit the dataset</h3>

In [12]:
from sklearn.neighbors import KNeighborsClassifier
classifierKNN = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifierKNN.fit(X_train, y_train)
#making the prediction for the test results
y_pred_KNN = classifierKNN.predict(X_test)

In [13]:
#making the confusion matrix
cm_KNN = confusion_matrix(y_test, y_pred_KNN)
cm_KNN

array([[1196,    0],
       [ 129,   68]])

In [14]:
score = accuracy_score(y_test, y_pred_KNN)
print(score*100)

90.7394113424


<h3>Using SVM to fit the dataset</h3>

In [15]:
from sklearn.svm import SVC
classifierSVM = SVC(kernel = 'linear', random_state = 0)
classifierSVM.fit(X_train, y_train)
#making the prediction for the test results
y_pred_SVM = classifierSVM.predict(X_test)

In [16]:
#making the confusion matrix
cm_SVM = confusion_matrix(y_test, y_pred_SVM)
cm_SVM

array([[1193,    3],
       [  16,  181]])

In [17]:
score = accuracy_score(y_test, y_pred_NB)
print(score*100)

89.303661163


<h3>Using Random Forset to fit the dataset</h3>

In [18]:
from sklearn.ensemble import RandomForestClassifier
classifierRF = RandomForestClassifier()
classifierRF.fit(X_train, y_train)
y_pred_RF = classifierRF.predict(X_test)

In [19]:
cm_RF = confusion_matrix(y_test, y_pred_RF)
cm_RF

array([[1194,    2],
       [  43,  154]])

In [20]:
score = accuracy_score(y_test, y_pred_RF)
print(score*100)

96.7695620962


<span>Thus we can see that Random Forest model is best able to predict the type of sms-text we have followed by kNN and on second place and then Naive Bayes and SVM tied together at third place.</span>