## SPAM Classifier 

We are going to use data provided by UCI which is open for all to use.

It can be found at  https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. 

Following are the steps we are going to follow:

1. Get the data
2. Preprocess the data
3. Create train and test datasets
4. Use Naive Bayes classifier to identify spam messages 
5. Evaluation metrics


In [1]:
#import statements

import os
import pandas as pd

#nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

#sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#plotting
import matplotlib.pyplot as plt
from yellowbrick.classifier import ConfusionMatrix


In [2]:
#get the current working directory
cwd = os.getcwd()

In [3]:
data = pd.read_csv(cwd + "/smsspamcollection/SMSSpamCollection", sep  ="\t", names= ['label', 'messages'])

In [4]:
data

Unnamed: 0,label,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Lets convert the ham and spam to 0 and 1 category respectively which will be our labels.

In [5]:
data['label'] = data['label'].replace({'ham':0, 'spam':1})
data

Unnamed: 0,label,messages
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


## Using Porter Stemming to create corpus

In [6]:
#now we wil preprocess the messages
import re

#create porter stemmer object
ps = PorterStemmer()

corpus=[]


for sent in data['messages']:
    #remove all punctuation
    sent = re.sub('[^a-zA-Z]', ' ', sent)
    sent = sent.lower()
    sent = sent.split()
    sent = [ps.stem(word) for word in sent if not word in stopwords.words("english")]
    sent = " ".join(sent)
    corpus.append(sent)

In [7]:
data['corpus'] = corpus

In [8]:
#porter stemmed data
data_ps = data.drop('messages', axis = 1)

In [9]:
data_ps

Unnamed: 0,label,corpus
0,0,go jurong point crazi avail bugi n great world...
1,0,ok lar joke wif u oni
2,1,free entri wkli comp win fa cup final tkt st m...
3,0,u dun say earli hor u c alreadi say
4,0,nah think goe usf live around though
...,...,...
5567,1,nd time tri contact u u pound prize claim easi...
5568,0,b go esplanad fr home
5569,0,piti mood suggest
5570,0,guy bitch act like interest buy someth els nex...


## Using Lemmatization to create corpus

In [10]:
#create an empty list to store the corpus
corpus_lemma=[]

#create a lemmatizer object
lemma = WordNetLemmatizer()

for sent in data['messages']:
    #remove all punctuation
    sent = re.sub('[^a-zA-Z]', ' ', sent)
    sent = sent.lower()
    sent = sent.split()
    sent = [lemma.lemmatize(word) for word in sent if not word in stopwords.words("english")]
    sent = " ".join(sent)
    corpus_lemma.append(sent)

In [11]:
data['corpus_lemma'] = corpus_lemma

In [12]:
data_lemma = data.drop(['messages','corpus'], axis=1)

In [13]:
data_lemma

Unnamed: 0,label,corpus_lemma
0,0,go jurong point crazy available bugis n great ...
1,0,ok lar joking wif u oni
2,1,free entry wkly comp win fa cup final tkts st ...
3,0,u dun say early hor u c already say
4,0,nah think go usf life around though
...,...,...
5567,1,nd time tried contact u u pound prize claim ea...
5568,0,b going esplanade fr home
5569,0,pity mood suggestion
5570,0,guy bitching acted like interested buying some...


## Vectorization using Bag of words

#### Using Porter Stemmed data

In [14]:

def model(vectorizer,corpus, label):
    #create features
    X = vectorizer.fit_transform(corpus).toarray() 
    #create labels
    y = label

    #now splitting the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

    #creating object for NaiveBayes
    nb_model = MultinomialNB()
    
    #predicting
    y_pred_test = nb_model.fit(X_train, y_train).predict(X_test)
    
    #accuracy score
    score = accuracy_score(y_test, y_pred_test)
    
    return score

In [15]:
#creating instance of CountVecotrizer from sklearn
vectorizer = CountVectorizer()

In [16]:
score_ps_cv = model(vectorizer, data['corpus'], data['label'])
print("Accuracy score using Porter stemmer and Bag of Words vectorization is : " + str(score_ps_cv))

Accuracy score using Porter stemmer and Bag of Words vectorization is : 0.9721973094170404


#### Using Lemmatized data 

In [21]:
#creating instance of CountVecotrizer from sklearn
vectorizer = CountVectorizer()

In [22]:
score_lemma_cv = model(vectorizer,data['corpus_lemma'], data['label'])
print("Accuracy score using Lemmatized data and Bag of Words vectorization is : " + str(score_lemma_cv))

Accuracy score using Lemmantized data and Bag of Words vectorization is : 0.9721973094170404


## Vectorization using TFiDF

#### Using Porter Stemmer Data


In [24]:
#creating instance of Tfidf Vectorizer from sklearn
vectorizer = TfidfVectorizer()

score_ps_tf = model(vectorizer, data['corpus'], data['label'])
print("Accuracy score using Porter stemmer and TFiDF vectorization is : " + str(score_ps_tf))

Accuracy score using Porter stemmer and TFiDF vectorization is : 0.9596412556053812


#### Using Lemmatized data

In [25]:
#creating instance of Tfidf Vectorizer from sklearn
vectorizer = TfidfVectorizer()

score_lemma_tf = model(vectorizer,data['corpus_lemma'], data['label'])
print("Accuracy score using Lemmatized data and Bag of Words vectorization is : " + str(score_lemma_tf))

Accuracy score using Lemmatized data and Bag of Words vectorization is : 0.9605381165919282


In [29]:

accuracy_data = pd.DataFrame({'Porter Stemmer':[score_ps_cv, score_ps_tf],'Lemmatized':[score_lemma_cv, score_lemma_tf]},
                            index=['Count_vectorizer','TFiDF'])

In [30]:
accuracy_data

Unnamed: 0,Porter Stemmer,Lemmatized
Count_vectorizer,0.972197,0.972197
TFiDF,0.959641,0.960538


So as we can see from above table that using countVectorizer for both porter stemmed and lemmatized data gave same higher accuracy than TFiDF data. However, this can be reversible in some other cases. It depends on the corpus pre processing. 

In [None]:
ConfusionMatrix(y_tesst, 