# This is a short tutorial on NLP, task is to create a model to predict whether given email is a spam or ham

#####  Steps to create the model
    1 Load the dataset (dataset used in this tutorial is available on https://archive.ics.uci.edu/ml/datasets.php)
    
    2. Write a function, which would iterate to the whole dataset and will perform following tasks:
        2.1 it will first remove all the special characters from the message
        2.2 will covernt all the messages to lower case messages
        2.3 split the message and make a list of words, iterate each word to remove stopwords & get its stema/lemma (more 
            on this later)
        2,4 create the message back from the new set of words that we got after removing stopwords and getting stema/lemma
        2.5 append the clean message to message_set
        
    3. Create bag of words
    
    4. Create two models using MultinomialNB(), one model for the message set which was created using Stemming & one for Lemmatization
    
    5. Evaluate the models and check their accuracies

# 1. Import the necessary liabraries

In [114]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import confusion_matrix

In [115]:
# lets load the dataset, the dataset is availabe on the website "https://archive.ics.uci.edu/ml/datasets.php"

# file has labels(spam/ham) & the message itself, labels and messages are tab seperated, hence we are goint to use "\t" as the 
# seperator, dataset does not have any column names, so, lets name them as "Label" & "Message"

spamham = pd.read_csv("SMSSpamCollection.txt",sep = "\t", names = ["Label","Message"])
spamham.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [116]:
# check the no. of rows and columns present in the dataset
spamham.shape

(5572, 2)

# 2 Removing noise from the dataset

### What is Noise?
Take a look at the messages present in the dataset, you would find words like "Go" & "go". Words like "is", "the", words like "drive" "driving". This is basically noise and need to be cleaned as explained below.

### StopWords
Words like "The" "is" etc are the most frequent words in any email/text, we are going to remove these words from the messages  present in the dataset before training the model as these words do not add any value to model's predictive power.

### Stemming
It’s a rule-based technique that just chops off the suffix of a word to get its root form which is called the ‘stem’. For example, "driver" would be converted to "driv" and "going" would be converted to "go", obviously "driv" is not a valid word, so we need better options than Stemming.

### Lemmatization: 
Lemmatization is more accurate than Stemming, This method takes an input word and searches for its base word by going recursively through all the variations of dictionary words. hence, driver would be converted to "drvie" whereas "going" would be converted to "go". This method is slower than Stemming though.

In [117]:
# following two Classes are needed to convert the words to their stem and lemma forms.
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

# 3. Function to preprocess the dataset

In [118]:
# now lets clean the dataset, by removing Special characters.

# re.sub used below tries to look for each occurrence of the pattern in the string and replaces all of them by the replacement 
#string, returns the same original if match not found. 

#I have used Meta-sequence \w as first parameter to re.sub, which is used to match all the alphanumeric characters.
# "^" in front of "\w" would actually negate this condition, and hence, "^\w" would end up retaining only alphanumeric characters
# present in the messages.

def preprocess(data,stem = True):
    message_set = []
    for i in range(len(spamham)):
        temp = re.sub('[^\w]',' ',spamham["Message"][i])
        temp = temp.lower()
        temp = temp.split()    
        sentence = " "
        for word in temp:        
            if word not in stopwords.words("english"):
                if stem:
                    word = stemmer.stem(word)
                else:
                    word = lemma.lemmatize(word,pos = 'v')
                sentence = sentence + word + ' '
        message_set.append(sentence)
    return (message_set)

In [119]:
# create clean message set by running the preprocessing function, this function would remove stopwords and change the words
# to their stem form
message_set = preprocess(spamham,stem = True)
message_set[:5]

[' go jurong point crazi avail bugi n great world la e buffet cine got amor wat ',
 ' ok lar joke wif u oni ',
 ' free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri question std txt rate c appli 08452810075over18 ',
 ' u dun say earli hor u c alreadi say ',
 ' nah think goe usf live around though ']

# 4 Bag of Words 

After performing all the preprocessing steps,such as removal of stop words, stemming and lemmatization, the next thing that we need to do is to change the text to a tabular form. This can be done using the bag-of-words representation, also called a bag-of-words model, where each row of the table represents each document(message in this case), and the columns represent the words of the text.

Bag of Words can be created using CountVectorizer class.

In [120]:
# Bag of Words model
cv = CountVectorizer(max_features=4000) # max feature here represents the no. of features we want to select to create the model
X = cv.fit_transform(message_set).toarray()

In [121]:
# lets save the target variable to "y"
y = spamham['Label']

# 5 Train Test Split

In [122]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

# 6 Model Building & Evaluation - Stemming

In [123]:
# Training model using Naive bayes classifier
model = MultinomialNB().fit(X_train, y_train)

# make predictions
y_pred=model.predict(X_test)

In [124]:
# lets check the accuracy of the model
accuracy_score(y_test,y_pred)

0.9880382775119617

In [125]:
# Confusion Matrix
confusion = confusion_matrix (y_test, y_pred)
confusion

array([[1443,    8],
       [  12,  209]], dtype=int64)

# 7 Model Building & Evaluation - Lemmatization

In [126]:
# create clean message set by running the preprocessing function, this function would remove stopwords and change the words
# to their lemma form
message_set = preprocess(spamham,stem = False)
message_set[:5]

[' go jurong point crazy available bugis n great world la e buffet cine get amore wat ',
 ' ok lar joke wif u oni ',
 ' free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry question std txt rate c apply 08452810075over18 ',
 ' u dun say early hor u c already say ',
 ' nah think go usf live around though ']

In [127]:
# Bag of Words model
cv = CountVectorizer(max_features=4000)
X = cv.fit_transform(message_set).toarray()

In [128]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

In [129]:
# Training model using Naive bayes classifier
model = MultinomialNB().fit(X_train, y_train)

# make predictions
y_pred=model.predict(X_test)

In [130]:
# lets check the accuracy of the model
accuracy_score(y_test,y_pred)

0.9886363636363636

In [131]:
# Confusion Matrix
confusion = confusion_matrix (y_test, y_pred)
confusion

array([[1445,    6],
       [  13,  208]], dtype=int64)