# Problem Statement

To detect the spam in the mail using different machine learning techniques and compare them to get the most accurate model. Finally, to build a spam detector and evaluate their performances.

 The dataset we used was from a shuffled sample of email subjects and bodies containing both spam and ham emails in numerous proportions, which we converted into lemmas.

Spam is one of the main threats posed to email users. Therefore, an effective spam filtering technology is a significant contribution to the sustainability of cyberspace and our society. As the importance of email is not lesser than your bank account containing 1Cr., then protecting it from spam or frauds is also mandatory.

# Work Flow

1. Data Collection -> we will be using a dataset originally compiled and posted on the UCI Machine Learning repository containing of 2 columns
2. Data Preprocessing -> convert our labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam' for ease of computation
3. Data Analysis -> to understand the insights of the data, what is this data all about and other such kind of things
4. Data Splitting -> training data & test data
5. Model Building -> using different machine learning techniques and compare them to get the most accurate model
6. Model Evaluation -> Using the test data

In [1]:
# importing the requires libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Data Collection

In [2]:
# loading the dataset 
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'sms_message'])
data.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
data.tail()

Unnamed: 0,label,sms_message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


Here, we see that the dataset containing of 2 columns. The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam.
The second column is the text content of the SMS message that is being classified.

# Data Preprocessing

In [4]:
# renaming the column names 
# convert label values to numerical values : 0 for ham and 1 for spam

data['label'] = data.label.map({'ham':0, 'spam':1})
data.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# finding the dataset size
data.shape

(5572, 2)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   label        5572 non-null   int64 
 1   sms_message  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [7]:
# checking for missing values in the dataset
data.isnull().sum()

label          0
sms_message    0
dtype: int64

In [8]:
# finding out the number of ham and spam mails
data.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

# Data Analysis

Here in our dataset, we haveis a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

**Using Bag of Words**

Here we will introduce the Bag of Words (BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. The BoW concept treats each word individually and the order in which the words occur does not matter.

In [9]:
# given document set
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

In [10]:
# Converting all the string in the documents set to their lower case

lower_case_documents = []

for sentence in documents:
    lower_case_documents.append(sentence.lower())

In [11]:
# Checking the result 
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


In [12]:
# removing all the punctuation

sans_punctuation_documents = []
import string

for sentence in lower_case_documents:
    sans_punctuation_documents.append(sentence.translate(str.maketrans('', '', string.punctuation)))

In [13]:
# Checking the result
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


Tokenizing the sentence in the document into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and end of a word. Most commonly, we use a single space as the delimiter character for identifying words.

In [14]:
# tokenize using single space

preprocessed_documents = []

for sentence in sans_punctuation_documents:
    preprocessed_documents.append(sentence.split(" "))

In [15]:
# Observing the result
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


Now, that we have our document set in the required format, we can proceed to counting the occurrence of each word in each document of the document set.

Counter counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list.

In [16]:
# Create a dictionary with the keys being each word in each document and the corresponding values being the frequency of occurence of that word

# List where we save each counter dictionary as an item
frequency_list = []

# import needed libraries
import pprint
from collections import Counter

for sentence in preprocessed_documents:
    count_freq = Counter(sentence)
    frequency_list.append(count_freq)
    
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


**Using sklearn**

In [17]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

In [18]:
# importing scikit learn count vectorizer method
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance
count_vector = CountVectorizer(input = documents)

In [19]:
# Observe the instance created
print(count_vector)

CountVectorizer(input=['Hello, how are you!', 'Win money, win from home.',
                       'Call me now.', 'Hello, Call hello you tomorrow?'])


In [20]:
# Fit the document dataset to the CountVectorizer object previously created
count_vector.fit(documents)

# Get the list of words which have been categorized as features
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [21]:
# Creating a matrix with each row representing one of the 4 documents and each column representing a word (feature name)
# each value in the matrix represents the frequency of the word in that column occuring in the particylar document in that row

doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [22]:
# Convert the 'doc_array' into a dataframe. 
# Use the words (feature names) as columns names

frequency_matrix = pd.DataFrame(data = doc_array, columns = count_vector.get_feature_names())

# Observe the new dataframe
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


# Data Splitting 

In [23]:
from sklearn.model_selection import train_test_split

x = data.sms_message
y = data.label

In [24]:
# Splitting into training data & test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [25]:
print(x.shape, x_train.shape, x_test.shape)

(5572,) (4457,) (1115,)


In [26]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(x_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(x_test)

# Naive Bayes implementation

we will be using the multinomial Naive Bayes implementation which is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input.

In [27]:
# importing the MultinomialNB classifier
from sklearn.naive_bayes import MultinomialNB

# Create the classifier
naive_bayes = MultinomialNB()

# fit the training data into the classifier
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [29]:
# accuracy of training data
from sklearn.metrics import accuracy_score

pred_train_naive_bayes = naive_bayes.predict(training_data)
training_accuracy_NB = accuracy_score(pred_train_naive_bayes, y_train)
training_accuracy_NB

0.9923715503702042

In [30]:
# accuracy of test data

pred_test_naive_bayes = naive_bayes.predict(testing_data)
test_accuracy_NB = accuracy_score(pred_test_naive_bayes, y_test)
test_accuracy_NB

0.9901345291479821

In [31]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_naive_bayes)))
print('Precision score: ', format(precision_score(y_test, pred_test_naive_bayes)))
print('Recall score: ', format(recall_score(y_test, pred_test_naive_bayes)))
print('F1 score: ', format(f1_score(y_test, pred_test_naive_bayes)))

Accuracy score:  0.9901345291479821
Precision score:  0.9788732394366197
Recall score:  0.9455782312925171
F1 score:  0.9619377162629758


**Conclusion**
- One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features.
- The other major advantage it has is its relative simplicity.
- Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known.
- It rarely ever overfits the data.
- Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.

# Decision Tree implementation

we will be using the decision tree classifier implementation which is suitable for classification with discrete features 

In [32]:
# importing the MultinomialNB classifier
from sklearn.tree import DecisionTreeClassifier

# Create the classifier
model_tree = DecisionTreeClassifier()

# fit the training data into the classifier
model_tree.fit(training_data ,y_train)

pred_train_tree = model_tree.predict(training_data)
training_accuracy_tree = accuracy_score(pred_train_tree, y_train)
training_accuracy_tree

1.0

In [33]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy']}]
search = GridSearchCV(model_tree, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    5.2s finished


In [34]:
# optimum parameter values
search.best_params_

{'criterion': 'gini'}

In [35]:
model_tree = DecisionTreeClassifier(criterion='gini').fit(training_data, y_train)

In [36]:
# accuracy of training data

pred_train_tree = model_tree.predict(training_data)
training_accuracy_tree = accuracy_score(pred_train_tree, y_train)
training_accuracy_tree

1.0

In [37]:
# accuracy of test data

pred_test_tree = model_tree.predict(testing_data)
test_accuracy_tree = accuracy_score(pred_test_tree, y_test)
test_accuracy_tree

0.9713004484304932

In [38]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_tree)))
print('Precision score: ', format(precision_score(y_test, pred_test_tree)))
print('Recall score: ', format(recall_score(y_test, pred_test_tree)))
print('F1 score: ', format(f1_score(y_test, pred_test_tree)))

Accuracy score:  0.9713004484304932
Precision score:  0.8859060402684564
Recall score:  0.8979591836734694
F1 score:  0.8918918918918919


**Conclusion**
- Decision trees are used for classification and regression. 
- It measures the degree of disorganization during a system called Entropy. The entropy factor varies from sample to sample. 
- The entropy is zero for the homogeneous sample, and for the equal dividend sample, the entropy is 1.
- It chooses the split which has rock bottom entropy compared to the parent node and other splits. The lesser the entropy, the upper it is.

# Random Forest implementation

we will be using the random forest classifier implementation which is suitable for classification with discrete features

In [39]:
# importing the MultinomialNB classifier
from sklearn.ensemble import RandomForestClassifier

# Create the classifier
model_rf = RandomForestClassifier(n_estimators=500, max_features='sqrt')

# fit the training data into the classifier
model_rf.fit(training_data, y_train)

pred_train_rf = model_rf.predict(training_data)
training_accuracy_rf = accuracy_score(pred_train_rf, y_train)
training_accuracy_rf

1.0

In [40]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy'], 'n_estimators':[100,200,300,400,500,600]}]
search = GridSearchCV(model_rf, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  3.2min finished


In [41]:
# optimum parameter values
search.best_params_

{'criterion': 'gini', 'n_estimators': 600}

In [44]:
model_rf = RandomForestClassifier(criterion='gini', n_estimators=600, max_features='sqrt').fit(training_data, y_train)

In [45]:
# accuracy of training data

pred_train_rf = model_rf.predict(training_data)
training_accuracy_rf = accuracy_score(pred_train_rf, y_train)
training_accuracy_rf

1.0

In [46]:
# accuracy of test data

pred_test_rf = model_rf.predict(testing_data)
test_accuracy_rf = accuracy_score(pred_test_rf, y_test)
test_accuracy_rf

0.9856502242152466

In [48]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_rf)))
print('Precision score: ', format(precision_score(y_test, pred_test_rf)))
print('Recall score: ', format(recall_score(y_test, pred_test_rf)))
print('F1 score: ', format(f1_score(y_test, pred_test_rf)))

Accuracy score:  0.9856502242152466
Precision score:  1.0
Recall score:  0.891156462585034
F1 score:  0.9424460431654677


**Conclusion**
- Random forest is like bootstrapping algorithm with a call tree (CART) model. 
- Random forest gives rather more accurate predictions when put next to simple CART or regression models in many scenarios. 
- These cases generally have a high number of predictive variables and an enormous sample size. 
- This is often actually because it captures the variance of several input variables at a uniform time and enables a high number of observations to participate within the prediction.

# Bagging implementation

In [52]:
# Create the classifier
model_bag = RandomForestClassifier(n_estimators=500, max_features=None)

# fit the training data into the classifier
model_bag.fit(training_data, y_train)

pred_train_bag = model_bag.predict(training_data)
training_accuracy_bag = accuracy_score(pred_train_bag, y_train)
training_accuracy_bag

1.0

In [51]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy'], 'n_estimators':[100,200,300,400,500,600]}]
search = GridSearchCV(model_bag, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 10.5min finished


In [53]:
# optimum parameter values
search.best_params_

{'criterion': 'entropy', 'n_estimators': 500}

In [54]:
model_bag = RandomForestClassifier(criterion='entropy', n_estimators=500, max_features=None).fit(training_data, y_train)

In [55]:
# accuracy of training data

pred_train_bag = model_bag.predict(training_data)
training_accuracy_bag = accuracy_score(pred_train_bag, y_train)
training_accuracy_bag

1.0

In [56]:
# accuracy of test data

pred_test_bag = model_bag.predict(testing_data)
test_accuracy_bag = accuracy_score(pred_test_bag, y_test)
test_accuracy_bag

0.9766816143497757

In [57]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_bag)))
print('Precision score: ', format(precision_score(y_test, pred_test_bag)))
print('Recall score: ', format(recall_score(y_test, pred_test_bag)))
print('F1 score: ', format(f1_score(y_test, pred_test_bag)))

Accuracy score:  0.9766816143497757
Precision score:  0.9416058394160584
Recall score:  0.8775510204081632
F1 score:  0.9084507042253521


# Comparison

Decreasing order of accuracy is depicted as:
- Naive Bayes - 0.99013
- Random Forest - 0.98565
- Bagging - 0.97668
- Decision Tree - 0.97130

The results are highly clear that all the models are good in detecting the spam emails. Naive Bayes is the most accurate method because its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them.