# Problem Statement

To create a model that can classify SMS messages as spam or not spam (ham), based on the training we give to the model.

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

# Work Flow

1. Data Collection -> we will be using a dataset containing of 2 columns.
2. Data Preprocessing -> convert our labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam' for ease of computation
3. Data Splitting -> training data & test data
4. Model Building -> using different machine learning techniques and compare them to get the most accurate model
5. Model Evaluation -> Using the test data

In [1]:
# importing the requires libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Data Collection

In [2]:
# loading the dataset 
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'sms_message'])
data.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
data.tail()

Unnamed: 0,label,sms_message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


Here, we see that the dataset containing of 2 columns. The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam.
The second column is the text content of the SMS message that is being classified.

# Data Preprocessing

In [4]:
# renaming the column names 
# convert label values to numerical values : 0 for ham and 1 for spam

data['label'] = data.label.map({'ham':0, 'spam':1})
data.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# finding the dataset size
data.shape

(5572, 2)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   label        5572 non-null   int64 
 1   sms_message  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [7]:
# checking for missing values in the dataset
data.isnull().sum()

label          0
sms_message    0
dtype: int64

In [8]:
# finding out the number of ham and spam mails
data.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

# Data Splitting 

In [9]:
from sklearn.model_selection import train_test_split

x = data.sms_message
y = data.label

In [10]:
# Splitting into training data & test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [11]:
print(x.shape, x_train.shape, x_test.shape)

(5572,) (4457,) (1115,)


In [13]:
# Instantiate the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(x_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(x_test)

# Naive Bayes implementation

we will be using the multinomial Naive Bayes implementation which is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input.

In [14]:
# importing the MultinomialNB classifier
from sklearn.naive_bayes import MultinomialNB

# Create the classifier
naive_bayes = MultinomialNB()

# fit the training data into the classifier
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [15]:
# accuracy of training data
from sklearn.metrics import accuracy_score

pred_train_naive_bayes = naive_bayes.predict(training_data)
training_accuracy_NB = accuracy_score(pred_train_naive_bayes, y_train)
training_accuracy_NB

0.9923715503702042

In [16]:
# accuracy of test data

pred_test_naive_bayes = naive_bayes.predict(testing_data)
test_accuracy_NB = accuracy_score(pred_test_naive_bayes, y_test)
test_accuracy_NB

0.9901345291479821

In [17]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_naive_bayes)))
print('Precision score: ', format(precision_score(y_test, pred_test_naive_bayes)))
print('Recall score: ', format(recall_score(y_test, pred_test_naive_bayes)))
print('F1 score: ', format(f1_score(y_test, pred_test_naive_bayes)))

Accuracy score:  0.9901345291479821
Precision score:  0.9788732394366197
Recall score:  0.9455782312925171
F1 score:  0.9619377162629758


**Conclusion**
- One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features.
- The other major advantage it has is its relative simplicity.
- Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known.
- It rarely ever overfits the data.
- Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.

# Decision Tree implementation

we will be using the decision tree classifier implementation which is suitable for classification with discrete features 

In [18]:
# importing the Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier

# Create the classifier
model_tree = DecisionTreeClassifier()

# fit the training data into the classifier
model_tree.fit(training_data ,y_train)

pred_train_tree = model_tree.predict(training_data)
training_accuracy_tree = accuracy_score(pred_train_tree, y_train)
training_accuracy_tree

1.0

In [19]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy']}]
search = GridSearchCV(model_tree, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    3.8s finished


In [20]:
# optimum parameter values
search.best_params_

{'criterion': 'gini'}

In [21]:
model_tree = DecisionTreeClassifier(criterion='gini').fit(training_data, y_train)

In [22]:
# accuracy of training data

pred_train_tree = model_tree.predict(training_data)
training_accuracy_tree = accuracy_score(pred_train_tree, y_train)
training_accuracy_tree

1.0

In [23]:
# accuracy of test data

pred_test_tree = model_tree.predict(testing_data)
test_accuracy_tree = accuracy_score(pred_test_tree, y_test)
test_accuracy_tree

0.9730941704035875

In [24]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_tree)))
print('Precision score: ', format(precision_score(y_test, pred_test_tree)))
print('Recall score: ', format(recall_score(y_test, pred_test_tree)))
print('F1 score: ', format(f1_score(y_test, pred_test_tree)))

Accuracy score:  0.9730941704035875
Precision score:  0.8874172185430463
Recall score:  0.9115646258503401
F1 score:  0.8993288590604027


**Conclusion**
- Decision trees are used for classification and regression. 
- It measures the degree of disorganization during a system called Entropy. The entropy factor varies from sample to sample. 
- The entropy is zero for the homogeneous sample, and for the equal dividend sample, the entropy is 1.
- It chooses the split which has rock bottom entropy compared to the parent node and other splits. The lesser the entropy, the upper it is.

# Random Forest implementation

we will be using the random forest classifier implementation which is suitable for classification with discrete features

In [25]:
# importing the Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# Create the classifier
model_rf = RandomForestClassifier(n_estimators=500, max_features='sqrt')

# fit the training data into the classifier
model_rf.fit(training_data, y_train)

pred_train_rf = model_rf.predict(training_data)
training_accuracy_rf = accuracy_score(pred_train_rf, y_train)
training_accuracy_rf

1.0

In [26]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy'], 'n_estimators':[100,200,300,400,500,600]}]
search = GridSearchCV(model_rf, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  3.2min finished


In [27]:
# optimum parameter values
search.best_params_

{'criterion': 'gini', 'n_estimators': 400}

In [28]:
model_rf = RandomForestClassifier(criterion='gini', n_estimators=400, max_features='sqrt').fit(training_data, y_train)

In [29]:
# accuracy of training data

pred_train_rf = model_rf.predict(training_data)
training_accuracy_rf = accuracy_score(pred_train_rf, y_train)
training_accuracy_rf

1.0

In [30]:
# accuracy of test data

pred_test_rf = model_rf.predict(testing_data)
test_accuracy_rf = accuracy_score(pred_test_rf, y_test)
test_accuracy_rf

0.9856502242152466

In [31]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_rf)))
print('Precision score: ', format(precision_score(y_test, pred_test_rf)))
print('Recall score: ', format(recall_score(y_test, pred_test_rf)))
print('F1 score: ', format(f1_score(y_test, pred_test_rf)))

Accuracy score:  0.9856502242152466
Precision score:  1.0
Recall score:  0.891156462585034
F1 score:  0.9424460431654677


**Conclusion**
- Random forest is like bootstrapping algorithm with a call tree (CART) model. 
- Random forest gives rather more accurate predictions when put next to simple CART or regression models in many scenarios. 
- These cases generally have a high number of predictive variables and an enormous sample size. 
- This is often actually because it captures the variance of several input variables at a uniform time and enables a high number of observations to participate within the prediction.

# Bagging implementation

In [32]:
# Create the classifier
model_bag = RandomForestClassifier(n_estimators=500, max_features=None)

# fit the training data into the classifier
model_bag.fit(training_data, y_train)

pred_train_bag = model_bag.predict(training_data)
training_accuracy_bag = accuracy_score(pred_train_bag, y_train)
training_accuracy_bag

1.0

In [33]:
# optimizing model parameters
from sklearn.model_selection import GridSearchCV   

parameters = [{'criterion':['gini','entropy'], 'n_estimators':[100,200,300,400,500,600]}]
search = GridSearchCV(model_bag, parameters, scoring='accuracy', cv=5, verbose=True, n_jobs=-1).fit(training_data, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  9.9min finished


In [34]:
# optimum parameter values
search.best_params_

{'criterion': 'entropy', 'n_estimators': 300}

In [35]:
model_bag = RandomForestClassifier(criterion='entropy', n_estimators=300, max_features=None).fit(training_data, y_train)

In [36]:
# accuracy of training data

pred_train_bag = model_bag.predict(training_data)
training_accuracy_bag = accuracy_score(pred_train_bag, y_train)
training_accuracy_bag

1.0

In [37]:
# accuracy of test data

pred_test_bag = model_bag.predict(testing_data)
test_accuracy_bag = accuracy_score(pred_test_bag, y_test)
test_accuracy_bag

0.9766816143497757

In [38]:
# import accuracy_score, precision_score, recall_score, f1_score from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Computing the accuracy, precision, recall and F1 scores of the model using your test data 'y_test' 
print('Accuracy score: ', format(accuracy_score(y_test, pred_test_bag)))
print('Precision score: ', format(precision_score(y_test, pred_test_bag)))
print('Recall score: ', format(recall_score(y_test, pred_test_bag)))
print('F1 score: ', format(f1_score(y_test, pred_test_bag)))

Accuracy score:  0.9766816143497757
Precision score:  0.9416058394160584
Recall score:  0.8775510204081632
F1 score:  0.9084507042253521


# Comparison

Decreasing order of accuracy is depicted as:
- Naive Bayes - 0.99013
- Random Forest - 0.98565
- Bagging - 0.97668
- Decision Tree - 0.97309

The results are highly clear that all the models are good in detecting the spam emails. Naive Bayes is the most accurate method because its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them.