# SPAM DETECTION MODEL - Text Mining
In this Jupyter Notebook, I will demonstrate a real world example of text classification using machine learning. The goal of this project is to train a text classification machine learning model in python capable of predicting whether a text message is spam or not. I will use python’s Scikit learn library for machine learning to train the text classification model.

This jupyter notebook highlights the following:

* Importing the libraries needed
* Importing the data set
* Text preprocessing
* Converting text to numbers
* Splitting the data into train and test sets
* Training the text classification model and predicting SMS messages as spam or ham
* Evaluating the model
* Saving and loading the model

In [11]:
# import libraries
import pandas as pd
import numpy as np
import os

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings(action='ignore')
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV,cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from IPython.display import display
from textblob import TextBlob,Word

In [12]:
# read in the data file with pandas read_csv
df=pd.read_csv('sms_spam_collection.csv')
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 1. Text Preprocessing
Once the data has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. We will remove all the special characters, numbers, and unwanted spaces from our text. The final preprocessing step will be the lemmatization. In lemmatization, we reduce the word into dictionary root form. For instance 'cats' is converted into 'cat'. Lemmatization is done in order to avoid creating features that are semantically similar but syntactically different.

In [13]:
# check the number of text messages
print('Total number of text messages: {}\n'.format(df.shape[0]))

# check the number of duplicated messages
print('Number of duplicated messages: {}\n'.format(df.duplicated().sum()))

# check for missing values in the data
print('Number of missing values in each column:')
df.isnull().sum()

Total number of text messages: 5572

Number of duplicated messages: 403

Number of missing values in each column:


label      0
message    0
dtype: int64

In [14]:
# drop all the duplicated text messages from the data
df.drop_duplicates(inplace=True)

# remove all extra spaces from the text by trimming the whitespaces off the text
df.message = df.message.str.strip()
df.label = df.label.str.strip()

# reset index and drop an unwanted 'index' column that will be created
df = df.reset_index()
df = df.drop('index',axis=1)

# encode the target variables
df.label = df.label.map({'spam':1,'ham':0})

df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
# remove all stopwords and punctuation marks
stop_word = stopwords.words('english')

# loop through all the text messages in the data and remove all stopwords
for i in range(len(df)):
    # get the text
    text = df.message[i]
    # create a textblob object
    blob = TextBlob(text)
    # convert the text into a list of words
    words = blob.words
    # loop through all the words and identify the stopwords and remove them
    for word in words:
        wrd = word.strip()
        if wrd in stop_word:
            words.remove(word)
        else:
            continue
    line = ' '.join(words)
    df.message[i] = line
    
df.head()

Unnamed: 0,label,message
0,0,Go jurong point crazy Available in bugis n gre...
1,0,Ok lar Joking wif u oni
2,1,Free entry 2 wkly comp win FA Cup final tkts 2...
3,0,U dun say early hor U c already say
4,0,Nah I n't think goes usf lives around though


In [16]:
numbers = ['0','1','2','3','4','5','6','7','8','9']

# remove all the numbers from the text
for i in range(len(df)):
    # get the text message
    txt_msg = df.message[i]
    # split the text into a list of single characters
    char_list = txt_msg.split()
    # loop through the list and get rid of the numbers
    for j in char_list:
        if j in numbers:
            char_list.remove(j)
        else:
            continue
    new_text = ' '.join(char_list)
    df.message[i] = new_text
    
# print the first ten rows of the data
df.head(10)

Unnamed: 0,label,message
0,0,Go jurong point crazy Available in bugis n gre...
1,0,Ok lar Joking wif u oni
2,1,Free entry wkly comp win FA Cup final tkts 21s...
3,0,U dun say early hor U c already say
4,0,Nah I n't think goes usf lives around though
5,1,FreeMsg Hey darling 's week 's and word back I...
6,0,Even brother not like speak They treat me like...
7,0,As per request 'Melle Melle Oru Minnaminungint...
8,1,WINNER As valued network customer have selecte...
9,1,Had mobile 11 months more U R entitled Update ...


In [17]:
# we now lemmatize all the words
for i in range(len(df)):
    # get the text message
    txt = df.message[i]
    # create a textblob object
    blb = TextBlob(txt)
    # convert the text into a list of words
    wrds = blb.words
    wrd_container = []
    # iterate over the words and lemmatize each one of them
    for wrd in wrds:
        new_wrd = Word(wrd)
        lem_word = new_wrd.lemmatize()
        wrd_container.append(lem_word)
    # join the lemmatized words
    wrd_line = ' '.join(wrd_container)
    df.message[i] = wrd_line
    
df.head()

Unnamed: 0,label,message
0,0,Go jurong point crazy Available in bugis n gre...
1,0,Ok lar Joking wif u oni
2,1,Free entry wkly comp win FA Cup final tkts 21s...
3,0,U dun say early hor U c already say
4,0,Nah I n't think go usf life around though


## 2. Converting Text to Numbers
Machine unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore we need to convert our text into numbers. To do this task, we will use the TfidfVectorizer to convert the text to numbers.

In [18]:
df.dropna(inplace=True)

X = df.message
y = df.label

# split the data into train and test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.25)

# convert the text into numbers
vec = TfidfVectorizer()
training_x = vec.fit_transform(X_train)
testing_x = vec.transform(X_test)

# check the shape of the training X and the training y
print('Shape of training X: {}\n'.format(training_x.shape))
print('Shape of training y: {}\n'.format(y_train.shape))

Shape of training X: (3876, 7104)

Shape of training y: (3876,)



## 3. Model fitting, Evaluation, and Hyperparameter Tunning
This is the point where we train our text classification model, make predictions, and evaluate it. The model I will train is the Multinomial Naive Bayes (MultinomialNB) and evaluate it by calculating the accuracy score, precision score, and the recall score. After scoring the model, we will try to improve the model performance by tunning its hyper-parameters and obtain a higher accuracy.

In [19]:
# create the model instance
naive_model = MultinomialNB()

# fit the model on the training data
naive_model.fit(training_x,y_train)

# make predictions 
predictions = naive_model.predict(testing_x)

# score the model
print('Accuracy score of the model: {}\n'.format(accuracy_score(y_test,predictions)))
# display the classification report
print('Precision score of the model: {}\n'.format(precision_score(y_test,predictions)))
print('Recall score of the model: {}\n'.format(recall_score(y_test,predictions)))

Accuracy score of the model: 0.9574632637277649

Precision score of the model: 0.9904761904761905

Recall score of the model: 0.6582278481012658



In [20]:
# tunning hyper-parameters
params = {'alpha':np.linspace(0.1,2,20),
         'fit_prior':[True,False]}

# instantiate the model
bayes_model = MultinomialNB()

# define the grid search parameters
grid_search = GridSearchCV(estimator=bayes_model,
                          param_grid=params,
                          cv=5,
                          n_jobs=-1,
                          verbose=2)

# fit the model to start the grid search
grid_search.fit(training_x,y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


GridSearchCV(cv=5, estimator=MultinomialNB(), n_jobs=-1,
             param_grid={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
       1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ]),
                         'fit_prior': [True, False]},
             verbose=2)

In [21]:
# make predictions 
grid_predictions = grid_search.predict(testing_x)

# score the model
print('Accuracy score of the model: {}\n'.format(accuracy_score(y_test,grid_predictions)))
# display the classification report
print('Precision score of the model: {}\n'.format(precision_score(y_test,grid_predictions)))
print('Recall score of the model: {}\n'.format(recall_score(y_test,grid_predictions)))
# print out the best parameters found by the grid search for fitting the model on the data
print('Best parameters found by the grid search:')
grid_search.best_params_

Accuracy score of the model: 0.9822119102861562

Precision score of the model: 0.9856115107913669

Recall score of the model: 0.8670886075949367

Best parameters found by the grid search:


{'alpha': 0.2, 'fit_prior': True}

As seen above, the hyper-parameter tunning really helped improved the performance of the model. The accuracy of the model has now increased from 0.957 to 0.985 which is better and more accurate for prediction. The precision score tells us the accuracy of the positive predictions and the recall score, also called sensitivity or true positive rate (TPR) also tells us the ratio of positive instances that are correctly detected by the classifier. Having this, we now make a pipeline for our data fitting using the best parameters found in the grid search.ie. {'alpha': 0.2, 'fit_prior': True}

In [22]:
# make a pipeline and fit the model on the X_train and y_train
final_model = make_pipeline(TfidfVectorizer(),MultinomialNB(alpha=0.2,fit_prior=True))

# fit the model on the training data
final_model.fit(X_train,y_train)

# make predictions 
final_predictions = final_model.predict(X_test)

# score the model
print('Accuracy score of the model: {}\n'.format(accuracy_score(y_test,final_predictions)))
# display the classification report
print('Precision score of the model: {}\n'.format(precision_score(y_test,final_predictions)))
print('Recall score of the model: {}\n'.format(recall_score(y_test,final_predictions)))

Accuracy score of the model: 0.9822119102861562

Precision score of the model: 0.9856115107913669

Recall score of the model: 0.8670886075949367



## 4. Saving the final model on our computer's hard disk
The text classification model to detect whether an SMS is spam or not (ham) has been successfully trained and ready to be deployed and used in applications. The next thing to do is save the model into a pickle file ('.pkl' file) on our computer's hard disk so that we can load it whenever we are ready to use it.

In [23]:
# import the library that allows us to save and load our model.
#from sklearn.externals import joblib
import joblib

In [24]:
# my file directory on my computer
file_dir = 'spam_detection_model.pkl'
# save the model
joblib.dump(final_model,file_dir)

['spam_detection_model.pkl']

The model has now been saved and can be loaded again using the syntax:

In [1]: classifier = joblib.load(file_name)

This brings me to the end of my machine learning project and I have successfully trained a text classification model (Multinomial Naive Bayes Classification model) which can be used in applications to detect a spam message. Please do not hesitate to contact me for any sort of collaboration of discussion about this notebook. Thank you.

#### Project completed by Helenna Ariesty
[Email](helenna.ariesty@gmail.com) || [LinkedIn](https://www.linkedin.com/in/helenna-ariesty-24966793/) 