<a href="https://colab.research.google.com/github/hatimnaitlho/ml-sklearn/blob/master/spam_detection_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project overview and scope

One of the basic and popular tasks in supervised learning is the classification of data (images or texts). 

In this notebook, we will use Scikit-learn library to build an SMS spam detector based on the text of the SMS.

The Basic Machine learning algorithm for SMS-spam detection will be able to perform the following tasks:
- Collect data and preprocess it (vectorizing non numerical data)
- Divide data-set to training and test sets
- Build classifiers and training them
- Assess the performance of each classifier (precision, recall)
- Find classifiers with higher performance (precision and recall)

Note: the tuning of algorithms hyperparameters is outside the scope of this notebook. 

# The data collection

We will be using a [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from the UCI Machine Learning repository. This collection of 5,574 labeled SMS messages have been collected for mobile phone spam research.. 

The collection is composed by just one text file, where each line has the correct class (spam or ham) followed by the raw message.

The direct data link is [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/).

For this project, we will use Scikit-learn library which contains various classification, regression and clustering algorithms and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.



In [0]:
#Import libraries
import numpy as np 
import pandas as pd 



# Data collection

In [39]:
# Load the SMS-spam dataset
url_dataset= 'https://raw.githubusercontent.com/hatimnaitlho/ml-sklearn/master/datasets/smsspamcollection/SMSSpamCollection'
df = pd.read_table(url_dataset, sep='\t', names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [40]:
df.shape

(5572, 2)

# Data Preprocessing

Since Scikit-learn only deals with numerical values, we have to convert our categorical values to numerical ones. For that we will do some data preprocessing.

For labels, the solution is easy, and we have only to transform 'spam' on 1, and ham on 0.

The sms_message which is in the plain text format that contains all needed information (features) for classification purpose. We will then have to build our feature vector from plain text.

For this, scikit-learn has powerful vectorizers such as CountVectorizer, TfidfVectorizer, HashingVectorizer.


In [41]:
df.groupby('label').count()

Unnamed: 0_level_0,sms_message
label,Unnamed: 1_level_1
ham,4825
spam,747


In [42]:
df['label']=df['label'].map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


# Bag of Words (BoW) Implementation

What we have here in our data set is a collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

Let's implement the CountVectorizer method in sklearn which provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

### Implementing CountVectorizer method
Let's implement and use CountVectorizer method in a sample documents to understand how it works.

In [43]:
# Instantiate the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words='english')
 
documents = ['By receiving spam messages', 
              'Internet users are exposed to security issues',
             'and minors are exposed to inappropriate contents.',
             'Moreover, spam messages waste resources in terms of:',
              'storage, bandwidth, and productivity.',
             'What makes the problem worse is that spammers',
              'keep inventing new techniques to dodge spam filters.']

count_vector.fit(documents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [44]:
count_vector.get_feature_names()

['bandwidth',
 'contents',
 'dodge',
 'exposed',
 'filters',
 'inappropriate',
 'internet',
 'inventing',
 'issues',
 'makes',
 'messages',
 'minors',
 'new',
 'problem',
 'productivity',
 'receiving',
 'resources',
 'security',
 'spam',
 'spammers',
 'storage',
 'techniques',
 'terms',
 'users',
 'waste',
 'worse']

In [45]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
        0, 0, 0, 0]])

In [46]:
frequency_matrix = pd.DataFrame(doc_array, 
                                columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,bandwidth,contents,dodge,exposed,filters,inappropriate,internet,inventing,issues,makes,messages,minors,new,problem,productivity,receiving,resources,security,spam,spammers,storage,techniques,terms,users,waste,worse
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0
2,0,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1
6,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0


# Building Models

#### Let's separate our dataset into a training and test sets

In [48]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], test_size=0.25, random_state=42)
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


#### Applying Bag of Words processing

This preprocessing step will transform the plain text (sms_message), and extract numerical features using the Bag of Words concept.

In [0]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer(stop_words='english')

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

## Accuracy vs. Recall vs. Precision

Choosing the right performance metric is critical in machine learning projects, and allow to reach the best model according to the business need.

Accuracy is not perinent in our case (spam detection classifier). Therefore, we will use Precision and recall:
- The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

- Recall is The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

Precision answers the question "If the spam classifies an SMS or email as spam, what’s the probability that it’s really a spam?" as defined earlier. Thus, the spam classifier must have high precision, which prevent labeling ham messages as spam and loosing them in the "junk folder".

Recall, answers the following question "Of all the spam in the sms/email set, what proportion does the spam classifier detect?”. Thus, recall is less important than precision in this specific case, and will result on spam messages not detected, which will be annoying, but not as critical as loosing non-spam messages.


In [58]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import RidgeClassifierCV
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

l1, l2, l3, l4 =([] for i in range(4))
metrics_dict = {}
classifiers= [
        BernoulliNB(),
        RandomForestClassifier(),
        AdaBoostClassifier(),
        BaggingClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier(),
        DecisionTreeClassifier(),
        LogisticRegression(),
        RidgeClassifier(),
        RidgeClassifierCV(),
        SGDClassifier(),
        OneVsRestClassifier(SVC()),
        KNeighborsClassifier()
    ]

for clf in classifiers:
  clf.fit(training_data, y_train)
  clf.predict(testing_data)
  name= clf.__class__.__name__
  l1.append(name)
  metrics_dict['model']= l1
  l2.append(precision_score(y_test, clf.predict(testing_data)))
  metrics_dict['precision_score']= l2
  l3.append(recall_score(y_test, clf.predict(testing_data)))
  metrics_dict['recall_score'] = l3
  l4.append(f1_score(y_test, clf.predict(testing_data)))
  metrics_dict['f1_score'] = l4
 


df = pd.DataFrame(metrics_dict)
df



Unnamed: 0,model,precision_score,recall_score,f1_score
0,BernoulliNB,0.993377,0.806452,0.890208
1,RandomForestClassifier,1.0,0.83871,0.912281
2,AdaBoostClassifier,0.95858,0.870968,0.912676
3,BaggingClassifier,0.956522,0.827957,0.887608
4,ExtraTreesClassifier,0.987654,0.860215,0.91954
5,GradientBoostingClassifier,0.992908,0.752688,0.856269
6,DecisionTreeClassifier,0.934524,0.844086,0.887006
7,LogisticRegression,1.0,0.860215,0.924855
8,RidgeClassifier,1.0,0.849462,0.918605
9,RidgeClassifierCV,1.0,0.849462,0.918605


## The Confusion matrix

![alt text](https://miro.medium.com/max/1166/0*2ICu3zRUHkFvzxx7.png)

Let's campare two classifiers the KNeighborsClassifier, and the OneVsRestClassifier. They both have nearly the same precision (=1), meaning that every message labeled as spam is really spam. However, they have different recall scores.
- The recall_score(KNeighborsClassifier)= 0.360215 while, 
- The recall_score(OneVsRestClassifier)=0.876344

meaning that the OneVsRestClassifier will outperform in term of filtering more spam messages compared to the KNeighborsClassifier as we will see using the confusion matrix.


 

In [84]:
y_test.value_counts()

0    1207
1     186
Name: label, dtype: int64

In [87]:
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

knclf= KNeighborsClassifier()
knclf.fit(training_data, y_train)

y_pred_kn= knclf.predict(testing_data)

print("The confusion matrix of the KNeighborsClassifier: ")
display(confusion_matrix(y_test, y_pred_kn, labels=[1,0]))
print('\n')
print('The number of non-spam messages labelled as spam is {}'.format(confusion_matrix(y_test, y_pred_kn)[0,1]))
print('The number of spam messages not detected by the classifier {}'.format(confusion_matrix(y_test, y_pred_kn)[1,0]))

The confusion matrix of the KNeighborsClassifier: 


array([[  67,  119],
       [   0, 1207]])



The number of non-spam messages labelled as spam is 0
The number of spam messages not detected by the classifier 119


In [88]:
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

clf_ovs= OneVsRestClassifier(SVC())
clf_ovs.fit(training_data, y_train)
y_pred_ovs= clf_ovs.predict(testing_data)
print("The confusion matrix of the OneVsRestClassifier: ")
display(confusion_matrix(y_test, y_pred_ovs, labels=[1,0]))
print('\n')
print('The number of non-spam messages labelled as spam is {}'.format(confusion_matrix(y_test, y_pred_ovs)[0,1]))
print('The number of spam messages not detected by the classifier  {}'.format(confusion_matrix(y_test, y_pred_ovs)[1,0]))


The confusion matrix of the OneVsRestClassifier: 


array([[ 163,   23],
       [   0, 1207]])



The number of non-spam messages labelled as spam is 0
The number of spam messages not detected by the classifier  23


Note: We used default hyperparameters for each classifier, meaning that we have at this stage to choose the most permant classifiers to focus on, and then tune hyperparameters, using some ML techniques.