# Spam detection (Bayes)

This project is about SMS spam detection. A Naive Bayes classifier shall classify SMS text as spam or not spam.

We use the SMS Spam Collection Data Set from the UCI repository. It can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

The data set is composed by just one text file, where each line has the correct class (SPAM / HAM) followed by the raw text message.

Bayes theorem calculates the probability of a certain event happening (here, a message being spam) based on the joint probabilistic distributions of certain other events (her, a message being classified as spam).


In [0]:
#Import functions
import numpy as np
import pandas as pd

In [2]:
#Load the SMS-Spam data from UCI
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

--2019-08-17 14:06:05--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2019-08-17 14:06:06 (1018 KB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [5]:
#Read data
df = pd.read_csv('SMSSpamCollection', sep="\t", names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
#Convert the values in the 'label' column: 'ham'=0, 'spam'=1
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
#Implement Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [8]:
#Split dataset into training (75%) and test (25%)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [0]:
#Instantiate the CountVectorizer method
count_vector = CountVectorizer()

#Fit the training data and then return the matrix
#  1. Learn a vocabulary dictionary for the training data 
#  2. Transform the data into a document-term matrix
training_data = count_vector.fit_transform(X_train)

#Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
# Only Step 2. Transform the data into a document-term matrix
testing_data = count_vector.transform(X_test)

In [13]:
#Implement multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [0]:
#Apply the trained classifier it to testing data
predictions = naive_bayes.predict(testing_data)

In [15]:
#Evaluation: Compute accuracy, precision, recall and F1 scores
#  Accuracy = Ratio of correct predictions to total number of predictions
#  Precision = Proportion of messages classified as spam, actually were spam
#  Recall = Proportion of messages that actually were spam were classified by us as spam
#  F1 = Weighted average of the precision and recall scores 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


In [21]:
#Try the model
documents = ['Hello, how are you!',
                'Free entry in 2 a wkly comp',
                'Call me now.',
                'Hello, shall we meet tomorrow?']

naive_bayes.predict(count_vector.transform(documents))

array([0, 1, 0, 0])