## Spam Email Detection using Naive Bayes technique ##

In [2]:
import pandas as pd

# Reading file from dataset
df = pd.read_table('smsspamcollection/SMSSpamCollection', sep='\t', names=['label','sms_message'])

df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data Preprocessing ###

Converting the labels to binary variables for easier computation {'ham':0,'spam':1}

In [3]:
df['label'] = df.label.map({'ham':0,'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Now that our labels are ready, I will use bag of words (BoW) model for classification. In the BoW model, each sms_message will be broken down to individual terms. Each term will than be transformed to lowercase and then broken into tokens such that during classification, the word such as hello and Hello will be treated same. For that, I will use CountVectorizer from sklearn.feature_extraction.text.

Before that, I will have to split the dataset into training and testing.

In [5]:
# split the training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

# visualizing the counts of training and testing
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. No fitting is done so, X_test will not be stored in count_vector
testing_data = count_vector.transform(X_test)

### Model Implementation ###

Now that our data set ready for classification, we will use naive bayes classifier. Specifically, multinomial Naive bayes classifying algorithm will be used because it is suitable for classification discrete features like in our case where we are using word count for classification. One thing to note is that Naive Bayes Classifier assumes that each feature is independent of the other, which may not necessarily be the case here be because if we see the word "free", it is very likely to see another word "money" or "cash". However, assuming independence makes our model simple and works well on the testing data.

In [7]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [9]:
# making a prediction on the testing data
predictions = naive_bayes.predict(testing_data)

### Model Evaluation ###

Our model is ready. Now, it is time to evaluate how well our model performed in the testing set. We will evaluate following metrics to understand the model performance:
1. accuracy_score: ratio of correct predictions over incorrect predictions.
2. precision_score: proportion of messages we classified as spam that were actually spam.
3. recall_score: proportion of messages that were actually spam were classified as spam by our model.

precision_score and recall_score might sound similar but they are not. 
precision_score is (true positives / (true positives + false positives)) whereas
recall_score is the sensitivity measure, which is (true positives / (true positives + false negatives))

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


Bingo !!! Our classification model is ready. One of the advantages of using Naive Bayes classifier for these type of problems is for its ability to handle extremely large feature sets such as in our problem where each word is treated as a feature. Compared to neural network and decision trees, this classification is extremely fast and overfitting of the data is rarely the case.

Therefore, we have succesfully designed a model that can efficiently predict whether an SMS message is spam or not.