## 1-Importing Dataset into a Pandas
The read_table method of pandas is used to convert the dataset into a pandas dataframe. The
data is a tab separated dataset, so we use '\t' for the sep argument; there is no header
(header=None) and we label the colomns as 'lable' and 'sms_message' (names=['label',
'sms_message']).

In [1]:
import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collect
df = pd.read_table('smsspamcollection/SMSSpamCollection', 
                   sep='\t', 
                   header=None,
                   names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 2-Converting Categories to Binary Variables
Here, we convert the output (lable colomn) categories to binary variables to facilitate working with
scikit-learn.

In [2]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## 3-Splitting to Train and Test Sets
We split the dataset to train and test sets as follows:

In [3]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['sms_message'],
df['label'],
test_size=0.25,
random_state=1)

In [4]:
print('The x_train dataframe has {} samples.'.format(x_train.shape[0]))
print('The x_test dataframe has {} samples.'.format(x_test.shape[0]))

The x_train dataframe has 4179 samples.
The x_test dataframe has 1393 samples.


## 4-Bag of Words (BoW)
BoW is used to count the frecuency of words in a text dataset. We use BoW for the same purpose.
After treating the text data with BoW, the new dataset consists of colomns which are the words
present in the document and each cell shows the presence frequency of the corresponding word
(colomn) and text message (row). In order to convert a text data to such dataset, we use sklearns
count vectorizer method, as follows:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(stop_words = 'english')
training_data = count_vector.fit_transform(x_train)
testing_data = count_vector.transform(x_test)

Here we set the stop_words argument to 'english' so a list of common words (defined in scikit-
learn) will be removed from the document.

## 5-Naive Bayes Implementation
We use scikit-learn for naive bayes implementation, as follows:

In [6]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [7]:
predictions = naive_bayes.predict(testing_data)

## 6- Performance Evaluation
Here, we first calculate the accuracy of this prediction which is the ratio of the correct predictions
over the total number of predictions.

In [8]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, predictions)
print('prediction accuracy: {}'.format(acc))

prediction accuracy: 0.9877961234745154


In some problems, classification distributions are skewed like our spam-ham problem here where
around two percent of text messages are spam. In such cases accuarcy alone cannot be a good
evaluation metric for the prediction because even if we misclassify the spam messages, the
accuracy value can be still acceptable. Hence, we also calculate three other metrics, namely
precision, recal and F1, which come in handy when we should not rely on accuracy alone. The
precision metric shows the ratio of true positives to all predicted positives, recall tells us what
proportion of all spam messages were pedicted as spam, and F1 is the weighted average of the
precision and recall scores. We use scikit-learn to calculate these scores, as follows:

In [9]:
from sklearn.metrics import precision_score, recall_score, f1_score

print('precision score: ', format(precision_score(y_test, predictions)))
print('recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

precision score:  0.9615384615384616
recall score:  0.9459459459459459
F1 score:  0.9536784741144414


This classification problem is complete here. Based on the scores values, we can see that naive-bayes algorithm can predict spam emails with high accuracy, precision and recall scores.