# Naive Bayes for Spam Detection

We will use the SMS Spam dataset available on [data.world](https://data.world/lylepratt/sms-spam) for SMS spam detection.

We will preprocess our dataset and prepare for the machine learning algorithms:

1. get the text file and treat it: the resulting file is a list of tuples; the tuples are composed by (message, is_spam), and the length of the list is the number of messages.
2. extract the messages and the targets (is_spam)
3. build a bag of words using CountVectorizer
4. separate the dataset into training and test set

In [7]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score


def get_data_sms(filename):
    data_sms = []
    with open(filename, 'r', encoding='utf-8') as file:
        for line in file:
            is_spam = "ham" not in line
            if is_spam == True:
                line = line.replace("spam	","")
            else:
                line = line.replace("ham	", "")
            data_sms.append((line.rstrip(), is_spam))
    return data_sms

file = "SMSSpamCollection.txt"
data_sms = get_data_sms(file)

target = []
features = []

for i in range(len(data_sms)):
    features.append(data_sms[i][0])
    target.append(data_sms[i][1])

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(features)

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=.4, random_state=42)

### Machine learning models

We will use Naive Bayes to train our dataset.

First we will use Multinomial Naive Bayes to classify the SMS messages.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html):

> _MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice)._

In [12]:
print("Using Multinomial Naive Bayes\n")

mnb = MultinomialNB()

y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test)

accuracy_mnb = round(accuracy_score(y_test, y_pred_mnb), 4)
precision_mnb = round(precision_score(y_test, y_pred_mnb), 4)
recall_mnb = round(recall_score(y_test, y_pred_mnb), 4)

print("Accuracy:", accuracy_mnb)
print("Precision:", precision_mnb)
print("Recall:", recall_mnb)

Using Multinomial Naive Bayes

Accuracy: 0.9803
Precision: 0.9094
Recall: 0.9461


We can also use Bernoulli Naive Bayes, since it's a bi-class problem.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html):

> _BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable._

In [13]:
print("Using Bernoulli Naive Bayes\n")

bnb = BernoulliNB()

y_pred_bnb = bnb.fit(X_train, y_train).predict(X_test)

accuracy_bnb = round(accuracy_score(y_test, y_pred_bnb), 4)
precision_bnb = round(precision_score(y_test, y_pred_bnb), 4)
recall_bnb = round(recall_score(y_test, y_pred_bnb), 4)

print("Accuracy:", accuracy_bnb)
print("Precision:", precision_bnb)
print("Recall:", recall_bnb)

Using Bernoulli Naive Bayes

Accuracy: 0.978
Precision: 0.9806
Recall: 0.8519
