# Spam classifier: Naive Bayes

**Multinomial event model**

We will apply the naive Bayesian classifier of Laplace smoothing for spam filter training based on data [SpamAssassin Public Corpus](http://spamassassin.apache.org/publiccorpus/).

Result of applying this model is better than multivariate Bernoulli event model.

In [5]:
import json

In [6]:
import numpy as np

from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as progressbar

## Data loading

Process of the data cleaning was performed in [`spam-data-preparation.ipynb`](spam-data-preparation.ipynb).

In [7]:
def load_json_from_file(filename):
    with open(filename, "r", encoding="utf-8") as f:
        return json.load(f)

In [8]:
emails_tokenized_ham = load_json_from_file("emails-tokenized-ham.json")
emails_tokenized_spam = load_json_from_file("emails-tokenized-spam.json")

In [9]:
vocab = load_json_from_file("vocab.json")

## Data encoding

Let's represent each email as $ n $-dimensional vector $ \left [x_1, x_2, ..., x_n \right] $, where $ x_i $ is an index of the $ i $-th word of this email in the dictionary $ V $, and $ n $ is a number of words in the email letter.

For example, _ "Buy gold watches. Buy now." _ can be coded as: $ \left[3953, 11890, 32213, 3953, 20330\right] $.

In [11]:
def email_to_vector_multinomial(email_words, vocab):
    email_vec = np.zeros(len(email_words), dtype="int32")
    
    for word in email_words:
        if word in vocab:
            indices = [i for i, x in enumerate(email_words) if x == word] 
            email_vec[indices] = vocab[word]
            
    return email_vec

Let's encode all emails:

In [12]:
X = [
    email_to_vector_multinomial(email, vocab)
    for email in emails_tokenized_ham + emails_tokenized_spam
]

In [13]:
y = np.array([0] * len(emails_tokenized_ham) + [1] * len(emails_tokenized_spam))

Let's look at a few random emails:

In [14]:
sample_emails = [emails_tokenized_ham[10], emails_tokenized_ham[70]]

In [15]:
for email in sample_emails:
    print(email)
    print()

['hello', 'seen', 'discuss', 'articl', 'approach', 'thank', 'httpaddress', 'hell', 'rule', 'tri', 'accomplish', 'someth', 'thoma', 'alva', 'edison', 'sf', 'net', 'email', 'sponsor', 'osdn', 'tire', 'old', 'cell', 'phone', 'get', 'new', 'free', 'httpaddress', 'spamassassin', 'devel', 'mail', 'list', 'emailaddress', 'httpaddress']

['fri', 'number', 'aug', 'number', 'tom', 'wrote', 'xvid', 'number', 'project', 'make', 'gpl', 'divx', 'codec', 'sigma', 'design', 'number', 'sorri', 'sigma', 'design', 'number', 'number', 'httpaddress']



In [16]:
for email in sample_emails:
    email_vec = email_to_vector_multinomial(email, vocab)
    
    print("Email vector:", email_vec)
    print("Dimensionality:", email_vec.shape)
    print()

Email vector: [12866 26186  7632  1574  1361 29410 13468 12862 25408 30214   173 27396
 29564   872  8567 26411 19758  8849 27722 21116 29770 20748  4549 22215
 11528 19855 10871 13468 27540  7314 17535 16947  8851 13468]
Dimensionality: (34,)

Email vector: [10946 20419  1845 20419 29891 33008 33256 20419 23298 17594 12021  7779
  5370 26756  7232 20419 27443 26756  7232 20419 20419 13468]
Dimensionality: (22,)



## Splitting of samples

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [19]:
print("# Train:", len(X_train))
print("# Test: ", len(X_test))

# Train: 4987
# Test:  555


## Naive Bayesian classifier training

Let's count the total number of words in ham and spam emails.

In [20]:
def count_words(emails, y, c):
    k = 0

    for i in range(len(emails)):
        if y[i] == c:
            k = k + len(emails[i])
        elif c == "all":
            k = k + len(emails[i])
    return(k)    

ham_total_words_train = count_words(X_train, y_train, 0)
spam_total_words_train = count_words(X_train, y_train, 1)

Let's compute the class priors for ham and spam emails.

$\log{P(ham)} = \log{\frac{1\{y=ham\}}{m}}$

$\log{P(spam)} = \log{\frac{1\{y=spam\}}{m}}$

In [21]:
ham_log_prior = np.log(ham_total_words_train / count_words(X_train, y_train, "all"))
spam_log_prior = np.log(spam_total_words_train / count_words(X_train, y_train, "all"))

Let's compute likelihood $\log{\phi_{word\,|\,class}}$ for each word in the dictionary. We will also apply Laplace smoothing to avoid division by zero.

$\log{\phi_{word\,|\,class}} = \log{\frac{\sum_{i=1}^{m} \sum_{j=1}^{n_i} {1\{x_{j}^{(j)}=word \, \land \, y=class\}} + 1}{\sum_{i=1}^{m}{1\{y=class\}n_i} \,+\, |V|}}$

Let's create empty vectors $\log{\phi_{word \, | \, ham}}$ and $\log{\phi_{word \, | \, spam}}$ and fill them for each word in the dictionary.

In [22]:
ham_log_phi = np.zeros(len(vocab), dtype="float64")
spam_log_phi = np.zeros(len(vocab), dtype="float64")

In [23]:
def count_number_words(c, word_index):
    k = 0

    for i in range(len(X_train)):
        if y_train[i] == c:
            k = k + np.count_nonzero(X_train[i] == word_index)
    return(k)   

ham_word_counts = np.zeros(len(vocab))
spam_word_counts = np.zeros(len(vocab))

In [24]:
for word_index in progressbar(range(len(vocab)), desc="Training"):
    ham_log_phi[word_index] = np.log((count_number_words(0, word_index) + 1) / (ham_total_words_train + len(vocab)))
    spam_log_phi[word_index] = np.log((count_number_words(1, word_index) + 1) / (spam_total_words_train + len(vocab)))

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"





## Prediction

$\log P(y=class\,|\,words) = \log \frac{P(words\,|\,y=class) P(y=class)}{P(words)} = \frac{\sum_{i=1}^{n} \log P(words_i\,|\,y=class) + \log P(y=class)}{P(words)}$

The denominator $P(words)$ is the same for both classes, so it will be ignored in predictions.

In [29]:
def predict(X):
    y_pred = np.zeros(len(X))

    for i in progressbar(range(len(y_pred)), desc="Predict"):
        email_vector = X[i]
        
        ham_posterior = ham_log_phi[email_vector].sum() + ham_log_prior
        spam_posterior = spam_log_phi[email_vector].sum() + spam_log_prior

        # Whichever class has the bigger posterior probability, wins.
        y_pred[i] = 0 if ham_posterior > spam_posterior else 1
    
    return y_pred  

## Evaluation of prediction accuracy

In [30]:
pred_train = predict(X_train)
pred_test = predict(X_test)

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"






Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


In [31]:
accuracy_train = 1 - np.sum(pred_train != y_train) / len(y_train)
accuracy_test = 1 - np.sum(pred_test != y_test) / len(y_test)

In [32]:
print("Training accuracy:   {0:.3f}%".format(accuracy_train * 100))
print("Test accuracy:       {0:.3f}%".format(accuracy_test * 100))

Training accuracy:   97.714%
Test accuracy:       97.477%
