# CS280 Programming Assignment 1
Naive Bayes Spam Filter
<br>
Dataset: TREC06 Corpus<br>
<br>
__IMPORTANT NOTE BEFORE RUNNING:__
* Download and unzip __trec06p-cs280.zip__ into a folder named __'trec06p-cs280'__
* Move the folder __'trec06p-cs280\data'__ into the upper folder, e.g.
>$ mv trec06p-cs280\data ..\data

> __EXPECTED DIRECTORY CONTENTS__<br>
$ ls <br>
PA1_RILI.ipynb<br>
trec06p-cs280<br>
data<br>


## 2.1 Classifier Construction and Evaluation


### 1. Parse the documents inthe training set, form the vocabulary V, count their statistics and report the priori probabilities for spam and ham.

Get the labels and data paths for both the train and test data sets

In [None]:
labels_file_path = 'trec06p-cs280\labels'
data_root_path = 'data'
train_data_start_path = 'data/000/000'
train_data_end_path = 'data/070/299'
test_data_start_path = 'data/071/000'
test_data_end_path = 'data/126/021'

train_labels = []
train_data_paths = []
test_labels = []
test_data_paths = []

with open(labels_file_path) as labels_file:
    in_test_data_part = False
    for line in labels_file:
        line_contents = line.strip().split()
        if in_test_data_part:
            test_labels.append(line_contents[0])
            test_data_paths.append(line_contents[1][3:])
        else:
            train_labels.append(line_contents[0])
            train_data_paths.append(line_contents[1][3:])
            
        if train_data_end_path in line_contents[1]:
            in_test_data_part = True

Just a few sanity checks...

In [None]:
print(len(train_labels), len(train_data_paths))
print(train_labels[-1], train_data_paths[-1])
print(len(test_labels), len(test_data_paths))
print(test_labels[-1], test_data_paths[-1])

#### Compute the Priors
The Prior probabilities, P(w=H) and P(w=S) are thus computed as follows:

In [None]:
prior_ham = train_labels.count('ham') / len(train_labels)
prior_spam = train_labels.count('spam') / len(train_labels)

print('P(w=H) = ', prior_ham)
print('P(w=S) = ', prior_spam)

#### Construct the vocabulary

Let us define a function that would give us the "words" (as defined in the specifications) given a line/string:

In [None]:
def get_words_from_line(line):
    pattern = re.compile(r'(?<=\s)[a-zA-Z]+[\-\']?[\.|\,|\s]')
    words = pattern.findall(line)
    for word_index, word in enumerate(words):
        words[word_index] = word.strip(" \t\n.,-'").lower()
    return words

Then let us define some *stop words* -- words that are so commonly used that they become insignificant and meaningless.<br>
(Taken from https://towardsdatascience.com/multinomial-naive-bayes-classifier-for-text-analysis-python-8dd6825ece67)

In [None]:
stop_words = [
"a", "about", "above", "across", "after", "afterwards", 
"again", "all", "almost", "alone", "along", "already", "also",    
"although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "as", "at", "be", "became", "because", "become","becomes", "becoming", "been", "before", "behind", "being", "beside", "besides", "between", "beyond", "both", "but", "by","can", "cannot", "cant", "could", "couldnt", "de", "describe", "do", "done", "each", "eg", "either", "else", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "find","for","found", "four", "from", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "i", "ie", "if", "in", "indeed", "is", "it", "its", "itself", "keep", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mine", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next","no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part","perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "she", "should","since", "sincere","so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "take","than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they",
"this", "those", "though", "through", "throughout",
"thru", "thus", "to", "together", "too", "toward", "towards",
"under", "until", "up", "upon", "us",
"very", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while", 
"who", "whoever", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
]

Then, let us define a function that parses all the training data files to construct our vocabulary. <br>
Our vocabulary comes in the form of a dictionary with the words as the key and their corresponding indices as their values. This will come in useful later on when constructing the vocabulary vectors for the input documents to our spam filter.<br><br>
Note that only words less than a set maximum word length are included in an attempt to reduce the strain on the run-time memory. <br>
> Note that the average length of words in the English dictionary is around ~5. (http://norvig.com/mayzner.html) So here we set it to a reasonable multiple of the average.

In [None]:
import os
import re
#from playsound import playsound

def get_vocab_with_indices():
    labels_file_path = 'trec06p-cs280\labels'
    data_path = 'data'
    MAX_WORD_LEN = 10

    vocab_indices = {}
    current_vocab_index = 0

    for data_path_index, data_path in enumerate(train_data_paths):
        with open(data_path, 'r', errors='ignore') as file:
            print('Parsing file %d of %d for vocabulary:'%(data_path_index, len(train_data_paths)))
            for line in file:
                words = get_words_from_line(line)
                for word in words:
                    if word not in stop_words and len(word) <= MAX_WORD_LEN:
                        if (word not in vocab_indices):
                            vocab_indices[word] = current_vocab_index
                            current_vocab_index += 1

    #playsound('Victory.mp3')

    return vocab_indices

Thus we can call the above function like so:
>Optional: This may take a while. Uncomment the playsound() line and its necessary imports to play a sound when the processing is done!

In [None]:
V = get_vocab_with_indices()


In [None]:
print('Number of vocabulary words: ', len(V))

### Constructing the vector of likelihoods per document class
Now that we have our vocabulary, let us now define a method that returns a vector of word counts for each input document from our training data of a specified class (i.e. "spam" or "ham"). Additionally, it also returns the number of documents processed that matches the specified label. <br>

This is done by first creating a word-presence matrix (__'all_documents'__): The rows correspond to each word in our vocabulary while the columns correspond to each training data document labeled with __'document_class'__. If a certain word is present in a certain document, then the corresponding element in the matrix will be marked with a '1'.

Afterwards, the desired word counts can be obtained by simply summing along the rows.


In [None]:
import numpy as np

def get_word_counts(document_class, vocab_indices, train_data_paths, train_labels):
    print('vocab_indices length: ', len(vocab_indices))
    if document_class not in train_labels:
        return None
    class_doc_index = 0
    all_documents = np.zeros((len(vocab_indices),train_labels.count(document_class)), dtype=np.float64)
    print('all_documents shape: ', all_documents.shape)
    for doc_index, data_path in enumerate(train_data_paths):
        print('Processing document file %d of %d to vocabulary vector:'%(doc_index, len(train_data_paths)))
        if (train_labels[doc_index] != document_class):
            continue
        with open(data_path, 'r', errors='ignore') as file:
            for line in file:
                words = get_words_from_line(line)
                for word in words:
                    if word in vocab_indices:
                        if all_documents[vocab_indices[word],class_doc_index] == 0:
                            #print('\tFound: word "%s" (index %d) in %s doc_index %d path %s'%(word, vocab_indices[word], document_class, class_doc_index, data_path))
                            all_documents[vocab_indices[word],class_doc_index] = 1
                        else:
                            continue
        class_doc_index += 1
    word_counts = all_documents.sum(axis=1)
    #playsound('Victory.mp3')
    return word_counts, all_documents.shape[1]

Simply call the function for both 'ham' and 'spam' classes and store the results.<br>
>Optional: This may take a while. Uncomment the playsound() line and its necessary imports to play a sound when the processing is done!

In [None]:
word_counts_ham, num_ham_train_docs = get_word_counts('ham', V, train_data_paths, train_labels)
word_counts_spam, num_spam_train_docs = get_word_counts('spam', V, train_data_paths, train_labels)

Let us then define a function that returns a vector of likelihood probabilities or the probability that a specific word is present in a document given it has class __w__, e.g. *P(word_is_present | w)*:

In [None]:
def get_likelihoods_for(word_counts, total_num_class_docs):
    return word_counts/total_num_class_docs

In [None]:
likelihoods_ham = get_likelihoods_for(word_counts_ham, num_ham_train_docs)
likelihoods_spam = get_likelihoods_for(word_counts_spam, num_spam_train_docs)

How the likelihoods vectors are used: Let's say we want to know the likelihood probability for a certain word. We fetch first the corresponding index value in V using the word as a key, then simply use the result as an index for the likelihood vectors:

In [None]:
word = 'tears'
likelihoods_ham[V[word]], likelihoods_spam[V[word]]

### 2. Construct and train a Naive Bayesian Classifier from the count statistics.

Before we can implement our classifier, let us first define a function that converts an input document from a specified __data_path__ into a word-presence vector corresponding to our vocabulary. Similar to our word-presence matrix earlier, each row corresponds to our vocabulary words and a value of 1 means the corresponding vocabulary word is present in the document, and 0 otherwise.

In [None]:
def doc_to_vocab_vector(data_path, vocab_indices):
    vocab_vector = np.zeros((len(vocab_indices)), dtype=np.int8)
    print('Processing document file %s to vocabulary vector:'%(data_path))
    with open(data_path, 'r', errors='ignore') as file:
        for line in file:
            words = get_words_from_line(line)
            for word in words:
                if word in vocab_indices:
                    if vocab_vector[vocab_indices[word]] == 0:
                        vocab_vector[vocab_indices[word]] = 1
                    else:
                        continue
    return vocab_vector


As per Bayes' rule, what we need to compute for are the conditional probabilities for each document class (i.e. P(ham|w1, ..., wn) and P(spam|w1, ..., wn) ) which will then be compared. The classifier would then output a prediction corresponding to the class with the higher probability.<br>
<br>
The function __calculate_probability()__ does this computation given the document's word-presence vocabulary vector. It uses the function __calculate_sum_of_log_likelihoods__ to get the sum of the logarithm of the likelihood probabilities for each word, then converts it back to normal (non-logarithmic) space to be compared.<br>
<br><br>
To compute the sum of the logarithms of each likelihood, the function __calculate_sum_of_log_likelihoods()__ makes use of the inverse of the document vocabulary vector (__doc_vocab_vector_inverse__) and also the inverse of the likelihood probabilities (__likelihoods_inverse__).<br>
* __doc_vocab_vector__: a single training document input converted to word-presence vocabulary vector
* __doc_vocab_vector_inverse__: Each element corresponds to the opposite (i.e. 0 if 1; 1 if 0) of each value in the converted word-presence vocabulary vector (__doc_vocab_vector__)<br>
* __likelihoods__: contains the likelihood probabilities previously obtained,e.g. P(w1|__doc_class__), ..., P(wn|__doc_class__)
* __likelihoods_inverse__: Each element here corresponds to the probability of the inverse of the likelihood probabilities. (i.e. 1 - likelihood)
The probabilities to be multiplied (or in this case, taken the logarithm of and then summed up)are then computed as follows as __doc_likelihoods__:

doc_likelihoods = doc_vocab_vector$*$likelihoods + doc_ vocab_ vector_ inverse$*$likelihoods_ inverse

* if word is NOT present in the document (e.g. __doc_vocab_vector[i]__=1 and __doc_vocab_vector_inverse[i]__=0), then use the value in __likelihoods[i]__.
* if word is present in the document (e.g. __doc_vocab_vector[i]__=0 and __doc_vocab_vector_inverse[i]__=1), then use the value in __likelihoods_inverse[i]__.<br>
<br>
Also note that the words with zero likelihood in __doc_likelihoods__ were excluded when taking the sum of logarithms. Taking the logarithm of 0 results to -inf which results to an overflow exception  when summed.

In [None]:
from math import log, e

def predict_doc_class(doc_path, vocab_indices, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam):
 

    def calculate_probability(doc_vocab_vector, doc_class, vocab_indices, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam):
        doc_vocab_vector_inverse = np.bitwise_xor(doc_vocab_vector.astype(np.int8), np.ones(doc_vocab_vector.shape, dtype=np.int8))
        doc_vocab_vector_inverse = doc_vocab_vector_inverse.astype(float)

        def calculate_sum_of_log_likelihoods(doc_vocab_vector, doc_vocab_vector_inverse, likelihoods):
            
            likelihoods_inverse = np.ones(likelihoods.shape, dtype=np.float64)-likelihoods
            
            doc_likelihoods = doc_vocab_vector*likelihoods + doc_vocab_vector_inverse*likelihoods_inverse
            
            #exclude zero-valued doc_likelihood elements to avoid overflow!
            logsum_likelihoods = np.sum(np.log(doc_likelihoods[doc_likelihoods != 0]))
            return logsum_likelihoods

        logsum_likelihoods_spam = calculate_sum_of_log_likelihoods(doc_vocab_vector, doc_vocab_vector_inverse, likelihoods_spam)
        logsum_likelihoods_ham = calculate_sum_of_log_likelihoods(doc_vocab_vector, doc_vocab_vector_inverse, likelihoods_ham)
        logsum_likelihoods = 0
        logsum_likelihoods_other = 0
        prior = 0
        prior_other = 0
        if doc_class == 'ham':
            logsum_likelihoods = logsum_likelihoods_ham
            logsum_likelihoods_other = logsum_likelihoods_spam
            prior = prior_ham
            prior_other = prior_spam
        else:
            logsum_likelihoods = logsum_likelihoods_spam
            logsum_likelihoods_other = logsum_likelihoods_ham
            prior = prior_spam
            prior_other = prior_ham

        probability_log = logsum_likelihoods + log(prior)
        probability_log_other = logsum_likelihoods_other + log(prior_other)
        
        conditional_probability = (e**probability_log) / ((e**probability_log) + (e**probability_log_other))

        return conditional_probability
    
    doc_vocab_vector = doc_to_vocab_vector(doc_path, V)
    doc_vocab_vector = doc_vocab_vector.astype(float)
    predicted_class = 'ham'
    probability_ham = calculate_probability(doc_vocab_vector, 'ham', V, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam)
    probability_spam = calculate_probability(doc_vocab_vector, 'spam', V, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam)
    
    print('probability_ham = %12.10f ; probability_spam = %12.10f'%(probability_ham, probability_spam))
    
    if probability_spam > probability_ham:
        predicted_class = 'spam'
    return predicted_class



### 3. Implement the code for classifying an unknown message and try it on the testset.

At this point, there's really nothing left to do but to call the function we declared previously, __predict_doc_class()__, for all the files in __test_data_paths__ with their corresponding labels __test_labels__. <br>
<br>
And this we did, enclosing it in a function __predict_doc_classses()__ which can be useful later on.
>Optional: This may take a while. Uncomment the playsound() line and its necessary imports to play a sound when the processing is done!

In [None]:
def predict_doc_classes(test_data_paths, test_labels, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam):
    predicted_labels = []
    for data_path_index, data_path in enumerate(test_data_paths):
        print('\nProcessing data_path_index: %d'%(data_path_index))
        predicted_labels.append(predict_doc_class(data_path, V, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam))
        print('Predicted: %s ; Correct: %s'%(predicted_labels[data_path_index], test_labels[data_path_index]))

    #playsound('Victory.mp3')
    return predicted_labels

In [None]:
predicted_labels = []
predicted_labels = predict_doc_classes(test_data_paths, test_labels,
                                       likelihoods_ham, likelihoods_spam,
                                       prior_ham, prior_spam)


### 4. Write a function that computes the precision and recall measures.
* Precision = TP/(TP + FP)<br>
* Recall = TP/(TP + FN)<br>
<br>
* TP: num of SPAM messages classified as SPAM
* TN: num of HAM messages classified as HAM
* FP: num of HAM messages misclassified as SPAM
* FN: num of SPAM messages misclassified as HAM

In [None]:
def calculate_scores(true_labels, predicted_labels):
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    for index in range(len(true_labels)):
        TP += int(true_labels[index] == 'spam')*int(predicted_labels[index] == 'spam')
        TN += int(true_labels[index] == 'ham')*int(predicted_labels[index] == 'ham')
        FP += int(true_labels[index] == 'ham')*int(predicted_labels[index] == 'spam')
        FN += int(true_labels[index] == 'spam')*int(predicted_labels[index] == 'ham')
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    return precision, recall    

In [None]:
precision, recall = calculate_scores(test_labels, predicted_labels)
print('Test Results without smoothing:')
print('\tPrecision=', precision)
print('\tRecall=', recall)

## 2.2 Lambda Smoothing

### 1. Write another function that uses Lambda smoothing
To apply Lambda smoothing, we need to modify the formula for getting the likelihood probabilities:
* An extra __lmbda__ term in the numerator
* and an extra __lmbda__ * |V| in the denominator
<br>
<br>

To do this, we take the __predict_doc_classes()__ function above and internally modify the __get_likelihoods_for()__ function:

>Optional: This may take a while. Uncomment the playsound() line and its necessary imports to play a sound when the processing is done!

In [None]:
def predict_doc_classes_withsmoothing(test_data_paths, test_labels, vocab_indices,
                                      prior_ham, prior_spam,
                                      word_counts_ham, word_counts_spam,
                                      num_ham_train_docs, num_spam_train_docs,
                                      lmbda):
    predicted_labels = []
    
    def get_likelihoods_for(word_counts, total_num_class_docs, vocab_indices, lmbda=1.0):
        return (word_counts+lmbda)/(total_num_class_docs+lmbda*len(vocab_indices))
    
    likelihoods_ham = get_likelihoods_for(word_counts_ham, num_ham_train_docs, vocab_indices, lmbda)
    likelihoods_spam = get_likelihoods_for(word_counts_spam, num_spam_train_docs, vocab_indices, lmbda)
    
    for data_path_index, data_path in enumerate(test_data_paths):
        print('\nProcessing data_path_index: %d'%(data_path_index))
        predicted_labels.append(predict_doc_class(data_path, vocab_indices, likelihoods_ham, likelihoods_spam, prior_ham, prior_spam))
        print('Path: %s ; Predicted: %s ; Correct: %s'%(data_path, predicted_labels[data_path_index], test_labels[data_path_index]))

    #playsound('Victory.mp3')
    return predicted_labels, likelihoods_ham, likelihoods_spam

### 2. Use Lambda smoothing for 5 different values of lambda (2.0, 1.0, 0.5, 0.1, 0.005) and print the precision and recall for these values. What value of lambda yields the best precision and recall?

First we define a new function similar to __calculate_scores()__ defined previously added with informative prints.

In [None]:
def calculate_scores_withsmoothing(test_labels, predicted_labels, lmbda):
    precision, recall = calculate_scores(test_labels, predicted_labels)
    print('Test Results for Smoothing with lambda=', lmbda)
    print('Precision=', precision)
    print('Recall=', recall)
    
    return precision, recall

Then we call the functions previously defined for all the lambda values listed and gather their respective precision and recall scores.

In [None]:
lambdas = [2.0, 1.0, 0.5, 0.1, 0.005]
precision_scores = [0.0, 0.0, 0.0, 0.0, 0.0]
recall_scores = [0.0, 0.0, 0.0, 0.0, 0.0]
for index, lmbda in enumerate(lambdas):
    predicted_labelswithsmoothing = []
    predicted_labelswithsmoothing, likelihoods_h, likelihoods_s = predict_doc_classes_withsmoothing(test_data_paths, test_labels,
                                                                      V, prior_ham, prior_spam,
                                                                      word_counts_ham, word_counts_spam,
                                                                      num_ham_train_docs, num_spam_train_docs,
                                                                      lmbda)
    
    precision, recall = calculate_scores_withsmoothing(test_labels, predicted_labelswithsmoothing, lmbda)
    precision_scores[index] = precision
    recall_scores[index] = recall

In [None]:
for x, lmbda in enumerate(lambdas):
    print('For lambda=%5.3f: precision=%10.8f ; recall=%10.8f'%(lmbda, precision_scores[x], recall_scores[x]))

## 2.3 Improving your Classifier

### 1. Find a way to identify the most informative words for spam filtering. Using the best value of lambda found above, print the top 200 informative words for spam and ham messages.

Since we are going to be creating a new vocabulary, let us set the vocabulary dictionary __V__ to an empty dict to free up some memory.

In [None]:
V={}
word_counts_ham=[]
word_counts_spam=[]
likelihoods_ham=[]
likelihoods_spam=[]

According to the reference paper by Hovold, the frequency of occurence of the words in the training data should be considered when building the vocabulary for the filter. More specifically:
* Exclude the words with less than 3 occurences in the whole training data set
* Exclude 100 to 200 most frequently occuring words in the whole training data set<br>


This is implemented in the function __get_vocab_with_freq()__. The training data is parsed again to list down the words similar to what was done in __get_vocab_with_indices()__ but this time taking note of the number of times each word is encountered and is stored in the dict __vocab_freq__. The dict is then arranged by order of descending word frequency (i.e. from highest to lowest)
As suggested by the referenced paper, (1) words whose frequencies are less than __MIN_WORDFREQ__ are not included in the vocabulary and (2)the top __NUM_MOST_FREQUENT_TO_REMOVE__ most frequently occuring words are not included in the vocabulary.<br>
<br>
Also, notice that we imposed a limit on the length of words to include in the vocabulary. This is because, without doing so, the size of this jupyter notebook soars to ~100MB. This size causes the notebook to function slowly and then eventually crash. This is avoided by imposing a length limit on the vocabulary.<br>
<br>
>Optional: This may take a while. Uncomment the playsound() line and its necessary imports to play a sound when the processing is done!

In [None]:
import os
import re
from operator import itemgetter
from playsound import playsound

def get_vocab_with_freq(train_data_paths):
    
    MIN_WORDFREQ = 3
    NUM_MOST_FREQUENT_TO_REMOVE = 150
    MAX_VOCAB_SIZE = 200
    MAX_WORD_LEN = 10

    vocab_freq = {}
    current_vocab_index = 0

    for data_path_index, data_path in enumerate(train_data_paths):
        with open(data_path, 'r', errors='ignore') as file:
            print('Parsing file %d of %d for vocabulary:'%(data_path_index, len(train_data_paths)))
            for line in file:
                words = get_words_from_line(line)
                for word in words:
                    if word not in stop_words and (len(word) <= MAX_WORD_LEN):
                        if (word not in vocab_freq):
                            vocab_freq[word] = 1
                        else:
                            vocab_freq[word] += 1
    
    # Remove words with frequency < MIN_WORDFREQ
    vocab_freq = {word: freq for word, freq in vocab_freq.items() if freq >= MIN_WORDFREQ}
    
    # Then sort the dictionary by frequency, descending order
    vocab_freq = sorted(vocab_freq.items(), key = itemgetter(1))
    vocab_freq.reverse()
    
    # Finally, we get only the first 200 from the resulting list
    vocab_freq = vocab_freq[NUM_MOST_FREQUENT_TO_REMOVE+1:]
    vocab_freq = dict(vocab_freq[:MAX_VOCAB_SIZE])
    
    #playsound('Victory.mp3')

    return vocab_freq

In [None]:
vocab_freq = get_vocab_with_freq(train_data_paths)


In [None]:
vocab_freq

At this point we already have our vocabulary in __vocab_freq__ along with each word's frequency. We don't need the frequencies anymore, so let us define a function __get_vocab_indices()__to get their respective indices instead, similar to the output of __get_vocab_with_indices()__ previously declared.

In [None]:
def get_vocab_indices(vocab_freq):
    vocab_indices = {}
    current_vocab_index = 0
    for index, word in enumerate(vocab_freq):
        vocab_indices[word] = current_vocab_index
        current_vocab_index += 1
    return vocab_indices

In [None]:
V = get_vocab_indices(vocab_freq)

### 2. Evaluate the precision and recall of your classifier using this smaller vocabulary of 200 words.
<br>
Now that we have our brand new, smaller vocabulary, let us repeat the steps previously outlined to train, test, and evaluate our Naive Bayes Spam Classifier:

In [None]:
word_counts_ham, num_ham_train_docs = get_word_counts('ham', V, train_data_paths, train_labels)
word_counts_spam, num_spam_train_docs = get_word_counts('spam', V, train_data_paths, train_labels)

In [None]:
lmbda = 0.5

predicted_labelswithsmoothing = []
predicted_labelswithsmoothing, likelihood_ham, likelihood_spam = predict_doc_classes_withsmoothing(test_data_paths, test_labels,
                                            V, prior_ham, prior_spam,
                                            word_counts_ham, word_counts_spam,
                                            num_ham_train_docs, num_spam_train_docs,
                                            lmbda)

In [None]:
precision, recall = calculate_scores_withsmoothing(test_labels, predicted_labelswithsmoothing, lmbda)

# End