**Leo Qian**

Fall 2021

CS 251: Data Analysis and Visualization

Project 6: Supervised learning

In [1]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

## Task 3: Preprocess full spam email dataset 

Before you build a Naive Bayes spam email classifier, run the full spam email dataset through your preprocessing code.

Download and extract the full **Enron** emails (*zip file should be ~29MB large*). You should see a base `enron` folder, with `spam` and `ham` subfolders when you extract the zip file (these are the 2 classes).

Run the test code below to check everything over.

### 3a) Preprocess dataset

In [3]:
import email_preprocessor as epp

#### Test `count_words` and `find_top_words`

In [38]:
word_freq, num_emails = epp.count_words()

In [39]:
print(f'You found {num_emails} emails in the datset. You should have found 32625.')

You found 32625 emails in the datset. You should have found 32625.


In [40]:
top_words, top_counts = epp.find_top_words(word_freq)
print(f"Your top 5 words are\n{top_words[:5]}\nand they should be\n['the', 'to', 'and', 'of', 'a']")
print(f"The associated counts are\n{top_counts[:5]}\nand they should be\n[277459, 203659, 148873, 139578, 111796]")

Your top 5 words are
['the', 'to', 'and', 'of', 'a']
and they should be
['the', 'to', 'and', 'of', 'a']
The associated counts are
[277459, 203659, 148873, 139578, 111796]
and they should be
[277459, 203659, 148873, 139578, 111796]


### 3b) Make train and test splits of the dataset

Here we divide the email features into a 80/20 train/test split (80% of data used to train the supervised learning model, 20% we withhold and use for testing / prediction).

In [45]:
features, y = epp.make_feature_vectors(top_words, num_emails)

In [85]:
np.random.seed(0)
x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(features, y)

In [86]:
print('Shapes for train/test splits:')
print(f'Train {x_train.shape}, classes {y_train.shape}')
print(f'Test {x_test.shape}, classes {y_test.shape}')
print('\nThey should be:\nTrain (26100, 200), classes (26100,)\nTest (6525, 200), classes (6525,)')

Shapes for train/test splits:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)

They should be:
Train (26100, 200), classes (26100,)
Test (6525, 200), classes (6525,)


### 3c) Save data in binary format

It adds a lot of overhead to have to run through your raw email -> train/test feature split every time you wanted to work on your project! In this step, you will export the data in memory to disk in a binary format. That way, you can quickly load all the data back into memory (directly in ndarray format) whenever you want to work with it again. No need to parse from text files!

- Use numpy's `save` function to make six files in `.npy` format (e.g. `email_train_x.npy`, `email_train_y.npy`, `email_train_inds.npy`, `email_test_x.npy`, `email_test_y.npy`, `email_test_inds.npy`).

In [79]:
np.save('data/email_train_x.npy', x_train)
np.save('data/email_train_y.npy', y_train)
np.save('data/email_train_inds.npy', inds_train)
np.save('data/email_test_x.npy', x_test)
np.save('data/email_test_y.npy', y_test)
np.save('data/email_test_inds.npy', inds_test)

## Task 4: Naive Bayes Classifier

After finishing your email preprocessing pipeline, implement the one other supervised learning algorithm we we will use to classify email, **Naive Bayes**.

### 4a) Implement Naive Bayes

In `naive_bayes_multinomial.py`, implement the following methods:
- Constructor
- `train(data, y)`: Train the Naive Bayes classifier so that it records the "statistics" of the training set: the log of the class priors (i.e. how likely an email is in the training set to be spam or ham?) and the log of the class likelihoods (the probability of a word appearing in each class â€” spam or ham).
- `predict(data)`: Combine the log class likelihoods and log priors to compute the log posterior distribution. The predicted class for a test sample is the class that yields the highest posterior probability.
- `accuracy(y, y_pred)`: The usual definition :)


#### Bayes rule ingredients: Priors and likelihood (`train`)

To compute class predictions (probability that a test example belong to either spam or ham classes), we need to evaluate **Bayes Rule**. This means computing the priors and likelihoods based on the training data. We store the log of the priors and the log of the likelihoods.

**Prior:** $$P_c = \frac{N_c}{N}$$ where $P_c$ is the prior for class $c$ (spam or ham), $N_c$ is the number of training samples that belong to class $c$ and $N$ is the total number of training samples.

**Likelihood:** $$L_{c,w} = \frac{N_{c,w} + 1}{N_{c} + M}$$ where
- $L_{c,w}$ is the likelihood that word $w$ belongs to class $c$ (*i.e. what we are solving for*)
- $N_{c,w}$ is the total count of **word $w$** in emails that are only in class $c$ (*either spam or ham*)
- $N_{c}$ is the total number of **all words** that appear in emails of the class $c$ (*total number of words in all spam emails or total number of words in all ham emails*)
- $M$ is the number of features (*number of top words*).

Store $Log(P_c)$ and $Log(L_{c,w})$ in the `log_class_priors` and `log_class_likelihoods` fields.

#### Bayes rule ingredients: Posterior (`predict`)

To make predictions, we now combine the prior and likelihood to get the posterior:

**Log Posterior:** $$Log(\text{Post}_{i, c}) = Log(P_c) + \sum_{j \in J_i}x_{i,j}Log(L_{c,j})$$

 where
- $\text{Post}_{i,c}$ is the posterior for class $c$ for test sample $i$(*i.e. evidence that email $i$ is spam or ham*). We solve for its logarithm.
- $Log(P_c)$ is the logarithm of the prior for class $c$.
- $x_{i,j}$ is the number of times the jth word appears in the ith email.
- $Log(L_{c,j})$: is the log-likelihood of the jth word in class $c$.

In [49]:
from naive_bayes_multinomial import NaiveBayes

#### Test `train`

In [52]:
num_test_classes = 4
np.random.seed(0)
data_test = np.random.random(size=(100, 6))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_test, y_test)

print(f'Your log class priors are: {nbc.log_class_priors}\nand should be          [-1.42712 -1.34707 -1.38629 -1.38629].')
print(f'Your log class likelihoods shape is {nbc.log_class_likelihoods.shape} and should be (4, 6).')
print(f'Your log likelihoods are:\n{nbc.log_class_likelihoods}')


test_likelihoods = np.array(
[[-1.88939, -1.68757, -1.73893, -1.92213, -1.78302, -1.75022],
 [-1.79005, -1.7466,  -1.84883, -1.7786,  -1.85297, -1.73946],
 [-1.95787, -1.85664, -1.62705, -1.76927, -1.71753, -1.85681],
 [-1.67787, -1.70203, -1.83983, -2.09846, -1.78647, -1.70444]])
print(f'and should be\n{test_likelihoods}')

Your log class priors are: [-1.42712 -1.34707 -1.38629 -1.38629]
and should be          [-1.42712 -1.34707 -1.38629 -1.38629].
Your log class likelihoods shape is (4, 6) and should be (4, 6).
Your log likelihoods are:
[[-1.88939 -1.68757 -1.73893 -1.92213 -1.78302 -1.75022]
 [-1.79005 -1.7466  -1.84883 -1.7786  -1.85297 -1.73946]
 [-1.95787 -1.85664 -1.62705 -1.76927 -1.71753 -1.85681]
 [-1.67787 -1.70203 -1.83983 -2.09846 -1.78647 -1.70444]]
and should be
[[-1.88939 -1.68757 -1.73893 -1.92213 -1.78302 -1.75022]
 [-1.79005 -1.7466  -1.84883 -1.7786  -1.85297 -1.73946]
 [-1.95787 -1.85664 -1.62705 -1.76927 -1.71753 -1.85681]
 [-1.67787 -1.70203 -1.83983 -2.09846 -1.78647 -1.70444]]


#### Test `predict`

In [54]:
num_test_classes = 4
np.random.seed(0)
data_train = np.random.randint(low=0, high=num_test_classes, size=(100, 10))
data_test = np.random.randint(low=0, high=num_test_classes, size=(15, 10))
y_test = np.random.randint(low=0, high=num_test_classes, size=(100,))

nbc = NaiveBayes(num_classes=num_test_classes)
nbc.train(data_train, y_test)
test_y_pred = nbc.predict(data_test)

print(f'Your predicted classes are\n{test_y_pred}\nand should be\n[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]')

Your predicted classes are
[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]
and should be
[3 0 3 1 0 1 1 3 0 3 0 2 0 2 1]


### 4b) Spam filtering

Let's start classifying spam email using the Naive Bayes classifier.

- Use `np.load` to load in the train/test split that you created last week.
- Use your Naive Bayes classifier on the Enron email dataset!

**Question 7:** Print out the accuracy that you get on the test set with Naive Bayes. It should be roughly 89%.

In [55]:
import email_preprocessor as ep

In [80]:
x_train = np.load('data/email_train_x.npy')
y_train = np.load('data/email_train_y.npy')
inds_train = np.load('data/email_train_inds.npy')
x_test = np.load('data/email_test_x.npy')
y_test = np.load('data/email_test_y.npy')
inds_test = np.load('data/email_test_inds.npy')

In [81]:
# Write your Naive Bayes code here
naive_classifier = NaiveBayes(2)

naive_classifier.train(x_train, y_train.astype(int))

y_pred = naive_classifier.predict(x_test)

accur = naive_classifier.accuracy(y_test, y_pred)

print(f'accuracy: {accur:.2f}')

accuracy: 0.90


### 4c) Confusion matrix

To get a better sense of the errors that the Naive Bayes classifer makes, you will create a confusion matrix. 

- Implement `confusion_matrix` in `naive_bayes.py`.
- Print out a confusion matrix of the spam classification results.

**Debugging guidelines**:
1. The sum of all numbers in your 2x2 confusion matrix should equal the number of test samples (6525).
2. The sum of your spam row should equal the number of spam samples in the test set (3193)
3. The sum of your ham row should equal the number of spam samples in the test set (3332)

In [82]:
confusion = naive_classifier.confusion_matrix(y_test,y_pred)

# count = {}
# count[0] = 0
# count[1] = 0
# for label in y_test:
#     count[int(label)] += 1

print(count)


print(confusion)
print(confusion.sum())
print(confusion.sum(axis = 1,keepdims=True))

{0: 3244, 1: 3281}
[[3119.  125.]
 [ 497. 2784.]]
6525.0
[[3244.]
 [3281.]]


**Question 8:** Interpret the confusion matrix, using the convention that positive detection means spam (*e.g. a false positive means classifying a ham email as spam*). What types of errors are made more frequently by the classifier? What does this mean (*i.e. X (spam/ham) is more likely to be classified than Y (spam/ham) than the other way around*)?

**Reminder:** Look back and make sure you are clear on which class indices correspond to spam/ham.

**Answer 8:** look like more hams are classified as spam comparing to the number of spam classified as ham. This means there are probabilily some emails in the ham folder with a lot of words that appears more in the spam emails of the traning dataset so that these hams are classified as spam. 

## Task 5: Comparison with KNN


- Run a similar analysis to what you did with Naive Bayes above. When computing accuracy on the test set, you may want to reduce the size of the test set (e.g. to the first 500 emails in the test set).
- Copy-paste your `confusion_matrix` method into `knn.py` so that you can run the same analysis on a KNN classifier.

In [88]:
from knn import KNN

In [90]:
knn_classifier = KNN(2)
knn_classifier.train(x_train, y_train.astype(int))
y_pred = knn_classifier.predict(x_test[:500],2)

accur = knn_classifier.accuracy(y_test[:500], y_pred)

print(f'accuracy: {accur:.2f}')


accuracy: 0.90


In [92]:
print(knn_classifier.confusion_matrix(y_test[:500],y_pred))

[[221.   5.]
 [ 43. 231.]]


**Question 9:** What accuracy did you get on the test set (potentially reduced in size)?

**Question 10:** How does the confusion matrix compare to that obtained by Naive Bayes (*If you reduced the test set size, keep that in mind*)?

**Question 11:** Briefly describe at least one pro/con of KNN compared to Naive Bayes on this dataset.

**Question 12:** When potentially reducing the size of the test set here, why is it important that we shuffled our train and test set?

**Answer 9:** I got a very good accuracy, like 90%

**Answer 10:**  the confusion matrix looks almost as good as the one for the naive bay classifier which slightly better results proportionally

**Answer 11:** knn may give better result but which a great cost of speed

**Answer 12:** in this case, we will try our best to make the training and test be separated since we want to have an accurate result of the performance

## Extensions

### 0. Classify your own datasets

- Find datasets that you find interesting and run classification on them using your KNN algorithm (and if applicable, Naive Bayes). Analysis the performance of your classifer.

In [98]:
heart_data = np.loadtxt("data/heart_data.csv",skiprows=1,delimiter=",")

In [103]:
x_train,y_train,inds_train,x_test,y_test,inds_test = epp.make_train_test_sets(heart_data[:,:5],heart_data[:,-1])

In [119]:
knn_classifier = KNN(2)
knn_classifier.train(x_train,y_train)
y_pred = knn_classifier.predict(x_test,50)
accur = knn_classifier.accuracy(y_test,y_pred)

print(f'knn classifier accuracy: {accur:.2f}')

knn classifier accuracy: 0.68


In [118]:
naive_classifier = NaiveBayes(2)
naive_classifier.train(x_train,y_train.astype(int))
y_pred = naive_classifier.predict(x_test)
accur = naive_classifier.accuracy(y_test.astype(int),y_pred.astype(int))

print(f'naive bayes classifier accuracy: {accur:.2f}')

naive bayes classifier accuracy: 0.59


the result above is understandable. The low accuracy is probabily because even though we have some data related to people either with or without heart disease, but those features are not the causing factors for the label: whether the person have heart disease

For the fact that knn classifier works better than the naive bay classifier, I believe that's because people with or without heart disease may have similar heart-related conditions so that we would be able to predict their heart disease state by refering to their neighbors. But, still, the accuracy is pretty low meaning the data we chose to predict the data is not sufficient.  

### 1. Better text preprocessing

- If you look at the top words extracted from the email dataset, many of them are common "stop words" (e.g. a, the, to, etc.) that do not carry much meaning when it comes to differentiating between spam vs. non-spam email. Improve your preprocessing pipeline by building your top words without stop words. Analyze performance differences.

### 2. Feature size

- Explore how the number of selected features for the email dataset influences accuracy and runtime performance.

### 3. Distance metrics
- Compare KNN performance with the $L^2$ and $L^1$ distance metrics

### 4. K-Fold Cross-Validation

- Research this technique and apply it to data and your KNN and/or Naive Bayes classifiers.

### 5. Email error analysis

- Dive deeper into the properties of the emails that were misclassified (FP and/or FN) by Naive Bayes or KNN. What is their word composition? How many words were skipped because they were not in the training set? What could plausibly account for the misclassifications?