# Homework 4: Naive Bayes Classification and Logistic Regression
Due Tuesday Feb 21 5 PM

In this assignment, we use the UCI spam email database (https://archive.ics.uci.edu/ml/datasets/Spambase) and analyse it using Gaussian Naive Bayes and Logistic Regression.

## Part 1: Classification with Naive Bayes

Here, we use the Gaussian Naive Bayes algorithm (https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

First, we split the data set equally into training and test sets, and note that spam constitutes ~39% of the data. We then compute the means and standard deviations with respect to each of the 57 classes, for both the spam and not-spam training groups. This constitutes our probabilistic model which we then feed into the Naive Bayes classifier equation.

In [100]:
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix

In [5]:
# Load Data
spamdata = pd.read_csv("spambase.data", header = None)

spamdata.head()
spam = spamdata.values #to numpy matrix
print(spam.shape)
#print(spam[0,:])
data = spam[:,:-1] 
label = spam[:,-1]

#Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.50, stratify = label)

(4601, 58)
(4601, 57)
(4601,)


In [28]:
# Probabilistic Model

#Prior Probability for each class
prior_spam = sum(label) / len(label)
prior_not = 1 - prior_spam
print("total set: \t prior_spam = {0:0.2f} \t prior_not = {1:0.2f}".format(prior_spam, prior_not))

#Verifying training / test sets
prior_train_spam = sum(y_train) / len(y_train)
prior_train_not = 1 - prior_train_spam
print("training set: \t prior_spam = {0:0.2f} \t prior_not = {1:0.2f}".format(prior_train_spam, prior_train_not))

prior_test_spam = sum(y_test) / len(y_test)
prior_test_not = 1 - prior_test_spam
print("test set: \t prior_spam = {0:0.2f} \t prior_not = {1:0.2f}".format(prior_test_spam, prior_test_not))

total set: 	 prior_spam = 0.39 	 prior_not = 0.61
training set: 	 prior_spam = 0.39 	 prior_not = 0.61
test set: 	 prior_spam = 0.39 	 prior_not = 0.61


In [80]:
#Mean and standard Deviation of training set
epsilon = 0.0001 #Add to standard deviation if 0

#print(X_train.shape)
#print(np.mean(X_train[:,0]))
X_train_means = X_train.mean(axis=0)
X_train_stds = X_train.std(axis=0)
print(np.count_nonzero(X_train_stds == 0))

#Concat data + labels, and sort by label
X_joined = np.concatenate((X_train, y_train.reshape(-1,1)), axis=1)
print(X_joined.shape)
X_joined_sorted = sorted(X_joined, key=lambda x: x[57])
split_i = 0
for i in range(2300):
    if X_joined_sorted[i][57] == 1:
        print("found first 1-class @ i =")
        split_i = i
        break
print(split_i)
X_joined_sorted = np.array(X_joined_sorted)
X_train_not = X_joined_sorted[0:1394,:-1]
X_train_spam = X_joined_sorted[1394:,:-1]
print(X_train_not.shape)
print(X_train_spam.shape)

#Not spam mean, std
not_means = X_train_not.mean(axis=0)
not_stds = X_train_not.std(axis=0)
print(np.count_nonzero(not_stds == 0))
#Spam mean, std
spam_means = X_train_spam.mean(axis=0)
spam_stds = X_train_spam.std(axis=0)
print(np.count_nonzero(spam_stds == 0)) #no need for +epsilon since no 0-standard deviations found

0
(2300, 58)
found first 1-class @ i =
1394
(1394, 57)
(906, 57)
0
0


In [87]:
#Running Naive Bayes on test data
not_minus_mu_sq = (X_test - not_means)**2
spam_minus_mu_sq = (X_test - spam_means)**2

#X_minus_mu_sq = (X_test - X_train_means)**2
#X_train_stds_sq = X_train_stds**2
#print(X_minus_mu[0:2,25:33])
#print(X_test[0:2, 25:33])
#print(X_train_means[25:33])
#print(X_minus_mu_sq.shape)
#logN = np.log(1 / ( ((2*np.pi)**0.5)*X_train_stds ) ) - (X_minus_mu_sq/(2*X_train_stds_sq))
#print(logN.shape)

logN_not = np.log(1 / ( ((2*np.pi)**0.5)*not_stds ) ) - (not_minus_mu_sq/(2*not_stds**2))
logN_spam = np.log(1 / ( ((2*np.pi)**0.5)*spam_stds ) ) - (spam_minus_mu_sq/(2*spam_stds**2))

sum_not = np.sum(logN_not, axis=1) + np.log(prior_not)
sum_spam = np.sum(logN_spam, axis=1) + np.log(prior_spam)

sum_joined = np.concatenate( (sum_not.reshape(-1,1), sum_spam.reshape(-1,1)), axis = 1)

print(sum_joined.shape)
argmax_logN = np.argmax(sum_joined, axis = 1) #predicted classes

(2301, 2)


In [101]:
#Results
accuracy = np.sum(argmax_logN == y_test) / len(y_test)
print("accuracy = " + str(accuracy))

precision = np.sum(np.logical_and(argmax_logN,y_test)) / np.sum(argmax_logN)
print("precision = " + str(precision))

recall = np.sum(np.logical_and(argmax_logN,y_test)) / np.sum(y_test)
print("recall = " + str(recall))

#Confusion matrix
cm = confusion_matrix(y_test, argmax_logN)
print("confusion matrix")
print(cm)

accuracy = 0.834419817470665
precision = 0.7228813559322034
recall = 0.9404630650496141
confusion matrix
[[1067  327]
 [  54  853]]


## Part 1 Discussion

The recall was very high at ~94%, meaning almost all spam mails were caught. However, that came at a price with a much lower precision value of ~72%, suggesting the Naive Bayes classifier was over-zealous in detecting spam, and incorrectly lumped a number of real emails as spam, leaving us with a decent, but not great precision of ~83%.

As this is considerably worse than the SVMs in HW3 did, which had accuracy, precision, and recall values all near 90%, leaves us to conclude that the Naive Bayes bias is incorrect on the independence assumption. Logically, we would expect spam mail attributes to have some dependency (capital letters and dollar signs seem a likely pair!). However, even with the incorrect assumption, Naive Bayes does OK with an overall accuracy in the 80s, but given the nature of the task, we would value precision greatly to avoid mis-characterizing important emails, and so ~300/1400 emails incorrectly labeled as spam is unacceptable for this task.

Other reasons Naive Bayes may have struggled would be the equal weighting given to all the attributes, as every probability is simply multipled together. As seen in homework 3, there were a number of high-weight attributes that played a significant role in the classification.

## Part 2: Logistic Regression

Here, we use the scikit_learn library to perform Logistic Regression on the datase, with the "liblinear" solver setting.
(https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a)

In [102]:
from sklearn.linear_model import LogisticRegression

In [108]:
#https://stackoverflow.com/questions/52640386/how-do-i-solve-the-future-warning-min-groups-self-n-splits-warning-in
#https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions

logisticRegr = LogisticRegression(solver='liblinear') #lbfgs, liblinear
logisticRegr.fit(X_train, y_train)
predictions = logisticRegr.predict(X_test)

#Results
accuracy = np.sum(predictions == y_test) / len(y_test)
print("accuracy = " + str(accuracy))

precision = np.sum(np.logical_and(predictions,y_test)) / np.sum(predictions)
print("precision = " + str(precision))

recall = np.sum(np.logical_and(predictions,y_test)) / np.sum(y_test)
print("recall = " + str(recall))

#Confusion matrix
cm = confusion_matrix(y_test, predictions)
print("confusion matrix")
print(cm)

accuracy = 0.9287266405910474
precision = 0.9169472502805837
recall = 0.9007717750826902
confusion matrix
[[1320   74]
 [  90  817]]


## Part 2 Discussion

[1] https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c

The logistic regression results were all in the 90s (93, 92, 90 for accuracy, precision, and recall), notably performing even better than homework 3's SVMs by about 1% in each category, and far better than the suboptimal results from Part 1's Naive Bayes.

Although they Naive Bayes and Logistic Regression are both linear classifiers, the first is a generative model, and the second is a discriminative model. Additionally, according to the source above [1], Logistic Regression performs better with a larger dataset than Naive Bayes. In our case, Logistic Regression was far better even without a massive dataset, suggests the dataset does not follow the Naive Bayes bias.

### Misc function testing

In [96]:
#Numpy array broadcasting
a = np.array([[2, 2, 2], [1, 1, 1]])
print (a)
b = np.array([0, 1, 2])
print(b)
print(a-b)

#array squaring
print(a**2)

#division broadcasting
c = np.array([1, 2, 4])
print(a/c)

#Transpose
#https://stackoverflow.com/questions/36384760/transforming-a-row-vector-into-a-column-vector-in-numpy
    
#Sorting array by key
#https://stackoverflow.com/questions/10695139/sort-a-list-of-tuples-by-2nd-item-integer-value
    
#Checking array equality percentage
#https://stackoverflow.com/questions/25490641/check-how-many-elements-are-equal-in-two-numpy-arrays-python
    
#Precision Checking, logical AND on arrays
d = [0, 1, 0, 1, 0, 0, 0, 1, 1, 1]
e = [1, 0, 0, 0, 0, 1, 1, 1, 1, 1]
np_and = np.logical_and(d,e)
print(sum(np_and))

[[2 2 2]
 [1 1 1]]
[0 1 2]
[[ 2  1  0]
 [ 1  0 -1]]
[[4 4 4]
 [1 1 1]]
[[2.   1.   0.5 ]
 [1.   0.5  0.25]]
3
