#### In this homework we will use GaussianNaïve Bayes and Logistic Regression to classify the Spambasedatafrom the UCI ML repository

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
%matplotlib inline

##### Creating training and test set:

In [2]:
spam_data = pd.read_csv("spambase.data")
X = spam_data.iloc[:,:-1]
y = spam_data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

#### Probabilistic model

In [3]:
S_prior = len([y for y in y_train if y == 1])/len(y_train)
NS_prior = len([y for y in y_train if y == 0])/len(y_train)
print(f"Spam prior for training: {S_prior}")
print(f"Not Spam prior for training: {NS_prior}")

Spam prior for training: 0.3917391304347826
Not Spam prior for training: 0.6082608695652174


In [4]:
def cal_mean(data):
    return float(sum(data) / len(data))

def stddev(data):
    mu = cal_mean(data)
    nume = cal_mean([(x - mu) ** 2 for x in data])
    return np.sqrt(nume) if np.sqrt(nume) > 0 else 0.0001

    
mean_std_spam = []
mean_std_notspam = []
for i in range(57):
    feature = np.asarray(X_train.iloc[:,i])
    spam_list = []
    notspam_list = []
    for i,v in enumerate(y_train):
        if v == 1:
            spam_list.append(feature[i])
        else:
            notspam_list.append(feature[i])
    mean_std_spam.append((cal_mean(spam_list),stddev(spam_list)))
    mean_std_notspam.append((cal_mean(notspam_list),stddev(notspam_list)))

##### Naive Bayes

In [5]:
def cal_pdf(x,mu,sigma):
    return np.exp(-(x-mu)**2/(2*(sigma **2)))

predicted = np.zeros(len(y_test))
test_examples = np.asarray(X_test)
for i,ex in enumerate(test_examples):
    P_Spam = 0
    P_NOTSpam = 0
    
    for j,feature in enumerate(ex):
        N_S = -999999
        N_NS = -999999
        mean_S,std_S = mean_std_spam[j]
        mean_NS,std_NS = mean_std_notspam[j]
        
        pdf_S = cal_pdf(feature,mean_S,std_S)
        if pdf_S != 0:
            N_S = np.log(pdf_S/(np.sqrt(2*np.pi)*std_S))
            
        pdf_NS = cal_pdf(feature,mean_NS,std_NS)
        if pdf_NS != 0:
            N_NS = np.log(pdf_NS/(np.sqrt(2*np.pi)*std_NS)) 
            
        P_Spam += N_S
        P_NOTSpam += N_NS
        
    P_Spam += np.log(S_prior)
    P_NOTSpam += np.log(NS_prior)
    if P_Spam > P_NOTSpam:
        predicted[i] = 1
    else:
        predicted[i] = 0



In [6]:
print(f"Accuracy: {accuracy_score(y_test,predicted)}")


Accuracy: 0.8147826086956522


In [7]:
print(classification_report(y_test,predicted))

             precision    recall  f1-score   support

          0       0.97      0.72      0.82      1389
          1       0.69      0.96      0.80       911

avg / total       0.86      0.81      0.82      2300



### Confusion matrix

In [8]:
print(confusion_matrix(y_test,predicted))

[[998 391]
 [ 35 876]]


### Discussion

I got an accuracy of `81%` which I think is pretty low. But the model seems to have a very high recall, but again very low precision. 

With `Naïve Bayes` I got a lower accuracy compared to the one I got with `SVM` on Homework 3. By just looking at the overall accuracy it seems like the attributes are indeed dependent at some level. `Naïve Bayes` seems to have a pretty good recall score(higher than `SVM`) for `spam` class which seems to be good attribute for a model dealing with email spam classification, meaning it was able to predict all most all of the correct spam emails. But on the downside it does have a lower precision score which seems to be a concern because it seems to be classifying non-spam emails as spam. This attribute can be big no if someone wants to use this as a model for classifying spams emails, becuase they would be loosing important emails which are mistakenly classified as spam.

The reason why it has a very high recall and low precision for `spam class` might be because, it is computing the independent probablities of features given a spam email. But doing this comes with a cost. Since the `Naïve Bayes` model assumes that all the features are independent, it is agnostic to feature dependency. So becuase of this it might miss out on feature dependencies where even though some emails have spam features, they might actually not be spam features becuase there was some other feature which was negating it. Due to this Naïve Bayes model classifies some non-spam emails as spam just because it saw some spam feautures in the text. 

### Part 2

In [17]:
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial').fit(X_train, y_train)

1.) Describe what library you used and what parameter values you used in running logistic regression.  

I used sklearn library for my logistic regression. I used `multinomial logistic regression` as my classification model. I used `lbfgs` as my solver parameter.


2.) Give the accuracy, precision, and recall of your learned model on the test set, as well as a confusion matrix

In [18]:
print(f"Accuracy: {accuracy_score(y_test,clf.predict(X_test))}")

Accuracy: 0.9173913043478261


In [12]:
print(classification_report(y_test,clf.predict(X_test)))

             precision    recall  f1-score   support

          0       0.92      0.94      0.93      1389
          1       0.91      0.88      0.89       911

avg / total       0.92      0.92      0.92      2300



### Confusion matrix

In [13]:
print(confusion_matrix(y_test,predicted))

[[998 391]
 [ 35 876]]


3.) Write a few sentences comparing the results of logistic regression to those you obtained from Naïve Bayes and from your SVMfrom Homework 3.

`Logistic regression` seems to have a good balance between precision and recall scores for `not spam` class with an overall accuracy of `92%`. I think `Logistic regression` model seems to have ideal scores to be spam classifer than `Naïve Bayes`.  I say this because, for email classifer we don't mind a few spam email in our regular email and we do mind if some of our important email ends up in the spam folder. So here we are looking for a model which has high recall, meaning we want to reduce the number of good emails being wrongly classified as spam and we want a model to have high precision, meaning we want to also reduce spam emails being sent to our inbox. `SVM` from homework 3 seems to perform slightly better than `Logistic regression` in terms of overall accuracy. They both have same recall, but `SVM` seems to have a slightly higher precision for `spam` class. `Logistic regression` seems to perform better than `Naïve Bayes` in terms of overall accuracy and precision for `spam` class. But `Naïve Bayes` model seems to have a higher recall compared to all the models for `spam` class. But `Logistic regression` out performs `Naïve Bayes` in `not spam` class with a very higher precision and recall values.