# ISI SPRING 2019 RESEARCH PROJECT 
* By: Huy Nghiem
* Assingment for the spam classifcation project for the USC MINDS research group.
* TASK: Classify whether e-mails are spams or not and produce metrics for model performance. 
* In this module, we build a simple Logistic Regerssion model to classify spams and report on their requested performance metrics.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
%load_ext autoreload
%matplotlib inline

In [2]:
df = pd.read_csv("spambase.data", header=None)
feat = df.loc[:,:56].values #Must convert to np array for later use
label = df.loc[:,57].values

## Data Transformation
Based from the explorations, we see that this dataset has high variance and 
some variables are highly correlated with each other. 
Perform max-scaling and standardization transformation on the data to combat this and for better use of machine learning techniques later.
Afer scaling, each variable should have mean 0 and unit STD.

In [3]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [4]:
def STDscale(train,test): 
    '''Perform standard scaling on dataset. 
    Note: we must perform scaling on the training set 
    and then use these statitics to scale the test set
    '''
    scaler = StandardScaler().fit(train)
    df_train_scaled = scaler.transform(train)
    df_test_scaled = scaler.transform(test)
    return df_train_scaled, df_test_scaled

## LOGISTIC REGRESION Part 1

Build a Logistic Regression (LR) model for this claffication task.
LR is a good starting point since we have 50+ features with ~5000 observations, 
heuristically good enough for a linear technique.

We will peform k-fold Cross-validation and apply the model on the k-1 folds and test on the held out form. We shuffle the data before splitting to prevent over-concentration in 1 fold and not in another.

We perform standard scaling on the training data and apply the statistics on the held out data. The LR model is fit on each training set and performance metrics is reported for each fold.
We also use [lbfgs] solver under the hoood for fast convergence on small data.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

## Note on performane metrics
* False Positive: non-spams misclassified as spam (1)
* False Negative: spam but misclassified as non-spam (0)
* FPR = $\frac{FP}{FP+TN}$ 

* FNR = $\frac{FN}{FN+TP}$

In [6]:
def getResult(test_label, test_pred, num_train):
    '''Tally counts for true positive, False Positive,
        False Netgative, True Negative, Along with
        Accuracy Rate and Error Rate 
    '''
    tn, fp, fn, tp = confusion_matrix(test_label, test_pred).ravel()
    fpr = round(fp/(fp+tn),2)
    fnr = round(fn/(fn+tp),2)
    acc = (test_pred == test_label).astype(int).sum()/num_train*100
    err = (test_pred != test_label).astype(int).sum()/num_train*100
    return tn, fp, fn, tp, fpr, fnr, acc, err 

In [1]:
#Specify hyperparameters
num_k = 5
col_names = ["TN","FP","FN","TP","FPR","FNR","ACC%","ERR%"]
result = np.zeros((1,len(col_names)))

We fit the LR model using 10-fold Cross Validation technique.
We employed L-2 norm regularization to combat overfitting. 
Note that for each fold, we fit (calculate statistics) on training set and 
scale using these statistics on both train and validation set.

In [8]:
import warnings
kf = KFold(n_splits=num_k, shuffle=True, random_state=123)
kf.get_n_splits(feat)
i = 1
with warnings.catch_warnings(): 
    warnings.filterwarnings("ignore")
    for train_idx, test_idx in kf.split(feat):
        N, T  = train_idx.size, test_idx.size
        print("TESTING at FOLD %s with %s training obs., %s testing obs. " % (i, N, T)  )
        x_train, x_test = feat[train_idx], feat[test_idx]
        y_train, y_test = label[train_idx], label[test_idx]
        #Perform scaling on train_set
        x_train, x_test = STDscale(x_train,x_test)
        #Fit the model
        LR = LogisticRegression(penalty="l2", solver='lbfgs')
        LR.fit(x_train,y_train)
        y_pred = LR.predict(x_test)
        tn,fp,fn,tp,fpr,fnr,acc,err = getResult(y_test, y_pred, T)
        result = np.vstack((result, [tn,fp,fn,tp,fpr,fnr,acc,err]))
        print("Accuracy: %s %%" % round(acc,2))
        print("Error: %s %%" % round(err,2))
        i+=1

TESTING at FOLD 1 with 3680 training obs., 921 testing obs. 
Accuracy: 93.7 %
Error: 6.3 %
TESTING at FOLD 2 with 3681 training obs., 920 testing obs. 
Accuracy: 90.11 %
Error: 9.89 %
TESTING at FOLD 3 with 3681 training obs., 920 testing obs. 
Accuracy: 94.35 %
Error: 5.65 %
TESTING at FOLD 4 with 3681 training obs., 920 testing obs. 
Accuracy: 90.54 %
Error: 9.46 %
TESTING at FOLD 5 with 3681 training obs., 920 testing obs. 
Accuracy: 93.04 %
Error: 6.96 %


In [9]:
result = result[1:]
avg_result = np.sum(result,axis=0)/result.shape[0]
avg_result[:4] = 0
result = np.vstack((result,avg_result))
final_result = pd.DataFrame(result,columns=col_names)

In [10]:
final_result

Unnamed: 0,TN,FP,FN,TP,FPR,FNR,ACC%,ERR%
0,535.0,28.0,30.0,328.0,0.05,0.08,93.702497,6.297503
1,490.0,40.0,51.0,339.0,0.08,0.13,90.108696,9.891304
2,558.0,17.0,35.0,310.0,0.03,0.1,94.347826,5.652174
3,511.0,30.0,57.0,322.0,0.06,0.15,90.543478,9.456522
4,553.0,26.0,38.0,303.0,0.04,0.11,93.043478,6.956522
5,0.0,0.0,0.0,0.0,0.052,0.114,92.349195,7.650805


In [11]:
del x_train, x_test,y_train,y_test

### Observations
In part 1, using a simple 5-fold cross validation, we see observe an average accuracy rate of roughly 85 percent. This approach is a bit naive due to the following reasons:
* Lack of a testing set: A separate test set set aside to provide an objective basis to compare our model.
* In the Exploration phase, we noticed a high degree of correlations between variables. Logistic Regression is notorious for its susceptibility to this phenomena. Peforming PCA on this set may be a remedy.

## LOGISTIC REGRESSION Part 2
In this part, we enhance the former Logistic Regression Model by applying Principal Component Analysis (PCA) when fitting the model. PCA is a general dimensionality-reduction techniques to get represent a sample to a number of principle components that stil capture the majority of variance.
PCA reduces redundancy from correlated features and "explain" the data in fewer axes/components.

In [12]:
from sklearn.decomposition import PCA
def PCAtransform(train,test,var_ratio=0.95):
    '''
    Perform a PCA transformation on the train and test set
    with a ratio of at least 95% variance. 
    Note: PCA will select the MINIMUM number of principal components 
    to ratain 95% of the variance. 
    '''
    vr = var_ratio
    pca = PCA(var_ratio)
    pca.fit(train)
    train_pca = pca.transform(train)
    test_pca = pca.transform(test)
    return train_pca, test_pca, pca.explained_variance_ratio_

In [13]:
#This time, we increase the number of folds to 10, double that of the last round.
num_k = 10
result = np.zeros((1,len(col_names)))

In [14]:
import warnings
kf = KFold(n_splits=num_k,shuffle=True,random_state=123)
kf.get_n_splits(feat)
i = 1
with warnings.catch_warnings(): 
    warnings.filterwarnings("ignore")
    for train_idx, test_idx in kf.split(feat):
        N, T  = train_idx.size, test_idx.size
        print("TESTING at FOLD %s with %s training obs., %s testing obs. " % (i, N, T)  )
        x_train, x_test = feat[train_idx], feat[test_idx]
        y_train, y_test = label[train_idx], label[test_idx]
        #Perform scaling on train_set
        x_train_scaled, x_test_scaled = STDscale(x_train,x_test)
        x_train_pca, x_test_pca, var_ratio = PCAtransform(x_train_scaled,x_test_scaled)
        #print("{}% of variance explained on the training set".format(np.sum(var_ratio)))
        #Fit the model
        LR = LogisticRegression(penalty="l2", solver='lbfgs')
        LR.fit(x_train_pca,y_train)
        y_pred = LR.predict(x_test_pca)
        tn,fp,fn,tp,fpr,fnr,acc,err = getResult(y_test, y_pred, T)
        result = np.vstack((result, [tn,fp,fn,tp,fpr,fnr,acc,err]))
        print("Accuracy: %s %%" % round(acc,2))
        print("Error: %s %%" % round(err,2))
        i+=1

TESTING at FOLD 1 with 4140 training obs., 461 testing obs. 
Accuracy: 93.93 %
Error: 6.07 %
TESTING at FOLD 2 with 4141 training obs., 460 testing obs. 
Accuracy: 94.13 %
Error: 5.87 %
TESTING at FOLD 3 with 4141 training obs., 460 testing obs. 
Accuracy: 91.09 %
Error: 8.91 %
TESTING at FOLD 4 with 4141 training obs., 460 testing obs. 
Accuracy: 90.65 %
Error: 9.35 %
TESTING at FOLD 5 with 4141 training obs., 460 testing obs. 
Accuracy: 94.78 %
Error: 5.22 %
TESTING at FOLD 6 with 4141 training obs., 460 testing obs. 
Accuracy: 93.7 %
Error: 6.3 %
TESTING at FOLD 7 with 4141 training obs., 460 testing obs. 
Accuracy: 90.87 %
Error: 9.13 %
TESTING at FOLD 8 with 4141 training obs., 460 testing obs. 
Accuracy: 90.65 %
Error: 9.35 %
TESTING at FOLD 9 with 4141 training obs., 460 testing obs. 
Accuracy: 93.04 %
Error: 6.96 %
TESTING at FOLD 10 with 4141 training obs., 460 testing obs. 
Accuracy: 92.83 %
Error: 7.17 %


In [15]:
result = result[1:]
avg_result = np.sum(result,axis=0)/result.shape[0]
avg_result[:4] = 0
result = np.vstack((result,avg_result))
final_result = pd.DataFrame(result,columns=col_names)

In [16]:
final_result

Unnamed: 0,TN,FP,FN,TP,FPR,FNR,ACC%,ERR%
0,265.0,13.0,15.0,168.0,0.05,0.08,93.926247,6.073753
1,272.0,13.0,14.0,161.0,0.05,0.08,94.130435,5.869565
2,242.0,17.0,24.0,177.0,0.07,0.12,91.086957,8.913043
3,256.0,15.0,28.0,161.0,0.06,0.15,90.652174,9.347826
4,288.0,8.0,16.0,148.0,0.03,0.1,94.782609,5.217391
5,270.0,9.0,20.0,161.0,0.03,0.11,93.695652,6.304348
6,254.0,16.0,26.0,164.0,0.06,0.14,90.869565,9.130435
7,257.0,14.0,29.0,160.0,0.05,0.15,90.652174,9.347826
8,275.0,17.0,15.0,153.0,0.06,0.09,93.043478,6.956522
9,277.0,10.0,23.0,150.0,0.03,0.13,92.826087,7.173913


## OBSERVATIONS

The Accuracy and Error Rate are comparable between part 1 and 2.
A quick look over the respective results for the K-Fold tables reveal 
signifiance performance metrics for each fold. 

An average Accuracy of roughly 92% on testing set appears decent.
The slight increase in average Accuracy makes sense as a for each fold, the model benefits from learning more data.

The result reveals that our LR model produces higher __False Negatives Rate__ than __False Positives Rate__. But in terms of raw counts, we might want to reduce FN more, as we do not want to miss imporant emails.

For futher improvement, we can try other classfication models, such as SVM 
and Random Forest, which deal with the relatively high dimensionality 
differently. These are implemented in the module Model Building 2.

__END OF MODULE__