# Assignment 2: Spam Classification with SVM

### CS 4501 Machine Learning - Department of Computer Science - University of Virginia

![Spam email](https://www.saleshandy.com/blog/wp-content/uploads/2017/01/wsi-imageoptim-11-Reasons-Why-Your-Email-Ends-Up-In-Spam.png)

*Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the assignment, you will use SVMs to build your own spam filter. For references, you may refer to my [lecture 6](https://drive.google.com/open?id=1CeBhepjDKBaFBq2BZq-zNQs-MC8ll7aL4qAF8TJ24FM) and [lecture 6b](https://drive.google.com/open?id=13BidUAs_c2QdZkf92axt2S748sbnbI9Hgxg-fzb-OuU) or Chapter 5 of the textbook if you need additional sample codes to help with your assignment. For deliverables, you must write code in Python and submit **this** Jupyter Notebook file (.ipynb) to earn a total of 100 pts. You will gain points depending on how you perform in the following sections.*


---
## 1. PRE-PROCESSING THE DATA (20 pts)

**Data Acquiring:** Download the spam dataset from UC Irvine. You can find the dataset attached with the assignment in Collab. Note that the data is in raw file, so you have to convert them into a readable format (ie. CSV). Please be sure to read its documentation to learn more about the dataset. 

**Data Splitting:** Put data into the format needed for classification task, then split it into 80% training, 20% testing (each should have approximately the same proportion between positive and negative examples).

**Data Discovery:** Plot out all correlations among the features. You may notice some features are more correlated with your predicted value than other. This information will help you confirm that weights of your SVM model later on.

**Data Cleaning:** If your dataset has some missing values, make sure you are able to fill those values with the Imputer class. 

**Feature Scaling** You can use the standard library StandardScaler to normalize the value of each features.

In [1]:
# You might want to use the following packages
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix # optional
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

import os
import pandas as pd
import numpy as np


#################
# DATA ACQUIRING#
#################

# def load_spam_data(path):
#     csv_path = os.path.join(path)
#     return pd.read_csv(csv_path)

# spam_data = load_spam_data("spambase.data.csv")
# spam_col = load_spam_data("spambase.columns.csv")
spam_data = pd.read_csv("spambase.data.csv", header = None)
spam_col = pd.read_csv("spambase.columns.csv", header = None)
spam_col.drop(spam_col.columns[1], axis = 1, inplace=True)  # Drops the extra column that just have null values

X_labels = spam_col[0].values.tolist()    # Grabs the features... Does not contain "label" of whether spam or not
spam_col.loc[len(spam_col)]=['label']     # Adds the "label"... necessary for the below step
spam_data = spam_data.rename(columns=spam_col[0]) # Combines the data dataset and the column of features dataset

#################
# DATA SPLITTING#
#################
X = spam_data[X_labels]
Y = spam_data[["label"]]
X_train_pre, X_test_pre, Y_train, Y_test = train_test_split(X,Y, random_state=1, test_size=0.2) # 80-20 Train-Test split
print(len(X_train_pre), "X_train +", len(X_test_pre), "X_test")
print(len(Y_train), "Y_train +", len(Y_test), "Y_test")

# Your code goes here for this section.
X_train = [];
y_train = [];
X_test = [];
y_test = [];

FileNotFoundError: [Errno 2] File b'spambase.data.csv' does not exist: b'spambase.data.csv'

In [None]:
spam_data.info()

In [None]:
spam_data['capital_run_length_average'].value_counts()

In [None]:
spam_data.describe()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
spam_data.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
corr_matrix = spam_data.corr()
print (corr_matrix)

In [None]:
#################
# DATA DISCOVERY#
#################
corr_matrix["label"].sort_values(ascending=False)

In [None]:
spam_param_train = X_train_pre.select_dtypes(include=[np.number])
spam_param_test = X_test_pre.select_dtypes(include=[np.number])

spam_param_train.head()

In [None]:
sample_incomplete_rows = spam_data[spam_data.isnull().any(axis=1)]
sample_incomplete_rows
# We have no NaN Values

In [None]:
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

In [None]:
##################################
# DATA CLEANING & FEATURE SCALING#
# ################################
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Permission granted to use std_scaler 
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])
X_train = num_pipeline.fit_transform(spam_param_train)
X_test = num_pipeline.transform(spam_param_test)
X_train

- - -
## 2. TRAINING LINEAR SVM FOR SPAM CLASSIFICATION (15 pts)

Train your linear SVM classifier on the training data, and then test the classifier on the test data. You may use the default **loss function** (="hinge") and **default** value of the C hyperparameter (=1.0):

* Report (1) accuracy, (2) precision, (3) recall, and (4) F-score on the test data
* Create an ROC curve, using 100 evenly spaced thresholds, for this SVM. You may use library function calls to create the ROC curve.

**Implementation Notes:** For SVM, you do NOT need to add a column of 1's to the $\mathbf{x}$ matrix to have an intercept term



In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Training your svm here
svm_clf = LinearSVC(C=1.0, loss="hinge", random_state=42, tol=1)
svm_clf.fit(X_train, np.ravel(Y_train))

# # Testing your svm here
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import roc_curve

prediction = svm_clf.predict(X_test)
# fpr, tpr, thresholds = roc_curve(prediction, Y_train)
fpr, tpr, thresholds = roc_curve(Y_test, prediction)

print("Accuracy: ", accuracy_score(Y_test, prediction))
print("Precision: ",precision_score(Y_test, prediction))
print("Recall: ", recall_score(Y_test, prediction))
print("F-1: ", f1_score(Y_test, prediction))


plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='label')
plt.plot([0, 1], [0, 1],'k--', color='navy')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()


- - -
## 3. TUNING REGULARIZATION HYPER-PARAMETER C (15 pts)
Next, you will study the SVM tradeoff between margin and data violation by using different values of the C hyper-parameter. Your task is to run an experiment with different values of C on the spam dataset and report the performance measures similar to section 2. After running the experiment, you must provide some justifications on the reason you select a certain value of C. 

Hint: you can use cross validation for each value of C and then pick the value which yields the best performance.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))

# param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
param_distributions = {"C": uniform(.01, 100)}
rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(X_train_scaled, Y_train)

In [None]:
rnd_search_cv.best_estimator_

In [None]:
rnd_search_cv.best_score_

- - -
## 4. SELECTING THE FEATURES WITH LINEAR SVM (20 pts)

Once your learned a best linear SVM in previous sections, your next task is to find which are best features to classify spam. First, you must obtain the weight vector $\mathbf{w}$ using the attribute $coef_$ of your SVM classifier. Then, for the number of features $n = 2$ to $57$, you will run the following in a loop:

* Select a set of top $n$ features that have the highest weights
* Train a classifier $\text{SVM}_n$ on all training data, only using these $n$ features with the same hyperparameter C learn from section 3.
* Test $\text{SVM}_n$ on the test set (using the same $n$ features) to obtain accuracy.
* Plot accuracy on test data vs. $n$ number of features

Identify the top 5 features. Using the plot, discuss the effects of feature selection on the performance in a short paragraph (ie. How much better the performance becomes everytime one of top 5 features is added? Were the top spam features surprising to you?)  



In [None]:
# Your feature selection code goes here


# Your paragraph goes here for this section

- - -
## 5. KERNELIZING SVM WITH THE GAUSSIAN RBF (30 pts)

In this part of the asisgnment, you will be using SVMs to do non-linear classification. In particular, you will be using SVMs with Gaussian kernels on this dataset which is not linearly separable. 

$
    \mathbf{K}_{RBF}(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \exp(-\gamma ||\mathbf{x}^{(i)} -\mathbf{x}^{(j)}||^2).
$

First, your task is to determine the best regularization $C$ and the spread of the Gaussian kernel $\gamma$ hyperparameters to use. You can train the SVM on the training set and report the performance in metrics from section 2. By using different values of $C$ and $\gamma$, you will be able to learn a good non-linear decision boundary that can perform reasonably well for this dataset. 

Next, you will compare the performance of this kernelized version of SVM and the of linear SVM in Section 3. You will need to plot out the performance in terms of accuracy, precision, and recall, and the ROC curve) for both. How much better does your non-linear SVM classifier perform comparing to a linear SVM? 

**Implementation Note:** When implementing cross validation to select the best C and $\gamma$ parameter to use, you need to evaluate the error using cross validation.

Finally, write a paragraph reporting on the final performance of your RBF kerneled SVM. Do you think the performance is adequate to be deployed in practice? Justify your reasons.




In [None]:
from sklearn.svm import SVC
# hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2), ...
# for gamma, C in hyperparams:
#    rbf_kernel_svm_clf = SVC(kernel="rbf", gamma=gamma, C=C))
#    rbf_kernel_svm_clf.fit(X_cv, y_cv)
#    # Your code to train and find the best value of C and gamma here

- - - 
### NEED HELP?

In case you get stuck in any step in the process, you may find some useful information from:

 * Consult my [lecture 6](https://drive.google.com/open?id=1CeBhepjDKBaFBq2BZq-zNQs-MC8ll7aL4qAF8TJ24FM) and [lecture 6b](https://drive.google.com/open?id=13BidUAs_c2QdZkf92axt2S748sbnbI9Hgxg-fzb-OuU) and/or the textbook
 * Talk to the TA, they are available and there to help you during [office hour](https://docs.google.com/document/d/15qB84xjaS-uRJmfKmmQuCz38bLMFaoqdbuRLoZEdOYI/edit#heading=h.72k1pvft525n)
 * Come talk to me or email me <nn4pj@virginia.edu> with subject starting "CS4501 Assignment 2:...".

Best of luck and have fun!