# Homework 3: SVMs and Feature Selection
### Due Tuesday Feb 12 5 PM

In this assignment, we use the UCI spam email database (https://archive.ics.uci.edu/ml/datasets/Spambase) and analyse it using SVMs. As Python is our language of choice, we will be using the "scikit-learn" package which includes SVM functionality. 

Referencing:
https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/
https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array
https://stackoverflow.com/questions/6710684/remove-one-column-for-a-numpy-array

In [32]:
#import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix  

In [25]:
#Load Data
spamdata = pd.read_csv("spambase.data", header = None)

spamdata.head()
spam = spamdata.values #to numpy matrix
print(spam.shape)
print(spam[0,:])

#split data from labels
data = spam[:,:-1] 
label = spam[:,-1]

print(data.shape)
print(label.shape)
print(sum(label)) #1813 / 4601 spam labels

(4601, 58)
[  0.      0.64    0.64    0.      0.32    0.      0.      0.      0.
   0.      0.      0.64    0.      0.      0.      0.32    0.      1.29
   1.93    0.      0.96    0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.778   0.      0.
   3.756  61.    278.      1.   ]
(4601, 57)
(4601,)
1813.0


In [31]:
# ******PREPROCESSING **********************************
#Split training/test data
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.50, stratify = label) 
print(X_train.shape)
print(sum(y_train)) #approx half of 1813 for equal split
print(X_train[0,:])
print(X_test[0,:])

#Scale data by training set
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled[0,:])
print(X_test_scaled[0,:])

(2300, 57)
906.0
[ 0.     0.     0.     0.     0.9    0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.     2.7    0.
  0.9    0.     0.     0.     0.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     1.8    0.     0.     0.
  0.     0.     1.8    0.     0.9    0.     0.     0.     0.     0.
  0.     0.281  0.     0.     1.551 13.    76.   ]
[3.300e-01 3.300e-01 9.900e-01 0.000e+00 0.000e+00 6.600e-01 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 3.300e-01 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 2.650e+00 0.000e+00 3.300e-01
 0.000e+00 0.000e+00 0.000e+00 1.990e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 3.300e-01 0.000e+00
 0.000e+00 0.000e+00 3.300e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 5.100e-02 0.000e+00 0.000e+00 1.786e+00 2.800e+01
 1.340e+02]
[-0.34154352

In [None]:
#*** TRAINING ******************************
 
svclassifier = SVC(kernel='linear')  
svclassifier.fit(X_train, y_train)

y_pred = svclassifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))