## SVM Used to Train and Analyze Phishing Data ##

This lab project goes over review of the phishing dataset again, but instead of using Logistic Regression this time we are utilizing SVM. As this follows the steps of our previous SVM review of the spam dataset, the code below should look familiar.

# Brief Review #

**Hyperplanes** is the seporator between classes as seen in this image:

![alt text](http://vbehzadan.com/AISec/SVM1.png)

_Space_ in the realm of Data Science is considered a _set_.  They are called _spaces_ because mathmaticians think of data in geometric forms.  Example: The room we are in has three dimensions, this is data in space or in a geometric form.

A **prefix margin** is that space between the hyperplane line and the data being seporated.  To optimize the prefix margin, the _goal is to maximize the distance between the points of the classes_ (not distance from hyperplane and data but data and data).  This is using training data and the goal here is to keep the hyperplane generalized enough to avoid over or underfitting.  Wider margin allows for wider variations and thus correspond to fewer classification errors.

The nearest training data is known as a **support vector**.  If this is a data point, why is it called a vector?

The prefix margin corresponds to the $μ$(mu) symbol in the above equation.  This interacts with the bias in order to create the boundary (margin) of possitive value identification/labeling.  Without the bias term the hyperplane may go through or overlap with data which could otherwise be positively analized as one value or the other (and so on based on number of features being evaluated).

Bias ($β$) does not change once training is complete, similar to weights.  In order to update the bias the model will need to be retrained to retain / update accuracy.  Happens frequently in the real world.

The trainer / designer does not usually identify $μ$.  The model may be able to identify this itself.

In [1]:
# Setting up python coding environment
# Will return later to add plotting features

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

df = np.genfromtxt(os.path.join('Data','phishing_dataset.csv'), delimiter=',', dtype=np.int32)
samples = df[:,:-1]
targets = df[:, -1]

# Once samples and targets are identified, it's time to train!
X_train, X_test, y_train, y_test = train_test_split(
         samples, targets, test_size=0.3, random_state=0)
#X = input; y = 
#hyper parameter -- chosen by ml engineer
#typically go with 20/80 or 30/70 split for test/train

In [2]:
from sklearn.svm import SVC

# SVC defines the weights
# Support Vector Config
# Kernel indicates what f is in the equation being done to the data to find the hyperplane
# C is the regularization parameter
# random_state is a pseudo random number gerator used to shuffle data.  VERY IMPORTANT GOING FORWARD
#       allows us to reproduce the pseudo random number again thanks to the seed fed to it.
#       A seed is an input/element which typically is not known to the user (pseudo random).  Replace 'time' element in most cases.
svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)


In [3]:
from sklearn.metrics import accuracy_score

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
# Accuracy is test time, not training time!
# This is doing better than perceptron because: 

Misclassified samples: 257
Accuracy: 0.92
