###  Support_Vector_Machine  (Vincent and Tian)

Support vector machines (SVM) use a mechanism called kernels, which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes.

For example, an SVM with a linear kernel is similar to logistic regression. Therefore, in practice, the benefit of SVM's typically comes from using non-linear kernels to model non-linear decision boundaries.

Strengths: SVM's can model non-linear decision boundaries, and there are many kernels to choose from. They are also fairly robust against overfitting, especially in high-dimensional space.

Weaknesses: However, SVM's are memory intensive, trickier to tune due to the importance of picking the right kernel, and don't scale well to larger datasets. Currently in the industry, random forests are usually preferred over SVM's.

Reference: http://scikit-learn.org/stable/modules/svm.html#classification

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
from sklearn import svm
import _pickle as cPickle



In [2]:
#LinearSVR Minmax 
df_minimax = pd.read_csv('./data/ny_hmda_2015_minmax.csv')

x_minimax = np.array(df_minimax.drop(['action_taken'],1)) 
y_minimax = np.array(df_minimax['action_taken'])
x_train, x_test, y_train, y_test = train_test_split(x_minimax,y_minimax,test_size = 0.2)
print(LinearSVR().fit(x_train, y_train).score(x_test,y_test))

-0.47072697787


In [5]:
#LinearSVR Minmax
linSVC = LinearSVC()
linSVC.fit(x_train, y_train)
print(linSVC.score(x_test,y_test))
cPickle.dump(linSVC,open('models/linear_svc_model.p','wb'))

# print(LinearSVC().fit(x_train, y_train).score(x_test,y_test))

0.823913736322


In [None]:
df_robust = pd.read_csv('./data/ny_hmda_2015_robust.csv')

x_robust = np.array(df_robust.drop(['action_taken'],1)) 
y_robust = np.array(df_robust['action_taken'])
scores_robust = cross_val_score(SVC(), x_robust, y_robust, scoring='accuracy', cv=10)
print(scores_robust)

In [None]:
df_minimax = pd.read_csv('./data/ny_hmda_2015_minimax.csv')

x_minimax = np.array(df_minimax.drop(['action_taken'],1)) 
y_minimax = np.array(df_minimax['action_taken'])
scores_minimax = cross_val_score(SVC(), x_minimax, y_minimax, scoring='accuracy', cv=10)
print(scores_minimax)

In [None]:
df_normalize = pd.read_csv('./data/ny_hmda_2015_normalize.csv')

x_normalize = np.array(df_normalize.drop(['action_taken'],1)) 
y_normalize = np.array(df_normalize['action_taken'])
scores_normalize = cross_val_score(SVC(), x_normalize, y_normalize, scoring='accuracy', cv=10)
print(scores_normalize)

In [None]:
#SVC
df = pd.read_csv('./data/ny_hmda_2015_minmax.csv')
x = np.array(df.drop(['action_taken'],1)) 
y = np.array(df['action_taken'])
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2)

clf = svm.SVC()
clf.fit(x_train,y_train)

cPickle.dump(clf,open('models/svc_model.p','wb'))
accuracy = clf.score(x_test,y_test)
print(accuracy)