# Support Vector Machines
Classifing student success data by means of the [Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) from the sklearn module.

## Import Data
Import the data into a pandas dataframe. Get dummy variables for each categorical predictor in the data set and return the design matirx. Create a normalized and standardized design matrix as well to compare model preformance. Convert response variable to three classes *0 , 1,* and *2*.

In [1]:
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.utils.extmath import cartesian
from sklearn import preprocessing
from sklearn import metrics, svm

df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):           # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional response varibale returned
"""
def var_if_fac(data_frame, ind_var):
    index = data_frame.columns.get_loc(ind_var)
    mat = data_frame.as_matrix()
    return(vif(mat, index))

no_response = df.drop('G3',1)
arr1 = []
arr2 = list(no_response)

for i in list(no_response):
    arr1.append(var_if_fac(no_response,i))
vif_df = pd.DataFrame(list(zip(arr2,arr1)),columns = ['Ind_Var','VIF'])

drop_col_names = []

vifs = list(vif_df.VIF)
predictors = list(vif_df.Ind_Var)

for i in range(len(predictors)):
    if vifs[i] >= 10:
        drop_col_names.append(predictors[i])
        
df = df.drop(drop_col_names,1)  
"""
X = df.drop('G3',1)
y = response_conv(list(df.G3))
X_scale = preprocessing.scale(X)
X_norm = preprocessing.normalize(X)
#pd.DataFrame(X_norm).to_csv("norm.csv")

## Test/Training Sets and Optimal Penalty Parameter, Kernel, and Gamma
To train the model and later test, we must split each design matrix and response vector into training and test sets. The fucntion *opt* finds the optimal parameters for *C*, *Gamma*, and *Kernel* to be used in the model. Optimal is decided based on the parameters used in the model that returns the smallest mean squared error.

In [2]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.33, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X_scale, y, test_size=0.33, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y, test_size=0.33, random_state=42)
start_time = time.time()
combos = cartesian([['linear','rbf'],[0.1,1,10,100,1000],[0.1,0.01,0.001,0.0001]])
def opt(X,y):
    mse = []

    for k,c,g in combos:
        svc = svm.SVC(C=float(c),kernel=str(k),gamma=float(g),decision_function_shape='ovo')
        scores = cross_val_score(svc, X, y, cv=10, scoring='neg_mean_squared_error')
        mse.append(scores.mean())
    
    #MSE = [1 - x for x in cv_scores]
    opt_ = combos[mse.index(min(mse))]
    return(opt_)

k1,c1,g1 = opt(X1_train,y1_train)
k2,c2,g2 = opt(X2_train,y2_train)
k3,c3,g3 = opt(X3_train,y3_train)

print ("The optimal kernel, penalty parameter and gamma is %s, %r and %r respectively for Non-standardized design matrix." % (k1,float(c1),float(g1)))
print ("The optimal kernel, penalty parameter and gamma is %s, %r and %r respectively for standardized design matrix." % (k2,float(c2),float(g2)))
print ("The optimal kernel, penalty parameter and gamma is %s, %r and %r respectively for normalized design matrix." % (k3,float(c3),float(g3)))
print("Run time: %r minutes" % (round((int(time.time() - start_time)/60),2)))

The optimal kernel, penalty parameter and gamma is rbf, 1.0 and 0.001 respectively for Non-standardized design matrix.
The optimal kernel, penalty parameter and gamma is rbf, 10.0 and 0.001 respectively for standardized design matrix.
The optimal kernel, penalty parameter and gamma is rbf, 10.0 and 0.1 respectively for normalized design matrix.
Run time: 11.22 minutes


## Fit and Predict
After tuning model parameters to be optimal we fit each design matrix to its optimal model. Predictions are made and returned in a data frame for comparison.

In [3]:
SVM1 = svm.SVC(C=float(c1),kernel=str(k1),gamma=float(g1),decision_function_shape='ovo').fit(X1_train,y1_train)
SVM2 = svm.SVC(C=float(c2),kernel=str(k2),gamma=float(g2),decision_function_shape='ovo').fit(X2_train,y2_train)
SVM3 = svm.SVC(C=float(c3),kernel=str(k3),gamma=float(g3),decision_function_shape='ovo').fit(X3_train,y3_train)


svm1_pred = SVM1.predict(X1_test)
svm2_pred = SVM2.predict(X2_test)
svm3_pred = SVM3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, svm1_pred,svm2_pred,svm3_pred)), columns=['y_act','y_svm','y_svm_stand','y_svm_norm'])
pred.index.name = 'Obs'
pred

Unnamed: 0_level_0,y_act,y_svm,y_svm_stand,y_svm_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
5,1,1,1,1
6,1,1,1,1
7,0,1,0,1
8,1,1,1,1
9,1,1,1,1


## Results
Accuracy, confusion matrix, and classification reports are returned for each design matirx.

In [4]:
cm_svm1 = pd.DataFrame(metrics.confusion_matrix(y1_test, svm1_pred), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_svm2 = pd.DataFrame(metrics.confusion_matrix(y2_test, svm2_pred), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_svm3 = pd.DataFrame(metrics.confusion_matrix(y3_test, svm3_pred), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
print ("The accuracy of the Non-standardized SVM model is: ", SVM1.score(X1_test,y1_test))
print ("\n")
print ("The accuracy of the standardized SVM model is: ", SVM2.score(X2_test,y2_test))
print ("\n")
print ("The accuracy of the normalized SVM model is: ", SVM3.score(X3_test,y3_test))
print ("\n")
print("Non-standarized SVM Confusion Matrix: \n", cm_svm1)
print ("\n")
print("Standarized SVM Confusion Matrix: \n", cm_svm2)
print ("\n")
print("Normalized SVM Confusion Matrix: \n", cm_svm3)
print ("\n")
print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,svm1_pred))
print("\n")
print("Classification report for standardized design matrix:\n", metrics.classification_report(y2_test,svm2_pred))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,svm3_pred))

The accuracy of the Non-standardized SVM model is:  0.879069767442


The accuracy of the standardized SVM model is:  0.902325581395


The accuracy of the normalized SVM model is:  0.888372093023


Non-standarized SVM Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)        2       21       0
Pass(1)        0      187       0
Inc(2)         3        2       0


Standarized SVM Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       10       13       0
Pass(1)        4      183       0
Inc(2)         3        1       1


Normalized SVM Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)        3       20       0
Pass(1)        0      187       0
Inc(2)         3        1       1


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0       0.40      0.09      0.14        23
          1       0.89      1.00      0.94       187
          2       0.00      0.00      0.00         5

avg /

  'precision', 'predicted', average, warn_for)
