# SVM and SVR Homework

In [1]:
import  numpy as np
import pandas as pd
from sklearn.svm import SVC, SVR
from sklearn.metrics import confusion_matrix

## Wage Dataset
- In this homework, we study **SVMs** and **SVRs** on the wage dataset
- The data contains 3000 samples with 11 columns, many of which are categorical
- Firstly we load the Wage.csv data into a dataframe

In [2]:
wage = pd.read_csv('Wage.csv', index_col=0)
wage.columns

Index(['year', 'age', 'maritl', 'race', 'education', 'region', 'jobclass',
       'health', 'health_ins', 'logwage', 'wage'],
      dtype='object')

In [70]:
wage.shape

(3000, 11)

# Q1 Part A:
- All columns except **logwage** and **wage** are categorical
- Determine which columns are nominal categorical, which columns are ordinal categorical
- Use **value_counts()** method to study the ranges of these nominal categorical variables and frequency counts
- For each nominal categorical column, use pd.get_dummies to dummify such column into a dataframe. Don't forget to 
remove one of the dummified column (usually the one with highest frequency)

In [3]:
wage.maritl.?

2. Married          2074
1. Never Married     648
4. Divorced          204
5. Separated          55
3. Widowed            19
Name: maritl, dtype: int64

In [4]:
maritl = pd.get_dummies(?)
maritl = maritl.drop(?)
maritl

Unnamed: 0,1. Never Married,2. Married,3. Widowed,4. Divorced
231655,1,0,0,0
86582,1,0,0,0
161300,0,1,0,0
155159,0,1,0,0
11443,0,0,0,1
376662,0,1,0,0
450601,0,1,0,0
377954,1,0,0,0
228963,1,0,0,0
81404,0,1,0,0


In [6]:
wage.race

1. White    2480
2. Black     293
3. Asian     190
4. Other      37
Name: race, dtype: int64

In [7]:
race = pd.get_dummies(?)
race = race.drop(?)
race

Unnamed: 0,2. Black,3. Asian,4. Other
231655,0,0,0
86582,0,0,0
161300,0,0,0
155159,0,1,0
11443,0,0,0
376662,0,0,0
450601,0,0,1
377954,0,1,0
228963,1,0,0
81404,0,0,0


In [8]:
wage.education

2. HS Grad            971
4. College Grad       685
3. Some College       650
5. Advanced Degree    426
1. < HS Grad          268
Name: education, dtype: int64

In [9]:
education = pd.get_dummies(?)
education = education.drop(?)
education

Unnamed: 0,1. < HS Grad,3. Some College,4. College Grad,5. Advanced Degree
231655,1,0,0,0
86582,0,0,1,0
161300,0,1,0,0
155159,0,0,1,0
11443,0,0,0,0
376662,0,0,1,0
450601,0,1,0,0
377954,0,1,0,0
228963,0,1,0,0
81404,0,0,0,0


In [10]:
wage.region

2. Middle Atlantic    3000
Name: region, dtype: int64

In [11]:
wage.jobclass

1. Industrial     1544
2. Information    1456
Name: jobclass, dtype: int64

In [12]:
wage.health
health = pd.get_dummies(?)
health = health.drop(?)
health.columns = ['Bad Health']
health

Unnamed: 0,Bad Health
231655,1
86582,0
161300,1
155159,0
11443,1
376662,0
450601,0
377954,1
228963,0
81404,0


In [14]:
wage.health_ins
health_ins = pd.get_dummies(?)
health_ins = health_ins.drop(?)
health_ins.columns = []

## Q1 Part B:
- For a binary classification task, we set the target column to be **jobclass**
- For the features, we combine **year**, **age**, **logwage**, dummified **race**, **education**, **health**, **health_ins**, 
**maritl** into a dataframe called **wageFeature**

In [71]:
target = wage.jobclass
wageFeature  = pd.concat(?)

In [15]:
wageFeature.head()

Unnamed: 0,year,age,logwage,2. Black,3. Asian,4. Other,1. < HS Grad,3. Some College,4. College Grad,5. Advanced Degree,Bad Health,No Insurance,1. Never Married,2. Married,3. Widowed,4. Divorced
231655,2006,18,4.318063,0,0,0,1,0,0,0,1,1,1,0,0,0
86582,2004,24,4.255273,0,0,0,0,0,1,0,0,1,1,0,0,0
161300,2003,45,4.875061,0,0,0,0,1,0,0,1,0,0,1,0,0
155159,2003,43,5.041393,0,1,0,0,0,1,0,0,0,0,1,0,0
11443,2005,50,4.318063,0,0,0,0,0,0,0,1,0,0,0,0,1


## Q1 Part C:
- Choose the **linear** kernel and set $C=10000$. Fit a support vector classifier on wageFeature, target. What is the 
accuracy?
- How many support vectors are there?

In [77]:
svm = SVC()
svm.set_params(?,?)
svm.fit(?,?)

svm.score(?,?)

0.5993333333333334

In [78]:
svm.n_support_

array([684, 685], dtype=int32)

In [19]:
from sklearn.model_selection import GridSearchCV

## Q1 Part D:
- Set the kernel to be 'rbf'
- Run a **GridSearchCV** with $C$ ranging in np.linspace(0.01, 1, 10), $\gamma$ ranging in np.linspace(1e-3,1,10) and get the best
model
- What are the best hyperparameters $C$, $\gamma$ and the best **CV** accuracy?
- To save your time, we set $cv=3$ without going through the ten-fold cross validation
- Access the $cv\_result\_$ attribute (a python dict) of the grid object and print out the 'mean_train_score', 'mean_test_score' of the
grid search (To do so, please turn on the return_train_score keyword argument of **GridSearchCV**)


In [None]:
svm.set_params(?)

In [80]:
paramDict = {?}
grid = GridSearchCV(?, ?, cv=, return_train_score = True)
ans  = grid? # What to add here?

In [83]:
ans.best_params_

{'C': 1.0, 'gamma': 0.112}

In [84]:
ans.best_score_

0.6213333333333333

In [81]:
grid.cv_results_

array([0.51466667, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.51466667, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.532     , 0.56466667, 0.55033333, 0.52566667, 0.51933333,
       0.51466667, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.53833333, 0.59      , 0.57866667, 0.57666667, 0.562     ,
       0.545     , 0.52966667, 0.525     , 0.521     , 0.51933333,
       0.54033333, 0.59466667, 0.591     , 0.586     , 0.57933333,
       0.57433333, 0.565     , 0.55166667, 0.541     , 0.53266667,
       0.54233333, 0.60166667, 0.598     , 0.586     , 0.58133333,
       0.56866667, 0.571     , 0.56366667, 0.563     , 0.55166667,
       0.54333333, 0.60766667, 0.60133333, 0.594     , 0.58233333,
       0.57566667, 0.57066667, 0.56966667, 0.57033333, 0.564     ,
       0.54633333, 0.61333333, 0.60266667, 0.593     , 0.58133333,
       0.577     , 0.57466667, 0.57033333, 0.56633333, 0.56366667,
       0.552     , 0.61633333, 0.60133333, 0.59266667, 0.585  

In [82]:
ans.cv_results_

array([0.51466667, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.51466667, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.53266784, 0.62116994, 0.61283985, 0.55767218, 0.52533459,
       0.51533359, 0.51466667, 0.51466667, 0.51466667, 0.51466667,
       0.53800101, 0.65116678, 0.68016745, 0.70166628, 0.68916911,
       0.64217086, 0.59467077, 0.56400301, 0.54166967, 0.53116859,
       0.54083467, 0.66716869, 0.70983429, 0.75283146, 0.78049746,
       0.79232888, 0.77983038, 0.74349996, 0.69233428, 0.65066961,
       0.54466851, 0.68083403, 0.73049837, 0.7764993 , 0.81216672,
       0.84166606, 0.85999823, 0.87283031, 0.87683056, 0.87583164,
       0.54700176, 0.68983336, 0.74766696, 0.7948343 , 0.83250014,
       0.86066706, 0.88666623, 0.90416532, 0.9154994 , 0.92449982,
       0.55116768, 0.69816645, 0.76066679, 0.81000122, 0.84850173,
       0.87766715, 0.89966673, 0.91649949, 0.92700024, 0.93616741,
       0.55166834, 0.7061677 , 0.77233405, 0.82133505, 0.85700

## SVR
- We train a **SVR** for regression task using the same data set
- We select **logwage** as the continuous target and combine **year**, **age**, dummified **race**, **education** 
, **health**, **health_ins**, **maritl**, **jobclass** columns into a new data frame called wageFeature2
- Our main goal is to test the robustness of **SVR**

In [92]:
jobclass = pd.get_dummies(?)
jobclass = jobclass.drop(?)
logwage = wage.logwage
wageFeature2  = pd.concat(?)

In [87]:
from sklearn.linear_model import Ridge

## Q2 Part A:
- Compare a 3-fold **GridSearchCV** on **Ridge** regression and a 3-fold **GridSearchCV** on **SVR** (using rbf kernel)
- For the **Ridge** part, use the $\alpha$ range np.linspace(0.01,100,10)
- For **SVR**, let **C** range in np.linspace(1,100,20), and **gamma** range in np.linspace(1e-4, 1e-2, 10)
- For each of them, report the best hyperparameters, use the best models to fit on the total data (wageFeature2, logwage) and report their $R^2$

In [88]:
ridge = Ridge()
paramDict = {?}
grid = GridSearchCV(?, ?, cv=?, return_train_score = True)
ans_ridge = grid?  # What to add here?

In [90]:
ans_ridge.best_params_

{'alpha': 11.12}

In [94]:
ridge_best = ans_ridge.best_estimator_
ridge_best.fit(?,?)
ridge_best.score(?,?)

0.3748125927384242

In [95]:
svr = SVR()
svr.set_params(kernel=?)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [96]:
from sklearn.model_selection import GridSearchCV
paramDict = {?}
grid = GridSearchCV(?, ?, cv=?, return_train_score = True)
ans_svr  = grid? # What to add here?

In [97]:
ans_svr.best_params_

{'C': 16.63157894736842, 'gamma': 0.0012000000000000001}

In [98]:
ans_svr.best_score_

0.38709779738007666

In [99]:
svr_best = ans_svr.best_estimator_
svr_best.fit(?,?)
svr_best.score(?, ?)

## Q2 Part B:
- We would like to test the performance of the models when an outlier contaminantes the original data
- Use copy() method of pandas series to duplicate the logwage Series called logwage2. Then change the value of the last sample to 400 using logwage2.iloc[-1] = 400

- Repeat the procedure in Part A using (wageFeature2, logwage2) to do grid-search CV
- Use the best models suggested by **GridSearchCV** to fit on the altered data set (wageFeature2, logwage2)
- Measure the performance (using $R^2$) using unaltered data set (wageFeature2, logwage). What do you find? Please summary 
what you find in plain English

In [104]:
logwage2 = logwage.copy()  # a shallow copy of the logwage series
logwage2.iloc[-1] = 400

In [101]:
paramDict = {}
grid = GridSearchCV(?,?, cv=3, return_train_score = True)
ans_ridge = grid.fit(?,?)

In [102]:
ans_ridge.best_score_

-1.7861383852102972

In [103]:
ridge_best = ans_ridge.best_estimator_
?

-1.10746617885601

In [105]:
paramDict = {?}
grid = GridSearchCV(?, ?, cv=?, return_train_score = True)
ans_svr  = ? # what to fill here?

In [106]:
svr_best = ans_svr.best_estimator_
?

0.4023013936768253

## Q2 Part C
- With the best **SVR** model in part B, fit on (wageFeature2, logwage) and report the row indexes of support vectors
- Tuning the model **epsilon** to $1$ and fit on (wageFeature2, logwage), and report the row indexes of support vectors
- Explain why the number support vectors drop so drastically

In [112]:
svr_best.set_params(epsilon=0.1)
?

(1953,)


array([   0,    1,    2, ..., 2996, 2997, 2999], dtype=int32)

In [113]:
?

(35,)


array([  52,   76,  236,  269,  358,  369,  388,  499,  503,  545,  577,
        642,  947, 1112, 1229, 1271, 1281, 1293, 1325, 1419, 1476, 1829,
       2038, 2148, 2191, 2247, 2423, 2636, 2664, 2670, 2692, 2712, 2796,
       2880, 2925], dtype=int32)

In [None]:
## Summary (In Plain English)
?