## Support Vector Machine Classifiers with Base Features
The first thing I always try is a support vector machine. From the data exploration we saw the boundary was nonlinear, so a nonlinear kernel might do better here.

In [2]:
# load stuff we will need
import pandas as pd
import math

from kinecture import ml_helper

# When we make changes to modules, automatically reload the modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Load data and generate features
1. Load data, clean, make features.
2. Convert data to a form sklearn understands: X (observations), y (labels)

In [3]:
# load data, generate features
xls = pd.ExcelFile('data/allData-filtered.xlsx')
train_dataframe = xls.parse('filtered-noFormulas')
test09_dataframe = xls.parse('09test')
test10_dataframe = xls.parse('10test')
test11_dataframe = xls.parse('11test')

print("train instances:", len(train_dataframe))
print("test09 instances:", len(test09_dataframe))
print("test10 instances:", len(test10_dataframe))
print("test11 instances:", len(test11_dataframe))

train instances: 8906
test09 instances: 1108
test10 instances: 1068
test11 instances: 1149


In [5]:
# Generate features (determined from data_exploration)
train_features = ml_helper.gen_features(train_dataframe)
test09_features = ml_helper.gen_features(test09_dataframe)
test10_features = ml_helper.gen_features(test10_dataframe)
test11_features = ml_helper.gen_features(test11_dataframe)

test09_features['Truth'] = ml_helper.remap_labels({"SI-FIX": "SI"}, 
                                                 test09_features['Truth'])
test10_features['Truth'] = ml_helper.remap_labels({"SI-FIX": "SI"}, 
                                                 test10_features['Truth'])
test11_features['Truth'] = ml_helper.remap_labels({"SI-FIX": "SI"}, 
                                                 test11_features['Truth'])

In [6]:
train_features = ml_helper.clean_data(train_features)
test09_features = ml_helper.clean_data(test09_features)
test10_features = ml_helper.clean_data(test10_features)
test11_features = ml_helper.clean_data(test11_features)

In [7]:
# convert data into a form sklearn likes
train_sklearn = ml_helper.convert_features_for_sklearn(train_features)
test09_sklearn = ml_helper.convert_features_for_sklearn(test09_features)
test10_sklearn = ml_helper.convert_features_for_sklearn(test10_features)
test11_sklearn = ml_helper.convert_features_for_sklearn(test11_features)

testall_sklearn = test09_sklearn.append(test10_sklearn).append(test11_sklearn)

### Evaluate Linear Classifier Using Test Sets
First I start by building a classifier using a simple linear decision boundary. 

I adjusted the importance of each class to be inversely proportional to the class frequency (`class_weight: auto`). This ensures that recall for all classes will be roughly equal, even though there are many more TA and SI instances than S instances. I could also set just the S class to be more important by setting `class_weights={0:3}`, this will make class 0 (S) 3 times more important than SI or TA.

Best results for all test sessions are as follows (using C=1):


             S   SI   TA
        S  669  232  147
       SI  300  624  195
       TA  139  397  622

             precision    recall  f1-score   support

          S       0.60      0.64      0.62      1048
         SI       0.50      0.56      0.53      1119
         TA       0.65      0.54      0.59      1158
      total       0.58      0.58      0.58      3325

We see an average recall of 58% which is not good, suggesting better features are needed.

In [9]:
from sklearn import svm
from sklearn.grid_search import GridSearchCV

# use grid search to find best hyperparameters
# Set the parameters by cross-validation
tuned_parameters = [{'C': [0.1, 0.75, 1, 5, 10]}]

# build LinearSVC classifier using training data
classifier = GridSearchCV(svm.LinearSVC(class_weight='auto'), tuned_parameters)
classifier.fit(train_sklearn.X, train_sklearn.y)

print("Best params for LinearSVC")
print(classifier.best_params_)
print()

train_predictions = classifier.predict(train_sklearn.X)
ml_helper.report_accuracy(train_predictions, train_sklearn.y, header="results for train")

testall_predictions = classifier.predict(testall_sklearn.X)
ml_helper.report_accuracy(testall_predictions, testall_sklearn.y, header="results for testall")



Best params for LinearSVC
{'C': 0.75}

results for train
             S   SI   TA
        S 3226  805  196
       SI  950 1918  365
       TA  376  466  604

             precision    recall  f1-score   support

          S       0.71      0.76      0.73      4227
         SI       0.60      0.59      0.60      3233
         TA       0.52      0.42      0.46      1446

avg / total       0.64      0.65      0.64      8906

results for testall
             S   SI   TA
        S  648  375  135
       SI  200  619  300
       TA  167  232  649

             precision    recall  f1-score   support

          S       0.64      0.56      0.60      1158
         SI       0.50      0.55      0.53      1119
         TA       0.60      0.62      0.61      1048

avg / total       0.58      0.58      0.58      3325



### Build a classifier with nonlinear kernels
Since we observed that, at least for pairs of variables, features were not linearly seperable, a more complex decision boundary may perform better.

Results for the test were as follows:

             S   SI   TA
        S  545  350  153
       SI  219  711  189
       TA  162  380  616

             precision    recall  f1-score   support

          S       0.59      0.52      0.55      1048
         SI       0.49      0.64      0.56      1119
         TA       0.64      0.53      0.58      1158

      total       0.58      0.56      0.56      3325
      
With the nonlinear classifier we see slightly lower recall, suggesting some overfitting to the training data.

Overall, my suggestion would be to focus on improving the features, since increasing the model complexity does not seem to be helping.

In [10]:
from sklearn import svm

# use grid search to find best hyperparameters
# Set the parameters by cross-validation
tuned_parameters = [{'C': [0.1, 0.5, 1], 'gamma': [10, 20, 30]}]

classifier = GridSearchCV(svm.SVC(class_weight='auto'), tuned_parameters)
classifier.fit(train_sklearn.X, train_sklearn.y)

print("Best params for SVC with rbf kernel")
print(classifier.best_params_)
print()

train_predictions = classifier.predict(train_sklearn.X)
ml_helper.report_accuracy(train_predictions, train_sklearn.y, header="train")

testall_predictions = classifier.predict(testall_sklearn.X)
ml_helper.report_accuracy(testall_predictions, testall_sklearn.y, header="testall")



Best params for SVC with rbf kernel
{'C': 0.1, 'gamma': 10}

train
             S   SI   TA
        S 3058  884  285
       SI  893 1925  415
       TA  348  473  625

             precision    recall  f1-score   support

          S       0.71      0.72      0.72      4227
         SI       0.59      0.60      0.59      3233
         TA       0.47      0.43      0.45      1446

avg / total       0.63      0.63      0.63      8906

testall
             S   SI   TA
        S  617  411  130
       SI  188  723  208
       TA  164  354  530

             precision    recall  f1-score   support

          S       0.64      0.53      0.58      1158
         SI       0.49      0.65      0.55      1119
         TA       0.61      0.51      0.55      1048

avg / total       0.58      0.56      0.56      3325



## Next Steps
Looks like the features we have don't separate the data very well.  The next steps would be to try improving the features, see if adding more data will help, and seeing if we can get more data.