The goal of this worksheet is to tune hyperparamaters for an L2 regularized Logistic Regression model and an SVM model. This includes a grid search over hyperparams as well as a feature selection process. The Logistic Regression and Linear SVM models used will be from the scikit-learn package.
<br><br>The feature selection routine will validate models by training a model on a training set and validating the model on a test set. After finding the optimal number of parameters for a model, we will perform a cross validation of the model across the entire data set.
<br><br>Hence the procedure is as follows:
   * Read in the data, split the data into a training and testing set (.75 to .25 sizes)
   * Pick out a grid of hyperparameter values for the L2 regularized Logistic Regression model
   and the Linear Support Vector Machine. If time allows, we can use nonlinear kernels for 
   the SVM.
   * Feed the classifier and the training data into the feature_selection routine. The routine
   is to return the features that produced the lowest training error for the model. Training
   error is used here to reduce the bias of training the model (See ITSL)
   * After the feature selection routine is carried out, validate each model using a cross 
   validation approach. This means getting a cross validation score for each model over the
   same grid search space.

In [1]:
import copy
import itertools
from time import time

import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

from get_data import data
from featureselection import FeatureSelection

In [2]:
df = data()

## 11/29/2015

In [3]:
grid_search = []

for c in np.linspace(.1, 5.0, 50):
    for max_iter in [10, 50, 100]:
        print '\n\nL2 with c={} and max_iter={}\n'.format(c, max_iter)
        classifier = LogisticRegression(C=c, max_iter=max_iter)
        feature_selection = FeatureSelection(classifier, df)
        features, error = feature_selection.feature_reduce()
        grid_search.append({'classifier: ': 'Logistic L2', 'C':c, 'max_iter':max_iter, 'features':features, 'error':error})
        print grid_search



L2 with c=0.1 and max_iter=10

Calculating best 13 features
Training error: 0.206552006552
Calculating best 12 features
Training error: 0.206552006552
Calculating best 11 features
Training error: 0.206552006552
Calculating best 10 features
Training error: 0.206552006552
Calculating best 9 features
Training error: 0.206552006552
Calculating best 8 features
Training error: 0.206552006552
Calculating best 7 features
Training error: 0.206552006552
Calculating best 6 features
Training error: 0.206552006552
Calculating best 5 features
Training error: 0.206552006552
Calculating best 4 features
Training error: 0.206552006552
Calculating best 3 features
Training error: 0.206592956593
Calculating best 2 features
Training error: 0.209131859132
Calculating best 1 features
Training error: 0.222809172809
Total time taken: 17.3646306992

Best n: 13
Calculating best 13 features
Total time taken: 21.0951547623
Lowest Test error: 0.200589608156
[{'classifier: ': 'Logistic L2', 'C': 0.10000000000000001

KeyboardInterrupt: 

In [5]:
with open('log_reg_l2.txt', 'w') as f:
    for it in grid_search:
        f.write('{}\n'.format(it))

## 11/30/2015

In [5]:
grid_search = []

for c in np.linspace(2.6, 5.0, 25):
    for max_iter in [10, 50, 100]:
        print '\n\nL2 with c={} and max_iter={}\n'.format(c, max_iter)
        classifier = LogisticRegression(C=c, max_iter=max_iter)
        feature_selection = FeatureSelection(classifier, df)
        features, error = feature_selection.feature_reduce()
        grid_search.append({'classifier: ': 'Logistic L2', 'C':c, 'max_iter':max_iter, 'features':features, 'error':error})
        print grid_search



L2 with c=2.6 and max_iter=10

Calculating best 13 features
Training error: 0.200941850942
Calculating best 12 features
Training error: 0.198157248157
Calculating best 11 features
Training error: 0.166871416871
Calculating best 10 features
Training error: 0.165192465192
Calculating best 9 features
Training error: 0.16420966421
Calculating best 8 features
Training error: 0.164332514333
Calculating best 7 features
Training error: 0.164946764947
Calculating best 6 features
Training error: 0.166011466011
Calculating best 5 features
Training error: 0.166175266175
Calculating best 4 features
Training error: 0.168877968878
Calculating best 3 features
Training error: 0.170925470925
Calculating best 2 features
Training error: 0.177764127764
Calculating best 1 features
Training error: 0.21855036855
Total time taken: 23.5229580402

Best n: 9
Calculating best 9 features
Total time taken: 349.465702772
Lowest Test error: 0.169143839823
[{'classifier: ': 'Logistic L2', 'C': 2.6000000000000001, 'ma

In [6]:
with open('log_reg_l2_2.txt', 'w') as f:
    for it in grid_search:
        f.write('{}\n'.format(it))