# Step 3 - Build Model

### Domain and Data

Domain: We will use our machine learning pipelines to build models that utilize different transformers and parameters.

Data: Our dataset is the same from Steps 1 and 2, the MADELON dataset. 

### Problem Statement

Our goal is to build more robust models than those in Step 1 or Step 2. We also wish to find the optimal parameters that will yield better performing (in terms of both accuracy and salient feature selection) models. Finally, we want to compare the performance of models against each other.

In [1]:
# Import our wrapper functions from the project_5.py in our lib
from lib.project_5 import load_data_from_database, add_to_process_list, make_data_dict, validate_dictionary, general_model, general_transformer

In [2]:
# Load our data, from the database, into a DataFrame
madelon_df = load_data_from_database()

In [3]:
# Make sure our data was loaded correctly. Our DataFrame should have 2000 rows and 501 columns
madelon_df.shape

(2000, 501)

### Solution Statement

We will use SelectKBest as our transformer. The models we will use are Logistic Regression KNeighborsClassifier. To select the optimal parameters for each of these models, we will run a GridSearchCV (cross validated grid search) for each model.

##### Transform using SelectKBest, then run Logistic Regression:

In [4]:
# Create a data dictionary from our DataFrame
data_dictionary = make_data_dict(madelon_df)

In [5]:
# Use SelectKBest as our transformer
from sklearn.feature_selection import SelectKBest
selectkbest = general_transformer(SelectKBest(), data_dictionary)
selectkbest

{'X_test': array([[495, 523, 536, ..., 459, 543, 491],
        [490, 461, 611, ..., 426, 614, 436],
        [517, 706, 417, ..., 497, 368, 547],
        ..., 
        [531, 421, 542, ..., 430, 547, 597],
        [439, 577, 450, ..., 462, 426, 363],
        [469, 486, 551, ..., 434, 558, 417]]),
 'X_train': array([[551, 518, 559, ..., 441, 568, 417],
        [463, 567, 454, ..., 502, 433, 532],
        [561, 511, 509, ..., 475, 505, 410],
        ..., 
        [544, 645, 448, ..., 444, 423, 548],
        [472, 416, 534, ..., 465, 537, 384],
        [417, 432, 512, ..., 509, 506, 649]]),
 'processes': [SelectKBest(k=10, score_func=<function f_classif at 0x112bdf9b0>)],
 'y_test': index
 1616    1
 1878   -1
 446     1
 1421   -1
 89     -1
 805     1
 587     1
 738     1
 989    -1
 447    -1
 555    -1
 190    -1
 931    -1
 405    -1
 1887   -1
 71      1
 835     1
 1496    1
 813     1
 1373   -1
 1622    1
 1790   -1
 1235   -1
 1650   -1
 783    -1
 1699    1
 536    -1
 1053    1

In [6]:
# Run Logistic Regression
from sklearn.linear_model import LogisticRegression
selectkbest_scored = general_model(LogisticRegression(), selectkbest)
selectkbest_scored

{'X_test': array([[495, 523, 536, ..., 459, 543, 491],
        [490, 461, 611, ..., 426, 614, 436],
        [517, 706, 417, ..., 497, 368, 547],
        ..., 
        [531, 421, 542, ..., 430, 547, 597],
        [439, 577, 450, ..., 462, 426, 363],
        [469, 486, 551, ..., 434, 558, 417]]),
 'X_train': array([[551, 518, 559, ..., 441, 568, 417],
        [463, 567, 454, ..., 502, 433, 532],
        [561, 511, 509, ..., 475, 505, 410],
        ..., 
        [544, 645, 448, ..., 444, 423, 548],
        [472, 416, 534, ..., 465, 537, 384],
        [417, 432, 512, ..., 509, 506, 649]]),
 'coef_': array([[-0.00013693, -0.0014875 , -0.00717311,  0.00066741,  0.00400343,
          0.00299242, -0.00144017, -0.00638096,  0.00956805,  0.00035804]]),
 'processes': [SelectKBest(k=10, score_func=<function f_classif at 0x112bdf9b0>),
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          

In [15]:
# See how many salient features are remaining
selectkbest_scored['sal_features'].shape[1]

4

In [20]:
madelon_df.head()

Unnamed: 0_level_0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,-1
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,-1
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,-1
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,1
4,484,502,528,489,466,481,402,478,487,468,...,479,452,435,486,508,481,504,495,511,1


##### Transform using SelectKBest, then run KNearestNeighbors:

In [7]:
# Create a data dictionary from our DataFrame
data_dictionary_2 = make_data_dict(madelon_df)

In [8]:
# Use SelectKBest as our transformer
selectkbest_2 = general_transformer(SelectKBest(), data_dictionary_2)
selectkbest_2

{'X_test': array([[406, 440, 315, ..., 391, 529, 543],
        [528, 445, 563, ..., 551, 487, 553],
        [464, 468, 363, ..., 463, 467, 571],
        ..., 
        [468, 300, 661, ..., 472, 492, 591],
        [521, 352, 620, ..., 535, 550, 553],
        [453, 534, 374, ..., 458, 428, 549]]),
 'X_train': array([[445, 495, 355, ..., 431, 502, 501],
        [438, 481, 494, ..., 440, 320, 555],
        [465, 324, 567, ..., 469, 560, 629],
        ..., 
        [520, 404, 593, ..., 524, 572, 512],
        [439, 494, 484, ..., 440, 581, 429],
        [430, 409, 491, ..., 443, 568, 521]]),
 'processes': [SelectKBest(k=10, score_func=<function f_classif at 0x112bdf9b0>)],
 'y_test': index
 1358    1
 1885    1
 734    -1
 870     1
 644    -1
 1581   -1
 85     -1
 978    -1
 244    -1
 1235   -1
 1468    1
 1674    1
 349     1
 1640    1
 1286    1
 447    -1
 1633   -1
 42      1
 92     -1
 1232    1
 1028    1
 1751    1
 1778   -1
 1652    1
 448    -1
 1952    1
 247    -1
 551     1

In [10]:
# Run KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
knn = general_model(KNeighborsClassifier(), selectkbest_2)
knn

{'X_test': array([[406, 440, 315, ..., 391, 529, 543],
        [528, 445, 563, ..., 551, 487, 553],
        [464, 468, 363, ..., 463, 467, 571],
        ..., 
        [468, 300, 661, ..., 472, 492, 591],
        [521, 352, 620, ..., 535, 550, 553],
        [453, 534, 374, ..., 458, 428, 549]]),
 'X_train': array([[445, 495, 355, ..., 431, 502, 501],
        [438, 481, 494, ..., 440, 320, 555],
        [465, 324, 567, ..., 469, 560, 629],
        ..., 
        [520, 404, 593, ..., 524, 572, 512],
        [439, 494, 484, ..., 440, 581, 429],
        [430, 409, 491, ..., 443, 568, 521]]),
 'processes': [SelectKBest(k=10, score_func=<function f_classif at 0x112bdf9b0>),
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=5, p=2,
             weights='uniform')],
 'test_score': 0.84666666666666668,
 'train_score': 0.90357142857142858,
 'y_test': index
 1358    1
 1885    1
 734    -1
 870     1
 644    -1
 1581   

##### GridSearchCV for Logistic Regression:

In [12]:
lr = LogisticRegression()

from sklearn.model_selection import GridSearchCV
param_range = [0.0001, 0.001, 0.01, .1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'C': param_range}]

gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1)
gs = gs.fit(data_dictionary['X_train'], data_dictionary['y_train'])

print('Grid Search Best Score: %.4f' % gs.best_score_)
print('Grid Search Best Parameter for C: ')
print gs.best_params_

Grid Search Best Score: 0.5421
Grid Search Best Parameter for C: 
{'C': 0.001}


In [13]:
gs.best_estimator_

LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

##### GridSearchCV for KNeighborsClassifier:

In [14]:
knc = KNeighborsClassifier()

param_range = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
param_grid = [{'n_neighbors': param_range}]

gs2 = GridSearchCV(estimator=knc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1)
gs2 = gs2.fit(data_dictionary['X_train'], data_dictionary['y_train'])

print('Grid Search Best Score: %.4f' % gs2.best_score_)
print('Grid Search Best Parameter for n_neighbors: ')
print gs2.best_params_
print gs2.best_estimator_

Grid Search Best Score: 0.7157
Grid Search Best Parameter for C: 
{'n_neighbors': 17}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=17, p=2,
           weights='uniform')


### Metric

For feature selection, we will use the number of features left (i.e. not removed) as our metric.

The other obvious metric for our models will be their accuracy score (test score).

### Benchmark

For relevant feature selection, we strive to do better than 463 features (which is what we were left with in Step 2 when utilizing Logistic Regression with Lasso).

For accuracy, we hope to do better than the 57% benchmark that was set in Step 1.

### Results
For relevant feature selection, we were left with just 4 features. It appears that we did better than our previous number of 463 features. Unfortunately, since we know that the actual number of relevant features is higher than 4, and that our test score was only 62% with this combination of SelectKBest and LogisticRegression, it would seem that we eliminated informative features as well.

For our accuracy, both our Logistic Regression and our KNeighborsClassifier beat our benchmark of 57% by posting test scores of 62% and 84%, respectively. That is the good news.

Here is the bad news: both our GridSearchCVs, which I expected to outperform any of our previous models, actually did worse when compared to their respective model. The GridSearchCV for Logistic Regression scored a 54% (which is worse than both our benchmark of 57% AND the SelectKBest/Logistic Regression combination, which scored 62%). The GridSearchCV for our KNeighborsClassifier also performed significantly worse than our standard SelectKBest/KNeighborsClassifier combination (71% versus 84%). It is quite possible that I made errors in implementing the GridSearchCV, for that is the only reason I can come up with for these strange results.