# Step 3 - Build Model


### Domain and Data

Using Madelon, an artificial dataset, to create feature selection models.

### Problem Statement

Finding the best approach to select features and build the best model for current dataset.

### Solution Statement

By adding feature selection step to the pipeline and fitting models with different model parameters by using grid search, the best combination of models and features selected is found.


### Metric

Mean accuracy of the model is the metric for deciding if the model performing well and selected features are the important ones. Also the coefficient absolute value threshold for considering a feature important is set to 0.001.

### Benchmark

Considering the above metric, results for this step can be compared to the ones we got in the first step to see how much we have improved our metric by adding extra steps and modifying the model parameters.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/build_model.png" width="600px">

In [1]:
from lib.project_5 import pipeline

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC, LinearSVC

In [3]:
proj5_conn = {
    "url" : "joshuacook.me",
    "port" : "5432",
    "database" : "dsi",
    "table" : "madelon",
    "user" : "dsi_student",
    "password" : "correct horse battery staple"
}

feature_selection_params = range(4, 25)    
grid_search_params_lr = {
    'penalty' : ["l1", "l2"],
#    'C' : [0.01+x**2*0.05 for x in range(30)]    
    'C' : [0.1+x*0.05 for x in range(30)]    
}  

In [4]:
step3_b_output = (pipeline(proj5_conn, StandardScaler(), transformer=SelectKBest,
                     model=LogisticRegression(n_jobs=-1), fs_params=feature_selection_params, 
                     gs_params=grid_search_params_lr, random_state=10))

Connected to the database and got the data successfully.
Data dictionary created.
Data is scaled.
Transformer is found.
Transformer parameters are found.
Grid searches are created.
Grid searches are done.


In [5]:
step3_b_output["scaler"]

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
step3_b_output["transformer"]

SelectKBest(k=8, score_func=<function f_classif at 0x115558758>)

In [7]:
step3_b_output["model"]

LogisticRegression(C=0.15, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [8]:
step3_b_output["train_score"], step3_b_output["test_score"]

(0.61799999999999999, 0.61599999999999999)

In [9]:
step3_b_output["best_k"]

8

In [10]:
features = pd.DataFrame(step3_b_output["features"], columns=["Feature", "Coefficient"]).sort_values(by="Coefficient", ascending=False)

In [11]:
features

Unnamed: 0,Feature,Coefficient
7,feat_475,0.453415
3,feat_241,0.221779
5,feat_338,0.216256
0,feat_064,0.208713
2,feat_128,0.20541
1,feat_105,0.110269
4,feat_336,0.0242
6,feat_442,0.022083


### K-Nearest Neighbors ### 

In [12]:
feature_selection_params = range(5, 16)    
grid_search_params_knn = {
    "n_neighbors" : range(3,16)
#    "weights" : ["uniform", "distance"]
#    "metric" : ["minkowski", "manhattan", "euclidean"]
}    


In [13]:
step3_b_output_knn = (pipeline(proj5_conn, StandardScaler(), transformer=SelectKBest,
                     model=KNeighborsClassifier(n_jobs=-1), fs_params=feature_selection_params, 
                     gs_params=grid_search_params_knn, random_state=10))

Connected to the database and got the data successfully.
Data dictionary created.
Data is scaled.
Transformer is found.
Transformer parameters are found.
Grid searches are created.
Grid searches are done.


In [14]:
print step3_b_output_knn["model"]
print step3_b_output_knn["train_score"], step3_b_output_knn["test_score"]
print step3_b_output_knn["best_k"]
print step3_b_output_knn["features"]

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')
0.92 0.892
12
Index([u'feat_048', u'feat_064', u'feat_105', u'feat_128', u'feat_241',
       u'feat_336', u'feat_338', u'feat_378', u'feat_442', u'feat_472',
       u'feat_475', u'feat_493'],
      dtype='object')


### This is a different grid search from the previous cell, in this case instead of using bset scores of grid search which is based on results of the best scores of test data split using cross validation, the bset scores of the model for X_test and y_test is considered to choose the best model.

In [15]:
list_output_knn = []
results = []
for i in range(3,16):
    list_output_knn.append(pipeline(proj5_conn, StandardScaler(), transformer=SelectKBest,
                     model=KNeighborsClassifier(n_neighbors=i, weights="distance", n_jobs=-1), 
                     fs_params=feature_selection_params, verbose=False, random_state=10))
    inx = i-3
    k = list_output_knn[inx]["best_k"]
    model = list_output_knn[inx]["model"]
    model.fit(list_output_knn[inx]["X_train"], list_output_knn[inx]["y_train"])
    results.append((i, k, model.score(list_output_knn[inx]["X_test"], list_output_knn[inx]["y_test"])))
results = pd.DataFrame(results, columns=["n_neighbors", "best_k", "test_score"])

In [16]:
results.sort_values(by="test_score", ascending=False)

Unnamed: 0,n_neighbors,best_k,test_score
7,10,13,0.904
6,9,13,0.9
8,11,13,0.9
4,7,13,0.896
9,12,13,0.896
10,13,13,0.896
2,5,12,0.894
3,6,13,0.894
5,8,13,0.894
12,15,13,0.894


In [17]:
best_output_knn = (pipeline(proj5_conn, StandardScaler(), transformer=SelectKBest(k=13),
                    model=KNeighborsClassifier(n_neighbors=10, weights="distance", n_jobs=-1), 
                    verbose=False, random_state=10))
best_output_knn["features"]

Index([u'feat_048', u'feat_064', u'feat_105', u'feat_128', u'feat_241',
       u'feat_336', u'feat_338', u'feat_378', u'feat_442', u'feat_453',
       u'feat_472', u'feat_475', u'feat_493'],
      dtype='object')