# Step 2 - Identify Salient Features Using $\ell1$-penalty


### Domain and Data

Using Madelon, an artificial dataset, to create feature selection models.

### Problem Statement

Modify the benchmark model to eliminate some features using regularization.

### Solution Statement

Modifying the benchmark step by choosing l1 penalty, and lower C values, the regularization effect will eliminate some features by pushing their coefficients to zero.


### Metric

Mean accuracy of the model is the metric for deciding if the model performing well and selected features are the important ones. Also the coefficient absolute value threshold for considering a feature important is set to 0.001.

### Benchmark

Based on the data nature and experience between 5 to 10 features would be enough.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/identify_features.png" width="600px">

In [1]:
from lib.project_5 import pipeline

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC, LinearSVC

In [3]:
proj5_conn = {
    "url" : "joshuacook.me",
    "port" : "5432",
    "database" : "dsi",
    "table" : "madelon",
    "user" : "dsi_student",
    "password" : "correct horse battery staple"
}


In [4]:
step2_b_output = (pipeline(proj5_conn, StandardScaler(), model=LogisticRegression(C=0.0225, penalty="l1"), 
                    random_state=10))

Connected to the database and got the data successfully.
Data dictionary created.
Data is scaled.
Transformer is  not found.
No grid search.


In [5]:
step2_b_output["scaler"]

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
step2_b_output["model"]

LogisticRegression(C=0.0225, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [7]:
step2_b_output["train_score"], step2_b_output["test_score"]

(0.62666666666666671, 0.60999999999999999)

In [8]:
features = pd.DataFrame(step2_b_output["features"], columns=["Feature", "Coefficient"])

In [9]:
features["abs_coefs"] = abs(features["Coefficient"])

In [10]:
features.sort_values(by="abs_coefs", ascending=False).head()

Unnamed: 0,Feature,Coefficient,abs_coefs
475,feat_475,0.347557,0.347557
48,feat_048,0.063738,0.063738
424,feat_424,0.029566,0.029566
317,feat_317,0.026449,0.026449
205,feat_205,-0.008702,0.008702


In [11]:
high_features = features[features["abs_coefs"] > 0.001]

In [12]:
high_features.shape[0]

7

In [14]:
high_features[["Feature", "Coefficient"]]

Unnamed: 0,Feature,Coefficient
4,feat_004,0.002461
48,feat_048,0.063738
88,feat_088,-0.00218
205,feat_205,-0.008702
317,feat_317,0.026449
424,feat_424,0.029566
475,feat_475,0.347557


In [15]:
outputs = []
num_features_one_hundredth = []
num_features_one_thousandth = []
for c in range(7):
    outputs.append(pipeline(proj5_conn, StandardScaler(), model=LogisticRegression(C=10**(c-3), penalty="l1"), 
                verbose=False, random_state=10))
    features = pd.DataFrame(outputs[c]["features"], columns=["Feature", "Coefficient"])
    features["abs_coefs"] = abs(features["Coefficient"])
    num_features_one_hundredth.append(features[features["abs_coefs"] > 0.01].shape[0])
    num_features_one_thousandth.append(features[features["abs_coefs"] > 0.001].shape[0])
scores = pd.DataFrame([(x["train_score"], x["test_score"]) for x in outputs], columns=["train_score", "test_score"])
c_vals = [10**(c-3) for c in range(7)]
scores["C_value"] = c_vals
scores["num_features_one_hundredth"] = num_features_one_hundredth
scores["num_features_one_thousandth"] = num_features_one_thousandth

In [16]:
scores

Unnamed: 0,train_score,test_score,C_value,num_features_one_hundredth,num_features_one_thousandth
0,0.498667,0.504,0.001,0,0
1,0.616667,0.61,0.01,1,1
2,0.738667,0.576,0.1,215,267
3,0.779333,0.562,1.0,424,465
4,0.780667,0.566,10.0,460,495
5,0.783333,0.57,100.0,463,498
6,0.783333,0.57,1000.0,463,497


In [17]:
outputs = []
num_features_one_hundredth = []
num_features_one_thousandth = []
for c in range(40):
    outputs.append(pipeline(proj5_conn, StandardScaler(), model=LogisticRegression(C=c*0.0005+0.01, penalty="l1"), 
                   verbose=False, random_state=10))
    features = pd.DataFrame(outputs[c]["features"], columns=["Feature", "Coefficient"])
    features["abs_coefs"] = abs(features["Coefficient"])
    num_features_one_hundredth.append(features[features["abs_coefs"] > 0.01].shape[0])
    num_features_one_thousandth.append(features[features["abs_coefs"] > 0.001].shape[0])
scores = pd.DataFrame([(x["train_score"], x["test_score"]) for x in outputs], columns=["train_score", "test_score"])
c_vals = [c*0.0005+0.01 for c in range(40)]
scores["C_value"] = c_vals
scores["num_features_one_hundredth"] = num_features_one_hundredth
scores["num_features_one_thousandth"] = num_features_one_thousandth

In [18]:
scores

Unnamed: 0,train_score,test_score,C_value,num_features_one_hundredth,num_features_one_thousandth
0,0.616667,0.61,0.01,1,1
1,0.616667,0.61,0.0105,1,1
2,0.616667,0.61,0.011,1,1
3,0.616667,0.61,0.0115,1,1
4,0.616667,0.61,0.012,1,1
5,0.616667,0.61,0.0125,1,1
6,0.616667,0.61,0.013,1,1
7,0.616667,0.61,0.0135,1,1
8,0.616667,0.61,0.014,1,1
9,0.616667,0.61,0.0145,1,1
