# STEP 2 - Identify Salient Features Using $\ell1$-penalty

### Domain and Data

Domain: We will utilize our machine learning pipeline to programatically select relevant features from a vast number of features.

Data: Our dataset is the same from Step 1 (Benchmarking), the MADELON dataset. 

### Problem Statement

The task at hand is to identify only the relevant, informative features of our dataset. We have a total of 500 features. Of the 500 features, 5 features are actually informative, and 15 are linear combinations of the 5 informative features, which gives us a total of 20 salient features. Thus, this means that the remaining features (480) are essentially noise (i.e. distractors), and should be eliminated.

### Solution Statement

To reduce the noise, and then identify relevant features, we will use our constructed pipeline, using the Logistic Regression model with a L1 penalty (Lasso regularization). The Lasso will drive the coefficients of the noise features to 0, thus eliminating these distractors.

As in Step 1 - Benchmarking, our pipeline will run through these steps. The only difference will be that in our final step, the general_model, our Logistic Regression will use penalty='l1', the Lasso regularization:

<img src="assets/identify_features.png" width="600px">

In [1]:
# Import our wrapper functions from the project_5.py in our lib
from lib.project_5 import load_data_from_database, add_to_process_list, make_data_dict, validate_dictionary, general_model, general_transformer

In [2]:
# Load our data, from the database, into a DataFrame
madelon_df = load_data_from_database()

In [3]:
# Make sure our data was loaded correctly. Our DataFrame should have 2000 rows and 501 columns
madelon_df.shape

(2000, 501)

In [4]:
# Create a data dictionary from our DataFrame
data_dictionary = make_data_dict(madelon_df)

In [5]:
from sklearn.preprocessing import StandardScaler
scaled = general_transformer(StandardScaler(), data_dictionary)
scaled

{'X_test': array([[-0.29001603,  0.30698367,  0.03295526, ...,  1.18186068,
         -1.3640618 ,  1.04047179],
        [ 0.33487423, -1.44866846,  1.589581  , ..., -0.47245656,
         -0.45146634,  0.45816473],
        [-2.47713195,  0.07939914,  0.77299045, ...,  0.67837282,
         -0.0488507 ,  1.23457414],
        ..., 
        [-2.94579965, -0.47330616,  0.64539818, ...,  0.89415333,
          0.11219556, -0.51234702],
        [-0.29001603, -0.1481854 ,  0.46676899, ..., -0.76016391,
         -0.31726113,  0.06996003],
        [ 1.58465475, -0.7008907 ,  0.59436127, ..., -1.11979809,
         -0.71987677,  0.03113956]]),
 'X_train': array([[ 1.42843219,  0.46954405, -0.42637692, ..., -0.18474922,
          0.48797016, -1.0558336 ],
        [ 2.05332245, -0.53833032, -2.9782224 , ...,  0.39066548,
          0.78322163, -1.24993595],
        [ 1.42843219, -1.44866846, -0.14567392, ...,  0.31873864,
         -0.77355885,  0.80754897],
        ..., 
        [-2.00846426, -0.798426

In [6]:
from sklearn.linear_model import LogisticRegression
scored = general_model(LogisticRegression(penalty='l1'), scaled)
scored

{'X_test': array([[-0.29001603,  0.30698367,  0.03295526, ...,  1.18186068,
         -1.3640618 ,  1.04047179],
        [ 0.33487423, -1.44866846,  1.589581  , ..., -0.47245656,
         -0.45146634,  0.45816473],
        [-2.47713195,  0.07939914,  0.77299045, ...,  0.67837282,
         -0.0488507 ,  1.23457414],
        ..., 
        [-2.94579965, -0.47330616,  0.64539818, ...,  0.89415333,
          0.11219556, -0.51234702],
        [-0.29001603, -0.1481854 ,  0.46676899, ..., -0.76016391,
         -0.31726113,  0.06996003],
        [ 1.58465475, -0.7008907 ,  0.59436127, ..., -1.11979809,
         -0.71987677,  0.03113956]]),
 'X_train': array([[ 1.42843219,  0.46954405, -0.42637692, ..., -0.18474922,
          0.48797016, -1.0558336 ],
        [ 2.05332245, -0.53833032, -2.9782224 , ...,  0.39066548,
          0.78322163, -1.24993595],
        [ 1.42843219, -1.44866846, -0.14567392, ...,  0.31873864,
         -0.77355885,  0.80754897],
        ..., 
        [-2.00846426, -0.798426

In [7]:
scored['sal_features'].shape

(1400, 252)

In [8]:
counter = 0
for coef in scored['coef_'].flat:
    if coef == 0.0:
        counter += 1
print counter

248






### Metric

**TODO**: Write a statement about the metric you will be using. This is with regard to identifying features. This is the metric that will show you whether or not a feature is important. Provide a brief justification for choosing this metric.

### Benchmark

**TODO**: This may or may not directly connect to your metric. It would be good here to provide a statement about how many feautures you might be looking for. 

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.