# STEP 2 - Identify Salient Features Using $\ell1$-penalty

### Domain and Data

Domain: We will utilize our machine learning pipeline to programatically select relevant features from a vast number of features.

Data: Our dataset is the same from Step 1 (Benchmarking), the MADELON dataset. 

### Problem Statement

The task at hand is to identify only the relevant, informative features of our dataset. We have a total of 500 features. Of the 500 features, 5 features are actually informative, and 15 are linear combinations of the 5 informative features, which gives us a total of 20 salient features. Thus, this means that the remaining features (480) are essentially noise (i.e. distractors), and should be eliminated.

### Solution Statement

To reduce the noise, and thus identify relevant features, we will use our constructed pipeline, using the Logistic Regression model with a L1 penalty (Lasso regularization). The Lasso will drive the coefficients of the noise features to 0, thus eliminating these distractors.

As in Step 1 - Benchmarking, our pipeline will run through these steps. The only difference will be that in our final step, the general_model, our Logistic Regression will use penalty='l1', the Lasso regularization:

<img src="assets/identify_features.png" width="600px">

In [1]:
# Import our wrapper functions from the project_5.py in our lib
from lib.project_5 import load_data_from_database, add_to_process_list, make_data_dict, validate_dictionary, general_model, general_transformer

In [2]:
# Load our data, from the database, into a DataFrame
madelon_df = load_data_from_database()

In [3]:
# Make sure our data was loaded correctly. Our DataFrame should have 2000 rows and 501 columns
madelon_df.shape

(2000, 501)

In [4]:
# Create a data dictionary from our DataFrame
data_dictionary = make_data_dict(madelon_df)

In [5]:
# Transform our data using StandardScaler
from sklearn.preprocessing import StandardScaler
scaled = general_transformer(StandardScaler(), data_dictionary)
scaled

{'X_test': array([[ 0.79002585,  1.29603583,  1.67665391, ...,  0.53934519,
         -0.42435928, -0.26481217],
        [ 1.41226843,  0.47200491,  0.07523707, ...,  0.75829452,
         -0.50332383,  0.3560579 ],
        [-0.45445931,  0.07647007, -0.4155197 , ...,  0.46636208,
         -0.58228839,  0.62768856],
        ..., 
        [-0.29889867,  2.02118304,  0.17855429, ..., -0.48241836,
         -0.34539472,  1.13214549],
        [-0.14333802, -1.07717322, -0.23471457, ...,  1.26917629,
          0.44425082,  1.48138491],
        [-1.07670189, -0.35202601, -1.34537464, ...,  0.46636208,
         -2.29318707,  0.31725352]]),
 'X_train': array([[-0.29889867, -0.02241364, -0.82878856, ..., -1.06628325,
          0.36528627,  0.74410169],
        [-0.92114125, -2.56042887,  0.97926271, ..., -0.19048592,
         -0.16114409, -1.23492166],
        [-2.78786899,  1.29603583, -0.05390945, ..., -0.0445197 ,
          0.23367868, -0.65285597],
        ..., 
        [ 0.79002585, -0.714599

In [6]:
# Run Logistic Regression, but this time with Lasso regularization (denoted by the passed
# parameter penalty='l1')
from sklearn.linear_model import LogisticRegression
scored = general_model(LogisticRegression(penalty='l1'), scaled)
scored

{'X_test': array([[ 0.79002585,  1.29603583,  1.67665391, ...,  0.53934519,
         -0.42435928, -0.26481217],
        [ 1.41226843,  0.47200491,  0.07523707, ...,  0.75829452,
         -0.50332383,  0.3560579 ],
        [-0.45445931,  0.07647007, -0.4155197 , ...,  0.46636208,
         -0.58228839,  0.62768856],
        ..., 
        [-0.29889867,  2.02118304,  0.17855429, ..., -0.48241836,
         -0.34539472,  1.13214549],
        [-0.14333802, -1.07717322, -0.23471457, ...,  1.26917629,
          0.44425082,  1.48138491],
        [-1.07670189, -0.35202601, -1.34537464, ...,  0.46636208,
         -2.29318707,  0.31725352]]),
 'X_train': array([[-0.29889867, -0.02241364, -0.82878856, ..., -1.06628325,
          0.36528627,  0.74410169],
        [-0.92114125, -2.56042887,  0.97926271, ..., -0.19048592,
         -0.16114409, -1.23492166],
        [-2.78786899,  1.29603583, -0.05390945, ..., -0.0445197 ,
          0.23367868, -0.65285597],
        ..., 
        [ 0.79002585, -0.714599

In [7]:
# See how many salient features are remaining after using our Lasso penalty
scored['sal_features'].shape[1]

463

In [8]:
# Verify the above number (463) is actually the number of salient features remaining. We
# will do this by checking the value of each coefficient. If the value is 0, then that feature
# was eliminated. This means that if we subtract the total number of features (500) from the
# number of salient features remaining, we will be left with the number of features that were
# eliminated. Thus, the number we are looking for is 500 - 463 = 37. 
counter = 0
for coef in scored['coef_'].flat:
    if coef == 0.0:
        counter += 1
print counter

37


### Metric

We will use the number of salient features remaining as our metric. This means that the coefficient of that salient feature will not be zero, as our Lasso penalty should drive the coefficients of irrelevant features to zero.

### Benchmark

Our scrum overlord, Joshua Cook, has instructed us that if we are left with 30 features or less, we should consider this a win.

### Results

As you can see, we fell very short of our benchmark number of 30 salient features or less. After running our Logistic Regression with Lasso, we were still left with 463 supposedly salient features. Of course, we know there aren't 463 relevant features, since the MADELON dataset actually contains only 20 relevant features. Also, though we weren't using test score accuracy as our benchmark, it is important to note that running Lasso (as opposed to using the default 'l2' Ridge from our Step-1 Benchmark result) actually *hurt* our accuracy, as we scored a less than stellar 49%.

Perhaps running our Logistic Regression again with a different C (the default is 1.0) will allow for better feature selection, along with an improved test score. We will explore other options in our Step 3 Jupyter Notebook, when we run our model through a GridSearch to exhaustively search for optimal parameters.