**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

In this exercise you'll use some feature selection algorithms to improve your model. Some methods take a while to run, so you'll write functions and verify they work on small samples.

To start, just run the following cell. It takes a minute or so to run.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

import os

clicks = pd.read_parquet('./input/feature-engineering-data/baseline_data.pqt')
data_files = ['count_encodings.pqt',
              'catboost_encodings.pqt',
              'interactions.pqt',
              'past_6hr_events.pqt',
              'downloads.pqt',
              'time_deltas.pqt',
              'svd_encodings.pqt']
data_root = './input/feature-engineering-data'
for file in data_files:
    features = pd.read_parquet(os.path.join(data_root, file))
    clicks = clicks.join(features)

def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

RuntimeError: Decompression 'SNAPPY' not available.  Options: ['GZIP', 'UNCOMPRESSED']

## Baseline Score

Let's look at the baseline score for all the features we've made so far.

In [2]:
train, valid, test = get_data_splits(clicks)
_, baseline_score, _ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9658334271834417


### 1) Which data to use for feature selection?

Since many feature selection methods require calculating statistics from the dataset, should you use all the data for feature selection?

Uncomment the following line after you've decided your answer.

In [3]:
# q_1.solution()

Now we have 91 features we're using for predictions. With all these features, there is a good chance the model is overfitting the data. We might be able to reduce the overfitting by removing some features. Of course, the model's performance might decrease. But at least we'd be making the model smaller and faster without losing much performance.

In [4]:
%matplotlib inline

nas = clicks.isna().sum()

In [5]:
nas[nas > 0]

attributed_time    1843715
dtype: int64

In [6]:
clicks.attributed_time.count()

456846

In [7]:
clicks.shape

(2300561, 94)

In [8]:
len(list(clicks.columns))

94

In [9]:
clicks.is_attributed.sum()

456846

In [10]:
clicks.click_time.head(20)

0    2017-11-06 15:13:23
1    2017-11-06 15:41:07
2    2017-11-06 15:42:32
3    2017-11-06 15:56:17
4    2017-11-06 15:57:01
5    2017-11-06 16:00:00
6    2017-11-06 16:00:01
7    2017-11-06 16:00:01
8    2017-11-06 16:00:01
9    2017-11-06 16:00:01
10   2017-11-06 16:00:01
11   2017-11-06 16:00:01
12   2017-11-06 16:00:02
13   2017-11-06 16:00:02
14   2017-11-06 16:00:02
15   2017-11-06 16:00:02
16   2017-11-06 16:00:02
17   2017-11-06 16:00:02
18   2017-11-06 16:00:02
19   2017-11-06 16:00:02
Name: click_time, dtype: datetime64[ns]

In [11]:
import seaborn as sns

### 2) Univariate Feature Selection

Below, use `SelectKBest` with the `f_classif` scoring function to choose 40 features from the 91 features in the data. 

In [12]:
from sklearn.feature_selection import SelectKBest, f_classif
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Create the selector, keeping 40 features
selector = SelectKBest(f_classif, k=40)

# Use the selector to retrieve the best features
X_new = selector.fit_transform(train[feature_cols], train['is_attributed'])

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(selector.inverse_transform(X_new), index=train.index, columns=feature_cols)

# Find the columns that were dropped
dropped_columns = selected_features.columns[(selected_features == 0).all()]

q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [13]:
(clicks == 0).all()

ip                  False
app                 False
device              False
os                  False
channel             False
                    ...  
os_channel_svd_0    False
os_channel_svd_1    False
os_channel_svd_2    False
os_channel_svd_3    False
os_channel_svd_4    False
Length: 94, dtype: bool

In [None]:
# Uncomment these lines if you need some guidance
# q_2.hint()
# q_2.solution()

In [14]:
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))

Training model!
Validation AUC score: 0.9625481759576047


### 3) The best value of K

With this method we can choose the best K features, but we still have to choose K ourselves. How would you find the "best" value of K? That is, you want it to be small so you're keeping the best features, but not so small that it's degrading the model's performance.

Uncomment the following line after you've decided your answer.

In [15]:
q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> To find the best value of K, you can fit multiple models with increasing values of K, then choose the smallest K with validation score above some threshold or some other criteria. A good way to do this is loop over values of K and record the validation scores for each iteration.

### 4) Use L1 regularization for feature selection

Now try a more powerful approach using L1 regularization. Implement a function `select_features_l1` that returns a list of features to keep.

Use a `LogisticRegression` classifier model with an L1 penalty to select the features. For the model, set the random state to 7 and the regularization parameter to 0.1. Fit the model then use `SelectFromModel` to return a model with the selected features.

The checking code will run your function on a sample from the dataset to provide more immediate feedback.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    """ Return selected features using logistic regression with an L1 penalty """
    logistic = LogisticRegression(C=0.1, penalty='l1', random_state=7).fit(X, y)
    model = SelectFromModel(logistic, prefit=True)

    reducedX = model.transform(X)
    
    selected_features = pd.DataFrame(model.inverse_transform(reducedX), index=X.index, columns=X.columns)
    
    return selected_features.columns[(selected_features != 0).any()]

# Run this cell to check your work
q_4.check()



<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [21]:
# Uncomment these if you're feeling stuck
#q_4.hint()
#q_4.solution()

In [None]:
selected = select_features_l1(train[feature_cols], train['is_attributed'])
dropped_columns = feature_cols.drop(selected)
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))



In [None]:
print(dropped_columns)
print(dropped_columns.shape)

### 5) Feature Selection with Trees

Since we're using a tree-based model, using another tree-based model for feature selection might produce better results. What would you do different to select the features using a trees classifier?

Uncomment the following line after you've decided your answer.

In [None]:
#q_5.solution()

### 6) Top K features with L1 regularization

Here you've set the regularization parameter `C=0.1` which led to some number of features being dropped. However, by setting `C` you aren't able to choose a certain number of features to keep. What would you do to keep the top K important features using L1 regularization?

Uncomment the following line after you've decided your answer.

In [None]:
#q_6.solution()

Congratulations on finishing this course! To keep learning, check out the rest of [our courses](https://www.kaggle.com/learn/overview). I recommned the machine learning explainability and deep learning courses as great next skills to learn.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*