**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

You'll refit the classifier after each encoding to check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [88]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

clicks = pd.read_parquet('baseline_data.pqt')

In [89]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,year,month,day,weekday,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,2017,11,7,1,9,30,38,15220,11,1,13,159
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,2017,11,7,1,13,40,27,18448,24,1,17,67
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,2017,11,7,1,18,5,24,17663,11,1,19,52
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,2017,11,7,1,4,58,8,16496,12,1,13,146
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,2017,11,9,3,9,0,9,11852,11,1,1,45


In [90]:
clicks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
ip                 100000 non-null int64
app                100000 non-null int64
device             100000 non-null int64
os                 100000 non-null int64
channel            100000 non-null int64
click_time         100000 non-null datetime64[ns]
attributed_time    227 non-null object
is_attributed      100000 non-null int64
year               100000 non-null int64
month              100000 non-null int64
day                100000 non-null int64
weekday            100000 non-null int64
hour               100000 non-null int64
minute             100000 non-null int64
second             100000 non-null int64
ip_labels          100000 non-null int64
app_labels         100000 non-null int64
device_labels      100000 non-null int64
os_labels          100000 non-null int64
channel_labels     100000 non-null int64
dtypes: datetime64[ns](1), int64(18), object(1)
memory usage: 1

Here I'll define a couple functions to help test the new encodings.

In [91]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time', ascending=True)
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [92]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Baseline model
Training model!
Validation AUC score: 0.9324771929824561


### 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. Considering this, what data should you be using to calculate the encodings?

<span style="color:blue">You should calculate the encodings from the training set only. If you include data from the validation and test sets into the encodings, you'll overestimate the model's performance. You should in general be vigilant to avoid leakage, that is, including any information from the validation and test sets into the model.</span>

### 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [67]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = count_enc.transform(train[cat_features]).add_suffix('_count')
valid_encoded = count_enc.transform(valid[cat_features]).add_suffix('_count')

In [72]:
train = train.join(train_encoded)
valid = valid.join(valid_encoded)

In [73]:
train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,year,month,...,ip_labels,app_labels,device_labels,os_labels,channel_labels,ip_count,app_count,device_count,os_count,channel_count
54955,48646,12,1,19,178,2017-11-06 16:00:00,,0,2017,11,...,8546,11,1,19,45,6,10779,75626,19258,2430
28314,93836,12,1,30,328,2017-11-06 16:00:09,,0,2017,11,...,16327,11,1,30,86,8,10779,75626,509,813
31830,5314,8,1,13,145,2017-11-06 16:00:09,,0,2017,11,...,918,7,1,13,38,482,1438,75626,17046,1415
99357,73954,23,1,19,153,2017-11-06 16:00:11,,0,2017,11,...,12903,22,1,19,40,22,1152,75626,19258,2397
83228,91574,3,1,17,135,2017-11-06 16:00:11,,0,2017,11,...,15929,2,1,17,34,26,15248,75626,4193,1167


In [74]:
train.os.value_counts().head()

19    19258
13    17046
17     4193
18     3876
22     3176
Name: os, dtype: int64

In [75]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.848635588972431


Count encoding improved our model's score!

### 3) Why is count encoding effective?
At first glance, it could be surprising that Count Encoding helps make accurate models. 
Why do you think is count encoding is a good idea, or how does it improve the model score?

<span style="color:blue">Rare values tend to have similar counts (with values like 1 or 2), so you can classify rare values together at prediction time. Common values with large counts are unlikely to have the same exact count as other values. So, the common/important values get their own grouping.</span>


### 4) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [77]:
# Create the mean count encoder
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set
target_enc.fit(train[cat_features], train.is_attributed)

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = target_enc.transform(train[cat_features]).add_suffix('_target')
valid_encoded = target_enc.transform(valid[cat_features]).add_suffix('_target')

In [78]:
train = train.join(train_encoded)
valid = valid.join(valid_encoded)

In [80]:
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.8808902255639098


### 5) Try removing IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

<span style="color:blue">Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the "true" mean, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren't in the training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying different encodings.</span>

### 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [83]:
# Create the mean count encoder
cb_enc = ce.CatBoostEncoder(cols=cat_features)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train.is_attributed)

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
cb_train_encoded = cb_enc.transform(train[cat_features]).add_suffix('_cb')
cb_valid_encoded = cb_enc.transform(valid[cat_features]).add_suffix('_cb')

In [84]:
train = train.join(cb_train_encoded)
valid = valid.join(cb_valid_encoded)

In [85]:
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.8828791979949875


The CatBoost encodings work the best, so we'll keep those.

In [86]:
encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

In [87]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,year,month,...,ip_labels,app_labels,device_labels,os_labels,channel_labels,ip_cb,app_cb,device_cb,os_cb,channel_cb
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,2017,11,...,15220,11,1,13,159,0.000284,9.3e-05,0.001626,0.001232,1.14899e-05
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,2017,11,...,18448,24,1,17,67,1.7e-05,4e-06,0.001626,0.001431,8.946127e-07
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,2017,11,...,17663,11,1,19,52,0.000758,9.3e-05,0.001626,0.001714,4.308712e-06
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,2017,11,...,16496,12,1,13,146,0.000758,1e-06,0.001626,0.001232,7.181187e-07
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,2017,11,...,11852,11,1,1,45,0.000569,9.3e-05,0.001626,0.001047,9.358289e-07


# Keep Going

Now you are ready to **[generating completely new features](https://www.kaggle.com/matleonard/feature-generation)** from the data itself.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*