**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

In this set of exercises, you'll create new features from the existing data. Again you'll compare the score lift for each new feature compared to a baseline model. First off, run the cells below to set up a baseline dataset and model.

In [5]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
#from learntools.core import binder
#binder.bind(globals())
#from learntools.feature_engineering.ex3 import *

# Create features from   timestamps
click_data = pd.read_csv('train_sample.csv', 
                         parse_dates=['click_time'])
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

# Label encoding for categorical features
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    clicks[feature] = label_encoder.fit_transform(clicks[feature])
    
def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model. Hold on a minute to see the validation score")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

print("Baseline model score")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Baseline model score
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9324771929824561


In [6]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,15220,11,1,13,159,2017-11-07 09:30:38,,0,7,9,30,38
1,18448,24,1,17,67,2017-11-07 13:40:27,,0,7,13,40,27
2,17663,11,1,19,52,2017-11-07 18:05:24,,0,7,18,5,24
3,16496,12,1,13,146,2017-11-07 04:58:08,,0,7,4,58,8
4,11852,11,1,1,45,2017-11-09 09:00:09,,0,9,9,0,9


### 1) Add interaction features

Here you'll add interaction features for each pair of categorical features (ip, app, device, os, channel). The easiest way to iterate through the pairs of features is with `itertools.combinations`. For each new column, join the values as strings with an underscore, so 13 and 47 would become `"13_47"`. As you add the new columns to the dataset, be sure to label encode the values.

In [2]:
import itertools

cat_features = ['ip', 'app', 'device', 'os', 'channel']
interactions = pd.DataFrame(index=clicks.index)

# Iterate through each pair of features, combine them into interaction features
for col1, col2 in itertools.combinations(cat_features, 2):
    new_col_name = '_'.join([col1, col2])
    print(new_col_name)

    # Convert to strings and combine
    new_values = clicks[col1].map(str) + "_" + clicks[col2].map(str)
    encoder = preprocessing.LabelEncoder()
    interactions[new_col_name] = encoder.fit_transform(new_values)

ip_app
ip_device
ip_os
ip_channel
app_device
app_os
app_channel
device_os
device_channel
os_channel


In [3]:
interactions.head()

Unnamed: 0,ip_app,ip_device,ip_os,ip_channel,app_device,app_os,app_channel,device_os,device_channel,os_channel
0,15234,6356,15276,17939,34,183,66,55,91,479
1,24800,10281,24825,29300,199,954,248,59,149,1003
2,22485,9302,22504,26559,34,189,77,61,133,1248
3,19001,7903,19074,22383,55,272,102,55,77,466
4,5364,2269,5366,6266,34,175,74,24,125,1365


In [4]:
clicks = clicks.join(interactions)
print("Score with interactions")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Score with interactions
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9143819548872181


# Generating numerical features

Adding interactions is a quick way to create more categorical features from the data. It's also effective to create new numerical features, you'll typically get a lot of improvement in the model. This takes a bit of brainstorming and experimentation to find features that work well.

For these exercises I'm going to have you implement functions that operate on Pandas Series. It can take multiple minutes to run these functions on the entire data set so instead I'll provide feedback by running your function on a smaller dataset.

### 2) Number of events in the past six hours

The first feature you'll be creating is the number of events from the same IP in the last six hours. It's likely that someone who is visiting often will download the app.

Implement a function `count_past_events` that takes a Series of click times (timestamps) and returns another Series with the number of events in the last hour. **Tip:** The `rolling` method is useful for this.

In [35]:
pd.Series(clicks.index, index=clicks.click_time).sort_index().rolling('2S').count().head(10)

click_time
2017-11-06 16:00:00    1.0
2017-11-06 16:00:09    1.0
2017-11-06 16:00:09    2.0
2017-11-06 16:00:11    1.0
2017-11-06 16:00:11    2.0
2017-11-06 16:00:13    1.0
2017-11-06 16:00:19    1.0
2017-11-06 16:00:20    2.0
2017-11-06 16:00:20    3.0
2017-11-06 16:00:21    3.0
dtype: float64

In [7]:
def count_past_events(series, time_window='6H'):
    series = pd.Series(series.index, index=series)
    # Subtract 1 so the current event isn't counted
    past_events = series.rolling(time_window).count() - 1
    return past_events

Because this can take a while to calculate on the full data, we'll load pre-calculated versions in the cell below to test model performance.

In [41]:
clicks = clicks.sort_values(by=['click_time'], ascending=True)

past_events = count_past_events(clicks.click_time, time_window='6H')
past_events.head()

click_time
2017-11-06 16:00:00    0.0
2017-11-06 16:00:09    1.0
2017-11-06 16:00:09    2.0
2017-11-06 16:00:11    3.0
2017-11-06 16:00:11    4.0
dtype: float64

In [43]:
clicks['ip_past_6hr_counts'] = past_events.values

train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9316852130325814


### 3) Features from future information

In the last exercise you created a feature that looked at past events. You could also make features that use information from events in the future. Should you use future events or not? 

Uncomment the following line after you've decided your answer.

<span style="color:blue">In general, you shouldn't use information from the future. When you're using models like this in a real-world scenario you won't have data from the future. Your model's score will likely be higher when training and testing on historical data, but it will overestimate the performance on real data. I should note that using future data will improve the score on Kaggle competition test data, but avoid it when building machine learning products.</span>

### 4) Time since last event

Implement a function `time_diff` that calculates the time since the last event in seconds from a Series of timestamps. This will be ran like so:

```python
timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)
```

In [45]:
def time_diff(series):
    return series.diff().dt.total_seconds()

In [46]:
timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)

We'll again load pre-computed versions of the data, which match what your function would return

In [55]:
clicks['past_events_6hr'] = timedeltas

train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9294175438596491


In [56]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_past_6hr_counts,past_events_6hr
54955,8546,11,1,19,45,2017-11-06 16:00:00,,0,6,16,0,0,0.0,
28314,16327,11,1,30,86,2017-11-06 16:00:09,,0,6,16,0,9,1.0,
31830,918,7,1,13,38,2017-11-06 16:00:09,,0,6,16,0,9,2.0,
99357,12903,22,1,19,40,2017-11-06 16:00:11,,0,6,16,0,11,3.0,
83228,15929,2,1,17,34,2017-11-06 16:00:11,,0,6,16,0,11,4.0,


### 5) Number of previous app downloads

It's likely that if a visitor downloaded an app previously, it'll affect the likelihood they'll download one again. Implement a function `previous_attributions` that returns a Series with the number of times an app has been download (`'is_attributed' == 1`) before the current event.

In [58]:
def previous_attributions(series):
        # Subtracting raw values so I don't count the current event
        sums = series.expanding(min_periods=2).sum() - series
        return sums

In [64]:
app_count = clicks.groupby('ip')['is_attributed'].transform(previous_attributions)

Again loading pre-computed data.

In [65]:
clicks['Number of previous app downloads'] = app_count

In [66]:
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9295378446115289


### 6) Tree-based vs Neural Network Models

So far we've been using LightGBM, a tree-based model. Would these features we've generated work well for neural networks as well as tree-based models?

Uncomment the following line after you've decided your answer.

<span style="color:blue">The features themselves will work for either model. However, numerical inputs to neural networks need to be standardized first. That is, the features need to be scaled such that they have 0 mean and a standard deviation of 1. This can be done using sklearn.preprocessing.StandardScaler.</span>

Now that you've generated a bunch of different features, you'll typically want to remove some of them to reduce the size of the model and potentially improve the performance. Next, I'll show you how to do feature selection using a few different methods such as L1 regression and Boruta.

# Keep Going

You know how to generate a lot of features. In practice, you'll frequently want to pare them down for modeling. Learn to do that in the **[Feature Selection lesson](https://www.kaggle.com/matleonard/feature-selection)**.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*