**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**

---


# Introduction

In this exercise, you will develop a baseline model for predicting if a customer will buy an app after clicking on an ad. With this baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

In [41]:
import pandas as pd

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

In [42]:
click_data = pd.read_csv('train_sample.csv', parse_dates=['click_time'])
click_data.head(10)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,87540,12,1,13,497,2017-11-07 09:30:38,,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0
3,94584,13,1,13,477,2017-11-07 04:58:08,,0
4,68413,12,1,1,178,2017-11-09 09:00:09,,0
5,93663,3,1,17,115,2017-11-09 01:22:13,,0
6,17059,1,1,17,135,2017-11-09 01:17:58,,0
7,121505,9,1,25,442,2017-11-07 10:01:53,,0
8,192967,2,2,22,364,2017-11-08 09:35:17,,0
9,143636,3,1,19,135,2017-11-08 12:35:26,,0


In [43]:
click_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
ip                 100000 non-null int64
app                100000 non-null int64
device             100000 non-null int64
os                 100000 non-null int64
channel            100000 non-null int64
click_time         100000 non-null datetime64[ns]
attributed_time    227 non-null object
is_attributed      100000 non-null int64
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 6.1+ MB


## Baseline Model

The first thing you need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First you need to do a bit of feature engineering before training the model itself.

### 1) Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [44]:
clicks = click_data.assign(year=click_data.click_time.dt.year,
                           month=click_data.click_time.dt.month,
                           day=click_data.click_time.dt.day,
                           weekday=click_data.click_time.dt.weekday,
                           hour=click_data.click_time.dt.hour,
                           minute=click_data.click_time.dt.minute,
                           second=click_data.click_time.dt.second
                          )

In [45]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,year,month,day,weekday,hour,minute,second
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,2017,11,7,1,9,30,38
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,2017,11,7,1,13,40,27
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,2017,11,7,1,18,5,24
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,2017,11,7,1,4,58,8
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,2017,11,9,3,9,0,9


### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [46]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = LabelEncoder()
encoded = clicks[cat_features].apply(encoder.fit_transform).add_suffix('_labels')

In [47]:
# Since clicks and encoded have the same index and I can easily join them
clicks = clicks.join(encoded)

In [48]:
clicks.to_parquet('baseline_data.pqt')

### 3) One-hot Encoding

Now you have label encoded features, does it make sense to use one-hot encoding for the categorical variables ip, app, device, os, or channel?

Uncomment the following line after you've decided your answer.

<span style="color:blue">The ip column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. This many columns will make your model run very slow, so in general you want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so you don't actually need to one-hot encode the categorical features.</span>

## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### 4) Train/test splits with time series data
This is time series data. Are they any special considerations when creating train/test splits for time series? If so, what and why?

Uncomment the following line after you've decided your answer.

<span style="color:blue">Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.</span>

### Create train/validation/test splits

Here we'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [53]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time', ascending=True)
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

In [54]:
for each in [train, valid, test]:
    print(f"Outcome fraction = {clicks.is_attributed.mean():.4f}")

Outcome fraction = 0.0023
Outcome fraction = 0.0023
Outcome fraction = 0.0023


### Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [55]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

[1]	valid_0's auc: 0.932413
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's auc: 0.530091
[3]	valid_0's auc: 0.446663
[4]	valid_0's auc: 0.608956
[5]	valid_0's auc: 0.608624
[6]	valid_0's auc: 0.809223
[7]	valid_0's auc: 0.810089
[8]	valid_0's auc: 0.810314
[9]	valid_0's auc: 0.80959
[10]	valid_0's auc: 0.809999
[11]	valid_0's auc: 0.809321
Early stopping, best iteration is:
[1]	valid_0's auc: 0.932413


## Evaluate the model
Finally, with the model trained, I'll evaluate it's performance on the test set. 

In [52]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

Test score: 0.7811898797595189


This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score. However, since this is the test set, we only want to look at it at the end of all our manipulations. At the very end of this course you'll look at the test score again to see if you improved on the baseline model.

# Keep Going
Now that you have a baseline model, you are ready to learn **[Categorical Encoding Techniques](https://www.kaggle.com/matleonard/categorical-encodings)** to improve it.

---
**[Feature Engineering Home Page](https://www.kaggle.com/learn/feature-engineering)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*