# Building Predictive Models 
### for [Click-Through Rate Prediction](https://www.kaggle.com/c/avazu-ctr-prediction) by Avazu
*Phong Nguyen, July 2019*

To recap from the previous EDA, we are tasked to build a **binary classification model** to predict the probability of an ad being clicked. In this notebook, I will go through a process of feature engineering, model training, parameter tunning and making predictions. 

As the dataset is huge, I will use a small sample of 100,000 events for faster processing. Then use the tuned parameters to build a model with a full dataset later. I also iteratively go through the model building process rather than doing exhaustive feature engineering first.

## Data loading and formatting

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse import hstack

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import log_loss

import xgboost as xgb
import lightgbm as lgb

import gzip
import pickle

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
 # Will save the sample to a file for faster loading later
train_size = 40428967
already_sample = False

if already_sample:
    sample = pd.read_csv('sample-100k.csv', index_col=0)
else:
    sample_size = 10**5
    skip = sorted(np.random.choice(np.arange(1, train_size + 1), train_size - sample_size, replace=False))
    sample = pd.read_csv('train', skiprows=skip, index_col=0)
    sample.to_csv('sample-100k.csv')

In [3]:
def format_data(sample):
    cat_attribs = sorted(set(sample.columns) - { 'hour', 'click' })
    for c in cat_attribs:
        sample[c] = sample[c].astype('category')

    sample['hour'] = pd.to_datetime(sample['hour'], format='%y%m%d%H')
    sample['click'] = sample['click'].astype(np.uint8)
    
    return sample

In [4]:
sample = format_data(sample)

In [5]:
sample.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
click,100000,,,,,,0.17038,0.375968,0.0,0.0,0.0,0.0,1.0
hour,100000,240.0,2014-10-22 09:00:00,1086.0,2014-10-21 00:00:00,2014-10-30 23:00:00,,,,,,,
C1,100000,7.0,1005,91932.0,,,,,,,,,
banner_pos,100000,7.0,0,72118.0,,,,,,,,,
site_id,100000,1465.0,85f751fd,36013.0,,,,,,,,,
site_domain,100000,1319.0,c4e18dd6,37326.0,,,,,,,,,
site_category,100000,20.0,50e219e0,40785.0,,,,,,,,,
app_id,100000,1272.0,ecad2386,63987.0,,,,,,,,,
app_domain,100000,91.0,7801e8d9,67415.0,,,,,,,,,
app_category,100000,23.0,07d7df22,64785.0,,,,,,,,,


## Building a first simple model

As we need to submit the probability of the prediction, **Logistic Regression** is a good first choice. I will build a first simple model with one attribute. `banner_pos` looks like a reasonable choice as the position of an ad can play a role in it being clicked. 

The performance metric is **logloss** as chosen by the competition organiser.

In [6]:
labels = sample['click']

cat_attribs = ['banner_pos']
train_data = sample[cat_attribs]
onehot = OneHotEncoder(categories='auto', handle_unknown='ignore').fit(train_data)
train_data_prepared = onehot.transform(train_data)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with banner_pos:', -s)

Model with banner_pos: 0.45624931576686867


OK, so what is the logloss for a naive model that always predicts the mean of CTR.

In [7]:
m = labels.mean()
print('Model predicting mean:', log_loss(labels, [m] * len(labels)))

Model predicting mean: 0.45648823999656146


Well, the naive mean model is just a little bit worse than my first model with `banner_pos`. Good start anyway. I want to make a kaggle submission.

In [7]:
click_data_test = pd.read_csv('test', dtype={ 'id': np.uint64 })
click_data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4577464 entries, 0 to 4577463
Data columns (total 23 columns):
id                  uint64
hour                int64
C1                  int64
banner_pos          int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type         int64
device_conn_type    int64
C14                 int64
C15                 int64
C16                 int64
C17                 int64
C18                 int64
C19                 int64
C20                 int64
C21                 int64
dtypes: int64(13), object(9), uint64(1)
memory usage: 803.2+ MB


In [8]:
test_data = click_data_test[cat_attribs]
test_data_prepared = onehot.transform(test_data)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
preds

NameError: name 'cat_attribs' is not defined

OK, submit the second column. Export to csv then zip the file for fast uploading.

In [22]:
def generate_submission_file(preds, data=click_data_test):
    df = pd.DataFrame({ 'id': data['id'].values, 'click': preds[:, 1] })
    df.set_index('id', inplace=True)
    
    with gzip.open('submission.gz', 'wt') as f:
        f.write(df.to_csv())
    
generate_submission_file(preds)

I got **0.44091** for public leaderboard and a projected rank of 1387. At least, I got some ranking now and a baseline to improve. We save the model and keep track of its metadata.

In [11]:
all_models = [] # Store model name, information, logloss, projected rank, training size

In [12]:
def save_model(m, info, logloss, rank, size=len(sample)):
    filename = 'saved-models/model-' + str(len(all_models) + 1) + '.pkl'
    with open(filename, 'wb') as f:
        pickle.dump(m, f)
    all_models.append((filename, info, logloss, rank, size))
    
save_model(logreg, 'banner_pos only', 0.44091, 1387)

In [13]:
all_models

[('model-1.pkl', 'banner_pos only', 0.44091, 1387, 100000)]

## Improving the model with feature engineering
### Low-cardinality categorical features
Next is to add more categorical features to the model, starting with low-cardinality ones as `banner_pos` before.

In [14]:
cat_attribs = ['banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'C1', 'C15', 'C16', 'C18']
train_data = sample[cat_attribs]
onehot = OneHotEncoder(categories='auto', handle_unknown='ignore').fit(train_data)
train_data_prepared = onehot.transform(train_data)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with low-cardinality attributes:', -s)

Model with low-cardinality attributes: 0.43247418615169614


Great! The score gets lower than before. Make the second submission now.

In [15]:
test_data = click_data_test[cat_attribs]
test_data_prepared = onehot.transform(test_data)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
generate_submission_file(preds)

PL 0.41861, well much lower than the score in our model training. It's ranked 1208. OK, save the model.

In [16]:
save_model(logreg, 'low-cardinality features', 0.41861, 1208)

### High-cardinality categorical features
Onehot encoding features with a high number of categories creates a high number of features, which could cause overfitting and suffer from the curse of dimensionality. Instead, I will apply feature hashing to control the number of features produced. I won't use `device_id` and `device_ip` as the their cardinalities are too high and they act as ID fields. `C20` has missing values encoded as `-1`. This is actually fine to consider missing values as another category.

In [17]:
cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

# FeatureHasher requires string input instead of number
for a in cat_attribs:
    sample[a] = sample[a].astype('str')
    
train_data = sample[cat_attribs]
hasher = FeatureHasher(n_features=500, input_type='string')
train_data_prepared = hasher.transform(train_data.values)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with all categorical attributes:', -s)

Model with all categorical attributes: 0.42200277489033605


Make another submission. The training score is better than using only low-cardinality attributes, but how's about the final prediction?

In [18]:
test_data = click_data_test[cat_attribs]
for a in cat_attribs:
    test_data[a] = test_data[a].astype('str')
    
test_data_prepared = hasher.transform(test_data.values)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
generate_submission_file(preds)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


PL 0.41605, a tiny improvement and ranked  much lower than the score in our model training. It's ranked 1191. OK, save the model.

In [19]:
save_model(logreg, 'all categorical features with feature hashing', 0.41605, 1191)

### `hour` feature
This temporal feature is the only one left. As explored previously, we can derive two more features: hour of the day and day of the week.

In [32]:
sample['hour_of_day'] = sample['hour'].dt.hour
sample['day_of_week'] = sample['hour'].dt.weekday
num_attribs = ['hour_of_day', 'day_of_week']

# Feature scaling
scaler = StandardScaler()
num_data = scaler.fit_transform(sample[num_attribs])

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [23]:
cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

# FeatureHasher requires string input instead of number
for a in cat_attribs:
    sample[a] = sample[a].astype('str')
    
hasher = FeatureHasher(n_features=500, input_type='string')
cat_data = hasher.transform(sample[cat_attribs].values)

In [33]:
print('Num data:', num_data.shape)
print('Cat data:', cat_data.shape)

Num data: (100000, 2)
Cat data: (100000, 500)


In [34]:
train_data_prepared = hstack([num_data, cat_data])
logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with all categorical attributes:', -s)

Model with all categorical attributes: 0.42306352082587306


The model doesn't improve though! So, I won't use those features. For the time allowed, I will stop the feature engineering here and move on to model tuning.

## Tuning models
I will tune the current logistic regression classifier and also consider more complex ensemble methods such as random forest and gradient boosting. First is to get the training data, which includes categorical attributes as described earlier.

In [35]:
cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

# FeatureHasher requires string input instead of number
for a in cat_attribs:
    sample[a] = sample[a].astype('str')
    
train_data = sample[cat_attribs]
hasher = FeatureHasher(n_features=500, input_type='string')
train_data_prepared = hasher.transform(train_data.values)

### Logistic Regression
Consider different values of the regularisation, solver and penalty.

In [37]:
logreg = LogisticRegression(solver='liblinear', max_iter=500)
param_grid = [
    { 'C': [0.01, 0.1, 1, 10], 'solver': ['lbfgs'], 'penalty': ['l2'] }, # lbfgs only support l2
    { 'C': [0.01, 0.1, 1, 10], 'solver': ['liblinear'], 'penalty': ['l1', 'l2'] }
]
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] C=0.01, penalty=l2, solver=lbfgs ................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  C=0.01, penalty=l2, solver=lbfgs, score=-0.42401528560770063, total=   0.9s
[CV] C=0.01, penalty=l2, solver=lbfgs ................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV]  C=0.01, penalty=l2, solver=lbfgs, score=-0.4250853864017236, total=   0.8s
[CV] C=0.01, penalty=l2, solver=lbfgs ................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.7s remaining:    0.0s


[CV]  C=0.01, penalty=l2, solver=lbfgs, score=-0.4177317788456098, total=   0.8s
[CV] C=0.01, penalty=l2, solver=lbfgs ................................
[CV]  C=0.01, penalty=l2, solver=lbfgs, score=-0.41417544165541265, total=   0.7s
[CV] C=0.01, penalty=l2, solver=lbfgs ................................
[CV]  C=0.01, penalty=l2, solver=lbfgs, score=-0.4217239113306176, total=   0.7s
[CV] C=0.1, penalty=l2, solver=lbfgs .................................
[CV]  C=0.1, penalty=l2, solver=lbfgs, score=-0.4226891529771033, total=   1.5s
[CV] C=0.1, penalty=l2, solver=lbfgs .................................
[CV]  C=0.1, penalty=l2, solver=lbfgs, score=-0.42624054338678347, total=   1.5s
[CV] C=0.1, penalty=l2, solver=lbfgs .................................
[CV]  C=0.1, penalty=l2, solver=lbfgs, score=-0.41772792950904525, total=   1.5s
[CV] C=0.1, penalty=l2, solver=lbfgs .................................
[CV]  C=0.1, penalty=l2, solver=lbfgs, score=-0.412415578139203, total=   1.4s
[CV] C=0.

[CV]  C=10, penalty=l2, solver=liblinear, score=-0.4286537692111541, total=   1.0s
[CV] C=10, penalty=l2, solver=liblinear ..............................
[CV]  C=10, penalty=l2, solver=liblinear, score=-0.4192316099635951, total=   1.1s
[CV] C=10, penalty=l2, solver=liblinear ..............................
[CV]  C=10, penalty=l2, solver=liblinear, score=-0.4139797166239617, total=   0.9s
[CV] C=10, penalty=l2, solver=liblinear ..............................
[CV]  C=10, penalty=l2, solver=liblinear, score=-0.42460086113934115, total=   1.0s


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  5.0min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [0.01, 0.1, 1, 10], 'solver': ['lbfgs'], 'penalty': ['l2']}, {'C': [0.01, 0.1, 1, 10], 'solver': ['liblinear'], 'penalty': ['l1', 'l2']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_log_loss', verbose=3)

In [39]:
print('Best model', grid_search.best_params_, -grid_search.best_score_)

Best model {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'} 0.42022383104243677


### Random Forest
The higher `n_estimators` the better the model is (like 100 or more?) but it will take time to run. So, I will set it to a small number for grid search. I will test the maximum depth and minimum number of points in a leaf node.

In [44]:
rf = RandomForestClassifier(n_estimators=20)
param_grid = { 
    'max_depth': [20, 40, 60, 80],
    'min_samples_leaf': [5, 10, 15]
}
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] max_depth=50, min_samples_leaf=5 ................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  max_depth=50, min_samples_leaf=5, score=-0.41672167762302814, total=   6.3s
[CV] max_depth=50, min_samples_leaf=5 ................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.8s remaining:    0.0s


[CV]  max_depth=50, min_samples_leaf=5, score=-0.4191187530501353, total=   6.3s
[CV] max_depth=50, min_samples_leaf=5 ................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   13.6s remaining:    0.0s


[CV]  max_depth=50, min_samples_leaf=5, score=-0.4195214198025711, total=   6.4s
[CV] max_depth=50, min_samples_leaf=5 ................................
[CV]  max_depth=50, min_samples_leaf=5, score=-0.41296529396967596, total=   6.3s
[CV] max_depth=50, min_samples_leaf=5 ................................
[CV]  max_depth=50, min_samples_leaf=5, score=-0.41853486496827946, total=   6.5s
[CV] max_depth=50, min_samples_leaf=10 ...............................
[CV]  max_depth=50, min_samples_leaf=10, score=-0.41596801122974675, total=   4.6s
[CV] max_depth=50, min_samples_leaf=10 ...............................
[CV]  max_depth=50, min_samples_leaf=10, score=-0.41908279522154085, total=   4.4s
[CV] max_depth=50, min_samples_leaf=10 ...............................
[CV]  max_depth=50, min_samples_leaf=10, score=-0.41462905498022684, total=   4.6s
[CV] max_depth=50, min_samples_leaf=10 ...............................
[CV]  max_depth=50, min_samples_leaf=10, score=-0.41168873558822106, total=   4.

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:  4.2min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': [50, 60, 70], 'min_samples_leaf': [5, 10, 15]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_log_loss', verbose=3)

In [45]:
print('Best model', grid_search.best_params_, -grid_search.best_score_)

Best model {'max_depth': 60, 'min_samples_leaf': 10} 0.4157813091811507


Great! Even with a small number of estimators, random forest still produces a better result than logistic regression.

### Gradient Boosting
Now try gradient boosting with the popular `xgboost` library.

In [48]:
gb = xgb.XGBClassifier(n_estimators=20)
param_grid = { 
    'max_depth': [5, 10, 20],
    'learning_rate': [0.1, 0.3, 1],
    'min_child_weight': [1, 3]
}
grid_search = GridSearchCV(gb, param_grid, cv=5, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] learning_rate=0.1, max_depth=5, min_child_weight=1 ..............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  learning_rate=0.1, max_depth=5, min_child_weight=1, score=-0.43911590573966613, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=1 ..............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.6s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=5, min_child_weight=1, score=-0.43745648687714056, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=1 ..............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.1s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=5, min_child_weight=1, score=-0.4357060287293047, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=1 ..............
[CV]  learning_rate=0.1, max_depth=5, min_child_weight=1, score=-0.4297017507965526, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=1 ..............
[CV]  learning_rate=0.1, max_depth=5, min_child_weight=1, score=-0.43756616486478517, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=3 ..............
[CV]  learning_rate=0.1, max_depth=5, min_child_weight=3, score=-0.4392189771498712, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=3 ..............
[CV]  learning_rate=0.1, max_depth=5, min_child_weight=3, score=-0.4376722045779574, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_weight=3 ..............
[CV]  learning_rate=0.1, max_depth=5, min_child_weight=3, score=-0.4334653632801026, total=   3.1s
[CV] learning_rate=0.1, max_depth=5, min_child_wei

[CV]  learning_rate=0.3, max_depth=20, min_child_weight=1, score=-0.4228743030766487, total=   9.4s
[CV] learning_rate=0.3, max_depth=20, min_child_weight=1 .............
[CV]  learning_rate=0.3, max_depth=20, min_child_weight=1, score=-0.42599705736972865, total=   9.4s
[CV] learning_rate=0.3, max_depth=20, min_child_weight=1 .............
[CV]  learning_rate=0.3, max_depth=20, min_child_weight=1, score=-0.43324615997346116, total=   9.3s
[CV] learning_rate=0.3, max_depth=20, min_child_weight=1 .............
[CV]  learning_rate=0.3, max_depth=20, min_child_weight=1, score=-0.41772629707733433, total=   9.4s
[CV] learning_rate=0.3, max_depth=20, min_child_weight=1 .............
[CV]  learning_rate=0.3, max_depth=20, min_child_weight=1, score=-0.4278117293308992, total=   9.4s
[CV] learning_rate=0.3, max_depth=20, min_child_weight=3 .............
[CV]  learning_rate=0.3, max_depth=20, min_child_weight=3, score=-0.4203158865457552, total=   9.5s
[CV] learning_rate=0.3, max_depth=20, min_

[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:  9.9min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=20, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': [5, 10, 20], 'learning_rate': [0.1, 0.3, 1], 'min_child_weight': [1, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_log_loss', verbose=3)

In [49]:
print('Best model', grid_search.best_params_, -grid_search.best_score_)

Best model {'learning_rate': 0.3, 'max_depth': 10, 'min_child_weight': 3} 0.4192162184315547


It's a bit worse and about twice slower than random forest. Now, try a different, faster library. 

In [55]:
gb = lgb.LGBMClassifier(n_estimators=20, learning_rate=0.3)
param_grid = { 
    'num_leaves': [10, 30, 50],
    'max_depth': [10, 20, 30],
    'min_data_in_leaf': [20, 30, 40]
}
grid_search = GridSearchCV(gb, param_grid, cv=5, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=10 ................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=10, score=-0.42486350706433973, total=   0.3s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=10 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=10, score=-0.42443667842997734, total=   0.3s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=10 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.7s remaining:    0.0s


[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=10, score=-0.41914215694065954, total=   0.3s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=10 ................
[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=10, score=-0.4139262958925761, total=   0.3s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=10 ................
[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=10, score=-0.42422937525080107, total=   0.3s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=30, score=-0.41981452681289094, total=   0.4s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=30, score=-0.42021604560280185, total=   0.4s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=10, min_data_in_leaf=20, num_leaves=30, score=-0.41787945097337487, total=   0.4s
[CV] max_depth=10, min_data_in_leaf=20, num_leaves=30 ....

[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=30, score=-0.4194514826423095, total=   0.4s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=30, score=-0.41491341368852247, total=   0.4s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=30, score=-0.40989116305397205, total=   0.6s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=30 ................
[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=30, score=-0.41821582572960014, total=   0.6s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=50, score=-0.41816178558763717, total=   0.7s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=20, min_data_in_leaf=20, num_leaves=50, score=-0.4183746346956301, total=   0.6s
[CV] max_depth=20, min_data_in_leaf=20, num_leaves=50 .....

[CV]  max_depth=30, min_data_in_leaf=20, num_leaves=50, score=-0.4199884667723293, total=   0.7s
[CV] max_depth=30, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=30, min_data_in_leaf=20, num_leaves=50, score=-0.4184561861629443, total=   0.7s
[CV] max_depth=30, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=30, min_data_in_leaf=20, num_leaves=50, score=-0.41682490020605417, total=   0.7s
[CV] max_depth=30, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=30, min_data_in_leaf=20, num_leaves=50, score=-0.4105933769747288, total=   0.7s
[CV] max_depth=30, min_data_in_leaf=20, num_leaves=50 ................
[CV]  max_depth=30, min_data_in_leaf=20, num_leaves=50, score=-0.41946855727324145, total=   0.7s
[CV] max_depth=30, min_data_in_leaf=30, num_leaves=10 ................
[CV]  max_depth=30, min_data_in_leaf=30, num_leaves=10, score=-0.42484776473540864, total=   0.4s
[CV] max_depth=30, min_data_in_leaf=30, num_leaves=10 ......

[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:  1.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        importance_type='split', learning_rate=0.3, max_depth=-1,
        min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
        n_estimators=20, n_jobs=-1, num_leaves=31, objective=None,
        random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
        subsample=1.0, subsample_for_bin=200000, subsample_freq=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'num_leaves': [10, 30, 50], 'max_depth': [10, 20, 30], 'min_data_in_leaf': [20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_log_loss', verbose=3)

In [56]:
print('Best model', grid_search.best_params_, -grid_search.best_score_)

Best model {'max_depth': 30, 'min_data_in_leaf': 30, 'num_leaves': 30} 0.4157245527300352


Wow, the training is really fast. Each model is trained in less than 1s compared to 5-10s with `xgboost`. And the performance is better as well!

### All models together
Now, I will compare the three algorithms we have using their best tuned parameters, and increase the number of iterations for logistic regression and increase the number of estimators for random forest and gradient boosting to see the best of them.

In [58]:
logreg = LogisticRegression(solver='lbfgs', penalty='l2', C=0.1, max_iter=10000)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Best logistic regression:', -s)

rf = RandomForestClassifier(n_estimators=200, max_depth=60, min_samples_leaf=10)
s = cross_val_score(rf, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Best random forest:', -s)

gb = lgb.LGBMClassifier(n_estimators=200, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
s = cross_val_score(gb, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Best gradient boosting:', -s)

Best logistic regression: 0.42022368635353213
Best random forest: 0.4151497809150225
Best gradient boosting: 0.42547294545007136


`LightGBM` is much faster than random forest from `sklearn`. However, the LGB model with 200 estimators performs worse than the one with 20 estimators. I guess that's the result of overfitting with so many trees. It also means that we can't simply reuse the same parameters for this 100k sample with our full training data. Let's check a few different values first.

In [62]:
gb = lgb.LGBMClassifier(n_estimators=20, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
param_grid = { 
    'n_estimators': [10, 30, 50, 100, 200, 500]
}
grid_search = GridSearchCV(gb, param_grid, cv=5, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

print('Best model', grid_search.best_params_, -grid_search.best_score_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ....... n_estimators=10, score=-0.4202471927697791, total=   0.3s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s


[CV] ....... n_estimators=10, score=-0.4203122004653788, total=   0.4s
[CV] n_estimators=10 .................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s


[CV] ...... n_estimators=10, score=-0.41626140018885055, total=   0.4s
[CV] n_estimators=10 .................................................
[CV] ....... n_estimators=10, score=-0.4103678291579343, total=   0.4s
[CV] n_estimators=10 .................................................
[CV] ....... n_estimators=10, score=-0.4194281983707684, total=   0.4s
[CV] n_estimators=30 .................................................
[CV] ....... n_estimators=30, score=-0.4170367700586952, total=   0.6s
[CV] n_estimators=30 .................................................
[CV] ........ n_estimators=30, score=-0.419090019426915, total=   0.7s
[CV] n_estimators=30 .................................................
[CV] ...... n_estimators=30, score=-0.41438973826331166, total=   0.7s
[CV] n_estimators=30 .................................................
[CV] ....... n_estimators=30, score=-0.4087543781550628, total=   0.5s
[CV] n_estimators=30 .................................................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  1.2min finished


Best model {'n_estimators': 30} 0.41545434743574794


So, our best model is gradient boosting with 30 estimators and the logloss is 0.41545. Let's make a submission.

In [63]:
gb = lgb.LGBMClassifier(n_estimators=30, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
gb.fit(train_data_prepared, labels)
preds = gb.predict_proba(test_data_prepared)
generate_submission_file(preds)

Oh, the submission score is worse than before with PL 0.41994 after all the parameters tuning with ensemble models.

In [65]:
save_model(gb, 'all features, gradient boosting with LightGBM', 0.41994, None)

I will submit the best logistic regression and random forest as well.

In [67]:
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
generate_submission_file(preds)

Worse than before tuning again!

In [69]:
save_model(logreg, 'all features, logistic regression', 0.41779, None)

In [68]:
rf.fit(train_data_prepared, labels)
preds = rf.predict_proba(test_data_prepared)
generate_submission_file(preds)

PL 0.41339. That is the best score so far. It came from the slowest training though!

In [70]:
save_model(rf, 'all features, random forest', 0.41339, 1168)

I don't have much time working on this. The explanation I have is the tuned model is overfitted with the training data. As the test data is from a different day. So, we have two options left:
- Higher accuracy: random forest
- Faster speed: gradient boosting with `LightGBM`

## Building model with larger training data

Now, I will use a 1M sample from my EDA.

In [72]:
sample_1m = pd.read_csv('sample.csv', index_col=0)

In [3]:
def transform_data(data):
    cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

    # FeatureHasher requires string input instead of number
    for a in cat_attribs:
        data[a] = data[a].astype('str')
    
    train_data = data[cat_attribs]
    hasher = FeatureHasher(n_features=500, input_type='string')
    train_data_prepared = hasher.transform(train_data.values)
    
    return train_data_prepared

In [81]:
labels = sample_1m['click']

In [80]:
train_data_prepared = transform_data(sample_1m)
train_data_prepared.shape

(1000000, 500)

In [82]:
gb = lgb.LGBMClassifier(n_estimators=30, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
%time gb.fit(train_data_prepared, labels)
preds = gb.predict_proba(test_data_prepared)
generate_submission_file(preds)

5.56 s ± 624 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


With 10x more training data, we managed to have a more accurate model. PL 0.40892 and ranked 1150.

In [84]:
save_model(gb, 'all features, random forest, 1m', 0.40892, 1150)

As random forest is too slow. I will only use `LightGBM`. For this assignment, I will try with 10M records instead of a full 40M one.

In [2]:
 # Will save the sample to a file for faster loading later
train_size = 40428967
sample_size = 10**7
skip = sorted(np.random.choice(np.arange(1, train_size + 1), train_size - sample_size, replace=False))
sample = pd.read_csv('train', skiprows=skip, index_col=0)

  mask |= (ar1 == a)


In [27]:
labels = sample['click']
train_data_prepared = transform_data(sample)

In [20]:
gb = lgb.LGBMClassifier(n_estimators=30, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
%time gb.fit(train_data_prepared, labels)

CPU times: user 7min 30s, sys: 9.41 s, total: 7min 39s
Wall time: 1min 2s


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        importance_type='split', learning_rate=0.3, max_depth=30,
        min_child_samples=20, min_child_weight=0.001, min_data_in_leaf=30,
        min_split_gain=0.0, n_estimators=30, n_jobs=-1, num_leaves=30,
        objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
        silent=False, subsample=1.0, subsample_for_bin=200000,
        subsample_freq=0)

In [23]:
preds = gb.predict_proba(test_data_prepared)
generate_submission_file(preds)

PL is just 0.40961, which is worse than the model trained with 1m records. That could be because of the same parameters are not optimal for this larger training size. As the model is trained in 1 minute, I decide to spend a bit more time to tune the number of estimators. I had to reduce to 30 because of overfitting with smaller training size, but I think it can be larger now.

In [32]:
gb = lgb.LGBMClassifier(num_leaves=30, learning_rate=0.3, max_depth=30, min_data_in_leaf=30)
param_grid = { 
    'n_estimators': [30, 50, 100]
}
grid_search = GridSearchCV(gb, param_grid, cv=3, scoring='neg_log_loss', verbose=3)
grid_search.fit(train_data_prepared, labels)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] n_estimators=30 .................................................
[CV] ....... n_estimators=30, score=-0.4210155709909052, total=  51.2s
[CV] n_estimators=30 .................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.1min remaining:    0.0s


[CV] ...... n_estimators=30, score=-0.41192013189540827, total=  53.8s
[CV] n_estimators=30 .................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.1min remaining:    0.0s


[CV] ...... n_estimators=30, score=-0.41250853313175706, total=  54.1s
[CV] n_estimators=50 .................................................
[CV] ...... n_estimators=50, score=-0.41668256776473844, total= 1.1min
[CV] n_estimators=50 .................................................
[CV] ...... n_estimators=50, score=-0.41074633448335335, total= 1.1min
[CV] n_estimators=50 .................................................
[CV] ....... n_estimators=50, score=-0.4109628234789631, total= 1.1min
[CV] n_estimators=100 ................................................
[CV] ..... n_estimators=100, score=-0.41429869457268587, total= 2.0min
[CV] n_estimators=100 ................................................
[CV] ...... n_estimators=100, score=-0.4097249608514304, total= 2.0min
[CV] n_estimators=100 ................................................
[CV] ...... n_estimators=100, score=-0.4088698766056718, total= 2.1min


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 14.1min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        importance_type='split', learning_rate=0.3, max_depth=30,
        min_child_samples=20, min_child_weight=0.001, min_data_in_leaf=30,
        min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=30,
        objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
        silent=True, subsample=1.0, subsample_for_bin=200000,
        subsample_freq=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [30, 50, 100]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='neg_log_loss',
       verbose=3)

In [33]:
print('Best model', grid_search.best_params_, -grid_search.best_score_)

Best model {'n_estimators': 100} 0.4109645110100144


That's right, more trees help this time. Now train the final model and submit predictions.

In [None]:
gb = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.3, max_depth=30, min_data_in_leaf=30, num_leaves=30)
gb.fit(train_data_prepared, labels)

preds = gb.predict_proba(test_data_prepared)
generate_submission_file(preds)

My final model has PL 0.40603 and ranked 1127.