# Building Predictive Models 
### for Click-Through Rate Prediction by Avazu
*Phong Nguyen, July 2019*

To recap from the previous EDA, we are tasked to build a **binary classification model** to predict the probability of an ad being clicked. In this notebook, I will go through a process of feature engineering, model training, parameter tunning and making predictions. 

As the dataset is huge, I will use a small sample of 100,000 events for faster processing. Then use the tuned parameters to build a model with a full dataset later. I also iteratively go through the model building process rather than doing exhaustive feature engineering first.

## Data loading and formatting

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse import hstack

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import log_loss

import xgboost as xgb
import lightgbm as lgb

import gzip
import pickle

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
 # Will save the sample to a file for faster loading later
train_size = 40428967
already_sample = False

if already_sample:
    sample = pd.read_csv('sample-100k.csv', index_col=0)
else:
    sample_size = 10**5
    skip = sorted(np.random.choice(np.arange(1, train_size + 1), train_size - sample_size, replace=False))
    sample = pd.read_csv('train', skiprows=skip, index_col=0)
    sample.to_csv('sample-100k.csv')

In [3]:
def format_data(sample):
    cat_attribs = sorted(set(sample.columns) - { 'hour', 'click' })
    for c in cat_attribs:
        sample[c] = sample[c].astype('category')

    sample['hour'] = pd.to_datetime(sample['hour'], format='%y%m%d%H')
    sample['click'] = sample['click'].astype(np.uint8)
    
    return sample

In [4]:
sample = format_data(sample)

In [5]:
sample.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
click,100000,,,,,,0.17038,0.375968,0.0,0.0,0.0,0.0,1.0
hour,100000,240.0,2014-10-22 09:00:00,1086.0,2014-10-21 00:00:00,2014-10-30 23:00:00,,,,,,,
C1,100000,7.0,1005,91932.0,,,,,,,,,
banner_pos,100000,7.0,0,72118.0,,,,,,,,,
site_id,100000,1465.0,85f751fd,36013.0,,,,,,,,,
site_domain,100000,1319.0,c4e18dd6,37326.0,,,,,,,,,
site_category,100000,20.0,50e219e0,40785.0,,,,,,,,,
app_id,100000,1272.0,ecad2386,63987.0,,,,,,,,,
app_domain,100000,91.0,7801e8d9,67415.0,,,,,,,,,
app_category,100000,23.0,07d7df22,64785.0,,,,,,,,,


## Building a first simple model

As we need to submit the probability of the prediction, **Logistic Regression** is a good first choice. I will build a first simple model with one attribute. `banner_pos` looks like a reasonable choice as the position of an ad can play a role in it being clicked. 

The performance metric is **logloss** as chosen by the competition organiser.

In [6]:
labels = sample['click']

cat_attribs = ['banner_pos']
train_data = sample[cat_attribs]
onehot = OneHotEncoder(categories='auto', handle_unknown='ignore').fit(train_data)
train_data_prepared = onehot.transform(train_data)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with banner_pos:', -s)

Model with banner_pos: 0.45624931576686867


OK, so what is the logloss for a naive model that always predicts the mean of CTR.

In [7]:
m = labels.mean()
print('Model predicting mean:', log_loss(labels, [m] * len(labels)))

Model predicting mean: 0.45648823999656146


Well, the naive mean model is just a little bit worse than my first model with `banner_pos`. Good start anyway. I want to make a kaggle submission.

In [8]:
click_data_test = pd.read_csv('test', dtype={ 'id': np.uint64 })
click_data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4577464 entries, 0 to 4577463
Data columns (total 23 columns):
id                  uint64
hour                int64
C1                  int64
banner_pos          int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type         int64
device_conn_type    int64
C14                 int64
C15                 int64
C16                 int64
C17                 int64
C18                 int64
C19                 int64
C20                 int64
C21                 int64
dtypes: int64(13), object(9), uint64(1)
memory usage: 803.2+ MB


In [9]:
test_data = click_data_test[cat_attribs]
test_data_prepared = onehot.transform(test_data)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
preds

array([[0.83528146, 0.16471854],
       [0.83528146, 0.16471854],
       [0.83528146, 0.16471854],
       ...,
       [0.83528146, 0.16471854],
       [0.83528146, 0.16471854],
       [0.83528146, 0.16471854]])

OK, submit the second column. Export to csv then zip the file for fast uploading.

In [10]:
def generate_submission_file(preds, data=click_data_test):
    df = pd.DataFrame({ 'id': data['id'].values, 'click': preds[:, 1] })
    df.set_index('id', inplace=True)
    
    with gzip.open('submission.gz', 'wt') as f:
        f.write(df.to_csv())
    
generate_submission_file(preds)

I got **0.44091** for public leaderboard and a projected rank of 1387. At least, I got some ranking now and a baseline to improve. We save the model and keep track of its metadata.

In [11]:
all_models = [] # Store model name, information, logloss, projected rank, training size

In [12]:
def save_model(m, info, logloss, rank, size=len(sample)):
    filename = 'model-' + str(len(all_models) + 1) + '.pkl'
    with open(filename, 'wb') as f:
        pickle.dump(m, f)
    all_models.append((filename, info, logloss, rank, size))
    
save_model(logreg, 'banner_pos only', 0.44091, 1387)

In [13]:
all_models

[('model-1.pkl', 'banner_pos only', 0.44091, 1387, 100000)]

## Improving the model with feature engineering
### Low-cardinality categorical features
Next is to add more categorical features to the model, starting with low-cardinality ones as `banner_pos` before.

In [14]:
cat_attribs = ['banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'C1', 'C15', 'C16', 'C18']
train_data = sample[cat_attribs]
onehot = OneHotEncoder(categories='auto', handle_unknown='ignore').fit(train_data)
train_data_prepared = onehot.transform(train_data)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with low-cardinality attributes:', -s)

Model with low-cardinality attributes: 0.43247418615169614


Great! The score gets lower than before. Make the second submission now.

In [15]:
test_data = click_data_test[cat_attribs]
test_data_prepared = onehot.transform(test_data)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
generate_submission_file(preds)

PL 0.41861, well much lower than the score in our model training. It's ranked 1208. OK, save the model.

In [16]:
save_model(logreg, 'low-cardinality features', 0.41861, 1208)

### High-cardinality categorical features
Onehot encoding features with a high number of categories creates a high number of features, which could cause overfitting and suffer from the curse of dimensionality. Instead, I will apply feature hashing to control the number of features produced. I won't use `device_id` and `device_ip` as the their cardinalities are too high and they act as ID fields. `C20` has missing values encoded as `-1`. This is actually fine to consider missing values as another category.

In [17]:
cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

# FeatureHasher requires string input instead of number
for a in cat_attribs:
    sample[a] = sample[a].astype('str')
    
train_data = sample[cat_attribs]
hasher = FeatureHasher(n_features=500, input_type='string')
train_data_prepared = hasher.transform(train_data.values)

logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with all categorical attributes:', -s)

Model with all categorical attributes: 0.42200277489033605


Make another submission. The training score is better than using only low-cardinality attributes, but how's about the final prediction?

In [18]:
test_data = click_data_test[cat_attribs]
for a in cat_attribs:
    test_data[a] = test_data[a].astype('str')
    
test_data_prepared = hasher.transform(test_data.values)
logreg.fit(train_data_prepared, labels)
preds = logreg.predict_proba(test_data_prepared)
generate_submission_file(preds)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


PL 0.41605, a tiny improvement and ranked  much lower than the score in our model training. It's ranked 1191. OK, save the model.

In [19]:
save_model(logreg, 'all categorical features with feature hashing', 0.41605, 1191)

### `hour` feature
This temporal feature is the only one left. As explored previously, we can derive two more features: hour of the day and day of the week.

In [32]:
sample['hour_of_day'] = sample['hour'].dt.hour
sample['day_of_week'] = sample['hour'].dt.weekday
num_attribs = ['hour_of_day', 'day_of_week']

# Feature scaling
scaler = StandardScaler()
num_data = scaler.fit_transform(sample[num_attribs])

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [23]:
cat_attribs = ['C1', 'banner_pos', 'site_category', 'app_category', 'device_type', 'device_conn_type', 'site_id', 
               'site_domain', 'app_id', 'app_domain', 'device_model', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20']

# FeatureHasher requires string input instead of number
for a in cat_attribs:
    sample[a] = sample[a].astype('str')
    
hasher = FeatureHasher(n_features=500, input_type='string')
cat_data = hasher.transform(sample[cat_attribs].values)

In [33]:
print('Num data:', num_data.shape)
print('Cat data:', cat_data.shape)

Num data: (100000, 2)
Cat data: (100000, 500)


In [34]:
train_data_prepared = hstack([num_data, cat_data])
logreg = LogisticRegression(solver='lbfgs', max_iter=500)
s = cross_val_score(logreg, train_data_prepared, labels, scoring='neg_log_loss', cv=5).mean()
print('Model with all categorical attributes:', -s)

Model with all categorical attributes: 0.42306352082587306


The model doesn't improve though! So, I won't use those features. For the time allowed, I will stop the feature engineering here and move on to model tuning.