<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Teacher-Model" data-toc-modified-id="Teacher-Model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Teacher Model</a></span><ul class="toc-item"><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data preprocessing</a></span><ul class="toc-item"><li><span><a href="#Simple-transformation-for-dense-features" data-toc-modified-id="Simple-transformation-for-dense-features-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Simple transformation for dense features</a></span></li><li><span><a href="#Set-hashing-space-for-each-sparse-field" data-toc-modified-id="Set-hashing-space-for-each-sparse-field-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Set hashing space for each sparse field</a></span></li><li><span><a href="#Generate-input-data-for-model" data-toc-modified-id="Generate-input-data-for-model-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Generate input data for model</a></span></li></ul></li><li><span><a href="#Model" data-toc-modified-id="Model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#Define-Model" data-toc-modified-id="Define-Model-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Define Model</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Train</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Soft-targets" data-toc-modified-id="Soft-targets-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Soft-targets</a></span></li></ul></li></ul></div>

# Teacher Model

In [1]:
import os
import pandas as pd
import numpy as np

from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

from deepctr.inputs import SparseFeat, DenseFeat, get_feature_names
from deepctr.models.dcn import DCN

from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


In [2]:
DATA_PREFIX = "../../../data/criteo/"

TRAIN_DATA = os.path.join(DATA_PREFIX, 'train.csv')
TEST_DATA = os.path.join(DATA_PREFIX, 'test.csv')
TEST_LABELS_DATA = os.path.join(DATA_PREFIX, 'test_labels.csv')

## Data preprocessing

In [3]:
data = pd.read_csv(TRAIN_DATA)

In [4]:
data.rename(columns=dict([(col, col[1:] if col[0] == '_' else col) for col in data.columns]), inplace=True)

In [5]:
data.columns

Index(['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10',
       'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20',
       'c21', 'c22', 'c23', 'c24', 'c25', 'c26', 'c27', 'c28', 'c29', 'c30',
       'c31', 'c32', 'c33', 'c34', 'c35', 'c36', 'c37', 'c38', 'c39', 'id'],
      dtype='object')

In [5]:
dense_features = ['c{}'.format(i) for i in range(1, 14)]
sparse_features = ['c{}'.format(i) for i in range(14, 40)]

len(dense_features), len(sparse_features)

(13, 26)

In [6]:
data[sparse_features] = data[sparse_features].fillna('-1', )
data[dense_features] = data[dense_features].fillna(0, )
target = ['c0']

In [7]:
[(feat, data[feat].nunique()) for feat in sparse_features]

[('c14', 1445),
 ('c15', 556),
 ('c16', 1130758),
 ('c17', 360209),
 ('c18', 304),
 ('c19', 21),
 ('c20', 11845),
 ('c21', 631),
 ('c22', 3),
 ('c23', 49223),
 ('c24', 5194),
 ('c25', 985420),
 ('c26', 3157),
 ('c27', 26),
 ('c28', 11588),
 ('c29', 715441),
 ('c30', 10),
 ('c31', 4681),
 ('c32', 2029),
 ('c33', 4),
 ('c34', 870796),
 ('c35', 17),
 ('c36', 15),
 ('c37', 87605),
 ('c38', 84),
 ('c39', 58187)]

### Simple transformation for dense features

In [8]:
mms = MinMaxScaler(feature_range=(0, 1))
data[dense_features] = mms.fit_transform(data[dense_features])

### Set hashing space for each sparse field

In [36]:
fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=1000, embedding_dim=4, use_hash=True, dtype='string') 
                          for feat in sparse_features] + \
                        [DenseFeat(feat, 1,) 
                         for feat in dense_features]

linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )

### Generate input data for model

In [37]:
train, test = train_test_split(data, test_size=0.2, shuffle=False)
validation, test = train_test_split(test, test_size=0.5, shuffle=False)

len(train), len(validation), len(test)

(2931944, 366493, 366494)

In [38]:
def gen_model_input(df):
    return {name: df[name] for name in feature_names}


train_model_input = gen_model_input(train)
validation_model_input = gen_model_input(validation)
test_model_input = gen_model_input(test)

## Model

### Define Model

In [39]:
model = DCN(linear_feature_columns, dnn_feature_columns, cross_num=2,
            dnn_hidden_units=(128, 128), l2_reg_linear=3e-05, l2_reg_embedding=3e-05,
            l2_reg_cross=3e-05, l2_reg_dnn=0, init_std=0.0001, seed=1024, 
            dnn_dropout=0.2, dnn_use_bn=True, dnn_activation='relu', task='binary')

model.compile("adam", "binary_crossentropy",
              metrics=['binary_crossentropy'], )

### Train

In [25]:
checkpoint = ModelCheckpoint("best_model.hdf5", monitor='val_loss', verbose=1,
                             save_best_only=True, mode='auto', period=1)


history = model.fit(train_model_input, train[target].values, 
                    batch_size=256, epochs=5, verbose=1, use_multiprocessing=True,
                    callbacks=[checkpoint],
                    validation_data=(validation_model_input, validation[target].values))

Train on 2931944 samples, validate on 366493 samples
Epoch 1/5
Epoch 00001: loss improved from inf to 0.48632, saving model to best_model.hdf5
Epoch 2/5
Epoch 00002: loss improved from 0.48632 to 0.48089, saving model to best_model.hdf5
Epoch 3/5
Epoch 00003: loss improved from 0.48089 to 0.47840, saving model to best_model.hdf5
Epoch 4/5
Epoch 00004: loss improved from 0.47840 to 0.47711, saving model to best_model.hdf5
Epoch 5/5
Epoch 00005: loss improved from 0.47711 to 0.47627, saving model to best_model.hdf5


In [17]:
model.save_weights(os.path.join(DATA_PREFIX, 'DCN_w.h5'))

**Model size**

примерно 5MB

## Evaluation

In [26]:
pred_ans = model.predict(test_model_input, batch_size=256)



In [27]:
print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))

test LogLoss 0.4737
test AUC 0.7808


## Soft-targets

In [28]:
preds_train = model.predict(train_model_input, batch_size=256)



In [31]:
probs = [p[0] for p in preds_train]

In [32]:
pd.DataFrame({'id': train['id'], 'prob': probs}) \
    .to_csv(os.path.join(DATA_PREFIX, 'soft_targets.csv'), index=False)