<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Teacher-Model" data-toc-modified-id="Teacher-Model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Teacher Model</a></span><ul class="toc-item"><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data preprocessing</a></span><ul class="toc-item"><li><span><a href="#Data-generator" data-toc-modified-id="Data-generator-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Data generator</a></span></li><li><span><a href="#Set-hashing-space-for-each-sparse-field" data-toc-modified-id="Set-hashing-space-for-each-sparse-field-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Set hashing space for each sparse field</a></span></li></ul></li><li><span><a href="#Model" data-toc-modified-id="Model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#Define-Model" data-toc-modified-id="Define-Model-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Define Model</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Train</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Soft-targets" data-toc-modified-id="Soft-targets-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Soft-targets</a></span></li></ul></li></ul></div>

# Teacher Model

Обучим модель-учитель на всем датасете от Criteo.

Модель Deep & Cross Network (DCN)

Reference:
```
[1] Wang R, Fu B, Fu G, et al. Deep & cross network for ad click predictions[C]//Proceedings of the ADKDD'17. 
ACM, 2017: 12. (https://arxiv.org/abs/1708.05123)
```

In [37]:
import os
import pandas as pd
import numpy as np

from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

from deepctr.inputs import SparseFeat, DenseFeat, get_feature_names
from deepctr.models.dcn import DCN

from collections import defaultdict

In [197]:
DATA_PREFIX = "../../../data/criteo/"

FULL_TRAIN_DATA = os.path.join(DATA_PREFIX, 'train.txt')
FULL_TEST_DATA = os.path.join(DATA_PREFIX, 'test.txt')

TRAIN_DATA = os.path.join(DATA_PREFIX, 'train.csv')
TEST_DATA = os.path.join(DATA_PREFIX, 'test.csv')
TEST_LABELS_DATA = os.path.join(DATA_PREFIX, 'test_labels.csv')

## Data preprocessing

In [12]:
dense_features_indices = [i for i in range(1, 14)]
sparse_features_indices = [i for i in range(14, 40)]

dense_features = ['c{}'.format(i) for i in dense_features_indices]
sparse_features = ['c{}'.format(i) for i in sparse_features_indices]

len(dense_features_indices), len(sparse_features_indices)

(13, 26)

In [74]:
min_arr = [0] * 40
max_arr = [0] * 40
range_arr = [0] * 40

### Data generator

Для того чтобы не загружать весь файл целиком в память, будем читать его последовательно и генерировать батчи для обучения модели.

Для dense фичей будем делать min-max scaling

In [75]:
def data_generator(filename, batch_size=1024, dense_feature_missing_value=0, sparse_feature_missing_value='-1'):
    batch = defaultdict(list)
    labels = defaultdict(list)
    m = 0
    
    def prepare_data_dict(batch, dense_features, sparse_features):
        data_dict = {}
        for f_name in dense_features:
            data_dict[f_name] = np.array(batch[f_name])

        for f_name in sparse_features:
            data_dict[f_name] = pd.core.series.Series(batch[f_name])

        return data_dict
    
    with open(filename) as f:
        for line in f:
            features = line.split('\n')[0].split('\t')
            
            labels['label'].append(np.int32(features[0]))
            
            for i, f_name in zip(dense_features_indices, dense_features):
                val = features[i] if features[i] != '' else dense_feature_missing_value
                val = float(val)
                
                min_arr[i] = min(min_arr[i], val)
                max_arr[i] = max(max_arr[i], val)
                range_arr[i] = max_arr[i] - min_arr[i]
                
                if range_arr[i]:
                    val = (val - min_arr[i]) / range_arr[i]
                
                batch[f_name].append(val)
                
            for i, f_name in zip(sparse_features_indices, sparse_features):
                val = features[i] if features[i] != '' else sparse_feature_missing_value
                batch[f_name].append(val)
            
            m += 1
            if m % batch_size == 0:
                data_dict = prepare_data_dict(batch, dense_features, sparse_features)
                yield data_dict, pd.core.series.Series(labels['label'])

                m = 0
                batch = defaultdict(list)
                labels = defaultdict(list)

        data_dict = prepare_data_dict(batch, dense_features, sparse_features)
        yield data_dict, pd.core.series.Series(labels['label'])

Для валидации будем использовать те же данные, которые использовались при обучении маленькой модели учителя.

In [152]:
data = pd.read_csv(TRAIN_DATA)
data.rename(columns=dict([(col, col[1:] if col[0] == '_' else col) for col in data.columns]), inplace=True)

data[sparse_features] = data[sparse_features].fillna('-1', )
data[dense_features] = data[dense_features].fillna(0, )
target = ['c0']

mms = MinMaxScaler(feature_range=(0, 1))
data[dense_features] = mms.fit_transform(data[dense_features])

_, test = train_test_split(data, test_size=0.2, shuffle=False)
validation, test = train_test_split(test, test_size=0.5, shuffle=False)

print(len(validation), len(test))


def gen_model_input(df):
    return {name: (pd.core.series.Series(df[name]) if name in sparse_features else np.array(df[name])) for name in feature_names}


all_model_input = gen_model_input(data)
validation_model_input = gen_model_input(validation)
test_model_input = gen_model_input(test)

366493 366494


### Set hashing space for each sparse field

Для категориальных фичей будем использовать Hashing Trick, чтобы уменьшить размерность словаря эмбеддингов.

In [110]:
sparse_features_dims = dict([
    ('c14', 1445),
    ('c15', 556),
    ('c16', 1130758),
    ('c17', 360209),
    ('c18', 304),
    ('c19', 21),
    ('c20', 11845),
    ('c21', 631),
    ('c22', 3),
    ('c23', 49223),
    ('c24', 5194),
    ('c25', 985420),
    ('c26', 3157),
    ('c27', 26),
    ('c28', 11588),
    ('c29', 715441),
    ('c30', 10),
    ('c31', 4681),
    ('c32', 2029),
    ('c33', 4),
    ('c34', 870796),
    ('c35', 17),
    ('c36', 15),
    ('c37', 87605),
    ('c38', 84),
    ('c39', 58187)])

Для sparse фичей указываем размерность словаря и размерность эмбеддинга.

Размерность эмбеддинга вычисляется по формуле (см. [1])

$$ \text{embedding_dim} = 6 \cdot (\text{vocab_size})^{1/4} $$

In [184]:
fixlen_feature_columns = [SparseFeat(feat, 
                                     vocabulary_size=min(vocab_size, 50000), 
                                     embedding_dim=min(int(6 * (vocab_size) ** (0.25)), 100), 
                                     use_hash=True, dtype='string') 
                          for feat, vocab_size in sparse_features_dims.items()] + \
                        [DenseFeat(feat, 1,) for feat in dense_features]

# fixlen_feature_columns = [SparseFeat(feat, 
#                                      vocabulary_size=min(vocab_size, 10000), 
#                                      embedding_dim=32, 
#                                      use_hash=True, dtype='string') 
#                           for feat, vocab_size in sparse_features_dims.items()] + \
#                         [DenseFeat(feat, 1,) for feat in dense_features]

linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )

## Model

### Define Model

Будем использовать гиперпараметры, описанные в [1]

In [185]:
model = DCN(linear_feature_columns, dnn_feature_columns, cross_num=2,
            dnn_hidden_units=(128, 128), l2_reg_linear=0, l2_reg_embedding=0,
            l2_reg_cross=0, l2_reg_dnn=0, init_std=0.0001, seed=1024, 
            dnn_use_bn=True, dnn_activation='relu', task='binary')

model.compile("adam", "binary_crossentropy",
              metrics=['binary_crossentropy'], )

### Train

In [None]:
checkpoint = ModelCheckpoint("best_model_3.hdf5", monitor='loss', verbose=1,
                             save_best_only=True, mode='auto', period=1)

model.fit(data_generator(FULL_TRAIN_DATA, 1024),
          use_multiprocessing=True, steps_per_epoch=44766, epochs=5, verbose=1, 
          callbacks=[checkpoint])

In [187]:
model.save_weights(os.path.join(DATA_PREFIX, 'DCN_w.h5'))

**Model size**

168MB

## Evaluation

Оценим качество модели на валидационной выборке.

In [188]:
pred_ans = model.predict(test_model_input, batch_size=256)



In [189]:
print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))

test LogLoss 0.4566
test AUC 0.802


## Soft-targets

Создадим новый таргет для обучения модели ученика. 

В качестве нового таргета будем использовать вероятность класса 1, которую нам выдает модель учитель.

In [190]:
preds = model.predict(all_model_input, batch_size=256)



In [191]:
probs = [p[0] for p in preds]

In [194]:
pd.DataFrame({'id': data['id'], 'prob': probs}) \
    .to_csv(os.path.join(DATA_PREFIX, 'soft_targets_full.csv'), index=False)