### Download dataset

Please download the dataset from: 

http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/

This dataset is about 4.3G.

Uncompress it and put 'train.txt' in the 'data' folder.

Since the test set has no label, we just use train set for experiments. We split it into train set and valid set.

The [origin project](https://github.com/xxxmin/ctr_Keras) preprocess by:

1. fill NaN with 0.
2. remove category features with frequence less than 10.

But I find that even if you remove categories occuring less than 10 times, there are still too many possible values:

```
number of unique values:
int_0 320
int_1 4893
int_2 1919
int_3 176
int_4 98731
int_5 4723
int_6 1727
int_7 410
int_8 3708
int_9 9
int_10 135
int_11 171
int_12 379
str_0 1442
str_1 553
str_2 175780
str_3 128508
str_4 304
str_5 18
str_6 11929
str_7 628
str_8 3
str_9 41223
str_10 5159
str_11 174834
str_12 3174
str_13 26
str_14 11253
str_15 165205
str_16 10
str_17 4604
str_18 2016
str_19 4
str_20 172321
str_21 17
str_22 15
str_23 56455
str_24 85
str_25 43355
```

So I use hash trick.

In [190]:
import pandas as pd
import numpy as np
import tensorflow as tf
from collections import defaultdict
from tqdm import tqdm

In [194]:
input_fname = "data/train.txt"
output_train = "data/train.csv"
output_valid = "data/valid.csv"

total_lines = 45840618
valid_size = 0.1
col_names = ['label'] + ['int_%d' % d for d in range(13)] + ["str_%d" % d for d in range(26)]

In [193]:
# get number of unique values of each features.
# this takes about 40 minutes to run.

train_ds = tf.data.experimental.make_csv_dataset(
    input_fname,
    batch_size=128,
    column_names=col_names,
    label_name="label",
    field_delim='\t',
    num_epochs=1
)

cnt = defaultdict(lambda: defaultdict(lambda: 0))
for batch, label in tqdm(train_ds, total=total_lines//128):
    for key, tensor in batch.items():
        for val in tensor.numpy():
            cnt[key][val] += 1

In [203]:
print("number of unique values which occur more than 10 times:")
for k, c in cnt.items():
    print(k, sum(1 for v, n in c.items() if n > 10))

number of unique values which occur more than 10 times:
int_0 320
int_1 4893
int_2 1919
int_3 176
int_4 98731
int_5 4723
int_6 1727
int_7 410
int_8 3708
int_9 9
int_10 135
int_11 171
int_12 379
str_0 1442
str_1 553
str_2 175780
str_3 128508
str_4 304
str_5 18
str_6 11929
str_7 628
str_8 3
str_9 41223
str_10 5159
str_11 174834
str_12 3174
str_13 26
str_14 11253
str_15 165205
str_16 10
str_17 4604
str_18 2016
str_19 4
str_20 172321
str_21 17
str_22 15
str_23 56455
str_24 85
str_25 43355


In [176]:
def col_process(in_f, out_f, names, key, max_cut=128):
    sh = pd.read_csv(in_f, delimiter='\t', names=names, usecols=(key,))
    nunique = sh[key].nunique()
    max_cut = min(max_cut - 1, nunique)

    if key.startswith("str"):
        # using hash trick to handle string features.
        sh[key].fillna("no_value", inplace=True)
        sh[key] = pd.factorize(sh[key])[0]
        sh[key] = sh[key] % max_cut
    else:
        # split buckets for number featuers.
        sh[key] = pd.cut(sh[key], max_cut, labels=range(max_cut)).cat.codes
        sh[key].replace(-1, max_cut, inplace=True)

    sh.to_csv(out_f)

In [177]:
# this cell takes about 40 minutes to run.
col_fname = [input_fname + "_%s.csv" % key for key in col_names]

for key, out_f in zip(col_names, col_fname):
    print("processing key:", key)
    col_process(input_fname, out_f, col_names, key)

processing key: label
processing key: int_0
processing key: int_1
processing key: int_2
processing key: int_3
processing key: int_4
processing key: int_5
processing key: int_6
processing key: int_7
processing key: int_8
processing key: int_9
processing key: int_10
processing key: int_11
processing key: int_12
processing key: str_0
processing key: str_1
processing key: str_2
processing key: str_3
processing key: str_4
processing key: str_5
processing key: str_6
processing key: str_7
processing key: str_8
processing key: str_9
processing key: str_10
processing key: str_11
processing key: str_12
processing key: str_13
processing key: str_14
processing key: str_15
processing key: str_16
processing key: str_17
processing key: str_18
processing key: str_19
processing key: str_20
processing key: str_21
processing key: str_22
processing key: str_23
processing key: str_24
processing key: str_25


In [179]:
valid_lines = int(total_lines * valid_size)
train_lines = total_lines - valid_lines

def merge_cols(col_fname, output_train, output_valid, lim=100):
    from tqdm import tqdm

    col_f = [open(f) for f in col_fname]
    train_f = open(output_train, 'w')
    valid_f = open(output_valid, 'w')
    
    for idx, lines in tqdm(enumerate(zip(*col_f))):
        keys = [s.strip().split(",")[1] for s in lines]
        merged_s = ",".join(keys) + "\n"
        if idx == 0:
            train_f.write(merged_s)
            valid_f.write(merged_s)
        elif idx < train_lines:
            train_f.write(merged_s)
        else:
            valid_f.write(merged_s)
        
        if lim != -1 and idx > lim:
            break

In [181]:
merge_cols(col_fname, output_train, output_valid, lim=-1)

44795960it [14:52, 41576.45it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [183]:
!rm data/train.txt_*.csv

In [204]:
!head data/train.csv

label,int_0,int_1,int_2,int_3,int_4,int_5,int_6,int_7,int_8,int_9,int_10,int_11,int_12,str_0,str_1,str_2,str_3,str_4,str_5,str_6,str_7,str_8,str_9,str_10,str_11,str_12,str_13,str_14,str_15,str_16,str_17,str_18,str_19,str_20,str_21,str_22,str_23,str_24,str_25
0,0,0,0,0,0,0,0,0,0,1,1,127,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,1,0,127,0,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,0,0,1,0,1
0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,2,2,2,0,0,2,1,0,2,2,2,2,2,2,2,2,2,1,2,2,1,0,2,1,2
0,127,0,127,127,0,127,0,0,0,12,0,127,127,0,3,3,3,0,1,3,1,0,3,3,3,3,2,3,3,3,3,1,2,3,0,0,3,1,2
0,0,0,127,0,0,0,0,0,0,1,0,127,0,2,4,4,4,0,2,4,1,0,4,4,4,4,0,4,4,3,4,1,2,4,0,1,4,1,2
0,127,0,127,127,0,127,0,0,0,12,0,127,127,3,5,5,5,1,3,5,1,0,2,5,5,5,2,5,5,4,5,1,2,5,2,2,5,1,2
0,127,0,0,127,0,127,0,0,0,12,0,127,127,4,6,6,6,1,1,6,1,0,2,6,6,6,2,6,6,4,6,1,2,6,0,3,6,1,2
1,0,0,0,0,0,0,0,0,0,1,0,127,0,0,3,7,7,2,2,7,0,0,5,7,7,7,2,3,7,0,3,1,2,7,0,1,3,1,2
0,127,0,0,1,0,0,0,0,0,12,0,127,0,3,7,8,2,0,0,8,1,0,6,8

In [205]:
!head data/valid.csv

label,int_0,int_1,int_2,int_3,int_4,int_5,int_6,int_7,int_8,int_9,int_10,int_11,int_12,str_0,str_1,str_2,str_3,str_4,str_5,str_6,str_7,str_8,str_9,str_10,str_11,str_12,str_13,str_14,str_15,str_16,str_17,str_18,str_19,str_20,str_21,str_22,str_23,str_24,str_25
0,0,0,0,0,0,0,0,0,0,0,0,127,0,0,61,124,105,5,0,117,1,0,117,90,36,124,2,43,64,0,91,22,0,14,1,1,21,29,9
1,127,0,0,0,0,0,0,0,0,12,2,127,0,7,0,48,12,3,1,85,5,0,96,85,80,49,2,5,25,1,0,0,0,92,0,1,19,0,25
0,127,0,127,0,0,0,0,0,0,12,0,127,0,0,21,8,2,0,0,84,1,0,39,79,8,77,2,28,8,0,96,1,2,8,1,1,2,1,2
0,0,0,0,1,0,0,0,0,0,2,4,0,0,67,66,121,109,0,5,61,1,0,26,16,121,110,2,41,118,2,126,1,2,121,0,9,89,1,2
0,0,0,127,0,0,0,0,0,1,1,15,0,0,7,34,7,126,3,5,123,6,0,108,11,79,11,0,51,79,2,46,0,0,59,0,1,21,6,9
1,0,0,0,0,0,0,0,0,1,1,3,127,0,7,9,33,9,0,0,62,1,0,111,123,25,108,2,97,63,0,118,0,1,123,0,6,2,2,6
0,127,0,127,4,0,127,0,0,0,12,0,127,0,3,109,43,111,0,1,42,0,0,57,30,77,22,1,109,78,0,42,1,2,96,1,4,30,1,2
0,127,0,127,0,0,127,0,0,0,12,0,127,0,3,4,117,4,2

In [1]:
%load_ext autoreload
%autoreload 2
from tensorflow.keras import callbacks
import tensorflow as tf
print(tf.config.get_visible_devices("GPU"))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [2]:
MAX_CUT = 128
TRAIN_FNAME = "data/train.csv"
VALID_FNAME = "data/valid.csv"
col_names = ['label'] + ['int_%d' % d for d in range(13)] + ["str_%d" % d for d in range(26)]

In [3]:
# def check_csv(fname, n=40):
#     with open(fname) as f:
#         for li in f:
#             if li.count(",") != n - 1:
#                 print(li)
#                 print(li.count(","))
#                 break

# check_csv(TRAIN_FNAME)
# check_csv(VALID_FNAME)

In [4]:
def read_csv(filename, batch_size=256):
    return tf.data.experimental.make_csv_dataset(
        filename,
        batch_size=batch_size,
        column_names=col_names,
        label_name="label",
        num_epochs=1
    )

train_ds = read_csv(TRAIN_FNAME)
valid_ds = read_csv(VALID_FNAME)

In [5]:
def train_model(model, epochs=1, checkpoint_fname=None):
    callback_list = []
    callback_list.append(callbacks.EarlyStopping(monitor="val_loss", patience=2))
    
    if checkpoint_fname:
        ck = callbacks.ModelCheckpoint("checkpoints/fnn.h5",
            save_weights_only=True, verbose=1, save_best_only=True)
        callback_list.append(ck)
    
    model.fit(train_ds, epochs=epochs, validation_data=valid_ds, callbacks=callback_list)
    
    return model

## LR Model

In [6]:
from models.lr import make_lr_model
lr_model = train_model(make_lr_model(col_names[1:], [MAX_CUT] * (len(col_names) - 1)))

 123040/Unknown - 1806s 15ms/step - loss: 0.4964 - acc: 0.7656

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 133021/Unknown - 1953s 15ms/step - loss: 0.4967 - acc: 0.7653

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





In [7]:
print("hello")

hello


## FNN Model

In [5]:
from models.fnn import make_fnn_model
fnn_model = train_model(make_fnn_model(cols, val_nums))

Train on 57600 samples, validate on 6400 samples
Epoch 1/2
Epoch 2/2


## NFM Model

In [17]:
from models.nfm import make_nfm_model
nfm_model = make_nfm_model(cols, val_nums, interact="multiply", merge="add")
nfm_model = train_model(nfm_model)

Train on 57600 samples, validate on 6400 samples
Epoch 1/2
Epoch 2/2


In [19]:
nfm_model = make_nfm_model(cols, val_nums, interact="dot", merge="concat")
nfm_model = train_model(nfm_model)

Train on 57600 samples, validate on 6400 samples
Epoch 1/2
Epoch 2/2


In [20]:
nfm_model = make_nfm_model(cols, val_nums, interact="multiply", merge="concat")
nfm_model = train_model(nfm_model)

Train on 57600 samples, validate on 6400 samples
Epoch 1/2
Epoch 2/2
