# Data transformations
Assumption: the whole data won't fit into memory. We need to iterate over it in chunks.

We have three types of variables:
* **simple** (numerical)
* **IDs** (discrete, with huge number of categories)
* **text tokens** (sequence of discrete tokens). We may or may not do TFIDF or something similar, but we would need to precompute some statistics first.

## Overview

No need to transform basic variables, all are numerical: gender, age, depth, position.

Let's create dataset from these four alone. This will serve as a **baseline**.

We may want to separate baseline data transformation and baseline model. The most important will probably be data transformation, since it will influence the way we build a model.

In [129]:
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
from collections import OrderedDict

In [130]:
tf.__version__

'2.2.0'

In [131]:
data = pd.read_csv("data/D100k.tsv", sep="\t")
data.head()

Unnamed: 0,Click,DisplayURL,AdId,AdvertiserId,Depth,Position,UserID,Gender,Age,AdKeyword_tokens,AdTitle_tokens,AdDescription_tokens,Query_tokens
0,0,4298118681424644510,7686695,385,3,3,490234,1,3,4133|95|17,4133|95|17|0|4732|95|146|4079,8|81|123|205|2|95|26|95|60|32|1|17|146|1|991|3...,4133
1,0,13677630321509009335,3517124,23778,3,1,490234,1,3,4133,145|65|3927|832|93,3683|4990|2793|11589|21|10741|26|16044|26|3168...,4133
2,0,11689327222955583742,21021375,27701,3,2,490234,1,3,4133,4133|95|1|339|125|21|83093,726|50|2218|1533|2275|4133|1299|509|95|2072|1|...,4133
3,0,4298118681424644510,7686695,385,1,1,16960371,2,2,4133|95|17,4133|95|17|0|4732|95|146|4079,8|81|123|205|2|95|26|95|60|32|1|17|146|1|991|3...,4133|4942
4,0,15132506310926074459,4424000,20940,1,1,3524325,1,3,121|4133|95,121|4133|95|8762|3957|4563|2233|192|28|138|3,62|1162|570|8|4133|95|1|81|102|1155|650|1255|1...,4133|4942


In [151]:
ds = tf.data.experimental.make_csv_dataset("data/D100k.tsv",
                                           field_delim="\t",
                                           batch_size=5,  # increase for real work
                                           label_name="Click")

#### Inspection of batch dataset

In [152]:
ds

<PrefetchDataset shapes: (OrderedDict([(DisplayURL, (5,)), (AdId, (5,)), (AdvertiserId, (5,)), (Depth, (5,)), (Position, (5,)), (UserID, (5,)), (Gender, (5,)), (Age, (5,)), (AdKeyword_tokens, (5,)), (AdTitle_tokens, (5,)), (AdDescription_tokens, (5,)), (Query_tokens, (5,))]), (5,)), types: (OrderedDict([(DisplayURL, tf.float32), (AdId, tf.int32), (AdvertiserId, tf.int32), (Depth, tf.int32), (Position, tf.int32), (UserID, tf.int32), (Gender, tf.int32), (Age, tf.int32), (AdKeyword_tokens, tf.string), (AdTitle_tokens, tf.string), (AdDescription_tokens, tf.string), (Query_tokens, tf.string)]), tf.int32)>

In [153]:
x, y = next(iter(ds))

In [154]:
x, y

(OrderedDict([('DisplayURL',
               <tf.Tensor: shape=(5,), dtype=float32, numpy=
               array([1.0726002e+19, 6.4143073e+18, 1.5785113e+19, 8.9945571e+18,
                      7.9039147e+18], dtype=float32)>),
              ('AdId',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([20036558, 21248429, 20908196, 20030150, 21162436], dtype=int32)>),
              ('AdvertiserId',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([23800, 35668, 35088, 23799,  1325], dtype=int32)>),
              ('Depth',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([2, 3, 2, 3, 3], dtype=int32)>),
              ('Position',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([2, 2, 2, 2, 1], dtype=int32)>),
              ('UserID',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([ 5085857,  2641584, 11280635,   872931,  5086026], dtype=int32)>),
              ('Gender',
               <tf.Tensor: shape=(5,), dty

In [155]:
for feature_batch, label_batch in ds.take(1):
    print("'Clicked': {}".format(label_batch))
    print("features:")
    for key, value in feature_batch.items():
        print("  {!r:20s}: {}".format(key, value))

'Clicked': [0 0 1 0 0]
features:
  'DisplayURL'        : [7.3915312e+18 7.9039147e+18 1.4340390e+19 1.0536834e+18 5.8512529e+18]
  'AdId'              : [20691800 21162514  9027213 20363531 20133613]
  'AdvertiserId'      : [34245  1325 23808  1340 28698]
  'Depth'             : [2 2 3 3 2]
  'Position'          : [2 1 1 1 2]
  'UserID'            : [5091069 1680036 6237055 7969572 3037378]
  'Gender'            : [1 2 1 1 0]
  'Age'               : [2 5 3 3 2]
  'AdKeyword_tokens'  : [b'818|31' b'1277' b'1545' b'2684' b'366|270']
  'AdTitle_tokens'    : [b'4567|455|172|10847|170|799|589|1088|955|539|3'
 b'48|935|203|36|210|1|37|271|209|158'
 b'615|1545|75|31|1|138|1270|615|131' b'169|1460|872|6|1302|0|248|14|188'
 b'2677|1625|27|0|177|324|408|99|0|169|287|329']
  'AdDescription_tokens': [b'799|203|773|39|5460|3|1088|31|720|16509|3|39|5015|288|163|3|6188|14|4023|3'
 b'271|209|158|742|381|3500|1446|1781|1|32|597|734|1|742|381|29|665|631|3'
 b'1545|31|40|615|1|272|18889|1|220|511|20|5270

## Preprocessing data
`map` is all we need

In [77]:
def string_to_token_list(sentence_string):
    """Transforms string of tokens separated by '|' to list of (int) tokens"""
    sentence_as_list = sentence_string.split("|")
    token_list = [int(token) for token in sentence_as_list]
    return token_list

In [78]:
text = data["AdTitle_tokens"][0]
text

'4133|95|17|0|4732|95|146|4079'

In [79]:
string_to_token_list(text)

[4133, 95, 17, 0, 4732, 95, 146, 4079]

In [88]:
data.columns

Index(['Click', 'DisplayURL', 'AdId', 'AdvertiserId', 'Depth', 'Position',
       'UserID', 'Gender', 'Age', 'AdKeyword_tokens', 'AdTitle_tokens',
       'AdDescription_tokens', 'Query_tokens'],
      dtype='object')

In [104]:
def identity(x):
    return x

In [126]:
transforms = {
    "Depth": identity,
    "Position": identity,
    "Gender": identity,
    "Age": identity
}

def transform_x(x):
    return tf.stack([
        transform(x[key]) for key, transform in transforms.items()
    ], 1)

In [156]:
train = ds.map(lambda x, y: (transform_x(x), y))

In [157]:
train

<MapDataset shapes: ((5, 4), (5,)), types: (tf.int32, tf.int32)>

## Model

In [172]:
N_SAMPLES = 100_000
batch_size = 1000
steps_per_epoch = N_SAMPLES // batch_size
ds = tf.data.experimental.make_csv_dataset("data/D100k.tsv",
                                           field_delim="\t",
                                           batch_size=5,  # increase for real work
                                           label_name="Click")

train = ds.map(lambda x, y: (transform_x(x), y))

In [173]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(units=128, activation="relu"))
model.add(keras.layers.Dense(units=64, activation="relu"))
model.add(keras.layers.Dense(units=1, activation="sigmoid"))
model.compile(loss=tf.keras.losses.categorical_crossentropy, 
              optimizer=keras.optimizers.Adam(), 
              metrics=["accuracy", keras.metrics.AUC()])

In [174]:
model.fit(train, epochs=10, steps_per_epoch=steps_per_epoch)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb46541cda0>

TODO:
* How to do cross-validation in automated fashion?
* Incorporate test set, without it it is pointless. How to do it?
* Run it on bigger dataset