# Data transformations
Assumption: the whole data won't fit into memory. We need to iterate over it in chunks.

We have three types of variables:
* **simple** (numerical)
* **IDs** (discrete, with huge number of categories)
* **text tokens** (sequence of discrete tokens). We may or may not do TFIDF or something similar, but we would need to precompute some statistics first.

## Overview

No need to transform basic variables, all are numerical: gender, age, depth, position.

Let's create dataset from these four alone. This will serve as a **baseline**.

We may want to separate baseline data transformation and baseline model. The most important will probably be data transformation, since it will influence the way we build a model.

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
from collections import OrderedDict

In [2]:
tf.__version__

'2.2.0'

In [3]:
data = pd.read_csv("data/D100k.tsv", sep="\t")
data.head()

Unnamed: 0,Click,DisplayURL,AdId,AdvertiserId,Depth,Position,UserID,Gender,Age,AdKeyword_tokens,AdTitle_tokens,AdDescription_tokens,Query_tokens
0,0,4298118681424644510,7686695,385,3,3,490234,1,3,4133|95|17,4133|95|17|0|4732|95|146|4079,8|81|123|205|2|95|26|95|60|32|1|17|146|1|991|3...,4133
1,0,13677630321509009335,3517124,23778,3,1,490234,1,3,4133,145|65|3927|832|93,3683|4990|2793|11589|21|10741|26|16044|26|3168...,4133
2,0,11689327222955583742,21021375,27701,3,2,490234,1,3,4133,4133|95|1|339|125|21|83093,726|50|2218|1533|2275|4133|1299|509|95|2072|1|...,4133
3,0,4298118681424644510,7686695,385,1,1,16960371,2,2,4133|95|17,4133|95|17|0|4732|95|146|4079,8|81|123|205|2|95|26|95|60|32|1|17|146|1|991|3...,4133|4942
4,0,15132506310926074459,4424000,20940,1,1,3524325,1,3,121|4133|95,121|4133|95|8762|3957|4563|2233|192|28|138|3,62|1162|570|8|4133|95|1|81|102|1155|650|1255|1...,4133|4942


In [4]:
ds = tf.data.experimental.make_csv_dataset("data/D100k.tsv",
                                           field_delim="\t",
                                           batch_size=5,  # increase for real work
                                           label_name="Click")

#### Inspection of batch dataset

In [5]:
ds

<PrefetchDataset shapes: (OrderedDict([(DisplayURL, (5,)), (AdId, (5,)), (AdvertiserId, (5,)), (Depth, (5,)), (Position, (5,)), (UserID, (5,)), (Gender, (5,)), (Age, (5,)), (AdKeyword_tokens, (5,)), (AdTitle_tokens, (5,)), (AdDescription_tokens, (5,)), (Query_tokens, (5,))]), (5,)), types: (OrderedDict([(DisplayURL, tf.float32), (AdId, tf.int32), (AdvertiserId, tf.int32), (Depth, tf.int32), (Position, tf.int32), (UserID, tf.int32), (Gender, tf.int32), (Age, tf.int32), (AdKeyword_tokens, tf.string), (AdTitle_tokens, tf.string), (AdDescription_tokens, tf.string), (Query_tokens, tf.string)]), tf.int32)>

In [6]:
x, y = next(iter(ds))

In [7]:
x, y

(OrderedDict([('DisplayURL',
               <tf.Tensor: shape=(5,), dtype=float32, numpy=
               array([2.4127719e+18, 1.1121834e+19, 1.5785113e+19, 1.0245114e+19,
                      4.6967827e+18], dtype=float32)>),
              ('AdId',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([20067154, 21156726, 20908196, 21134385,  3176858], dtype=int32)>),
              ('AdvertiserId',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([23781, 23807, 35088, 33296, 23790], dtype=int32)>),
              ('Depth',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([2, 2, 3, 3, 2], dtype=int32)>),
              ('Position',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([2, 1, 2, 1, 2], dtype=int32)>),
              ('UserID',
               <tf.Tensor: shape=(5,), dtype=int32, numpy=array([11191571,  7940281,  7997418,   466930,  1173013], dtype=int32)>),
              ('Gender',
               <tf.Tensor: shape=(5,), dty

In [8]:
for feature_batch, label_batch in ds.take(1):
    print("'Clicked': {}".format(label_batch))
    print("features:")
    for key, value in feature_batch.items():
        print("  {!r:20s}: {}".format(key, value))

'Clicked': [0 0 0 0 0]
features:
  'DisplayURL'        : [1.1309026e+19 1.2057879e+19 1.5785113e+19 7.9039147e+18 8.0783747e+18]
  'AdId'              : [22002402 20157587 20908196 21162527 10766997]
  'AdvertiserId'      : [38263 27961 35088  1325 30157]
  'Depth'             : [2 3 3 1 3]
  'Position'          : [2 3 2 1 3]
  'UserID'            : [4139724 5087306 7981563  766924 7971910]
  'Gender'            : [2 1 2 1 2]
  'Age'               : [3 3 6 2 6]
  'AdKeyword_tokens'  : [b'50|230|518' b'12731' b'12731' b'485|271|209|3942|48' b'50|230|518']
  'AdTitle_tokens'    : [b'45|31|571|1916|38' b'12731|1354|1|334|34|51'
 b'12731|190|513|12731|677|183' b'48|935|203|36|210|1|37|271|209|158'
 b'172|1307|170|50|18501|35|1073|3|373|102|26|1724|3571|3']
  'AdDescription_tokens': [b'643|34|571|1916|1567|6879|31134|2462|665|9130|39810|0|102289|13491|0|103237|799|2663|3'
 b'51|277|198|2|12731|421|128|1224|1|1354|2121|1|139|525|1|930|28|1435|3'
 b'12731|390|1354|1|4383|234|26|205|734|26|17|

## Preprocessing data
`map` is all we need

In [9]:
def string_to_token_list(sentence_string):
    """Transforms string of tokens separated by '|' to list of (int) tokens"""
    sentence_as_list = sentence_string.split("|")
    token_list = [int(token) for token in sentence_as_list]
    return token_list

In [10]:
text = data["AdTitle_tokens"][0]
text

'4133|95|17|0|4732|95|146|4079'

In [11]:
string_to_token_list(text)

[4133, 95, 17, 0, 4732, 95, 146, 4079]

In [12]:
data.columns

Index(['Click', 'DisplayURL', 'AdId', 'AdvertiserId', 'Depth', 'Position',
       'UserID', 'Gender', 'Age', 'AdKeyword_tokens', 'AdTitle_tokens',
       'AdDescription_tokens', 'Query_tokens'],
      dtype='object')

In [13]:
def identity(x):
    return x

In [14]:
transforms = {
    "Depth": identity,
    "Position": identity,
    "Gender": identity,
    "Age": identity
}

def transform_x(x):
    return tf.stack([
        transform(x[key]) for key, transform in transforms.items()
    ], 1)

In [15]:
train = ds.map(lambda x, y: (transform_x(x), y))

In [16]:
train

<MapDataset shapes: ((5, 4), (5,)), types: (tf.int32, tf.int32)>

## Model

In [27]:
N_SAMPLES = 100_000
batch_size = 10000
steps_per_epoch = N_SAMPLES // batch_size
ds = tf.data.experimental.make_csv_dataset("data/D100k.tsv",
                                           field_delim="\t",
                                           batch_size=5,  # increase for real work
                                           label_name="Click")

train = ds.map(lambda x, y: (transform_x(x), y))

In [28]:
model = keras.models.Sequential()
# model.add(keras.layers.Dense(units=128, activation="relu"))
model.add(keras.layers.Dense(units=1, activation="sigmoid"))
model.compile(loss=tf.keras.losses.categorical_crossentropy, 
              optimizer=keras.optimizers.Adam(), 
              metrics=["accuracy", keras.metrics.AUC()])

In [29]:
model.fit(train, epochs=100, steps_per_epoch=steps_per_epoch)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7ff809186e10>

TODO:
* How to do cross-validation in automated fashion?
* Incorporate test set, without it it is pointless. How to do it?
* Run it on bigger dataset