In this notebook we are trying to use Keras neural networks to solve a classification problem. This is an adapted dataset where we try to predict the 'label' giving a sample weight 'weight' using a set of features. The problem consists of a set of very sparse timeseries, the features are used to give context to these timeseries. With the classification we are trying to predict the label in the feature, i.e. the validation set has values for 't' which are all unseen in during training (as training happens before validation in time). We couldn't find any literature how to deal with this problem (i.e. thousands of timeseries with a lot of missing data, trying to use contextual features to still be able to predict).

The dataset contains the following features:

- label: the label as a binary classification, either 0 or 1
- weight: the sample weight
- feature_0-8: categorical features
- t(_squared/_cubed): time component, first day of training data is t=0 etc, will be non-overlapping between training and validation set
- best_guess(_2): best guess for the probability of the positive class

We try to interact time and the best_guess feature with the categorical features in the hope of gaining an improvement. The 'best_guess' is also the model that should be beaten.

In [72]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output

In [73]:
train_df = pd.read_csv("training_data_sample.csv").dropna()
validate_df = pd.read_csv("validation_data_sample.csv").dropna()
test_df = pd.read_csv("test_data_sample.csv").dropna()

train_df = train_df[train_df.weight>0]#.sample(frac=0.1)
validate_df = validate_df[validate_df.weight>0]#.sample(frac=0.1)
test_df = test_df[test_df.weight>0]#.sample(frac=0.1)

In [74]:
TARGET = ['label']
WEIGHT = ['weight']

FEATURES_CAT = [
    #'feature_0'
    #,'feature_1'
    #,'feature_2'
    'feature_3'
    ,'feature_4'
    ,'feature_5'
    ,'feature_6'
    ,'feature_7'
    ,'feature_8'
]

FEATURES_CONT = [
   't','t_squared','t_cubed','best_guess_2'
]

FEATURES_CROSSED = [
    #('t','feature_0')
    #, ('t','feature_1')
    #, ('t','feature_2')
     ('t','feature_3')
    , ('t','feature_4')
    , ('t','feature_5')
    , ('t','feature_6')
    , ('t','feature_7')
    
    #, ('feature_8','feature_0')
    #, ('feature_8','feature_1')
    #, ('feature_8','feature_2')
    , ('feature_8','feature_3')
    , ('feature_8','feature_4')
    , ('feature_8','feature_5')
    , ('feature_8','feature_6')
    , ('feature_8','feature_7')
    
    #, ('best_guess_2','feature_0')
    #, ('best_guess_2','feature_1')
    #, ('best_guess_2','feature_2')
    , ('best_guess_2','feature_3')
    , ('best_guess_2','feature_4')
    , ('best_guess_2','feature_5')
    , ('best_guess_2','feature_6')
    , ('best_guess_2','feature_7')
]

# Keras

In [75]:
def compute_metrics(pred,weight,label):
    # Tensorflow reports logloss with a different denominator - adding both for consistency
    return pd.Series(index=['logloss', 'logloss_tf']
                        ,data=[
                            - np.sum( weight * (label * np.log(pred) + (1 - label) * np.log(1 - pred)) ) / np.sum(weight),
                        - np.sum( weight * (label * np.log(pred) + (1 - label) * np.log(1 - pred)) ) / len(weight)])



In [76]:
feature_columns = []
for feature_name in FEATURES_CAT:
    
    train_df[feature_name] = train_df[feature_name].astype(str)
    validate_df[feature_name] = validate_df[feature_name].astype(str)
    test_df[feature_name] = test_df[feature_name].astype(str)

    vocabulary = pd.concat([train_df[feature_name],validate_df[feature_name],test_df[feature_name]]).unique()
    print("Feature: ", feature_name, ", voc size: ", len(vocabulary))
    col = tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
    if feature_name in ['feature_0','feature_1','feature_2']:
        feature_columns.append(tf.feature_column.embedding_column(col,dimension=64))
    else:
        feature_columns.append(tf.feature_column.indicator_column(col))

for feature_name in FEATURES_CONT:
    feature_columns.append(tf.feature_column.numeric_column(feature_name))
    
for f1,f2 in FEATURES_CROSSED:
    print("Feature: ", f1, " + ", f2)
    voc2 = pd.concat([train_df[f2],validate_df[f2],test_df[f2]]).unique()
    if f1 in FEATURES_CONT:
        num_col = tf.feature_column.numeric_column(f1)
        if f1=='t':
            col = tf.feature_column.bucketized_column(num_col,boundaries=list(np.arange(0.5,101.5,1.0)))
        if f1=='best_guess_2':
            col = tf.feature_column.bucketized_column(num_col,boundaries=list(np.arange(0,1.01,0.001)))
        feature_columns.append(
            tf.feature_column.indicator_column(
            tf.feature_column.crossed_column(
                [col
                , tf.feature_column.categorical_column_with_vocabulary_list(f2, voc2)]
                , hash_bucket_size=1000)))
    elif f1 in FEATURES_CAT:
        voc1 = pd.concat([train_df[f1],validate_df[f1],test_df[f1]]).unique()
        feature_columns.append(
            tf.feature_column.indicator_column(
            tf.feature_column.crossed_column(
                [tf.feature_column.categorical_column_with_vocabulary_list(f1, voc1)
                , tf.feature_column.categorical_column_with_vocabulary_list(f2, voc2)]
                , hash_bucket_size=1000)))
        
train_x = train_df[FEATURES_CAT+FEATURES_CONT]
train_weights = train_df[WEIGHT].values.flatten()
train_y = train_df[TARGET].values.flatten()

validate_x = validate_df[FEATURES_CAT+FEATURES_CONT]
validate_weights = validate_df[WEIGHT].values.flatten()
validate_y = validate_df[TARGET].values.flatten()

test_x = test_df[FEATURES_CAT+FEATURES_CONT]
test_weights = test_df[WEIGHT].values.flatten()
test_y = test_df[TARGET].values.flatten()

Feature:  feature_3 , voc size:  759
Feature:  feature_4 , voc size:  65
Feature:  feature_5 , voc size:  184
Feature:  feature_6 , voc size:  12
Feature:  feature_7 , voc size:  40
Feature:  feature_8 , voc size:  7
Feature:  t  +  feature_3
Feature:  t  +  feature_4
Feature:  t  +  feature_5
Feature:  t  +  feature_6
Feature:  t  +  feature_7
Feature:  feature_8  +  feature_3
Feature:  feature_8  +  feature_4
Feature:  feature_8  +  feature_5
Feature:  feature_8  +  feature_6
Feature:  feature_8  +  feature_7
Feature:  best_guess_2  +  feature_3
Feature:  best_guess_2  +  feature_4
Feature:  best_guess_2  +  feature_5
Feature:  best_guess_2  +  feature_6
Feature:  best_guess_2  +  feature_7


In [77]:
def get_data(features, labels, weights, training=True, batch_size=64):
    """An input function for training or evaluating"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels, weights))

    # Shuffle and repeat if you are in training mode.
    if training:
        dataset = dataset.shuffle(1000)
    
    return dataset.batch(batch_size)

In [78]:
batch_size = 128
train_ds = get_data(train_x, train_y, train_weights, training=True, batch_size=batch_size)
val_ds = get_data(validate_x, validate_y, validate_weights, training=False, batch_size=batch_size)
test_ds = get_data(test_x, test_y, test_weights, training=False, batch_size=batch_size)

In [79]:
print("Steps per epoch: ", train_x.shape[0]/batch_size)

Steps per epoch:  14806.1328125


In [80]:
def make_model(output_bias=None):
    
    if output_bias:
        output_bias = tf.keras.initializers.Constant(output_bias)
    
    model = tf.keras.Sequential([
        tf.keras.layers.DenseFeatures(feature_columns),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(rate=0.2),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(rate=0.2),
        tf.keras.layers.Dense(1, activation='sigmoid', bias_initializer=output_bias)
    ])

    optimizer = tf.keras.optimizers.Adam(1e-4)
    

    model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=optimizer,
                metrics=['accuracy'])
    
    return model

In [81]:
bias = np.log(np.sum(train_df[train_df.label==1]['weight'])/np.sum(train_df[train_df.label==0]['weight']))
print("Bias: ",bias)
model = make_model(bias)

Bias:  -3.4765001309572376


In [82]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    verbose=1,
    patience=0,
    mode='auto',
    restore_best_weights=True)

model.fit(train_ds, validation_data=val_ds, epochs=10, shuffle=True, use_multiprocessing=True, callbacks = [early_stopping])



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Epoch 1/10
Epoch 2/10
Epoch 00002: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f81bee00390>

In [83]:
print("BASE:")
compute_metrics(
    validate_df.best_guess_2.astype(float)
    , validate_df.weight.astype(float)
    , validate_df.label.astype(float))

BASE:


logloss       0.122638
logloss_tf    1.517179
dtype: float64

In [84]:
print("VARIANT:")
compute_metrics(
    model.predict(val_ds).flatten()
    , validate_df.weight.astype(float)
    , validate_df.label.astype(float))

VARIANT:


logloss       0.131922
logloss_tf    1.632031
dtype: float64

In [85]:
validate_df

Unnamed: 0.1,Unnamed: 0,best_guess,best_guess_2,t,t_squared,t_cubed,feature_0,feature_8,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_1,label,weight
0,0,0.033504,0.032436,21,441,9261,4896991224,0,1582895760,5084375,8558615,50300040,783437,4994604,1255224792,0,12
1,1,0.065101,0.060879,21,441,9261,43023456,0,33297160,10372125,3291775,29341690,2350311,17481114,24993414,0,1
2,2,0.065565,0.056306,21,441,9261,1717096860,0,60831350,18100375,8558615,31018358,783437,7491906,26844778,0,7
3,3,0.044676,0.045376,21,441,9261,1090183644,0,260614310,50640375,15142165,47785038,783437,24973020,566980225,1,1
5,5,0.032365,0.032885,21,441,9261,6965958492,0,916312230,14846375,11192035,2515002,2350311,4162170,3164906758,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1053079,1053079,0.031336,0.036243,27,729,19683,31201224912,4142250,6754841170,153751500,1975065,104791750,783437,832434,14160620395,0,7
1053080,1053080,0.024375,0.024976,27,729,19683,31077532476,4142250,6742674900,153751500,1975065,51138374,783437,832434,14126370161,0,4
1053081,1053081,0.026610,0.022022,27,729,19683,31210444224,4142250,6757402490,153751500,1975065,11736676,4700622,832434,14163397441,0,1
1053083,1053083,0.016623,0.031544,27,729,19683,31153591800,4142250,6741394240,153751500,1975065,41916700,783437,832434,281870169,0,40
