## Traininig the Inclusive classifier with tf.Keras using data in TFRecord format

**tf.keras Inclusive classifier** This notebooks trains a neural network for the particle classifier using the Inclusive Classifier, using as input the list of recunstructed particles with the low level features + the high level features. Data is prepared from Parquet using Apache Spark, and written into TFRecord format. Data in TFRecord format is read from TensorFlow using tf.data and tf.io in tf.keras.

To run this notebook we used the following configuration:
* *Software stack*: TensorFlow 2.0.0-beta1
* *Platform*: CentOS 7, Python 3.6

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Sequential, Input, Model
from tensorflow.keras.layers import Masking, Dense, Activation, GRU, Dropout, concatenate

tf.version.VERSION
# only needed for TensorFlow 1.x
# tf.enable_eager_execution()

'1.13.1'

# Configure distributed training using tf.distribute
This notebook shows an example of distributed training with tf.keras using 3 concurrent executions on a single machine.
The test machine has 24 physical cores it has been notes that a serial execution of the training would leave spare capacity. With distributed training we can "use all the CPU on the box". 
- TensorFlow MultiWorkerMirroredStrategy is used to distribute the training.
- Configuration of the workers is done using the OS enviroment variable **TF_CONFIG**.
- **nodes_endpoints** configures the list of machines and ports that will be used. In this example, we use 3 workers on the same machines, you can use this to distribute over multiple machines too
- **worker_number** will be unique for each worker, numbering starts from 0
- Worker number 0 will be the master. 
- You need to run the 3 notebooks for the 3 configured workers at the same time (training will only start when all 3 workers are active) 

In [4]:
# Each worker will have a unique worker_number, numbering starts from 0
worker_number=0

# nodes_endpoints = ["localhost:12345", "localhost:12346", "localhost:12347"]
# number_workers = len(nodes_endpoints)

# import os
# import json
# os.environ['TF_CONFIG'] = json.dumps({
#     'cluster': {
#         'worker': nodes_endpoints
#     },
#     'task': {'type': 'worker', 'index': worker_number}
# })


In [5]:
# strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

W0820 10:58:59.233548 139785715705664 cross_device_ops.py:1164] Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:worker/replica:0/task:0/device:CPU:0


## Create the Keras model for the inclusive classifier hooking with tf.distribute

In [2]:
# This implements the distributed stratedy for model
## GRU branch
gru_input = Input(shape=(801,19), name='gru_input')
a = gru_input
a = Masking(mask_value=0.)(a)
a = GRU(units=50, activation='tanh')(a)
gruBranch = Dropout(0.2)(a)
    
hlf_input = Input(shape=(14,), name='hlf_input')
b = hlf_input
hlfBranch = Dropout(0.2)(b)

c = concatenate([gruBranch, hlfBranch])
c = Dense(25, activation='relu')(c)
output = Dense(3, activation='softmax')(c)
    
model = Model(inputs=[gru_input, hlf_input], outputs=output)
    
## Compile model
optimizer = 'Adam'
loss = 'categorical_crossentropy'
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"] )

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [3]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
gru_input (InputLayer)          (None, 801, 19)      0                                            
__________________________________________________________________________________________________
masking (Masking)               (None, 801, 19)      0           gru_input[0][0]                  
__________________________________________________________________________________________________
gru (GRU)                       (None, 50)           10500       masking[0][0]                    
__________________________________________________________________________________________________
hlf_input (InputLayer)          (None, 14)           0                                            
__________________________________________________________________________________________________
dropout (D

## Load test and training data in TFRecord format, using tf.data and tf.io

In [4]:
PATH = "/data/cern/"

# test dataset 
files_test_dataset = tf.data.Dataset.list_files(PATH + "testUndersampled.tfrecord/part-r*")

# training dataset 
files_train_dataset = tf.data.Dataset.list_files(PATH + "trainUndersampled.tfrecord/part-r*")

In [5]:
# tunable
num_parallel_reads=4

test_dataset = files_test_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE).interleave(
    tf.data.TFRecordDataset, 
    cycle_length=num_parallel_reads,
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

train_dataset = files_train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE).interleave(
    tf.data.TFRecordDataset, cycle_length=num_parallel_reads,
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

In [6]:
# Function to decode TF records into the required features and labels
def decode(serialized_example):
    deser_features = tf.io.parse_single_example(
      serialized_example,
      # Defaults are not specified since both keys are required.
      features={
          'HLF_input': tf.io.FixedLenFeature((14), tf.float32),
          'GRU_input': tf.io.FixedLenFeature((801,19), tf.float32),
          'encoded_label': tf.io.FixedLenFeature((3), tf.float32),
          })
    return((deser_features['GRU_input'], deser_features['HLF_input']), deser_features['encoded_label'])

In [7]:
# use for debugging
# for record in test_dataset.take(1):
#     print(record)

In [8]:
parsed_test_dataset=test_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)
parsed_train_dataset=train_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)

In [9]:
# use for debugging
# Show and example of the parsed data
# for record in parsed_test_dataset.take(1):
#    print(record)

In [10]:
# tunable
batch_size = 320
train=parsed_train_dataset.batch(batch_size)
train=train.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
train=train.repeat()
train

<DatasetV1Adapter shapes: (((?, 801, 19), (?, 14)), (?, 3)), types: ((tf.float32, tf.float32), tf.float32)>

In [12]:
import math
num_train_samples=8611   # there are 3426083 samples in the training dataset

steps_per_epoch=int(math.ceil(num_train_samples/(batch_size*1.0)))
steps_per_epoch

27

In [13]:
# tunable
test_batch_size = 256

test=parsed_test_dataset.batch(test_batch_size)
test=test.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
test=test.repeat()

In [14]:
num_test_samples=2123 # there are 856090 samples in the test dataset

validation_steps=int(math.ceil(num_test_samples/(test_batch_size*1.0)))
validation_steps

8

## Train the tf.keras model

In [15]:
# train the Keras model

# tunable
num_epochs = 8

# callbacks = [ tf.keras.callbacks.TensorBoard(log_dir='./logs') ]
callbacks = []
    
%time history = model.fit(train, steps_per_epoch=steps_per_epoch, \
                          validation_data=test, validation_steps=validation_steps, \
                          epochs=num_epochs, callbacks=callbacks, verbose=1)


Instructions for updating:
Use tf.cast instead.
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 10min 4s, sys: 1min 10s, total: 11min 14s
Wall time: 2min 47s


In [16]:
worker_number=0
PATH="/data/cern/tensorflow_model/"
model.save(PATH + "mymodel"  + ".h5")

## Training history performance metrics

In [17]:
%matplotlib notebook
import matplotlib.pyplot as plt 
plt.style.use('seaborn-darkgrid')
# Graph with loss vs. epoch

plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.title("Inclusive classifier loss")
plt.show()

<IPython.core.display.Javascript object>

In [18]:
# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.title("HLF classifier accuracy")
plt.show()

<IPython.core.display.Javascript object>

## Model performance metrics
Load the model and plot the ROC and AUC and te confusion matrix using the noteboook  
**4.3a-Model_evaluate_ROC_and_CM.ipynb**