This notebook extends the TF v2.x code from the single machine notebook
`23_ds_single_machine`. The modification are such that the code enables
multi-machine training using the
[horovod](https://github.com/horovod/horovod) framwork.

We will highlight the changes compared the single machine
implementation.

First: Check if the data is in local. If not, go to notebook
`1_data_and_preprocssing` and download the data from dbfs to local.

In [None]:
ls 06_LHC/

  

>     LICENSE
>     README.md
>     data
>     h5
>     models
>     scripts
>     utils

  

Get the imports.

In [None]:
import argparse
from argparse import Namespace
from datetime import datetime
import numpy as np
import tensorflow as tf
import socket
import os
import sys
from sklearn.cluster import KMeans
from tqdm import tqdm
import h5py

np.set_printoptions(edgeitems=1000)

from scipy.optimize import linear_sum_assignment

BASE_DIR = os.path.join(os.getcwd(), '06_LHC','scripts')  
#os.path.dirname(os.path.abspath(__file__))
sys.path.append(BASE_DIR)
sys.path.append(os.path.join(BASE_DIR, '..', 'models'))
sys.path.append(os.path.join(BASE_DIR, '..', 'utils'))
import provider
import gapnet_classify as MODEL

  

  

Get the input parameters.

In [None]:
parserdict = {'max_dim': 3, #help='Dimension of the encoding layer [Default: 3]')
              'n_clusters': 3, #help='Number of clusters [Default: 3]')
              'gpu': 0, #help='GPU to use [default: GPU 0]')
              'model': 'gapnet_clasify', #help='Model name [default: gapnet_classify]')
              'log_dir': 'log', #help='Log dir [default: log]')
              'num_point': 100, #help='Point Number [default: 100]')
              'max_epoch': 10, #help='Epoch to run [default: 200]')
              'epochs_pretrain': 1, #help='Epochs to for pretraining [default: 10]')
              'batch_size': 512, #help='Batch Size during training [default: 512]')
              'learning_rate': 0.001, #help='Initial learning rate [default: 0.01]')

              'momentum': 0.9, #help='Initial momentum [default: 0.9]')
              'optimizer': 'adam', #help='adam or momentum [default: adam]')
              'decay_step': 500000, #help='Decay step for lr decay [default: 500000]')
              'wd': 0.0, #help='Weight Decay [Default: 0.0]')
              'decay_rate': 0.5, #help='Decay rate for lr decay [default: 0.5]')
              'output_dir': 'train_results', #help='Directory that stores all training logs and trained models')
              'data_dir': os.path.join(os.getcwd(),'06_LHC', 'h5'), # '../h5', #help='directory with data [default: hdf5_data]')
              'nfeat': 8, #help='Number of features [default: 8]')
              'ncat': 20, #help='Number of categories [default: 20]')
             }

FLAGS = Namespace(**parserdict)
H5_DIR = FLAGS.data_dir

EPOCH_CNT = 0
MAX_PRETRAIN = FLAGS.epochs_pretrain
BATCH_SIZE = FLAGS.batch_size
NUM_POINT = FLAGS.num_point
NUM_FEAT = FLAGS.nfeat
NUM_CLASSES = FLAGS.ncat
MAX_EPOCH = FLAGS.max_epoch
BASE_LEARNING_RATE = FLAGS.learning_rate
GPU_INDEX = FLAGS.gpu
MOMENTUM = FLAGS.momentum
OPTIMIZER = FLAGS.optimizer
DECAY_STEP = FLAGS.decay_step
DECAY_RATE = FLAGS.decay_rate

# MODEL = importlib.import_module(FLAGS.model) # import network module
MODEL_FILE = os.path.join(BASE_DIR, 'models', FLAGS.model + '.py')
LOG_DIR = os.path.join(os.getcwd(), '06_LHC', 'logs', FLAGS.log_dir)

if not os.path.exists(LOG_DIR): os.makedirs(LOG_DIR)
os.system('cp %s.py %s' % (MODEL_FILE, LOG_DIR))  # bkp of model def
os.system('cp train_kmeans.py %s' % (LOG_DIR))  # bkp of train procedure

BN_INIT_DECAY = 0.5
BN_DECAY_DECAY_RATE = 0.5
BN_DECAY_DECAY_STEP = float(DECAY_STEP)
BN_DECAY_CLIP = 0.99

LEARNING_RATE_CLIP = 1e-5
HOSTNAME = socket.gethostname()

TRAIN_FILES = provider.getDataFiles(os.path.join(H5_DIR, 'train_files_wztop.txt'))
TEST_FILES = provider.getDataFiles(os.path.join(H5_DIR, 'test_files_wztop.txt'))

  

  

Define the utils functions.

In [None]:
def get_learning_rate(batch):
    learning_rate = tf.compat.v1.train.exponential_decay(
        BASE_LEARNING_RATE,  # Base learning rate.
        batch * BATCH_SIZE,  # Current index into the dataset.
        DECAY_STEP,  # Decay step.
        DECAY_RATE,  # Decay rate.
        staircase=True)
    learning_rate = tf.maximum(learning_rate, LEARNING_RATE_CLIP)  # CLIP THE LEARNING RATE!
    return learning_rate


def get_bn_decay(batch):
    bn_momentum = tf.compat.v1.train.exponential_decay(
        BN_INIT_DECAY,
        batch * BATCH_SIZE,
        BN_DECAY_DECAY_STEP,
        BN_DECAY_DECAY_RATE,
        staircase=True)
    bn_decay = tf.minimum(BN_DECAY_CLIP, 1 - bn_momentum)
    return bn_decay

  

  

Modification: - create checkpoint directory for horovod - directory is
user chosen

In [None]:
import os
import time
 
checkpoint_dir = '/dbfs/databricks/driver/06_LHC/logs/train/{}/'.format(time.time())
 
os.makedirs(checkpoint_dir)

  

  

Create horovod h5 loading function: - not the rank and size is inputed.
- rank is the current device id - size is the total number of available
GPUs - we split the data in the h5 file for each device.

In [None]:
def load_h5_hvd(h5_filename, rank=0, size=1):
    f = h5py.File(h5_filename, 'r')
    data = f['data'][rank::size]
    label = f['pid'][rank::size]
    seg = f['label'][rank::size]
    print("loaded {0} events".format(len(data)))

    return (data, label, seg)

  

  

Main training function. Modifications are: - import packages again.
Otherwise single devices may cause problems. - initialise the horovod
runner - copy the files from local to each GPU such that they are
available for horovod. - scale the learning rate by the number of
available devices. - add a horovod specific distributed optimizer. - use
hooks for checkpoint saving ever 1000 steps. - switch from a normal TF
training session to a monitored training session.

In [None]:
def train_hvd():
    import horovod.tensorflow as hvd
    import tensorflow as tf
    import shutil

    
    # do all the imports here again in order for hvd to work nicely
    import horovod.tensorflow as hvd
    import argparse, shlex
    from datetime import datetime
    import numpy as np
    import tensorflow as tf
    import socket
    import os
    import sys
    from sklearn.cluster import KMeans
    from tqdm import tqdm
    np.set_printoptions(edgeitems=1000)
    from scipy.optimize import linear_sum_assignment
    BASE_DIR = os.path.join(os.getcwd(), '06_LHC','scripts')
    sys.path.append(BASE_DIR)
    sys.path.append(os.path.join(BASE_DIR, '..', 'models'))
    sys.path.append(os.path.join(BASE_DIR, '..', 'utils'))
    
    # HOROVOD: initialize Horovod.
    hvd.init()
    
    # HOROVOD: Copy files from local to each single GPU directory
    src = "/dbfs/FileStore/06_LHC"
    dst = os.path.join(os.getcwd(), '06_LHC')
    print("Copying data/files to local horovod folder...")
    shutil.copytree(src, dst)
    print("Done with copying!")
    
    import provider
    import gapnet_classify as MODEL
    
    
    with tf.Graph().as_default():
        with tf.device('/gpu:' + str(GPU_INDEX)):
            pointclouds_pl, labels_pl = MODEL.placeholder_inputs(BATCH_SIZE, NUM_POINT, NUM_FEAT)

            is_training_pl = tf.compat.v1.placeholder(tf.bool, shape=())

            # Note the global_step=batch parameter to minimize.
            # That tells the optimizer to helpfully increment the 'batch' parameter for you every time it trains.
            batch = tf.Variable(0)
            alpha = tf.compat.v1.placeholder(dtype=tf.float32, shape=())
            bn_decay = get_bn_decay(batch)
            tf.compat.v1.summary.scalar('bn_decay', bn_decay)
            print("--- Get model and loss")

            pred, max_pool = MODEL.get_model(pointclouds_pl, is_training=is_training_pl,
                                             bn_decay=bn_decay,
                                             num_class=NUM_CLASSES, weight_decay=FLAGS.wd,
                                             )

            class_loss = MODEL.get_focal_loss(pred, labels_pl, NUM_CLASSES)
            mu = tf.Variable(tf.zeros(shape=(FLAGS.n_clusters, FLAGS.max_dim)), name="mu",
                             trainable=True)  # k centroids
            kmeans_loss, stack_dist = MODEL.get_loss_kmeans(max_pool, mu, FLAGS.max_dim,
                                                            FLAGS.n_clusters, alpha)

            full_loss = kmeans_loss + class_loss

            print("--- Get training operator")
            # Get training operator
            learning_rate = get_learning_rate(batch)
            # HOROVOD: scale learning rade from hvd dependent number of processes (=hvd.size)
            tf.compat.v1.summary.scalar('learning_rate', learning_rate * hvd.size())
            if OPTIMIZER == 'momentum':
                optimizer = tf.compat.v1.train.MomentumOptimizer(learning_rate * hvd.size(), momentum=MOMENTUM)
            elif OPTIMIZER == 'adam':
                optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate * hvd.size())
            # HOROVOD: add Horovod Distributed Optimizer
            optimizer = hvd.DistributedOptimizer(optimizer)

            global_step = tf.compat.v1.train.get_or_create_global_step() 
            train_op_full = optimizer.minimize(full_loss, global_step=global_step) #batch)
            train_op = optimizer.minimize(class_loss, global_step=global_step) #batch)

            # Add ops to save and restore all the variables.
            saver = tf.compat.v1.train.Saver()
             
        # HOROVOD
        hooks = [
          # Horovod: BroadcastGlobalVariablesHook broadcasts initial variable states
          # from rank 0 to all other processes. This is necessary to ensure consistent
          # initialization of all workers when training is started with random weights
          # or restored from a checkpoint.
          hvd.BroadcastGlobalVariablesHook(0),
          
          #checkpoint_dir_mod = checkpoint_dir if hvd.rank() == 0 else None
          
          tf.compat.v1.train.CheckpointSaverHook(checkpoint_dir=checkpoint_dir,
                                                 checkpoint_basename='cluster.ckpt',
                                                 save_steps=1_000
                                                ),

          # this one basically prints every n steps the "step" and the "loss". Output is cleaner without
          # tf.compat.v1.train.LoggingTensorHook(tensors={'step': global_step, 'loss': full_loss}, every_n_iter=75),
        ]

        # Create a session
        config = tf.compat.v1.ConfigProto()
        config.gpu_options.allow_growth = True
        config.allow_soft_placement = True
        config.log_device_placement = False
        config.gpu_options.visible_device_list = str(hvd.local_rank())
        
        # global variable initializer must be defined before session definition
        init_global_step = tf.compat.v1.global_variables_initializer()
        
        # MonitoredTrainingSession
        # takes care of session initialization,
        # restoring from a checkpoint, saving to a checkpoint, and closing when done
        # or an error occurs.
        #checkpoint_dir_mod = checkpoint_dir if hvd.rank() == 0 else None
        sess = tf.compat.v1.train.MonitoredTrainingSession(checkpoint_dir=checkpoint,
                                                           hooks=hooks,
                                                           config=config)

        # get one batch_data from the training files in oder to inintialise the session
        train_idxs = np.arange(0, len(TRAIN_FILES))
        current_file = os.path.join(os.getcwd(), '06_LHC', 'h5', TRAIN_FILES[train_idxs[0]])
        current_data, current_label, current_cluster = load_h5_hvd(current_file, hvd.rank(), hvd.size())
        batch_data, batch_label = get_batch(current_data, current_label, 0, BATCH_SIZE)
        # 
        feed_dict = {pointclouds_pl: batch_data,
                     labels_pl: batch_label,
                     is_training_pl: False,
                     alpha: 2 * (EPOCH_CNT - MAX_PRETRAIN + 1),}
        #NOT SO CLEAR THAT init_global_step IS NECESSARY. 
        sess.run(init_global_step, feed_dict=feed_dict)

        # hels with merging: CHANGE THIS IF POSSIBLE
        sess.graph._unsafe_unfinalize()
        # Add summary writers
        merged = tf.compat.v1.summary.merge_all()
        train_writer = tf.compat.v1.summary.FileWriter(os.path.join(LOG_DIR, 'train'), sess.graph)
        test_writer = tf.compat.v1.summary.FileWriter(os.path.join(LOG_DIR, 'test'), sess.graph)
        
        # Init variables
        print("Total number of weights for the model: ", np.sum([np.prod(v.get_shape().as_list()) for v in tf.compat.v1.trainable_variables()]))
        ops = {'pointclouds_pl': pointclouds_pl,
          'labels_pl': labels_pl,
          'is_training_pl': is_training_pl,
          'max_pool': max_pool,
          'pred': pred,
          'alpha': alpha,
          'mu': mu,
          'stack_dist': stack_dist,
          'class_loss': class_loss,
          'kmeans_loss': kmeans_loss,
          'train_op': train_op,
          'train_op_full': train_op_full,
          'merged': merged,
          'step': batch,
          'learning_rate': learning_rate
        }

        for epoch in range(MAX_EPOCH):
            print('\n**** EPOCH %03d ****' % (epoch))
            sys.stdout.flush()
            
            is_full_training = epoch > MAX_PRETRAIN
            max_pool = train_one_epoch(sess, ops, train_writer, hvd.rank(), hvd.size(), is_full_training)
            if epoch == MAX_PRETRAIN:
                centers = KMeans(n_clusters=FLAGS.n_clusters).fit(np.squeeze(max_pool))
                centers = centers.cluster_centers_
                sess.run(tf.compat.v1.assign(mu, centers))

            eval_one_epoch(sess, ops, test_writer, hvd.rank(), hvd.size(), is_full_training)
            """if is_full_training:
                save_path = saver.save(sess, os.path.join(LOG_DIR, 'cluster.ckpt'))
            else:
                save_path = saver.save(sess, os.path.join(LOG_DIR, 'model.ckpt'))"""
            #print("Model saved in file: %s" % save_path)


  

  

Training utils.

In [None]:
def get_batch(data, label, start_idx, end_idx):
    batch_label = label[start_idx:end_idx]
    batch_data = data[start_idx:end_idx, :, :]
    return batch_data, batch_label


def cluster_acc(y_true, y_pred):
    """
    Calculate clustering accuracy. Require scikit-learn installed
    """
    y_true = y_true.astype(np.int64)
    D = max(y_pred.max(), y_true.max()) + 1
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1
    ind = linear_sum_assignment(w.max() - w)
    ind = np.asarray(ind)
    ind = np.transpose(ind)
    return sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size

  

  

One epoch training and evaluation functions: - the applicable horovod
rank and size is fed into both functions. - use the rank and size to
load the correct h5 data. - remove progress bars since progress bars
from each device would overlap.

In [None]:
def train_one_epoch(sess, ops, train_writer, hvd_rank, hvd_size, is_full_training):
    """ ops: dict mapping from string to tf ops """
    is_training = True

    train_idxs = np.arange(0, len(TRAIN_FILES))

    acc = loss_sum = 0
    y_pool = []
    for fn in range(len(TRAIN_FILES)):
        # print('----' + str(fn) + '-----')
        current_file = os.path.join(os.getcwd(), '06_LHC', 'h5', TRAIN_FILES[train_idxs[fn]])
        current_data, current_label, current_cluster = load_h5_hvd(current_file, hvd_rank, hvd_size)

        current_label = np.squeeze(current_label)

        file_size = current_data.shape[0]
        num_batches = file_size // BATCH_SIZE
        # num_batches = 5
        print(str(datetime.now()))

        # initialise progress bar
        #process_desc = "TRAINING: Loss {:2.3e}"
        #progress_bar = tqdm(initial=0, leave=True, total=num_batches,
        #                    desc=process_desc.format(0),
        #                    position=0)
        for batch_idx in range(num_batches):
            start_idx = batch_idx * BATCH_SIZE
            end_idx = (batch_idx + 1) * BATCH_SIZE
            batch_data, batch_label = get_batch(current_data, current_label, start_idx, end_idx)
            cur_batch_size = end_idx - start_idx

            # print(batch_weight)
            feed_dict = {ops['pointclouds_pl']: batch_data,
                         ops['labels_pl']: batch_label,
                         ops['is_training_pl']: is_training,
                         ops['alpha']: 2 * (EPOCH_CNT - MAX_PRETRAIN + 1),}
            if is_full_training:
                summary, step, _, loss_val, dist, lr = sess.run([ops['merged'], ops['step'],
                                                                 ops['train_op_full'], ops['kmeans_loss'],
                                                                 ops['stack_dist'], ops['learning_rate']],
                                                                feed_dict=feed_dict)

                batch_cluster = np.array([np.where(r == 1)[0][0] for r in current_cluster[start_idx:end_idx]])
                cluster_assign = np.zeros((cur_batch_size), dtype=int)

                for i in range(cur_batch_size):
                    index_closest_cluster = np.argmin(dist[:, i])
                    cluster_assign[i] = index_closest_cluster

                acc += cluster_acc(batch_cluster, cluster_assign)
            else:
                summary, step, _, loss_val, max_pool, lr = sess.run([ops['merged'], ops['step'],
                                                                     ops['train_op'], ops['class_loss'],
                                                                     ops['max_pool'], ops['learning_rate']],
                                                                    feed_dict=feed_dict)

                if len(y_pool) == 0:
                    y_pool = np.squeeze(max_pool)
                else:
                    y_pool = np.concatenate((y_pool, np.squeeze(max_pool)), axis=0)

            loss_sum += np.mean(loss_val)

            #train_writer.add_summary(summary, step)
            if hvd_rank == 0:
                train_writer.add_summary(summary, step)
              
              

            # Update train bar
            #process_desc.format(loss_val)
            #progress_bar.update(1)
        #progress_bar.close()

    print('learning rate: %f' % (lr))
    print('train mean loss: %f' % (loss_sum / float(num_batches)))
    #if is_full_training:
    print('train clustering accuracy: %f' % (acc / float(num_batches)))
    return y_pool


def eval_one_epoch(sess, ops, test_writer, hvd_rank, hvd_size, is_full_training):
    """ ops: dict mapping from string to tf ops """
    global EPOCH_CNT
    is_training = False
    test_idxs = np.arange(0, len(TEST_FILES))
    # Test on all data: last batch might be smaller than BATCH_SIZE
    loss_sum = acc = 0
    acc_kmeans = 0

    for fn in range(len(TEST_FILES)):
        # print('----' + str(fn) + '-----')
        current_file = os.path.join(os.getcwd(), '06_LHC', 'h5', TEST_FILES[test_idxs[fn]])
        current_data, current_label, current_cluster = load_h5_hvd(current_file, hvd_rank, hvd_size)
        current_label = np.squeeze(current_label)

        file_size = current_data.shape[0]
        num_batches = file_size // BATCH_SIZE
        
        """process_desc = "VALIDATION: Loss {:2.3e}"
        progress_bar = tqdm(initial=0, leave=True, total=num_batches,
                        desc=process_desc.format(0),
                        position=0)"""
        for batch_idx in range(num_batches):
            start_idx = batch_idx * BATCH_SIZE
            end_idx = (batch_idx + 1) * BATCH_SIZE
            batch_data, batch_label = get_batch(current_data, current_label, start_idx, end_idx)
            cur_batch_size = end_idx - start_idx

            feed_dict = {ops['pointclouds_pl']: batch_data,
                         ops['is_training_pl']: is_training,
                         ops['labels_pl']: batch_label,
                         ops['alpha']: 2 * (EPOCH_CNT - MAX_PRETRAIN + 1),}

            if is_full_training:
                summary, step, loss_val, max_pool, dist, mu = sess.run([ops['merged'], ops['step'],
                                                                        ops['kmeans_loss'],
                                                                        ops['max_pool'], ops['stack_dist'],
                                                                        ops['mu']],
                                                                       feed_dict=feed_dict)
                
                batch_cluster = np.array([np.where(r == 1)[0][0] for r in current_cluster[start_idx:end_idx]])
                cluster_assign = np.zeros((cur_batch_size), dtype=int)
                for i in range(cur_batch_size):
                    index_closest_cluster = np.argmin(dist[:, i])
                    cluster_assign[i] = index_closest_cluster

                acc += cluster_acc(batch_cluster, cluster_assign)

            else:
                summary, step, loss_val = sess.run([ops['merged'], ops['step'],
                                                    ops['class_loss']],
                                                   feed_dict=feed_dict)

            #test_writer.add_summary(summary, step)
            if hvd_rank == 0:
                test_writer.add_summary(summary, step)

            loss_sum += np.mean(loss_val)
            
            """# Update train bar
            process_desc.format(loss_val)
            progress_bar.update(1)"""
        #progress_bar.close()

    total_loss = loss_sum * 1.0 / float(num_batches)
    print('test mean loss: %f' % (total_loss))
    #if is_full_training:
    print('testing clustering accuracy: %f' % (acc / float(num_batches)))

    EPOCH_CNT += 1

  

  

Run the training: - initialise the Horovod runner with np=2 GPUs. The
cluster does not allow more GPUs - run the horovod runner with the given
training function.

In [None]:
from sparkdl import HorovodRunner
 
hr = HorovodRunner(np=2)
hr.run(train_hvd)

  

>     HorovodRunner will stream all training logs to notebook cell output. If there are too many logs, you
>     can adjust the log level in your train method. Or you can set driver_log_verbosity to
>     'log_callback_only' and use a HorovodRunner log  callback on the first worker to get concise
>     progress updates.
>     The global names read or written to by the pickled function are {'get_bn_decay', 'MOMENTUM', 'range', 'NUM_CLASSES', 'OPTIMIZER', 'EPOCH_CNT', 'get_learning_rate', 'load_h5_hvd', 'get_batch', 'MAX_EPOCH', 'len', 'NUM_FEAT', 'GPU_INDEX', 'BATCH_SIZE', 'LOG_DIR', 'FLAGS', 'checkpoint_dir', 'eval_one_epoch', 'NUM_POINT', 'MAX_PRETRAIN', 'TRAIN_FILES', 'print', 'str', 'train_one_epoch'}.
>     The pickled object size is 11055 bytes.
>
>     ### How to enable Horovod Timeline? ###
>     HorovodRunner has the ability to record the timeline of its activity with Horovod  Timeline. To
>     record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable  to the location of the
>     timeline file to be created. You can then open the timeline file  using the chrome://tracing
>     facility of the Chrome browser.
>
>     Start training.
>     [1,1]<stderr>:2021-01-04 17:28:24.816450: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,0]<stderr>:2021-01-04 17:28:24.869795: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,0]<stdout>:Copying data/files to local horovod folder...
>     [1,1]<stdout>:Copying data/files to local horovod folder...
>     [1,0]<stdout>:Done with copying!
>     [1,1]<stdout>:Done with copying!
>     [1,0]<stdout>:--- Get model and loss
>     [1,1]<stdout>:--- Get model and loss
>     [1,1]<stderr>:2021-01-04 17:30:20.132866: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
>     [1,0]<stderr>:2021-01-04 17:30:20.176503: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
>     [1,1]<stderr>:2021-01-04 17:30:20.184581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:20.185468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
>     [1,1]<stderr>:pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
>     [1,1]<stderr>:coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
>     [1,1]<stderr>:2021-01-04 17:30:20.185502: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,0]<stderr>:2021-01-04 17:30:20.232379: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:20.233308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
>     [1,0]<stderr>:pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
>     [1,0]<stderr>:coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
>     [1,0]<stderr>:2021-01-04 17:30:20.233358: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,1]<stderr>:2021-01-04 17:30:20.619010: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
>     [1,0]<stderr>:2021-01-04 17:30:20.637123: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
>     [1,1]<stderr>:2021-01-04 17:30:20.889096: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
>     [1,0]<stderr>:2021-01-04 17:30:20.931508: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
>     [1,1]<stderr>:2021-01-04 17:30:20.936127: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
>     [1,0]<stderr>:2021-01-04 17:30:20.982198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
>     [1,1]<stderr>:2021-01-04 17:30:21.453513: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
>     [1,0]<stderr>:2021-01-04 17:30:21.540557: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
>     [1,1]<stderr>:2021-01-04 17:30:21.560198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
>     [1,0]<stderr>:2021-01-04 17:30:21.627762: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
>     [1,0]<stderr>:2021-01-04 17:30:21.679130: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
>     [1,0]<stderr>:2021-01-04 17:30:21.679396: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:21.680404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:21.681219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
>     [1,0]<stdout>:--- Get training operator
>     [1,1]<stderr>:2021-01-04 17:30:22.394999: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
>     [1,1]<stderr>:2021-01-04 17:30:22.395216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:22.396223: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:22.397023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
>     [1,1]<stdout>:--- Get training operator
>     [1,0]<stderr>:INFO:tensorflow:Create CheckpointSaverHook.
>     [1,0]<stderr>:Create CheckpointSaverHook.
>     [1,0]<stderr>:INFO:tensorflow:Create CheckpointSaverHook.
>     [1,0]<stderr>:Create CheckpointSaverHook.
>     [1,0]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
>     [1,0]<stderr>:Instructions for updating:
>     [1,0]<stderr>:Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
>     [1,0]<stderr>:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
>     [1,0]<stderr>:Instructions for updating:
>     [1,0]<stderr>:Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
>     [1,1]<stderr>:INFO:tensorflow:Create CheckpointSaverHook.
>     [1,1]<stderr>:Create CheckpointSaverHook.
>     [1,1]<stderr>:INFO:tensorflow:Create CheckpointSaverHook.
>     [1,1]<stderr>:Create CheckpointSaverHook.
>     [1,0]<stderr>:INFO:tensorflow:Graph was finalized.
>     [1,0]<stderr>:Graph was finalized.
>     [1,0]<stderr>:2021-01-04 17:30:26.378385: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
>     [1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>     [1,0]<stderr>:2021-01-04 17:30:26.403662: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
>     [1,0]<stderr>:2021-01-04 17:30:26.404019: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5587faad13f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
>     [1,0]<stderr>:2021-01-04 17:30:26.404049: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
>     [1,0]<stderr>:2021-01-04 17:30:26.499397: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:26.500320: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5587faabc500 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
>     [1,0]<stderr>:2021-01-04 17:30:26.500353: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
>     [1,0]<stderr>:2021-01-04 17:30:26.500691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:26.501549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
>     [1,0]<stderr>:pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
>     [1,0]<stderr>:coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
>     [1,0]<stderr>:2021-01-04 17:30:26.501625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,0]<stderr>:2021-01-04 17:30:26.501696: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
>     [1,0]<stderr>:2021-01-04 17:30:26.501769: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
>     [1,0]<stderr>:2021-01-04 17:30:26.501800: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
>     [1,0]<stderr>:2021-01-04 17:30:26.501829: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
>     [1,0]<stderr>:2021-01-04 17:30:26.501857: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
>     [1,0]<stderr>:2021-01-04 17:30:26.501885: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
>     [1,0]<stderr>:2021-01-04 17:30:26.502014: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:26.502935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:26.503846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
>     [1,0]<stderr>:2021-01-04 17:30:26.503903: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,1]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
>     [1,1]<stderr>:Instructions for updating:
>     [1,1]<stderr>:Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
>     [1,1]<stderr>:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
>     [1,1]<stderr>:Instructions for updating:
>     [1,1]<stderr>:Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
>     [1,1]<stderr>:INFO:tensorflow:Graph was finalized.
>     [1,1]<stderr>:Graph was finalized.
>     [1,1]<stderr>:2021-01-04 17:30:27.413689: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
>     [1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>     [1,1]<stderr>:2021-01-04 17:30:27.451878: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499995000 Hz
>     [1,1]<stderr>:2021-01-04 17:30:27.452210: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5631a5a623f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
>     [1,1]<stderr>:2021-01-04 17:30:27.452249: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
>     [1,0]<stderr>:2021-01-04 17:30:27.462163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
>     [1,0]<stderr>:2021-01-04 17:30:27.462213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
>     [1,0]<stderr>:2021-01-04 17:30:27.462226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
>     [1,0]<stderr>:2021-01-04 17:30:27.463593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:27.464598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,0]<stderr>:2021-01-04 17:30:27.465438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13943 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
>     [1,1]<stderr>:2021-01-04 17:30:27.565670: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:27.566566: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5631a575bc90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
>     [1,1]<stderr>:2021-01-04 17:30:27.566604: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
>     [1,1]<stderr>:2021-01-04 17:30:27.568675: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:27.569510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
>     [1,1]<stderr>:pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
>     [1,1]<stderr>:coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
>     [1,1]<stderr>:2021-01-04 17:30:27.569562: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,1]<stderr>:2021-01-04 17:30:27.569637: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
>     [1,1]<stderr>:2021-01-04 17:30:27.569686: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
>     [1,1]<stderr>:2021-01-04 17:30:27.569717: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
>     [1,1]<stderr>:2021-01-04 17:30:27.569744: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
>     [1,1]<stderr>:2021-01-04 17:30:27.569770: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
>     [1,1]<stderr>:2021-01-04 17:30:27.569798: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
>     [1,1]<stderr>:2021-01-04 17:30:27.569919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:27.570828: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:27.571633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
>     [1,1]<stderr>:2021-01-04 17:30:27.572656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>     [1,0]<stderr>:2021-01-04 17:30:27.764091: W tensorflow/core/common_runtime/colocation_graph.cc:1139] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
>     [1,0]<stderr>:  /job:localhost/replica:0/task:0/device:CPU:0].
>     [1,0]<stderr>:See below for details of this colocation group:
>     [1,0]<stderr>:Colocation Debug Info:
>     [1,0]<stderr>:Colocation group had the following types and supported devices:
>     [1,0]<stderr>:Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
>     [1,0]<stderr>:ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
>     [1,0]<stderr>:AssignVariableOp: CPU XLA_CPU XLA_GPU
>     [1,0]<stderr>:VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU
>     [1,0]<stderr>:Const: GPU CPU XLA_CPU XLA_GPU
>     [1,0]<stderr>:VarHandleOp: CPU XLA_CPU XLA_GPU
>     [1,0]<stderr>:
>     [1,0]<stderr>:Colocation members, user-requested devices, and framework assigned devices, if any:
>     [1,0]<stderr>:  Variable/Initializer/initial_value (Const)
>     [1,0]<stderr>:  Variable (VarHandleOp) /device:GPU:0
>     [1,0]<stderr>:  Variable/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
>     [1,0]<stderr>:  Variable/Assign (AssignVariableOp) /device:GPU:0
>     [1,0]<stderr>:  Variable/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
>     [1,0]<stderr>:  ReadVariableOp (ReadVariableOp)
>     [1,0]<stderr>:  ReadVariableOp_4 (ReadVariableOp)
>     [1,0]<stderr>:  save/AssignVariableOp (AssignVariableOp) /device:GPU:0
>     [1,0]<stderr>:  HorovodBroadcast_Variable_0/ReadVariableOp (ReadVariableOp)
>     [1,0]<stderr>:  AssignVariableOp (AssignVariableOp)
>     [1,0]<stderr>:  ReadVariableOp_5 (ReadVariableOp)
>     [1,0]<stderr>:  report_uninitialized_variables/VarIsInitializedOp (VarIsInitializedOp)
>     [1,0]<stderr>:  report_uninitialized_variables_1/VarIsInitializedOp (VarIsInitializedOp)
>     [1,0]<stderr>:  save_1/AssignVariableOp (AssignVariableOp) /device:GPU:0
>     [1,0]<stderr>:
>     [1,0]<stderr>:INFO:tensorflow:Running local_init_op.
>     [1,0]<stderr>:Running local_init_op.
>     [1,0]<stderr>:INFO:tensorflow:Done running local_init_op.
>     [1,0]<stderr>:Done running local_init_op.
>     [1,1]<stderr>:2021-01-04 17:30:28.981517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
>     [1,1]<stderr>:2021-01-04 17:30:28.981570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
>     [1,1]<stderr>:2021-01-04 17:30:28.981580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
>     [1,1]<stderr>:2021-01-04 17:30:28.982650: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:28.983624: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
>     [1,1]<stderr>:2021-01-04 17:30:28.984487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13943 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
>     [1,1]<stderr>:2021-01-04 17:30:29.296559: W tensorflow/core/common_runtime/colocation_graph.cc:1139] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
>     [1,1]<stderr>:  /job:localhost/replica:0/task:0/device:CPU:0].
>     [1,1]<stderr>:See below for details of this colocation group:
>     [1,1]<stderr>:Colocation Debug Info:
>     [1,1]<stderr>:Colocation group had the following types and supported devices:
>     [1,1]<stderr>:Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
>     [1,1]<stderr>:ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
>     [1,1]<stderr>:AssignVariableOp: CPU XLA_CPU XLA_GPU
>     [1,1]<stderr>:VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU
>     [1,1]<stderr>:Const: GPU CPU XLA_CPU XLA_GPU
>     [1,1]<stderr>:VarHandleOp: CPU XLA_CPU XLA_GPU
>     [1,1]<stderr>:
>     [1,1]<stderr>:Colocation members, user-requested devices, and framework assigned devices, if any:
>     [1,1]<stderr>:  Variable/Initializer/initial_value (Const)
>     [1,1]<stderr>:  Variable (VarHandleOp) /device:GPU:0
>     [1,1]<stderr>:  Variable/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
>     [1,1]<stderr>:  Variable/Assign (AssignVariableOp) /device:GPU:0
>     [1,1]<stderr>:  Variable/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
>     [1,1]<stderr>:  ReadVariableOp (ReadVariableOp)
>     [1,1]<stderr>:  ReadVariableOp_4 (ReadVariableOp)
>     [1,1]<stderr>:  save/AssignVariableOp (AssignVariableOp) /device:GPU:0
>     [1,1]<stderr>:  HorovodBroadcast_Variable_0/ReadVariableOp (ReadVariableOp)
>     [1,1]<stderr>:  AssignVariableOp (AssignVariableOp)
>     [1,1]<stderr>:  ReadVariableOp_5 (ReadVariableOp)
>     [1,1]<stderr>:  report_uninitialized_variables/VarIsInitializedOp (VarIsInitializedOp)
>     [1,1]<stderr>:  report_uninitialized_variables_1/VarIsInitializedOp (VarIsInitializedOp)
>     [1,1]<stderr>:  save_1/AssignVariableOp (AssignVariableOp) /device:GPU:0
>     [1,1]<stderr>:
>     [1,1]<stderr>:INFO:tensorflow:Running local_init_op.
>     [1,1]<stderr>:Running local_init_op.
>     [1,1]<stderr>:INFO:tensorflow:Done running local_init_op.
>     [1,1]<stderr>:Done running local_init_op.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
>     [1,0]<stderr>:Calling checkpoint listeners before saving checkpoint 0...
>     [1,0]<stderr>:INFO:tensorflow:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 0...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
>     [1,0]<stderr>:Calling checkpoint listeners after saving checkpoint 0...
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
>     [1,1]<stderr>:Calling checkpoint listeners after saving checkpoint 0...
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 0...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 0 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>
>     *** WARNING: skipped 5495 bytes of output ***
>
>     [1,0]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.38528
>     [1,0]<stderr>:global_step/sec: 2.38528
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.34714
>     [1,1]<stderr>:global_step/sec: 2.34714
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 3.0324
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 3.03245
>     [1,0]<stderr>:global_step/sec: 3.0324
>     [1,1]<stderr>:global_step/sec: 3.03245
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 3.825902
>     [1,0]<stdout>:train clustering accuracy: 0.000000
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 4.023047
>     [1,1]<stdout>:train clustering accuracy: 0.000000
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,0]<stderr>:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 295 vs previous value: 295. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
>     [1,1]<stdout>:test mean loss: 1.974160
>     [1,1]<stdout>:testing clustering accuracy: 0.000000
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 001 ****
>     [1,0]<stdout>:test mean loss: 1.894498[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.000000
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 001 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:32:45.264640
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:32:45.838243
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.04904
>     [1,0]<stderr>:global_step/sec: 2.04904
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.04898
>     [1,1]<stderr>:global_step/sec: 2.04898
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.93649
>     [1,0]<stderr>:global_step/sec: 2.93649
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.93658
>     [1,1]<stderr>:global_step/sec: 2.93658
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.89172
>     [1,0]<stderr>:global_step/sec: 2.89172
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.89167
>     [1,1]<stderr>:global_step/sec: 2.89167
>     [1,0]<stdout>:learning rate: 0.001000
>     [1,0]<stdout>:train mean loss: 1.479257
>     [1,0]<stdout>:train clustering accuracy: 0.000000
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 1.611109
>     [1,1]<stdout>:train clustering accuracy: 0.000000
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 1.255668
>     [1,1]<stdout>:testing clustering accuracy: 0.000000
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 002 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:34:44.019857
>     [1,0]<stdout>:test mean loss: 2.434365[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.000000[1,0]<stdout>:
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 002 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:34:44.976721
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.81088
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.8109
>     [1,0]<stderr>:global_step/sec: 1.81088
>     [1,1]<stderr>:global_step/sec: 1.8109
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84508
>     [1,0]<stderr>:global_step/sec: 2.84508
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.84507
>     [1,1]<stderr>:global_step/sec: 2.84507
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84237
>     [1,0]<stderr>:global_step/sec: 2.84237
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.84234
>     [1,1]<stderr>:global_step/sec: 2.84234
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.073204
>     [1,0]<stdout>:train clustering accuracy: 0.565135
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.029733
>     [1,1]<stdout>:train clustering accuracy: 0.590128
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.035144
>     [1,1]<stdout>:testing clustering accuracy: 0.583770
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 003 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:36:46.278189
>     [1,0]<stdout>:test mean loss: 0.073979[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.535852
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 003 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:36:47.495020
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.95912
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.95915
>     [1,1]<stderr>:global_step/sec: 1.95915
>     [1,0]<stderr>:global_step/sec: 1.95912
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.8459
>     [1,1]<stderr>:global_step/sec: 2.8459
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84588
>     [1,0]<stderr>:global_step/sec: 2.84588
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1000...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 1000...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 1000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 1000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1000...
>     [1,0]<stderr>:Calling checkpoint listeners before saving checkpoint 1000...
>     [1,0]<stderr>:INFO:tensorflow:Saving checkpoints for 1000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,0]<stderr>:Saving checkpoints for 1000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1000...
>     [1,0]<stderr>:Calling checkpoint listeners after saving checkpoint 1000...
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1000...
>     [1,1]<stderr>:Calling checkpoint listeners after saving checkpoint 1000...
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.6452
>     [1,1]<stderr>:global_step/sec: 2.6452
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.64518
>     [1,0]<stderr>:global_step/sec: 2.64518
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.047254
>     [1,0]<stdout>:train clustering accuracy: 0.592644
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.028695
>     [1,1]<stdout>:train clustering accuracy: 0.583104
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.032069
>     [1,1]<stdout>:testing clustering accuracy: 0.577938
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 004 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:38:48.063939
>     [1,0]<stdout>:test mean loss: 0.046610[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.558246
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 004 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:38:49.158460
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.98098
>     [1,0]<stderr>:global_step/sec: 1.98098
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.98091
>     [1,1]<stderr>:global_step/sec: 1.98091
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.84389
>     [1,1]<stderr>:global_step/sec: 2.84389
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84377
>     [1,0]<stderr>:global_step/sec: 2.84377
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84688
>     [1,0]<stderr>:global_step/sec: 2.84688
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.84681
>     [1,1]<stderr>:global_step/sec: 2.84681
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.033857
>     [1,0]<stdout>:train clustering accuracy: 0.608349
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.026509
>     [1,1]<stdout>:train clustering accuracy: 0.584607
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1475...
>     [1,0]<stderr>:Calling checkpoint listeners before saving checkpoint 1475...
>     [1,0]<stderr>:INFO:tensorflow:Saving checkpoints for 1475 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:Saving checkpoints for 1475 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1475...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 1475...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 1475 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 1475 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1475...
>     [1,0]<stderr>:Calling checkpoint listeners after saving checkpoint 1475...
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1475...
>     [1,1]<stderr>:Calling checkpoint listeners after saving checkpoint 1475...
>     [1,1]<stdout>:test mean loss: 0.033016
>     [1,1]<stdout>:testing clustering accuracy: 0.577536
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 005 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:40:50.403813
>     [1,0]<stdout>:test mean loss: 0.033497[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.544601
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 005 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:40:52.117234
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.837
>     [1,1]<stderr>:global_step/sec: 1.837
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.83696
>     [1,0]<stderr>:global_step/sec: 1.83696
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.84922
>     [1,0]<stderr>:global_step/sec: 2.84922
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.8491
>     [1,1]<stderr>:global_step/sec: 2.8491
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.86428
>     [1,0]<stderr>:global_step/sec: 2.86428
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.86434
>     [1,1]<stderr>:global_step/sec: 2.86434
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.026064
>     [1,0]<stdout>:train clustering accuracy: 0.643836[1,0]<stdout>:
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.025038
>     [1,1]<stdout>:train clustering accuracy: 0.584362
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.027497
>     [1,1]<stdout>:testing clustering accuracy: 0.562982
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 006 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:42:49.541722
>     [1,0]<stdout>:test mean loss: 0.025077[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.560493
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 006 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:42:50.501567
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.99568
>     [1,1]<stderr>:global_step/sec: 1.99568
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.99566
>     [1,0]<stderr>:global_step/sec: 1.99566
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85026
>     [1,0]<stderr>:global_step/sec: 2.85026
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85017
>     [1,1]<stderr>:global_step/sec: 2.85017
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85591
>     [1,0]<stderr>:global_step/sec: 2.85591
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85594
>     [1,1]<stderr>:global_step/sec: 2.85594
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 2000...
>     [1,0]<stderr>:Calling checkpoint listeners before saving checkpoint 2000...
>     [1,0]<stderr>:INFO:tensorflow:Saving checkpoints for 2000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,0]<stderr>:Saving checkpoints for 2000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 2000...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 2000...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 2000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 2000 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/cluster.ckpt.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 2000...
>     [1,1]<stderr>:Calling checkpoint listeners after saving checkpoint 2000...
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 2000...
>     [1,0]<stderr>:Calling checkpoint listeners after saving checkpoint 2000...
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.021587
>     [1,0]<stdout>:train clustering accuracy: 0.646451
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.023550
>     [1,1]<stdout>:train clustering accuracy: 0.584812
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.025361
>     [1,1]<stdout>:testing clustering accuracy: 0.553805
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 007 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:44:50.785483
>     [1,0]<stdout>:test mean loss: 0.022237[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.557256
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 007 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:44:51.680681
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.89396
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.89394
>     [1,1]<stderr>:global_step/sec: 1.89396
>     [1,0]<stderr>:global_step/sec: 1.89394
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85501
>     [1,0]<stderr>:global_step/sec: 2.85501
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85498
>     [1,1]<stderr>:global_step/sec: 2.85498
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85526
>     [1,0]<stderr>:global_step/sec: 2.85526
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85522
>     [1,1]<stderr>:global_step/sec: 2.85522
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.019167
>     [1,0]<stdout>:train clustering accuracy: 0.631786
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.021949
>     [1,1]<stdout>:train clustering accuracy: 0.579774
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.022571
>     [1,1]<stdout>:testing clustering accuracy: 0.561590
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 008 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:46:49.376643
>     [1,0]<stdout>:test mean loss: 0.019591[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.596399
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 008 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:46:50.271701
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.98749
>     [1,0]<stderr>:global_step/sec: 1.98749
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.98748
>     [1,1]<stderr>:global_step/sec: 1.98748
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85509
>     [1,1]<stderr>:global_step/sec: 2.85509
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.855
>     [1,0]<stderr>:global_step/sec: 2.855
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85604
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85601
>     [1,1]<stderr>:global_step/sec: 2.85601
>     [1,0]<stderr>:global_step/sec: 2.85604
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.017833
>     [1,0]<stdout>:train clustering accuracy: 0.612758
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.019644
>     [1,1]<stdout>:train clustering accuracy: 0.571921
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,1]<stdout>:test mean loss: 0.019357
>     [1,1]<stdout>:testing clustering accuracy: 0.568440
>     [1,1]<stdout>:
>     [1,1]<stdout>:**** EPOCH 009 ****
>     [1,1]<stdout>:loaded 151332 events
>     [1,1]<stdout>:2021-01-04 17:48:47.868751
>     [1,0]<stdout>:test mean loss: 0.018919[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.638806[1,0]<stdout>:
>     [1,0]<stdout>:
>     [1,0]<stdout>:**** EPOCH 009 ****
>     [1,0]<stdout>:loaded 151332 events
>     [1,0]<stdout>:2021-01-04 17:48:48.844865
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 1.98952
>     [1,0]<stderr>:global_step/sec: 1.98952
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 1.9895
>     [1,1]<stderr>:global_step/sec: 1.9895
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85433
>     [1,0]<stderr>:global_step/sec: 2.85433
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.8543
>     [1,1]<stderr>:global_step/sec: 2.8543
>     [1,0]<stderr>:INFO:tensorflow:global_step/sec: 2.85624
>     [1,1]<stderr>:INFO:tensorflow:global_step/sec: 2.85631
>     [1,1]<stderr>:global_step/sec: 2.85631
>     [1,0]<stderr>:global_step/sec: 2.85624
>     [1,1]<stdout>:learning rate: 0.001000
>     [1,1]<stdout>:train mean loss: 0.017083
>     [1,1]<stdout>:train clustering accuracy: 0.566671
>     [1,0]<stdout>:learning rate: 0.001000[1,0]<stdout>:
>     [1,0]<stdout>:train mean loss: 0.017304
>     [1,0]<stdout>:train clustering accuracy: 0.599682
>     [1,0]<stdout>:loaded 37833 events
>     [1,1]<stdout>:loaded 37833 events
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 2950...
>     [1,0]<stderr>:Calling checkpoint listeners before saving checkpoint 2950...
>     [1,0]<stderr>:INFO:tensorflow:Saving checkpoints for 2950 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:Saving checkpoints for 2950 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 2950...
>     [1,1]<stderr>:Calling checkpoint listeners before saving checkpoint 2950...
>     [1,1]<stderr>:INFO:tensorflow:Saving checkpoints for 2950 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,1]<stderr>:Saving checkpoints for 2950 into /dbfs/databricks/driver/06_LHC/logs/train/1609781294.4809902/model.ckpt.
>     [1,0]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
>     [1,0]<stderr>:Instructions for updating:
>     [1,0]<stderr>:Use standard file APIs to delete files with this prefix.
>     [1,0]<stderr>:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
>     [1,0]<stderr>:Instructions for updating:
>     [1,0]<stderr>:Use standard file APIs to delete files with this prefix.
>     [1,0]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 2950...
>     [1,0]<stderr>:Calling checkpoint listeners after saving checkpoint 2950...
>     [1,1]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
>     [1,1]<stderr>:Instructions for updating:
>     [1,1]<stderr>:Use standard file APIs to delete files with this prefix.
>     [1,1]<stderr>:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
>     [1,1]<stderr>:Instructions for updating:
>     [1,1]<stderr>:Use standard file APIs to delete files with this prefix.
>     [1,1]<stderr>:INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 2950...
>     [1,1]<stderr>:Calling checkpoint listeners after saving checkpoint 2950...
>     [1,1]<stdout>:test mean loss: 0.015524
>     [1,1]<stdout>:testing clustering accuracy: 0.553644
>     [1,0]<stdout>:test mean loss: 0.018224[1,0]<stdout>:
>     [1,0]<stdout>:testing clustering accuracy: 0.600920[1,0]<stdout>:

  

Results: - Execution of the command for np=2 GPUs takes 3.39 hours. -
Plot below show the validation accuracy vs epoch. - Note that we switch
to the full loss after n=10 epochs. - We observe an improvement in the
cluster validation set accuracy after around 50 epochs. - Highest
cluster validation set accuracy lies at about 68%. - Output of the
algorithm is the stored model.

![The Standard
Model](https://raw.githubusercontent.com/Tarnekar/bitstarter/master/validation_accuracy.png)