# Distributed training with TensorFlow

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

tf.distribute.Strategy has been designed with these key goals in mind:

    Easy to use and support multiple user segments, including researchers, ML engineers, etc.
    Provide good performance out of the box.
    Easy switching between strategies.

tf.distribute.Strategy can be used with a high-level API like Keras, and can also be used to distribute custom training loops (and, in general, any computation using TensorFlow).

In TensorFlow 2.0, you can execute your programs eagerly, or in a graph using tf.function. tf.distribute.Strategy intends to support both these modes of execution. Although we discuss training most of the time in this guide, this API can also be used for distributing evaluation and prediction on different platforms.

You can use tf.distribute.Strategy with very few changes to your code, because we have changed the underlying components of TensorFlow to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.

In [1]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [0]:
import tensorflow as tf
import numpy as np
import os

# GPU Strategies

tf.distribute.Strategy intends to cover a number of use cases along different axes. Some of these combinations are currently supported and others will be added in the future. Some of these axes are:

- Synchronous vs asynchronous training: These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.
- Hardware platform: You may want to scale your training onto multiple GPUs on one machine, or multiple machines in a network (with 0 or more GPUs each), or on Cloud TPUs.

In order to support these use cases, there are six strategies available. 

## Mirrored Strategy

tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. 

It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. 

Together, these variables form a single conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying identical updates.

Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up, and makes them available on each device. 

It’s a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. 

There are many all-reduce algorithms and implementations available, depending on the type of communication available between devices. By default, it uses NVIDIA NCCL as the all-reduce implementation

In [0]:
mirrored_strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


This will create a MirroredStrategy instance which will use all the GPUs that are visible to TensorFlow, and use NCCL as the cross device communication.

If you wish to use only some of the GPUs on your machine, you can do so like this:

In [0]:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"])

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


If you wish to override the cross device communication, you can do so using the cross_device_ops argument by supplying an instance of tf.distribute.CrossDeviceOps. 

Currently, tf.distribute.HierarchicalCopyAllReduce and tf.distribute.ReductionToOneDevice are two options other than tf.distribute.NcclAllReduce which is the default.

In [0]:
mirrored_strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


## Central Storage Strategy

tf.distribute.experimental.CentralStorageStrategy does synchronous training as well. 

Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.

In [0]:
central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()

INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ('/device:GPU:0',), variable_device = '/device:GPU:0'


This will create a CentralStorageStrategy instance which will use all visible GPUs and CPU. Update to variables on replicas will be aggregated before being applied to variables.

## MultiWorkerMirroredStrategy

tf.distribute.experimental.MultiWorkerMirroredStrategy is very similar to MirroredStrategy. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to MirroredStrategy, it creates copies of all variables in the model on each device across all workers.

It uses CollectiveOps as the multi-worker all-reduce communication method used to keep variables in sync. A collective op is a single op in the TensorFlow graph which can automatically choose an all-reduce algorithm in the TensorFlow runtime according to hardware, network topology and tensor sizes.

It also implements additional performance optimizations. For example, it includes a static optimization that converts multiple all-reductions on small tensors into fewer all-reductions on larger tensors. In addition, we are designing it to have a plugin architecture - so that in the future, you will be able to plugin algorithms that are better tuned for your hardware. Note that collective ops also implement other collective operations such as broadcast and all-gather.

In [0]:
multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0',)
INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.AUTO


MultiWorkerMirroredStrategy currently allows you to choose between two different implementations of collective ops. CollectiveCommunication.RING implements ring-based collectives using gRPC as the communication layer. CollectiveCommunication.NCCL uses Nvidia's NCCL to implement collectives. 

CollectiveCommunication.AUTO defers the choice to the runtime. 

The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster. 

In [0]:
multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(tf.distribute.experimental.CollectiveCommunication.NCCL)

INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0',)
INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.NCCL


# TPUStrategy

tf.distribute.experimental.TPUStrategy lets you run your TensorFlow training on Tensor Processing Units (TPUs). 

TPUs are Google's specialized ASICs designed to dramatically accelerate machine learning workloads.

In terms of distributed training architecture, TPUStrategy is the same MirroredStrategy - it implements synchronous distributed training. 

TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in TPUStrategy.

In [0]:
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()

In [0]:
tf.config.experimental_connect_to_cluster(cluster_resolver)

In [0]:
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)





INFO:tensorflow:Initializing the TPU system: 10.103.142.26:8470


INFO:tensorflow:Initializing the TPU system: 10.103.142.26:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


<tensorflow.python.tpu.topology.Topology at 0x7f8b873b7e48>

The TPUClusterResolver instance helps locate the TPUs. In Colab, you don't need to specify any arguments to it.

If you want to use this for Cloud TPUs:

- You must specify the name of your TPU resource in the tpu argument.
- You must initialize the tpu system explicitly at the start of the program. This is required before TPUs can be used for computation. Initializing the tpu system also wipes out the TPU memory, so it's important to complete this step first in order to avoid losing state.


## ParameterServerStrategy

tf.distribute.experimental.ParameterServerStrategy supports parameter servers training on multiple machines. 

In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

In [0]:
ps_strategy = tf.distribute.experimental.ParameterServerStrategy()

ValueError: ignored

# One Device Strategy

tf.distribute.OneDeviceStrategy runs on a single device. This strategy will place any variables created in its scope on the specified device. 

Input distributed through this strategy will be prefetched to the specified device. Moreover, any functions called via strategy.run will also be placed on the specified device.

You can use this strategy to test your code before switching to other strategies which actually distributes to multiple devices/machines.

In [0]:
strategy = tf.distribute.OneDeviceStrategy(device="gpu:0")

# Using with tf.Keras

We've integrated tf.distribute.Strategy into tf.keras which is TensorFlow's implementation of the Keras API specification. tf.keras is a high-level API to build and train models. By integrating into tf.keras backend, we've made it seamless for you to distribute your training written in the Keras training framework.

Here's what you need to change in your code:

- Create an instance of the appropriate tf.distribute.Strategy
- Move the creation and compiling of Keras model inside strategy.scope.

We support all types of Keras models - sequential, functional and subclassed.

In [0]:
mirrored_stratgy = tf.distribute.MirroredStrategy()
with mirrored_stratgy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1, ))])
    model.compile(loss='mse', optimizer='sgd')

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In this example we used MirroredStrategy so we can run this on a machine with multiple GPUs. strategy.scope() indicated which parts of the code to run distributed. 

Creating a model inside this scope allows us to create mirrored variables instead of regular variables. 

Compiling under the scope allows us to know that the user intends to train this model using this strategy. 

Once this is set up, you can fit your model like you would normally. 

MirroredStrategy takes care of replicating the model's training on the available GPUs, aggregating gradients, and more./

In [0]:
dataset = tf.data.Dataset.from_tensor_slices(([1.], [1.])).repeat(100).batch(10)

In [0]:
model.fit(dataset, epochs=2)
model.evaluate(dataset)

Train for 10 steps
Epoch 1/2
Epoch 2/2


1.3552119731903076

In [0]:
import numpy as np

In [0]:
inputs, targets = np.ones((100, 1)), np.ones((100, 1))
history = model.fit(inputs, targets, epochs=2, batch_size=10)

Train on 100 samples
Epoch 1/2
Epoch 2/2


In both cases (dataset or numpy), each batch of the given input is divided equally among the multiple replicas. 

For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step. 

Each epoch will then train faster as you add more GPUs. 

Typically, you would want to increase your batch size as you add more accelerators so as to make effective use of the extra computing power. You will also need to re-tune your learning rate, depending on the model. 

You can use strategy.num_replicas_in_sync to get the number of replicas.

In [0]:
BATCHES_SIZE_PER_REPLICA = 5
global_batch_size = (BATCHES_SIZE_PER_REPLICA * mirrored_stratgy.num_replicas_in_sync)

In [0]:
dataset = tf.data.Dataset.from_tensor_slices(([1.], [1.])).repeat(100)
dataset = dataset.batch(global_batch_size)

In [0]:
LEARNING_RATES_BY_BATCH_SIZE = {5: 0.1, 10: 0.15}
learning_rate = LEARNING_RATES_BY_BATCH_SIZE[global_batch_size]

# Support Currently 

In TF 2.0 release, MirroredStrategy, TPUStrategy, CentralStorageStrategy and MultiWorkerMirroredStrategy are supported in Keras. 

Except MirroredStrategy, others are currently experimental and are subject to change. Support for other strategies will be coming soon. The API and how to use will be exactly the same as above.

# Simple Example 

In [0]:
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import os

In [0]:
datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)

[1mDownloading and preparing dataset mnist (11.06 MiB) to /root/tensorflow_datasets/mnist/3.0.0...[0m


local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead set
data_dir=gs://tfds-data/datasets.



[1mDataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.0. Subsequent calls will reuse this data.[0m


In [0]:
mnist_train, mnist_test = datasets['train'], datasets['test']

Create a MirroredStrategy object. This will handle distribution, and provides a context manager (tf.distribute.MirroredStrategy.scope) to build your model inside.

In [0]:
strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [0]:
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1


When training a model with multiple GPUs, you can use the extra computing power effectively by increasing the batch size. 

In general, use the largest batch size that fits the GPU memory, and tune the learning rate accordingly.

In [0]:
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

In [0]:
BUFFER_SIZE = 10000
BUFFER_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCHES_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

In [0]:
def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255

    return image, label

Apply this function to the training and test data, shuffle the training data, and batch it for training. 

Notice we are also keeping an in-memory cache of the training data to improve performance.

In [0]:
train_dataset = mnist_train.map(scale).cache().shuffle(BATCH_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

Create and compile the Keras model in the context of strategy.scope.

In [0]:
with strategy.scope():
    model = tf.keras.Sequential([
                                 tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
                                 tf.keras.layers.MaxPooling2D(),
                                 tf.keras.layers.Flatten(),
                                 tf.keras.layers.Dense(64, activation='relu'),
                                 tf.keras.layers.Dense(10)
    ])

In [0]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(), metrics=['acc'])

The callbacks used here are:

- TensorBoard: This callback writes a log for TensorBoard which allows you to visualize the graphs.
- Model Checkpoint: This callback saves the model after every epoch.
- Learning Rate Scheduler: Using this callback, you can schedule the learning rate to change after every epoch/batch.

For illustrative purposes, add a print callback to display the learning rate in the notebook.

In [0]:
checkpoint_dir = '/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

In [0]:
def decay(epoch):
    if epoch < 3:
       return 1e-3
    elif epoch >=3 and epoch <= 7:
        return 1e-4
    else:
        return 1e-5

In [0]:
class PrintLR(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        print('\nLearning rate for epoch {} is {}'.format(epoch + 1, model.optimizer.lr.numpy()))

In [0]:
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]


Now, train the model in the usualway, calling fit on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not.

In [0]:
model.fit(train_dataset, epochs=12, callbacks=callbacks)

# Custom Training

This tutorial demonstrates how to use tf.distribute.Strategy with custom training loops. 

We will train a simple CNN model on the fashion MNIST dataset. The fashion MNIST dataset contains 60000 train images of size 28 x 28 and 10000 test images of size 28 x 28.

We are using custom training loops to train our model because they give us flexibility and a greater control on training. Moreover, it is easier to debug the model and the training loop.

In [3]:
fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [0]:
train_images = tf.expand_dims(train_images, axis=-1)
test_images = tf.expand_dims(test_images, axis=-1)

In [5]:
print(train_images.shape, test_images.shape)

(60000, 28, 28, 1) (10000, 28, 28, 1)


In [0]:
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

Create a strategy to distribute the variables and the graph

How does tf.distribute.MirroredStrategy strategy work?

All the variables and the model graph is replicated on the replicas.

Input is evenly distributed across the replicas.

Each replica calculates the loss and gradients for the input it received.

The gradients are synced across all the replicas by summing them.

After the sync, the same update is made to the copies of the variables on each replica.


In [7]:
strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)


In [8]:
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1


## Setup input pipeline

Export the graph and the variables to the platform-agnostic SavedModel format. 

After your model is saved, you can load it with or without the scope.

In [0]:
BUFFER_SIZE = len(train_images)
BATCHES_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCHES_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

In [0]:
EPOCHS = 10

In [0]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

In [0]:
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset  = strategy.experimental_distribute_dataset(test_dataset)

In [0]:
def create_model():
    model = tf.keras.Sequential([
                                 tf.keras.layers.Conv2D(32, 3, activation='relu'),
                                 tf.keras.layers.MaxPooling2D(),
                                 tf.keras.layers.Conv2D(64, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
    ])

    return model

In [0]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

# Define the loss function

Normally, on a single machine with 1 GPU/CPU, loss is divided by the number of examples in the batch of input.

So, how should the loss be calculated when using a tf.distribute.Strategy?

- For an example, let's say you have 4 GPU's and a batch size of 64. One batch of input is distributed across the replicas (4 GPUs), each replica getting an input of size 16.

- The model on each replica does a forward pass with its respective input and calculates the loss. Now, instead of dividing the loss by the number of examples in its respective input (BATCH_SIZE_PER_REPLICA = 16), the loss should be divided by the GLOBAL_BATCH_SIZE (64).


Why do this?

- This needs to be done because after the gradients are calculated on each replica, they are synced across the replicas by summing them.


How to do this in TensorFlow?

- If you're writing a custom training loop, as in this tutorial, you should sum the per example losses and divide the sum by the GLOBAL_BATCH_SIZE: scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE) or you can use tf.nn.compute_average_loss which takes the per example loss, optional sample weights, and GLOBAL_BATCH_SIZE as arguments and returns the scaled loss.

- If you are using regularization losses in your model then you need to scale the loss value by number of replicas. You can do this by using the tf.nn.scale_regularization_loss function.

- Using tf.reduce_mean is not recommended. Doing so divides the loss by actual per replica batch size which may vary step to step.

- This reduction and scaling is done automatically in keras model.compile and model.fit

- If using tf.keras.losses classes (as in the example below), the loss reduction needs to be explicitly specified to be one of NONE or SUM. AUTO and SUM_OVER_BATCH_SIZE are disallowed when used with tf.distribute.Strategy. AUTO is disallowed because the user should explicitly think about what reduction they want to make sure it is correct in the distributed case. SUM_OVER_BATCH_SIZE is disallowed because currently it would only divide by per replica batch size, and leave the dividing by number of replicas to the user, which might be easy to miss. So instead we ask the user do the reduction themselves explicitly.

In [0]:
with strategy.scope():
    # Set reduction to `none` so we can do the reduction afterwards and divide by
    # global batch size.

    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
    def compute_loss(labels, predictions):
        per_example_loss = loss_object(labels, predictions)
        return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)


# Define the metrics to track loss and accuracy

These metrics track the test loss and training and test accuracy. You can use .result() to get the accumulated statistics at any time.

In [0]:
with strategy.scope():
    test_loss = tf.keras.metrics.Mean(name="test_loss")
    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
    test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="test_accuracy")

# Training Loop

In [0]:
with strategy.scope():
    model = create_model()
    optimizer = tf.keras.optimizers.Adam()
    checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

In [0]:
with strategy.scope():
  def train_step(inputs):
    images, labels = inputs

    with tf.GradientTape() as tape:
      predictions = model(images, training=True)
      loss = compute_loss(labels, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_accuracy.update_state(labels, predictions)
    return loss 

  def test_step(inputs):
    images, labels = inputs

    predictions = model(images, training=False)
    t_loss = loss_object(labels, predictions)

    test_loss.update_state(t_loss)
    test_accuracy.update_state(labels, predictions)



In [0]:
with strategy.scope():
  # `run` replicates the provided computation and runs it
  # with the distributed input.
  @tf.function
  def distributed_train_step(dataset_inputs):
    per_replica_losses = strategy.experimental_run_v2(train_step, args=(dataset_inputs,))
    return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                           axis=None)
 
  @tf.function
  def distributed_test_step(dataset_inputs):
    return strategy.experimental_run_v2(test_step, args=(dataset_inputs,))

  for epoch in range(EPOCHS):
    # TRAIN LOOP
    total_loss = 0.0
    num_batches = 0
    for x in train_dist_dataset:
      total_loss += distributed_train_step(x)
      num_batches += 1
    train_loss = total_loss / num_batches

    # TEST LOOP
    for x in test_dist_dataset:
      distributed_test_step(x)

    if epoch % 2 == 0:
      checkpoint.save(checkpoint_prefix)

    template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
                "Test Accuracy: {}")
    print (template.format(epoch+1, train_loss,
                           train_accuracy.result()*100, test_loss.result(),
                           test_accuracy.result()*100))

    test_loss.reset_states()
    train_accuracy.reset_states()
    test_accuracy.reset_states()


INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Epoch 1, Loss: 0.5085545778274536, Accuracy: 81.89833068847656, Test Loss: 0.3894329369068146, Test Accuracy: 85.86000061035156
INFO:tensorflow:Reduce to /job:l