# Chapter 11 Training Deep Neural Nets

There are three problems with training deep neural nets that we will cover in this chapter: the vanishing/exploding gradient problem, slow training with a large network, and overfitting. 

## Vanishing/Exploding Gradients

When using the backpropagation algorithm, the gradient for each weight and bias is calculated backwards beginning from the output layer. In particular, the gradient of the bias at a particular layer (the reasoning for the gradient of a weight remains the same since the weight gradient is a product of two terms, one of which is the gradient of the bias at its layer) can be calculated as the gradient of a future layer multiplied by the product of the weight and derivative of the activation function of the intermediate layers. This means that the earlier a layer is, the more terms appear in this product. Furthermore, initializing with a 0-centered normal with a standard deviation of 1 means the weights are usually smaller than 1, and the derivative of the sigmoid activation function reaches a maximum of 1/4. Because the terms are quite small, so are the gradients, which are called *vanishing gradients*. The earlier a layer appears, the smaller its gradient is, and small gradients cause the network to learn slowly. It is also possible that the weights and derivative become large in later layers or with a different activation function, in which case the gradients are referred to as *exploding gradients* and make learning very erratic. 

We can also see that the sigmoid function has a very small derivative once the input becomes very positive or very negative, exacerbating the problem. We can deal with these by changing the way we initialize the weights or by changing the activation function.  

### Xavier and He Initialization

When using the logistic activation function, we can use a strategy called *Xavier intialization* which samples weights from either a 0-centered normal where the standard deviation is function of the number of inputs and outputs or a uniform distribution where the limits are are a function of the same parameters. *He initialization* also defines these distributions with different functions but the same parameters when using the ReLU activation function.

In [41]:
import tensorflow as tf
import numpy as np
from tensorflow_graph_in_jupyter import show_graph

In [42]:
# Load data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [8]:
tf.reset_default_graph()

n_inputs = 28 * 28  
n_hidden1 = 300

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

# Use Xavier intialization with uniform distribution
xavier_initializer = tf.contrib.layers.variance_scaling_initializer(factor=1.0, 
                                                                    mode='FAN_AVG', 
                                                                    uniform=True)
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, 
                         kernel_initializer=xavier_initializer, name='hidden1')

### Nonsaturating Activation Functions

While the ReLU function doesn't saturate no matter how large its positive input is, it's derivative becomes 0 for any negative input. The gradient obviously doesn't change at this point, so this neuron is unlikely to change given further training. One alternative is the leaky ReLU, which has a small slope (the hyperparameter alpha which should be less than 1, usually 0.01) instead of a constant 0 for negative inputs. The leaky ReLU function is defined as as max(alpha\*z, z). There is also randomized leaky ReLU, which randomly picks alpha during training and fixes an average during training, and can be used as a method of regularization. Finally, parametric leaky ReLU treats alpha as a variable like the weights or biases than can be optimized by backpropagation, but this activatoin function is more prone to overfitting. 

We can also use the exponential linear unit (ELU), which is defined as z for positive z, and alpha\*(e^z - 1) for negative z. This function turned out to perform better and train faster than other activation functions. While the gradient is more costly computationally, the convergence rate more than compensates for this slowdown during training. However, we will see this computational slowdown during training. 

The scaled exponential linear unit (SELU) activation function performs normalization like batch normalization, discussed below, inside the activation function. However, this only works with feed forward neural networks and doesn't allow for the use of techniques such as l1 or l2 regularization and dropout. When these conditions are met, it generally performs better than other activation functions. Remember to scale and normalize the initial inputs.

In [10]:
# Leaky ReLU and ELU functions
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.leaky_relu, name='hidden1')
hidden2 = tf.layers.dense(hidden1, n_hidden1, activation=tf.nn.elu, name='hidden2')

### Batch Normalization

While the above methods are good for dealing with gradient instability, the problem might reappear as we update the weights and biases during training. Batch normalization can prevent this by normalizing the inputs to your activation function, leading to less extreme values for your inputs and thus decreased chances of saturation. More generally, it is meant to deal with the problem of layers having to deal with varying input distributions as the network's parameters are updated. It does this by calculating the batch mean and standard deviation, using these values to center and normalize the inputs, and calculating the linear output using gamma\*X + beta, where gamma (scaling) and beta (offset) are learned during training. This technique is so effective it even allows the use of saturating activation functions such as the sigmoid function, makes networks less sensitive to weight/bias initalization, allows for higher learning rates, and acts like a form of regularization. There is, however, a computational cost.  

In [13]:
from functools import partial

n_hidden2 = 100
n_outputs = 10

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
# Boolean for whether using training data
training = tf.placeholder_with_default(False, shape=(), name='training')
he_initializer = tf.contrib.layers.variance_scaling_initializer()

batch_norm_layer = partial(tf.layers.batch_normalization, 
                          training=training, momentum=0.9)
dense_layer = partial(tf.layers.dense, 
                     kernel_initializer = he_initializer)

# Hidden layer
hidden1 = dense_layer(X, n_hidden1, name='hidden1')
# Batch normalization, inherits parameters from partial
bn1 = batch_norm_layer(hidden1)
# ELU activation function
bn1_act = tf.nn.elu(bn1)
hidden2 = dense_layer(bn1_act, n_hidden2, name='hidden2')
bn2_act = tf.nn.elu(batch_norm_layer(hidden2))
logits_before_bn = dense_layer(bn2_act, n_outputs, name='outputs')
logits = batch_norm_layer(logits_before_bn)

## Reusing Pretrained Layers

*Transfer learning* is the process of using layers from pretrained models in your current network, generally the older layers since these are the ones that have learned to identify lower level details such as lines, edges, and so on. Let's look at the MNIST model we trained and saved in the last chapter.

### Reusing a TensorFlow Model

We can load this graph and look at all of its operations by its name or using TensorBoard.

In [14]:
tf.reset_default_graph()

saver = tf.train.import_meta_graph('models/final_mnist_model.ckpt.meta')

for op in tf.get_default_graph().get_operations():
    print(op.name)

X
y
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
dnn/hidden1/MatMul
dnn/hidden1/BiasAdd
dnn/hidden1/Relu
hidden2/kernel/Initializer/random_uniform/shape
hidden2/kernel/Initializer/random_uniform/min
hidden2/kernel/Initializer/random_uniform/max
hidden2/kernel/Initializer/random_uniform/RandomUniform
hidden2/kernel/Initializer/random_uniform/sub
hidden2/kernel/Initializer/random_uniform/mul
hidden2/kernel/Initializer/random_uniform
hidden2/kernel
hidden2/kernel/Assign
hidden2/kernel/read
hidden2/bias/Initializer/zeros
hidden2/bias
hidden2/bias/Assign
hidden2/bias/read
dn

In [16]:
show_graph(tf.get_default_graph())

If we want to load tensors or operations, we can either do this individually or all at once if the user has made a collection.

In [29]:
# Tensor names include index
X = tf.get_default_graph().get_tensor_by_name('X:0')
y = tf.get_default_graph().get_tensor_by_name('y:0')

loss = tf.get_default_graph().get_operation_by_name('loss/loss')

# Create and use collection
for op in (X, y, loss):
    tf.add_to_collection('collection_name', op)
    
X, y, loss  = tf.get_collection('collection_name')

Let's say we wanted just the first layer from our MNIST model, and wanted to add two layers on top of that. There are (at least) two ways we can do this. One is to use import_meta_graph() to load the whole graph, grab the components we want while ignoring the rest, and building the new components into this same graph. Then we can save this graph and restore it within our session before training it. Let's add a third layer to your MNIST model, for example.

In [47]:
tf.reset_default_graph()

n_hidden3 = 50
n_outputs = 10

# Load entire graph
saver = tf.train.import_meta_graph('models/final_mnist_model.ckpt.meta')

X = tf.get_default_graph().get_tensor_by_name('X:0')
y = tf.get_default_graph().get_tensor_by_name('y:0')

# Grab last hidden layer from graph
hidden2 = tf.get_default_graph().get_tensor_by_name('dnn/hidden2/Relu:0')

# Build new hidden and output layer
new_hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name='new_hidden3')
new_logits = tf.layers.dense(new_hidden3, n_outputs, name='new_outputs')

# Define all components that depend on new layers
with tf.name_scope('new_loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=new_logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('new_eval'):
    correct = tf.nn.in_top_k(new_logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
    
with tf.name_scope('new_train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
    
init = tf.global_variables_initializer()
# Create new saver for saving model (old saver for restoration)
new_saver = tf.train.Saver()

show_graph(tf.get_default_graph())

We can see in TensorBoard that the old loss, eval, and training scopes are still in the graph. Now let's train it.

In [48]:
n_epochs = 40
batch_size = 50

def shuffle_batch(X, y, batch_size):
    random_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(random_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

with tf.Session() as sess:
    init.run()
    saver.restore(sess, 'models/final_mnist_model.ckpt')
    
    for epoch in np.arange(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 5 == 0:
            accuracy_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
            print(f'Epoch {epoch} Validation accuracy: {accuracy_valid}')

INFO:tensorflow:Restoring parameters from ./final_mnist_model.ckpt
Epoch 0 Validation accuracy: 0.9621999859809875
Epoch 5 Validation accuracy: 0.9746000170707703
Epoch 10 Validation accuracy: 0.9771999716758728
Epoch 15 Validation accuracy: 0.9783999919891357
Epoch 20 Validation accuracy: 0.9797999858856201
Epoch 25 Validation accuracy: 0.9805999994277954
Epoch 30 Validation accuracy: 0.9800000190734863
Epoch 35 Validation accuracy: 0.9797999858856201


Another method is to construct the graph in its entirety explicitly, and then restore the variables that we want to use with the get_collection method during execution.

In [55]:
tf.reset_default_graph()

n_inputs = 28*28
n_hidden1 = 300  # resued layer
n_hidden2 = 100  # new layer
n_hidden3 = 50   # new layer
n_outputs = 10   # new layer
learning_rate = 0.01

relu_layer = partial(tf.layers.dense, activation=tf.nn.relu)

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = relu_layer(X, n_hidden1, name='hidden1')            # old layer
    hidden2 = relu_layer(hidden1, n_hidden2, name='hidden2')      # new layer
    hidden3 = relu_layer(hidden2, n_hidden3, name='hidden3')      # new layer
    logits = tf.layers.dense(hidden3, n_outputs, name='outputs')  # new layer
    
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy') 
    
with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [56]:
# Only grab variables within hidden layers
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                              scope='hidden1')
restore_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    # Load variables with hidden layers
    restore_saver.restore(sess, 'models/final_mnist_model.ckpt')
    
    for epoch in np.arange(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 5 == 0:
            validation_accuracy = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
            print(f'Epoch {epoch} Validation accuracy: {validation_accuracy}')

INFO:tensorflow:Restoring parameters from ./final_mnist_model.ckpt
Epoch 0 Validation accuracy: 0.9419999718666077
Epoch 5 Validation accuracy: 0.9702000021934509
Epoch 10 Validation accuracy: 0.975600004196167
Epoch 15 Validation accuracy: 0.9778000116348267
Epoch 20 Validation accuracy: 0.979200005531311
Epoch 25 Validation accuracy: 0.9779999852180481
Epoch 30 Validation accuracy: 0.9801999926567078
Epoch 35 Validation accuracy: 0.980400025844574


### Freezing the Lower Layers

If we want to freeze the lower layers (as mentioned above, these layers detect lower-level features), we can change the train name scope we've been using to only change the variables of certain layers.

In [53]:
tf.reset_default_graph()

n_inputs = 28*28
n_hidden1 = 300  # resued layer
n_hidden2 = 100  # new layer
n_hidden3 = 50   # new layer
n_outputs = 10   # new layer
learning_rate = 0.01

relu_layer = partial(tf.layers.dense, activation=tf.nn.relu)

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = relu_layer(X, n_hidden1, name='hidden1')            # old layer
    hidden2 = relu_layer(hidden1, n_hidden2, name='hidden2')      # new layer
    hidden3 = relu_layer(hidden2, n_hidden3, name='hidden3')      # new layer
    logits = tf.layers.dense(hidden3, n_outputs, name='outputs')  # new layer
    
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                  scope='hidden[34]|outputs')
    training_op = optimizer.minimize(loss, var_list=train_vars)

### Caching the Frozen Layers

We can also just store the output of the topmost frozen layer instead of having to actually travel through those layers.

In [57]:
tf.reset_default_graph()

n_inputs = 28*28
n_hidden1 = 300  # resued layer
n_hidden2 = 100  # new layer
n_hidden3 = 50   # new layer
n_outputs = 10   # new layer
learning_rate = 0.01

relu_layer = partial(tf.layers.dense, activation=tf.nn.relu)

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = relu_layer(X, n_hidden1, name='hidden1')            # old layer
    hidden2 = relu_layer(hidden1, n_hidden2, name='hidden2')      # new layer
    hidden2_stop = tf.stop_gradient(hidden2)                      # cache output
    hidden3 = relu_layer(hidden2_stop, n_hidden3, name='hidden3') # new layer
    logits = tf.layers.dense(hidden3, n_outputs, name='outputs')  # new layer
    
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy') 
    
with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [63]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                              scope='hidden1')
restore_saver = tf.train.Saver(reuse_vars)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_batches = len(X_train) // batch_size

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, 'models/final_mnist_model.ckpt')
    
    # Get output of 2nd hidden layer
    h2_cache = sess.run(hidden2, feed_dict={X: X_train})
    # Run validation data through first two layers
    h2_cache_valid = sess.run(hidden2, feed_dict={X: X_valid})
    
    for epoch in np.arange(n_epochs):
        random_indices = np.random.permutation(len(X_train))
        # Random batches of output of 2nd hidden layer
        hidden2_batches = np.array_split(h2_cache[random_indices], n_batches)
        y_batches = np.array_split(y_train[random_indices], n_batches)
        for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
            # Use output of 2nd hidden layer as training data
            sess.run(training_op, feed_dict={hidden2: hidden2_batch, y: y_batch})
        
        accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_valid, y: y_valid})
        print(f"{epoch} Validation accuracy: {accuracy_val}")

INFO:tensorflow:Restoring parameters from ./final_mnist_model.ckpt
0 Validation accuracy: 0.8399999737739563
1 Validation accuracy: 0.8838000297546387
2 Validation accuracy: 0.9007999897003174
3 Validation accuracy: 0.9103999733924866
4 Validation accuracy: 0.9182000160217285
5 Validation accuracy: 0.9218000173568726
6 Validation accuracy: 0.9247999787330627
7 Validation accuracy: 0.928600013256073
8 Validation accuracy: 0.9308000206947327
9 Validation accuracy: 0.9337999820709229
10 Validation accuracy: 0.9337999820709229
11 Validation accuracy: 0.9366000294685364
12 Validation accuracy: 0.9359999895095825
13 Validation accuracy: 0.9362000226974487
14 Validation accuracy: 0.9383999705314636
15 Validation accuracy: 0.9395999908447266
16 Validation accuracy: 0.9404000043869019
17 Validation accuracy: 0.9404000043869019
18 Validation accuracy: 0.9404000043869019
19 Validation accuracy: 0.9401999711990356
20 Validation accuracy: 0.9413999915122986
21 Validation accuracy: 0.941200017929077

## Faster Optimizers

We also have the choice between optimizers such as momentum optimization, Nesterov Accelerated gradient, AdaGrad, RMSProp, and Adam optimization, although Adam is almost always the best choice.

### Learning Rate Scheduling

There are multiple methods for how we should change the learning rate when training our model, called the *learning rate schedule*. We'd like to start with a high learning rate and reduce it once we start getting closer to the minimum. TensorFlow has [multiple options](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/training/learning_rate_decay.py) for which type of learning rate decay you'd like to use, but let's go with exponential decay with momentum optimization. Keep in mind that AdaGrad, RMSProp, and Adam have their own mechanisms for updating the learning rate. 

In [64]:
tf.reset_default_graph()

n_inputs = 28 * 28 
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope('train'):
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name='global_step')
    # decayed_learning_rate = initial_learning_rate * decay_rate^(global_step / decay_steps)
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, 
                                               decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    # Global_step increments by one each times variables are updated
    training_op = optimizer.minimize(loss, global_step=global_step)

## Avoiding Overfitting through Regularization

### Early Stopping

One simple strategy to deal with overfitting is to record the validation accuracy at set intervals (number of steps or epochs), and stop training once we're confident that validation accuracy will continue to drop (it's possible that a temporary dip will give way to eventual improvement).

### $\\l_1$ and $\\l_2$ Regularization

We can pass TensorFlow's regularizers as parameters to our layers and make sure to include these in our loss function to use l1 and l2 regularization in our network. 

In [65]:
tf.reset_default_graph()

scale=0.001

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

regularized_dense_layer = partial(tf.layers.dense, activation=tf.nn.relu,
                                 kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope('dnn'):
    hidden1 = regularized_dense_layer(X, n_hidden1, name='hidden1')
    hidden2 = regularized_dense_layer(hidden1, n_hidden2, name='hidden2')
    logits = regularized_dense_layer(hidden2, n_outputs, activation=None, 
                                    name='outputs')
    
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    base_loss = tf.reduce_mean(xentropy, name='avg_xentropy')
    # Regularized losses stored in keys
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_losses, name='loss')

### Dropout

Dropout will temporarily remove neurons at every training step, with each neuron having a probability p (the dropout rate hyperparameter) of being removed. Because the neurons receive only a fraction of the number of inputs and the weights will be adjusted to this smaller number, we have to mutliply the final weights by the keep probability (1-p). We can implement this in TensorFlow by defining a tf.layers.dropout layer after defining the input and each hidden layer. We have to pass the layer we're trying to "dropout", the dropout rate, and whether we're currently training (initialized as False and becomes True once we execute our training operation). 

In [None]:
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
training = tf.placeholder_with_default(False, shape=(), name='training')

dropout_rate = 0.5
# Dropout input
X_drop = tf.layers.dropout(X, dropout_rate, training)

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu
                             name='hidden1')
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training)
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu
                             name='hidden2')
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training)
    logits = tf.layers.dense(hidden2_drop, n_outputs, name='outputs')