# Training deep neural nets
Adapted from Chap. 11 of `Hands-on Machine Learning with Scikit-Learn and TensorFlow` by A. Geron.

### Main problems in training deep neural networks with millions of parametetrs
1. Vanishing (or exploding) gradients that make lower layers hard to train
2. Training is very slow
3. Severe overfitting possible

## Vanishing gradients problem
Backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. Gradients often get smaller and smaller as the algorithm progresses down to lower (closer to input) layers. Therefore, gradient descent (GD) update leaves the connection weights in the lower layers virtually unchanged and optimization never converges. Neural networks may also suffer from the _exploding gradients_ problem. This is especially the case with recurrent neural networks.

In 2010, Glorot and Bengio showed that with logistic activation function and a simple mean-zero-std-one initialization, the variances of the outputs of each layer are much larger than the variances of the inputs. Therefore, the outputs in the top layers saturate close to zero or one and the gradients are very small. Therefore, the backpropagation algorithm has basically no gradient to propagate to the lower layers.

As a solution, Glorot and Bengio proposed that the connection weights be initialized with zero-mean normal distribution with the following standard deviation $\sigma$ or uniform distribution with the range $-r$ and $+r$:

$$
    \sigma = \sqrt{\frac{2}{n_{inputs} + n_{outputs}}}, \\
    r = \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}
$$
Here $n_{inputs}$ and $n_{outputs}$ are the number of input and output connections for each layer. By default, `tf.layers.dense()` uses this initialization with a uniform distribution. One can use the similar _He initialization_ as follows:
```python
    he_init = tf.contrib.layers.variance_scaling_initializer(mode='FAN_AVG')
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, kernel_initializer=he_init, name="hidden1")
```

### Nonsaturating activation functions
As mentioned above, using the logistic activation function can lead to the vanishing gradients problem. For example, ReLU works well because it does not saturate for large positive values and it also fast to compute. ReLU suffers, however, partially from the same problem as the logistic activation function that the output values can saturate to zero. To solve this problem, one can use a variant of the ReLU such as `LeakyReLU`$(z)=\max(\alpha z, z)$, where typically $\alpha=0.01$. The non-zero $alpha$ ensures that leaky ReLUs never completely die.

Another variant is the exponential linear unit (ELU) defined as

$$
\textrm{ELU}(z) = \left\{\begin{array}{lr}
        \alpha(e^z - 1), & \text{for } z < 0 \\
        z, & \text{for } z \geq 0
        \end{array}\right.
$$

Compared to ReLU, ELU has nonzero gradient for $z < 0$, is smooth everywhere, and has an average output closer to 0 at $x=0$. All these features mitigate the vanishing gradients problem. The only disadvantage with ELU is that it is slower to compute than ReLU.

### Batch normalization
He initialization and variants of ELU reduce the vanishing gradients problem at the beginning of training, but nothing guarantees it does not re-emerge during training. In 2015, Ioffe and Szegedy [proposed](https://arxiv.org/pdf/1502.03167v3.pdf) a technique called _Batch normalization_ (BN) to address the vanishing/exploding gradients problems, or more generally the problem that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change (Internal covariance shift problem). 

The technique consists of adding an operation in the model just before the activation function of each layer: zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). This operationg lets the model learn the optimal scale and inputs for each layer.

The algorithm estimates the inputs' mean and standard deviation by evaluating the mean and standard deviation of the inputs over the current mini-batch (hence the name Batch Normalization). In total, four parameters are learned for each batch-normalized layer: $\gamma$ (scale of outputs), $\beta$ (offset of outputs), $\mu$ (mean of inputs), and $\sigma$ (standard deviation of inputs).

Ioffe and Szegedy showed that using batch normalization strongly reduced the vanishing gradients problem, reduced the sensitivity of training to the weight initialization, allowed for using much larger learning rates, and even acted as a regularization mitigating overfitting. Batch normalization adds complexity to the model and slows down training due to the extra computations required. Training can speed up once GD has found reasonably good values for the scales and offsets, though.

See the example below for batch normalization in TensorFlow.

### Gradient clipping
One technique to mitigate the exploding gradients problem is to [clip the gradients](http://proceedings.mlr.press/v28/pascanu13.pdf) during backpropagation so that they never exceed a given threshold. In TensorFlow, the optimizer's `minimize` function both computes the gradients and applies them to variables, so one must instead compute the gradients first, then create an operation to clip the gradients by value and finally apply the clipped gradients. The clip threshold is a hyperparameter that can be tuned. See the example below for gradient clipping.

### Full example from Chap. 10 with He initialization, ELU activation function, batch normalization, and gradient clipping

In [None]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from functools import partial
import os
from datetime import datetime

n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") # Number of training samples in batch not known and not required
y = tf.placeholder(tf.int64, shape=(None), name="y") # y is a 1D array with unknown length

# This will be set to True during training to tell batch normalization layers to use the whole training set's mean and stddev
training = tf.placeholder_with_default(False, shape=(), name='training')

he_init = tf.contrib.layers.variance_scaling_initializer(mode='FAN_AVG')

# BN uses exponential decay with momentum to compute the running averages
batch_norm_layer = partial(tf.layers.batch_normalization, training=training, momentum=0.9)

# Neural network with batch-normalized layers
with tf.name_scope("ann"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", kernel_initializer=he_init)
    bn1 = batch_norm_layer(hidden1)
    bn1_act = tf.nn.elu(bn1)
    hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2", kernel_initializer=he_init)
    bn2 = batch_norm_layer(hidden2)
    bn2_act = tf.nn.elu(bn2)
    logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs", kernel_initializer=he_init) # Unscaled as softmax computed later internally
    logits = batch_norm_layer(logits_before_bn, training=training, momentum=0.9)
    

with tf.name_scope("loss"):
    # This op expects unscaled logits, since it performs a softmax on logits internally for efficiency.
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
learning_rate = 0.05
gradient_clip_threshold = 1.0

# Training with gradient clipping
with tf.name_scope("train"):
    global_step = tf.Variable(0, name='global_step', trainable=False)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    grads_and_vars = optimizer.compute_gradients(loss)
    capped_cvs = [(tf.clip_by_value(grad, -gradient_clip_threshold, gradient_clip_threshold), var) 
                  for grad, var in grads_and_vars]
    training_op = optimizer.apply_gradients(capped_cvs, global_step=global_step)

# Evaluate performance by checking if the correct label is in top 1:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    with tf.name_scope("accuracy"):
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

tf.summary.scalar('accuracy', accuracy)
summaries = tf.summary.merge_all()
   
saver = tf.train.Saver()

mnist = input_data.read_data_sets("/tmp/mnist/data")

n_epochs = 10
batch_size = 50

# Batch normalization creates operations that must be evaluated at each step to update the moving averages
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

root_logdir = 'mnist-logs'

def tb_logdir():   
    now = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
    return os.path.join(root_logdir, 'run-%s' % now)

logdir = tb_logdir()
print('Using %s for TensorBoard logs' % logdir)

SAVED_MODEL_PATH = os.path.join(logdir, 'model.ckpt')

'''
# This could be useful later
def variable_summaries(var):
    """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
    with tf.name_scope('summaries'):
        mean = tf.reduce_mean(var)
        tf.summary.scalar('mean', mean)
        with tf.name_scope('stddev'):
            stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
            tf.summary.scalar('stddev', stddev)
        tf.summary.scalar('max', tf.reduce_max(var))
        tf.summary.scalar('min', tf.reduce_min(var))
        tf.summary.histogram('histogram', var)
'''
        
with tf.Session() as sess:
    file_writer = tf.summary.FileWriter(logdir, sess.graph)
    tf.global_variables_initializer().run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            summary, _, _ = sess.run([summaries, training_op, extra_update_ops], feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={training: True, X: X_batch, y: y_batch})
        acc_val = accuracy.eval(feed_dict={X: mnist.validation.images, y: mnist.validation.labels})
        print(epoch, "Train accuracy:", acc_train, "Val accuracy", acc_val)
        file_writer.add_summary(summary, epoch)
    save_path = saver.save(sess, SAVED_MODEL_PATH)
    file_writer.close()

### Transfer learning
Transfer learning refers to re-using a model that was trained for a different task, typically using only the lower, more generic layers. Below, the loading of TensorFlow models is illustrated with examples.

In [None]:
saved_model_meta = SAVED_MODEL_PATH + '.meta'
saver = tf.train.import_meta_graph(SAVED_MODEL_PATH + '.meta')

X = tf.get_default_graph().get_tensor_by_name('X:0')
y = tf.get_default_graph().get_tensor_by_name('y:0')

accuracy = tf.get_default_graph().get_tensor_by_name('eval/accuracy/accuracy:0')
# [n.name for n in tf.get_default_graph().as_graph_def().node]

# for op in tf.get_default_graph().get_operations():
#     print(op.name)

with tf.Session() as sess:
    saver.restore(sess, SAVED_MODEL_PATH)
    acc_test = accuracy.eval(feed_dict={X: mnist.validation.images, y: mnist.validation.labels})
    
print('Accuracy on test set:', acc_test)

### Exercise 8
a. Build a neural network with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.

In [None]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from functools import partial
from datetime import datetime
import os
import numpy as np

N_INPUTS = 28*28
N_NEURONS = 100
N_LAYERS = 5
N_EPOCHS = 100
BATCH_SIZE = 1000
ROOT_LOGDIR = 'chap-11-exercise-8-logs'
LEARNING_RATE = 0.01
LABELS_INCLUDED = [0, 1, 2, 3, 4]
N_CLASSES = len(LABELS_INCLUDED)

def build_placeholders(n_inputs):
    X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
    y = tf.placeholder(tf.int64, shape=(None), name="y")
    return X, y

def build_model(X, 
                y, 
                n_neurons, 
                n_layers, 
                n_classes, 
                learning_rate, 
                kernel_reg=tf.contrib.layers.l2_regularizer(scale=0.0)):
    
    # kernel_reg = tf.contrib.layers.l2_regularizer(scale=0.05)
    
    def build_hidden_layers(X, n_layers, n_neurons, init):
        neuron_layer = partial(tf.layers.dense, 
                               activation=tf.nn.elu, 
                               kernel_initializer=init, 
                               kernel_regularizer=kernel_reg)
        top_layer = X
        for ind in range(1, n_layers + 1):
            layer_name = 'hidden-layer-' + str(ind)
            top_layer = neuron_layer(top_layer, n_neurons, name=layer_name)
        return top_layer      

    '''
    def build_hidden_layers(X, n_layers, n_neurons, init):
        neuron_layer = partial(tf.layers.dense, activation=tf.nn.elu, kernel_initializer=init)
        hidden1 = neuron_layer(X, n_neurons, name='hidden1')
        hidden2 = neuron_layer(hidden1, n_neurons, name='hidden2')
        hidden3 = neuron_layer(hidden2, n_neurons, name='hidden3')
        hidden4 = neuron_layer(hidden3, n_neurons, name='hidden4')
        hidden5 = neuron_layer(hidden4, n_neurons, name='hidden5')
        return hidden5
    '''

    with tf.name_scope("hidden"):
        he_init = tf.variance_scaling_initializer()
        top_hidden = build_hidden_layers(X, n_layers=n_layers, n_neurons=n_neurons, init=he_init)

    
    with tf.name_scope("logits"):
        logits = tf.layers.dense(top_hidden, n_classes, name="logits")


    with tf.name_scope("loss"):
        # This op expects unscaled logits, since it performs a softmax on logits internally for efficiency.
        xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
        reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        base_loss = tf.reduce_mean(xentropy, name="base_loss")
        loss = tf.add_n([base_loss] + reg_losses, name="loss")

    with tf.name_scope("train"):
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        training_op = optimizer.minimize(loss)
    
    # Evaluate performance by checking if the correct label is in top 1:
    with tf.name_scope("eval"):
        correct = tf.nn.in_top_k(logits, y, 1)
        with tf.name_scope("accuracy"):
            accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
        
    tf.summary.scalar('accuracy', accuracy)
    tf.summary.scalar('loss', loss)
    summaries = tf.summary.merge_all()
    return { 'accuracy': accuracy, 'summaries': summaries, 'training_op': training_op }


def tb_logdir(root_logdir):   
    now = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
    return os.path.join(root_logdir, 'run-%s' % now)


def init_batch_iterator(mnist, X, y, labels_included, batch_size):
    
    def extract_data(all_images, all_labels):
        inds_used = [ind for ind, label in enumerate(all_labels) if label in labels_included]
        images = all_images[inds_used]
        labels = all_labels[inds_used]
        return images, labels
    
    train_images, train_labels = extract_data(mnist.train.images, mnist.train.labels)
    val_images, val_labels = extract_data(mnist.validation.images, mnist.validation.labels)
    
    n_train = len(train_images)
    assert n_train == len(train_labels)
    
    print('Using %d training and %d validation images' % (n_train, len(val_images)))
    
    def batch_iterator(train):
        if train:
            for i in range(n_train // batch_size):
                inds = range(i*batch_size, (i+1)*batch_size)
                yield { X: train_images[inds], y: train_labels[inds] }
        else:
            yield { X: val_images, y: val_labels }
        return None

    return batch_iterator

MNIST = input_data.read_data_sets("/tmp/mnist/data")

def train(n_neurons, 
          n_layers, 
          n_epochs=N_EPOCHS, 
          verbose=True, 
          kernel_reg_scale=0.0):

    tf.reset_default_graph()

    X, y = build_placeholders(n_inputs=N_INPUTS)

    model = build_model(X=X, 
                        y=y, 
                        n_neurons=n_neurons, 
                        n_layers=n_layers, 
                        n_classes=N_CLASSES,
                        learning_rate=LEARNING_RATE)

    accuracy, summaries, training_op = [model[key] for key in ['accuracy', 'summaries', 'training_op']]
    
    feed_batch_generator = init_batch_iterator(mnist=MNIST, 
                                               X=X, 
                                               y=y, 
                                               labels_included=LABELS_INCLUDED,
                                               batch_size=BATCH_SIZE)

    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

    logdir = tb_logdir(root_logdir=ROOT_LOGDIR)
    if verbose:
        print('Using %s for TensorBoard logs' % logdir)

    saved_model_path = os.path.join(logdir, 'model_final.ckpt')

    with tf.Session() as sess:
        train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph)
        val_writer = tf.summary.FileWriter(logdir + '/val')
        init.run()
        
        for epoch in range(1, n_epochs + 1):
            for feed_dict in feed_batch_generator(train=True):
                summary, train_acc, _ = sess.run([summaries, accuracy, training_op], feed_dict=feed_dict)

            if epoch % 5 == 0:
                val_generator = feed_batch_generator(train=False)
                feed_dict = next(val_generator)
                val_summary, val_acc = sess.run([summaries, accuracy], feed_dict=feed_dict)
                train_writer.add_summary(summary, epoch)
                val_writer.add_summary(val_summary, epoch)
                assert next(val_generator, None) == None
                saver.save(sess, os.path.join(logdir, 'model.ckpt'))
                if verbose:
                    print('Epoch:', epoch, 'Training acc:', train_acc, 'Validation acc:', val_acc)

        saver.save(sess, os.path.join(logdir, 'model_final.ckpt'))
        train_writer.close()
        val_writer.close()
    return val_acc
        
'''  
regularization_grid = np.linspace(0, 0.05, 3)

for kernel_reg in regularization_grid:
    val_acc = train(n_neurons=N_NEURONS, 
                    n_layers=N_LAYERS, 
                    n_epochs=10, 
                    kernel_reg_scale=kernel_reg_scale)
    print('')
'''  
train(n_neurons=N_NEURONS, n_layers=N_LAYERS)

Let us do the same as above but with `scikit-learn` compatible class:

In [22]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError
from datetime import datetime
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
import os

class MyClassifier(BaseEstimator, ClassifierMixin):
    
    def __init__(self, 
                 n_hidden_layers=5,
                 n_neurons=100, 
                 learning_rate=0.01, 
                 batch_size=100,
                 n_epochs=10,
                 activation=tf.nn.elu, 
                 initializer=tf.variance_scaling_initializer(), 
                 random_state=42,
                 dropout_rate=None,
                 batch_norm_momentum=None,
                 root_logdir='chap-11-ex-8'):

        self.n_hidden_layers = n_hidden_layers
        self.n_neurons = n_neurons
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.activation = activation
        self.initializer = initializer
        self.n_epochs = n_epochs
        self.random_state = random_state
        self.batch_norm_momentum = batch_norm_momentum
        self.dropout_rate = dropout_rate
        self._session = None
        self.root_logdir = root_logdir
        
    def _build_hidden(self, inputs):
        for layer_index in range(self.n_hidden_layers):
            with tf.name_scope("hidden"):
                if self.dropout_rate is not None:
                    inputs = tf.layers.dropout(inputs, name='hidden-drop-%d' % layer_index)
                inputs = tf.layers.dense(inputs, 
                                         self.n_neurons, 
                                         name='hidden-%d' % (layer_index + 1), 
                                         kernel_initializer=self.initializer)
                if self.batch_norm_momentum is not None:
                    inputs = tf.layers.batch_normalization(inputs, 
                                                           training=self._training, 
                                                           momentum=self.batch_norm_momentum,
                                                           name='hidden-%d-bn' % (layer_index + 1))
                inputs = self.activation(inputs, name='hidden-%d-out' % (layer_index + 1))
        return inputs
    
    def _build_graph(self, n_inputs, n_outputs):
        if self.random_state is not None:
            tf.set_random_seed(self.random_state)
            np.random.seed(self.random_state)
        X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
        y = tf.placeholder(tf.int32, shape=(None), name='y')
        
        if self.batch_norm_momentum or self.dropout_rate:
            self._training = tf.placeholder_with_default(False, shape=(), name='training')
        else:
            self._training = None
            
        hidden_outputs = self._build_hidden(X)
        
        with tf.name_scope("logits"):
            logits = tf.layers.dense(hidden_outputs, n_outputs, kernel_initializer=self.initializer, name='logits')
            y_proba = tf.nn.softmax(logits, name='Y_proba')
        
        with tf.name_scope("loss"):
            xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
            loss = tf.reduce_mean(xentropy, name='loss')
        
        with tf.name_scope("train"):
            optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
            training_op = optimizer.minimize(loss, name='training_op')
        
        with tf.name_scope("eval"):
            correct = tf.nn.in_top_k(logits, y, 1)
            accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')
            
        with tf.name_scope("summary"):
            loss_summary = tf.summary.scalar("loss", loss)
            acc_summary = tf.summary.scalar("accuracy", accuracy)
            
        summary = tf.summary.merge_all()
            
        init = tf.global_variables_initializer()
        saver = tf.train.Saver()
        self._X = X
        self._y = y
        self._y_proba = y_proba
        self._loss = loss
        self._accuracy = accuracy
        self._training_op = training_op
        self._init = init
        self._saver = saver
        self._summary = summary
        
    def close_session(self):
        if self._session:
            self._session.close()
    
    @property
    def _logdir(self): 
        now = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
        return os.path.join(self.root_logdir, 'run-%s' % now)


    def fit(self, X, y, n_epochs=20, X_valid=None, y_valid=None):
        # print('Shape of X:', X.shape)
        # print('Shape of y:', y.shape)
        self.close_session()
        n_inputs = X.shape[1]
        self.classes_ = np.unique(y)
        # print(self.classes_)
        n_outputs = len(self.classes_)
        self.class_to_index_ = {label: index
                            for index, label in enumerate(self.classes_)}
        y = np.array([self.class_to_index_[label]
                      for label in y], dtype=np.int32)
        self._graph = tf.Graph()
        with self._graph.as_default():
            self._build_graph(n_inputs, n_outputs)
            # extra ops for batch normalization
            extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

        self._session = tf.Session(graph=self._graph)
        best_loss = np.infty
        
        logdir = self._logdir
        
        with self._session.as_default() as sess:
            train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph)
            val_writer = tf.summary.FileWriter(logdir + '/val')
            self._init.run()
            # self._saver.save()
            for epoch in range(n_epochs):
                rnd_idx = np.random.permutation(len(X))
                i_batch = 0
                train_loss_sum = 0
                for rnd_indices in np.array_split(rnd_idx, len(X) // self.batch_size):
                    X_batch, y_batch = X[rnd_indices], y[rnd_indices]
                    feed_dict = {self._X: X_batch, self._y: y_batch}
                    if self._training is not None:
                        feed_dict[self._training] = True
                    train_loss, summary, _ = sess.run([self._loss, self._summary, self._training_op], feed_dict=feed_dict)
                    train_loss_sum += train_loss
                    i_batch += 1
                    if extra_update_ops:
                        sess.run(extra_update_ops, feed_dict=feed_dict)
                
                train_loss = train_loss_sum / i_batch
                train_writer.add_summary(summary, epoch)
                if X_valid is not None and y_valid is not None:
                    loss_val, acc_val, summary = sess.run([self._loss, self._accuracy, self._summary],
                                                 feed_dict={self._X: X_valid,
                                                            self._y: y_valid})
                    if loss_val < best_loss:
                        best_loss = loss_val
                    val_writer.add_summary(summary, epoch)
                    print("{}\tTraining loss: {:.6f}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
                        epoch, train_loss, loss_val, best_loss, acc_val * 100))
            self._saver.save(sess, os.path.join(logdir, 'model_final.ckpt'))
            train_writer.close()
            val_writer.close()
        return self
    
    def predict_proba(self, X):
        if not self._session:
            raise NotFittedError("This %s instance is not fitted yet" % self.__class__.__name__)
        with self._session.as_default() as sess:
            return self._y_proba.eval(feed_dict={self._X: X})

    def predict(self, X):
        class_indices = np.argmax(self.predict_proba(X), axis=1)
        return np.array([[self.classes_[class_index]]
                         for class_index in class_indices], np.int32)

    def save(self, path):
        self._saver.save(self._session, path)

N_INPUTS = 28*28
N_OUTPUTS = 5

MNIST = input_data.read_data_sets("/tmp/mnist/data")
X_train = MNIST.train.images
y_train = MNIST.train.labels
X_valid = MNIST.validation.images
y_valid = MNIST.validation.labels

inds = [ind for ind, label in enumerate(MNIST.train.labels) if label < 5]
X_train1 = X_train[y_train < 5]
y_train1 = y_train[y_train < 5]
val_inds = [ind for ind, label in enumerate(MNIST.validation.labels) if label < 5]
X_valid1 = X_valid[y_valid < 5]
y_valid1 = y_valid[y_valid < 5]

MyClassifier(batch_norm_momentum=0.9).fit(X_train1, y_train1, X_valid=X_valid1, y_valid=y_valid1, 
                   n_epochs=50)
    

Extracting /tmp/mnist/data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist/data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist/data/t10k-labels-idx1-ubyte.gz
0	Training loss: 0.136214	Validation loss: 0.095103	Best loss: 0.095103	Accuracy: 96.76%
1	Training loss: 0.072335	Validation loss: 0.051853	Best loss: 0.051853	Accuracy: 98.63%
2	Training loss: 0.054466	Validation loss: 0.053185	Best loss: 0.051853	Accuracy: 98.44%
3	Training loss: 0.043528	Validation loss: 0.054371	Best loss: 0.051853	Accuracy: 98.55%
4	Training loss: 0.034794	Validation loss: 0.045121	Best loss: 0.045121	Accuracy: 98.59%
5	Training loss: 0.031141	Validation loss: 0.037661	Best loss: 0.037661	Accuracy: 98.71%
6	Training loss: 0.027737	Validation loss: 0.037167	Best loss: 0.037167	Accuracy: 98.79%
7	Training loss: 0.022337	Validation loss: 0.034394	Best loss: 0.034394	Accuracy: 98.94%
8	Training loss: 0.022311	Validation loss: 0.028968	Best loss: 0.028968	Ac

MyClassifier(activation=<function elu at 0x112e6d0d0>,
       batch_norm_momentum=0.9, batch_size=100, dropout_rate=None,
       initializer=<tensorflow.python.ops.init_ops.VarianceScaling object at 0x123d47eb8>,
       learning_rate=0.01, n_epochs=10, n_hidden_layers=5, n_neurons=100,
       random_state=42, root_logdir='chap-11-ex-8')

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_distribs = {
    "n_neurons": np.arange(10, 100, 10),
    "batch_size": [10, 20, 50],
    "learning_rate": [0.01, 0.05, 0.1],
    "n_epochs": [10]
    # "activation": [tf.nn.relu, tf.nn.elu, leaky_relu(alpha=0.01), leaky_relu(alpha=0.1)],
    # you could also try exploring different numbers of hidden layers, different optimizers, etc.
    #"n_hidden_layers": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    #"optimizer_class": [tf.train.AdamOptimizer, partial(tf.train.MomentumOptimizer, momentum=0.95)],
}

rnd_search = RandomizedSearchCV(MyClassifier(random_state=42), param_distribs, n_iter=8,
                                random_state=42, verbose=2)
cv = rnd_search.fit(X_train, y_train, X_valid=X_valid, y_valid=y_valid)


### Example of hyperparameter optimization with [hyperopt](https://hyperopt.github.io/hyperopt/)

In [None]:
# Installation from source required to get the latest version
# !pip install --upgrade git+git://github.com/hyperopt/hyperopt.git

In [None]:
import time
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

space = {
            'n_neurons': hp.quniform('n_neurons', 10, 100, 10),
            'n_layers': hp.quniform('n_layers', 1, 9, 2),
            'n_epochs': hp.choice('n_epochs', [30, 50])
        }

def f_nn(params):
    print('Testing params:', params)
    val_acc = train(n_neurons=int(params['n_neurons']), 
                    n_layers=int(params['n_layers']),
                    n_epochs=int(params['n_epochs']),
                    verbose=False)
    return {
            'loss': -val_acc,
            'val_acc': val_acc,
            'params': params,
            'status': STATUS_OK
           }

trials = Trials()

best = fmin(f_nn,
    space=space,
    algo=tpe.suggest,
    max_evals=20,
    trials=trials)


In [None]:
import pandas as pd

# Flatten params to columns of their own
for trial in trials.results:
    params = dict(trial['params'])
    for key in params.keys():
        trial[key] = params[key]

pd.DataFrame.from_dict(trials.results).sort_values(by='loss', axis=0)

### Example of hyperparameter optimization with [Tune](https://ray.readthedocs.io/en/latest/tune-usage.html)

In [None]:
import ray
import ray.tune as tune
ray.init()

In [None]:
def train_func(params, reporter):
    print('Testing params:', params)
    val_acc = train(n_neurons=params['n_neurons'], 
                    n_layers=params['n_layers'],
                    n_epochs=params['n_epochs'],
                    verbose=False)
    reporter(val_acc=val_acc)
    return None

all_trials = tune.run_experiments({
    "my_experiment": {
        "run": train_func,
        "config": {"n_neurons": tune.grid_search([20, 50, 100]),
                    "n_layers": tune.grid_search([1, 5, 10]),
                    "n_epochs": tune.grid_search([50])
                  }
    }
})

In [None]:
import numpy as np
from skopt import gp_minimize

def f(x):
    return (np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) *
            np.random.randn() * 0.1)

res = gp_minimize(f, [(-2.0, 2.0)])

### Some TensorFlow tips and tricks for easier organization of code

Toy example of minimizing $(X + z + 1.0)^2$, where input $X=1.0$. Expected output is $z=2.0$. The example demonstrates using decorators to set default graph, name scopes, and variable scopes.

In [None]:
import tensorflow as tf
from functools import wraps
from datetime import datetime
import os

tf.reset_default_graph()

def with_return_graph(graph):
    def inner_function(function):
        @wraps(function)
        def wrapper(*args, **kwargs):
            # print("Arguments passed to decorator:", graph)
            with graph.as_default():
                function(*args, **kwargs)
            return graph
        return wrapper
    return inner_function

def with_name_scope(scope_name):
    def inner_func(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            with tf.name_scope(scope_name):
                return func(*args, **kwargs)
        return wrapper
    return inner_func

def with_variable_scope(scope_name, **kwargs_scope):
    def inner_func(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            with tf.variable_scope(scope_name, **kwargs_scope):
                return func(*args, **kwargs)
        return wrapper
    return inner_func

# graph = tf.Graph() # Alternative way to fill graph with decorators
# @with_return_graph(graph)
def build_graph():
    
    @with_name_scope('input')
    def add_inputs():
        X = tf.placeholder(tf.float32, shape=(None), name="X")
        @with_variable_scope('variables', initializer=tf.random_normal_initializer(stddev=1.0))
        def add_variables():
            return tf.get_variable(name="z", shape=())
        z = add_variables()
        w = tf.add(X, z, name='w')
        return w
    
    @with_name_scope('prediction')
    def add_prediction(X):
        return tf.add(X, 1.0, name='y')
    
    @with_name_scope('loss')
    def add_loss(y):
        return tf.square(y, name='loss')
    
    @with_name_scope('summaries')
    def add_summaries(loss):
        loss_summary = tf.summary.scalar('loss', loss)
        return tf.summary.merge_all()
    
    # @with_variable_scope("parameters", reuse=True)
    # def get_parameters():
    #     return { 'learning_rate': tf.get_variable("learning_rate") }
    
    def get_parameters():
        graph = tf.get_default_graph()
        return { 'learning_rate': graph.get_tensor_by_name("parameters/learning_rate:0")}
    
    @with_name_scope('train')
    def add_optimizer(loss, learning_rate):
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        training_op = optimizer.minimize(loss, name='training_op')
        return training_op
    
    w = add_inputs() # X + z
    y = add_prediction(w) # X + z + 1.0
    loss = add_loss(y) # y^2
        
    summary = add_summaries(loss)
    params = get_parameters()
    training_op = add_optimizer(loss, learning_rate=params['learning_rate'])
    
    return tf.get_default_graph()

def tb_logdir(root_logdir):   
    now = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
    return os.path.join(root_logdir, 'run-%s' % now)

# graph = tf.Graph()

# @with_return_graph(graph)
# def fill_graph():
#     return build_graph()

N_EPOCHS = 50
LEARNING_RATE = 0.1
ROOT_LOGDIR = 'tf-logs/toy-example/'
X_VAL = 1.0

'''
@with_variable_scope("parameters", reuse=False)
def set_constants(**kwargs):
    for key in kwargs:
        tf.get_variable(key, shape=(), initializer=tf.constant_initializer(kwargs[key]))
'''

@with_name_scope("parameters")
def set_constants(**kwargs):
    for key in kwargs:
        # tf.get_variable(key, shape=(), initializer=tf.constant_initializer(kwargs[key]))
        tf.constant(kwargs[key], name=key)

with tf.Graph().as_default():
    set_constants(learning_rate=LEARNING_RATE)
    graph = build_graph()
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

X = graph.get_tensor_by_name("input/X:0") # Placeholder for input
y = graph.get_tensor_by_name("prediction/y:0") # X + z + 1.0
z = graph.get_tensor_by_name("variables/z:0") # Variable to minimize
loss = graph.get_tensor_by_name("loss/loss:0") # Loss (X + z + 1.0)^2
training_op = graph.get_operation_by_name("train/training_op") # Operation updating z
learning_rate = graph.get_tensor_by_name("parameters/learning_rate:0") # Accessing parameters
summaries = graph.get_tensor_by_name("summaries/Merge/MergeSummary:0") # Accessing merged summary

logdir = tb_logdir(ROOT_LOGDIR)
saved_model_path = os.path.join(logdir, 'model_final.ckpt')

# print(graph.get_operations())

with tf.Session(graph=graph) as sess:
    file_writer = tf.summary.FileWriter(logdir, sess.graph)
    init.run()
    feed_dict = { X: X_VAL }
    z_val, y_val, loss_val = sess.run([z, y, loss], feed_dict=feed_dict)
    
    for epoch in range(1, N_EPOCHS+1):
        loss_val, summary_val, _ = sess.run([loss, summaries, training_op], feed_dict=feed_dict)
        if epoch % 5 == 0:
            print('Epoch:', epoch, 'loss:', loss_val)
            file_writer.add_summary(summary_val, epoch)
    
    z_val, y_val, loss_val = sess.run([z, y, loss], feed_dict=feed_dict)
    saver.save(sess, saved_model_path)
    file_writer.close()
    
print('Got value %.2f for z, expected -2.0' % (z_val))