# MC_Project - Building & Transferring a Deep Neural Net

### Key steps:
1. Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.
2. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 next. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.
3. Tune the hyperparameters using cross-validation and see what precision you can achieve.
4. Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce 5 better model?
6. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?
7. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a fresh new one.
8. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?
9. Try caching the frozen layers, and train the model again: how much faster is it now?

In [163]:
#set up libs and data from first notebook!
import tensorflow as tf
import numpy as np
import itertools
import matplotlib.pyplot as plt
from tensorflow.examples.tutorials.mnist import input_data

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


## 1. Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.

In [190]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

#### First lets build a function to construct DNN quickly, so we don't have to do it over and over again...

In [213]:
#set function params
def dnn(inputs, n_hidden_layers=5, n_neurons=100, name=None, 
        activation = tf.nn.elu, 
        initializer = tf.contrib.layers.variance_scaling_initializer()):
    #if n_neurons is not a integer
    if type(n_neurons) is not int:
        if not all([type(k)==int for k in n_neurons]):
            raise ValueError('All numbers must be integers! You idiot!')
        if len(n_neurons) != n_layers:
            raise ValueError('n_neurons must be equal to n_layers! Way to go, genius!')
    #else just take n_neurons for each layer
    else:
        n_neurons = np.repeat(n_neurons,n_hidden_layers)
    with tf.variable_scope(name, 'dnn'):
        for layer, neurons in zip(range(n_hidden_layers), n_neurons):
            inputs = tf.layers.dense(inputs, neurons, 
                                     activation=activation, 
                                     kernel_initializer=initializer, 
                                     name="SneakyHL_{0}".format(layer+1))
    #return the last inputs iteration (to be fed into the output layer)
    return inputs

In [215]:
#set up graph
reset_graph()

n_features = 28 * 28 # MNIST
n_classes = 5
he_init = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=(None, n_features), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

#run our function! Feed in the X variable as our first input
last_HL = dnn(X, n_neurons=150,name="MikesGloriousDNN")

logits = tf.layers.dense(last_HL, n_classes, kernel_initializer=he_init, name="logits")
Y_proba = tf.nn.softmax(logits, name="Y_proba")

In [216]:
#Checkout operations
tf.get_default_graph().get_operations()

[<tf.Operation 'X' type=Placeholder>,
 <tf.Operation 'y' type=Placeholder>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal/shape' type=Const>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal/mean' type=Const>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal/stddev' type=Const>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal/TruncatedNormal' type=TruncatedNormal>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal/mul' type=Mul>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Initializer/truncated_normal' type=Add>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel' type=VariableV2>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/Assign' type=Assign>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/kernel/read' type=Identity>,
 <tf.Operation 'MikesGloriousDNN/SneakyHL_1/bias/Initializer/zeros' type=Const>,
 <tf.Operation 'Mik

In [137]:
#Checkout operations
tf.trainable_variables()

[<tf.Variable 'MikesGloriousDNN/SneakyHL_1/kernel:0' shape=(784, 150) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_1/bias:0' shape=(150,) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_2/kernel:0' shape=(150, 150) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_2/bias:0' shape=(150,) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_3/kernel:0' shape=(150, 150) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_3/bias:0' shape=(150,) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_4/kernel:0' shape=(150, 150) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_4/bias:0' shape=(150,) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_5/kernel:0' shape=(150, 150) dtype=float32_ref>,
 <tf.Variable 'MikesGloriousDNN/SneakyHL_5/bias:0' shape=(150,) dtype=float32_ref>,
 <tf.Variable 'logits/kernel:0' shape=(150, 5) dtype=float32_ref>,
 <tf.Variable 'logits/bias:0' shape=(5,) dtype=float32_ref>]

## 2. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 next. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.

#### Now we have to complete the graph by adding the training_op and the accurcay variables

In [217]:
learning_rate = 0.01

#set loss function
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
mean_loss = tf.reduce_mean(cross_entropy, name='loss')

#set training operation
adam_optimizer = tf.train.AdamOptimizer(learning_rate)
training_operation = adam_optimizer.minimize(mean_loss, name="training_operation")

#set accuracy veriables
correct_top1 = tf.nn.in_top_k(logits, y, 1)
mean_accuracy = tf.reduce_mean(tf.cast(correct_top1,tf.float32),name="model_accuracy")

initialize_vars = tf.global_variables_initializer()
model_saver = tf.train.Saver()

### Great. Now lets extract the digits < 5 from the MNIST data set and run a model!

In [218]:
#extract MNIST observations 4 and below!
X_train1 = mnist.train.images[mnist.train.labels < 5]
y_train1 = mnist.train.labels[mnist.train.labels < 5]
X_valid1 = mnist.validation.images[mnist.validation.labels < 5]
y_valid1 = mnist.validation.labels[mnist.validation.labels < 5]
X_test1 = mnist.test.images[mnist.test.labels < 5]
y_test1 = mnist.test.labels[mnist.test.labels < 5]

print('Train size: ', X_train1.shape)
print('Validation size: ', X_valid1.shape)
print('Test size: ', X_test1.shape)

Train size:  (28038, 784)
Validation size:  (2558, 784)
Test size:  (5139, 784)


In [197]:
#set paramaters for model run
n_epochs = 20
batch_size = 100
n_rows = len(X_train1)
val_dict = {X:X_valid1, y:y_valid1}

#set params for early stopping
max_epoch_no_progress = 5
no_progress_epoch_count = 0
best_loss = np.infty
best_accuracy = 0

#run dat model!
with tf.Session() as sess:
    #initialize
    initialize_vars.run()
    step=0
    
    #begin epoch and nested batch loops!
    for epoch in range(n_epochs):
        #create random permutation (no duplication!!!)
        random_index = np.random.permutation(n_rows)
        #split up random array into batch sizes
        for rand_idx in np.array_split(random_index, n_rows//batch_size):
            X_batch, y_batch = X_train1[rand_idx], y_train1[rand_idx]
            #run training operations with batches
            sess.run(training_operation, feed_dict={X:X_batch, y:y_batch})
            step+=1
        #extract loss val after each epoch
        validation_loss, validation_acc = sess.run([mean_loss, mean_accuracy], feed_dict=val_dict)
        #save model and implement early stopping
        if validation_loss < best_loss:
            the_best_model = model_saver.save(sess,"./tf_logs/MC_ProjBTModel_0to4")
            best_loss = validation_loss
            best_accuracy = validation_acc
            no_progress_epoch_count = 0
        else:
            no_progress_epoch_count += 1
            if no_progress_epoch_count > max_epoch_no_progress:
                print("\nModel stopped EARLY at {0} epochs and {1} steps".format(epoch,step))
                print("Best Validation Accuracy: {:.4f}%".format(best_accuracy*100))
                break
        #print the results of the training epoch (keep 6 digits)
        print("{0} Epochs Complete:\tValidation Loss: {1:.6f},\tBest Loss: {2: .6f},\tAccuracy: {3: .3f}%".format(epoch,validation_loss,best_loss,validation_acc*100))

0 Epochs Complete:	Validation Loss: 0.086732,	Best Loss:  0.086732,	Accuracy:  97.381%
1 Epochs Complete:	Validation Loss: 0.124854,	Best Loss:  0.086732,	Accuracy:  96.325%
2 Epochs Complete:	Validation Loss: 0.054358,	Best Loss:  0.054358,	Accuracy:  98.084%
3 Epochs Complete:	Validation Loss: 0.067607,	Best Loss:  0.054358,	Accuracy:  98.475%
4 Epochs Complete:	Validation Loss: 0.062015,	Best Loss:  0.054358,	Accuracy:  98.475%
5 Epochs Complete:	Validation Loss: 0.045546,	Best Loss:  0.045546,	Accuracy:  98.593%
6 Epochs Complete:	Validation Loss: 0.077384,	Best Loss:  0.045546,	Accuracy:  98.475%
7 Epochs Complete:	Validation Loss: 0.117975,	Best Loss:  0.045546,	Accuracy:  98.436%
8 Epochs Complete:	Validation Loss: 0.172980,	Best Loss:  0.045546,	Accuracy:  95.700%
9 Epochs Complete:	Validation Loss: 0.063484,	Best Loss:  0.045546,	Accuracy:  98.397%
10 Epochs Complete:	Validation Loss: 0.043767,	Best Loss:  0.043767,	Accuracy:  98.827%
11 Epochs Complete:	Validation Loss: 0.053

## 3. Tune the hyperparameters using cross-validation and see what precision you can achieve.


### Let's use SKlearn's GridSearchCV!!! Create a custom NNet Classifier that is compatible with GridSearch

To speed things up, and avoiding duplicating our code a billion times, we will create a DNNClassifier class that we can train and pass into with Scikit-Learn's RandomizedSearchCV to perform hyperparameter tuning. Below is a breakdown of the Class:
- the __init__() method (constructor) simply creates the instances variables for each hyperparam
- the fit() method creates a graph, starts a session and trains the model. It calls on other functions to do so:
    - the _construct_graph() method builds the graph. Once this is done is saves all important operations as instance (class) variables so you can access them easily in other methods
    - the _DeepNeuralNetwork() method builds the hiddne layers. It also has support for batch normalization and dropout
    - in the fit() method the user can input X_valid and y_valid variables which will implement early stopping. **NOTE: this implementation DOES NOT save the best model to disk, but instead to memory! It uses the _get_model_params() method to get all the graphs variables and values, and the _restore_model_params() method to restore variables from teh best model. This actually sppeds up trianing! Wooohoooo!**
    - Once fit is complete, the model is trianed and the session remains open so that predictions can be made quickly... WITHOUT having to save a model to disk and restore it for every prediction. 
    - close_session() will allow you to close the session if need be!
- predict_proba() uses the trained model to predict the class probas
- predcit() method calls the predict_proba() method and return class with the highest probability for each instance!

A few notes:
- Use _leading_underscore (eg. self._variable) to decale 'private variables'; anything with this convention is ignored in "from module import"...  ***These are not actually private***, but its goo practice. These are actually known as “weak internal use indicators”.
- Use single trailing underscore (eg. self.classes_) to to avoid conflict with Python keywords. For example: Tkinter.Toplevel(master, class_='ClassName') # Avoid conflict with 'class' keyword



In [298]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError
import warnings
from datetime import datetime
from pytz import timezone
import numpy as np

class MikesGloriousDNNClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n_hidden_layers=5, n_neurons=100, learning_rate=0.01, k_in_top_val = 1, layer_name="hidden",
                 batch_size = 50,
                 max_epochs_no_progress = 10,
                 tf_logs_path = "./DNN1_LOGS",
                 optimizer_class=tf.train.AdamOptimizer, 
                 activation_function=tf.nn.elu, 
                 initializer=tf.contrib.layers.variance_scaling_initializer(), 
                 tb_model_name = None,
                 batch_norm_momentum_decay=None, 
                 dropout_rate=None, 
                 random_state=None):
        """Initialize the Classifier by storing all the hyperparams"""
        self.n_hidden_layers = n_hidden_layers
        self.n_neurons = n_neurons
        self.learning_rate = learning_rate
        self.k_in_top_val = k_in_top_val
        self.layer_name = layer_name
        self.batch_size = batch_size
        self.max_epochs_no_progress = max_epochs_no_progress
        self.tf_logs_path = tf_logs_path
        self.optimizer_class = optimizer_class
        self.activation_function = activation_function
        self.initializer = initializer
        self.tb_model_name = tb_model_name #assign tb_model_name to activate tensor board model saving
        self.batch_norm_momentum_decay = batch_norm_momentum_decay
        self.dropout_rate = dropout_rate
        self.random_state = random_state
        self.dropout_on=False
        self.batch_norm_on = False
        self.early_stopping_on = False
        self._session = None
        
        #test inputs
        assert (isinstance(self.n_hidden_layers, int)), "n_hidden_layers parameter must be integer"
        assert (isinstance(self.learning_rate, float)), "learning_rate parameter must be number"
        assert (isinstance(self.k_in_top_val, int)), "k_in_top_val parameter must be integer"
        
        if isinstance(self.n_neurons, list):
            if not all([(isinstance(k, np.int) or isinstance(k, np.integer)) for k in n_neurons]):
                raise ValueError('All numbers must be integers! You idiot!')
            if len(self.n_neurons) != self.n_hidden_layers:
                raise ValueError('n_neurons must be equal to n_layers! Way to go, genius!')
        else:
            assert (isinstance(self.n_neurons, int)), "n_neurons parameter must be integer"
    
    def _DeepNeuralNet(self, inputs):
        """Builds the hidden layers for the DNN, including support for batch normalization and dropout. Order of operations for a single layers is:
            1. Dropout function
            2. Normal dense layers
            3. Batch normalization function
            4. Activation function
        Output layer is created in the '_build_graph' method"""
        #first transform n_neurons
        if not isinstance(self.n_neurons, list):
            self.n_neurons = np.repeat(self.n_neurons, self.n_hidden_layers)
                        
        for layer, neurons in zip(range(self.n_hidden_layers), self.n_neurons):            
            #if dropout_rate is None (default) if loop will return False
            if self.dropout_rate:
                self.dropout_on = True
                #this will first create a dropout layers for X, and then a drop layer after every hidden layer
                inputs = tf.layers.dropout(inputs, self.dropout_rate, training=self._inTrainingMode)
            
            #create the normal hidden layers
            inputs = tf.layers.dense(inputs, neurons, 
                                     kernel_initializer=self.initializer, 
                                     name=self.layer_name+"_{}".format(layer+1))
            #if batch_normalization is None (default) if loop will return false
            if self.batch_norm_momentum_decay:
                self.batch_norm_on = True
                inputs = tf.layers.batch_normalization(inputs, 
                                                       momentum=self.batch_norm_momentum_decay, 
                                                       training=self._inTrainingMode)
            #create activation layer
            inputs = self.activation_function(inputs, name=self.layer_name+"_{}_ACTIVATED".format(layer+1))
            #return final activated layer
        return inputs  
    
    def _construct_graph(self, n_features, n_classes):
        """Build graph that takes in X, y, and sets up the outputs"""
        #set random states
        if self.random_state is not None:
            tf.set_random_seed(self.random_state)
            np.random.seed(self.random_state)
        
        #set X and y placeholders
        X = tf.placeholder(tf.float32, shape=(None, n_features), name='X')
        y = tf.placeholder(tf.int64, shape=(None), name='y')
        #set self._training if dropout or batch_norm is utilized
        if self.batch_norm_momentum_decay or self.dropout_rate:
            self._inTrainingMode = tf.placeholder_with_default(False, shape=(), name='inTrainingMode')
        else:
            self._inTrainingMode = None
        #create layers
        activated_last_HL = self._DeepNeuralNet(X)
        
        #create output logits and probabilities
        output_logits = tf.layers.dense(activated_last_HL, n_classes, 
                                          kernel_initializer=self.initializer, 
                                          name='output_logits')
        y_proba = tf.nn.softmax(output_logits, name='y_proba') #this isnt used in training but is useful later!
        
        #create loss funciton and calculate mean loss
        cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output_logits)
        mean_loss = tf.reduce_mean(cross_entropy, name='mean_loss')
        
        #create training operation and set accuracy variables
        model_optimizer = self.optimizer_class(learning_rate = self.learning_rate)
        training_op = model_optimizer.minimize(mean_loss, name='training_op')
        correct_topk = tf.nn.in_top_k(output_logits, y, self.k_in_top_val)
        mean_accuracy = tf.reduce_mean(tf.cast(correct_topk, tf.float32), name='mean_accuracy')
        
        #initialization and model saver variables
        initialize_vars = tf.global_variables_initializer()
        model_saver = tf.train.Saver()
        
        #track mean_loss for viewing in TB later!
        if self.tb_model_name:
            eastern = timezone('US/Eastern')
            now = datetime.now(tz=eastern).strftime("%Y-%m-%d %H.%M.%S")
            logdir = "{}/{}-{}".format(self.tf_logs_path,self.tb_model_name, now)
            mean_loss_summary = tf.summary.scalar('mean_loss_summary', mean_loss)
            file_writer = tf.summary.FileWriter(logdir, self._graph)
            #save values to instance
            self._mean_loss_summary = mean_loss_summary
            self._file_writer = file_writer
        
        #IMPORTANT: Save all important operations and variables for quick access later!
        self._X, self._y = X, y
        self._Y_proba = y_proba
        self._training_op = training_op
        self._correct_intopK = correct_topk
        self._mean_loss = mean_loss
        self._mean_accuracy = mean_accuracy
        self._init = initialize_vars
        self._model_saver = model_saver

    def close_session(self):
        #check if session initialized (self._session anything other than None or False)
        if self._session:
            self._session.close()
    
    def _get_model_params(self):
        """Get all variable values (used for early stopping, faster than saving to disk)"""
        #set class graph as default graph
        with self._graph.as_default():
            global_vars = tf.global_variables()
        #use dictionary comp. to get names and values of each variable
        return {gv_name.op.name: gv_value for gv_name, gv_value in zip(global_vars, self._session.run(global_vars))}
   
    def _restore_model_params(self, gv_names_vals):
        #extract gv names
        gv_names = list(gv_names_vals.keys())
        #extract assign operations for each gv        
        assign_ops = {gv_name:self._graph.get_operation_by_name(gv_name+"/Assign") for gv_name in gv_names}
        #extract second operation of each assign op for each gv
        init_values = {gv_name:assign_op.inputs[1] for gv_name, assign_op in assign_ops.items()}
        #create dict of assign_ops and gv_values
        feed_dict = {init_values[gv_name]:gv_names_vals[gv_name] for gv_name in gv_names}
        #restore model params by updating assignment operations
        self._session.run(assign_ops, feed_dict=feed_dict)
            
    def fit(self, X, y, X_valid=None, y_valid=None, n_epochs=100):
        """Fit the model to the training set. If X_valid and y_valid are provided, use EARLY STOPPING!!!"""
        #close existing session if open
        self.close_session()
        self.n_epochs = n_epochs
        
        #early stopping warning
        if X_valid is None or y_valid is None:
            warnings.warn("Early Stopping will not run without 'X_valid' and 'y_valid' inputs.")
        else:
            self.early_stopping_on = True
                
        # Translate the labels vector to a vector of sequential class indeces from integers 0 to n_classes-1
        self.classes_unique, y = np.unique(y, return_inverse=True)
            
        #extract n_inputs and n_outputs from the trainin set!!!
        n_classes_ = len(self.classes_unique)
        n_features_ = X.shape[1]
        
        self._graph = tf.Graph()
        with self._graph.as_default():
            #BUILD THE GRAPH
            self._construct_graph(n_features_, n_classes_)
            #We need to explicitly run the extra update operations needed by batch normalization 
            #namely (sess.run([training_op, extra_update_ops],...).
            extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        
        train_text = "Training DNN: \nlayers={0}, neurons={1}, epochs={2},  batch_size={3}, early_stop:{4}, batch_norm:{5}, dropout:{6}\n"
        print(train_text.format(self.n_hidden_layers,self.n_neurons,self.n_epochs,self.batch_size,
                                self.early_stopping_on,self.batch_norm_on,self.dropout_on))
        #set params to run model and perform early stopping (if we need it!)
        no_progress_epoch_count = 0
        best_loss = np.infty
        best_accuracy = 0
        best_params = None
        
        self._session = tf.Session(graph=self._graph)
        with self._session.as_default() as sess:
            self._init.run()
            step=0
            for epoch in range(self.n_epochs):
                random_index = np.random.permutation(len(X))
                for rand_idx in np.array_split(random_index, len(X)//self.batch_size):
                    X_batch, y_batch = X[rand_idx], y[rand_idx]
                    batch_dict = {self._X:X_batch, self._y:y_batch}
                    #if inTrainingMode is activated in the graph, then it needs to be set to true to activate dropout and/or batch_normalization
                    if self._inTrainingMode is not None:
                        batch_dict[self._inTrainingMode] = True
                    sess.run(self._training_op, feed_dict=batch_dict)
                    if extra_update_ops:
                        sess.run(extra_update_ops, feed_dict=batch_dict)
                    #save model summary every 10 steps for TB
                    if self.tb_model_name:
                        if step % 10 == 0:
                            summary_str = self._mean_loss_summary.eval(feed_dict=batch_dict)
                            self._file_writer.add_summary(summary_str, step)
                    step += 1

                #if validation data provided, run early stopping
                if self.early_stopping_on:
                    val_dict = {self._X: X_valid, self._y:y_valid}
                    validation_loss, validation_acc = sess.run([self._mean_loss, self._mean_accuracy], feed_dict=val_dict)

                    #save model and implement early stopping
                    if validation_loss < best_loss:
                        best_params = self._get_model_params()
                        best_loss = validation_loss
                        best_accuracy = validation_acc
                        no_progress_epoch_count = 0
                    else:
                        no_progress_epoch_count += 1
                        if no_progress_epoch_count > self.max_epochs_no_progress:
                            print("\nModel stopped EARLY at {0} epochs and {1} steps".format(epoch+1,step))
                            print("Best Validation Accuracy: {:.4f}%".format(best_accuracy*100))
                            break                        
                    #print the results of the training epoch (keep 6 digits)
                    print("{0} Epochs:\tVal Loss: {1:.6f},\tBest_Loss: {2: .6f},\tAcc: {3: .3f}%\tNo_Prog: {4}".format(epoch+1,validation_loss,best_loss,validation_acc*100,no_progress_epoch_count))
                #no validation provided, calculate loss and acc on batches
                else:
                    train_dict = {self._X: X_batch, self._y:y_batch}
                    train_loss, train_acc = sess.run([self._mean_loss, self._mean_accuracy], feed_dict=train_dict)
                    print("{0} Epochs Complete:\tTraining Batch Loss: {1:.6f}, \tAccuracy: {2: .3f}%".format(epoch+1,train_loss,train_acc*100))
            #close filewriter if open
            if self.tb_model_name:
                self._file_writer.close()
            #restore best model if we used early stopping
            if best_params:
                self._restore_model_params(best_params)
            #Return self. This is again for compatibility reasons with common interface of scikit-learn.
        return self

    def predict_proba(self, X):
        #return error if no session initialized (ie model hasn't been fit yet)
        if not self._session:
            raise NotFittedError("This {} instance has not yet been fitted!!!".format(self.__class__.__name__))
        with self._session.as_default() as sess:
            return self._Y_proba.eval(feed_dict={self._X: X})
    
    def predict(self, X):
        class_indices = np.argmax(self.predict_proba(X), axis=1)
        return np.array([[self.classes_unique[class_index]] for class_index in class_indices], np.int32)

    def save(self, path):
        self._model_saver.save(self._session, path)

##### Run the model without batch_norm or dropout!

In [223]:
reset_graph()
dnn_clf = MikesGloriousDNNClassifier(random_state=42, n_hidden_layers=5, n_neurons=50, max_epochs_no_progress=5)
dnn_clf.fit(X_train1, y_train1, n_epochs=1000, X_valid=X_valid1, y_valid=y_valid1)

Training DNN: 
layers=5, neurons=[50 50 50 50 50], epochs=1000,  batch_size=50, early_stop:True, batch_norm:False, dropout:False

1 Epochs:	Val Loss: 0.074042,	Best_Loss:  0.074042,	Acc:  97.694%	No_Prog: 0
2 Epochs:	Val Loss: 0.085416,	Best_Loss:  0.074042,	Acc:  98.163%	No_Prog: 1
3 Epochs:	Val Loss: 0.072412,	Best_Loss:  0.072412,	Acc:  97.850%	No_Prog: 0
4 Epochs:	Val Loss: 0.066085,	Best_Loss:  0.066085,	Acc:  98.749%	No_Prog: 0
5 Epochs:	Val Loss: 0.057669,	Best_Loss:  0.057669,	Acc:  98.241%	No_Prog: 0
6 Epochs:	Val Loss: 0.058243,	Best_Loss:  0.057669,	Acc:  98.593%	No_Prog: 1
7 Epochs:	Val Loss: 0.059500,	Best_Loss:  0.057669,	Acc:  98.749%	No_Prog: 2
8 Epochs:	Val Loss: 0.066636,	Best_Loss:  0.057669,	Acc:  98.475%	No_Prog: 3
9 Epochs:	Val Loss: 0.083246,	Best_Loss:  0.057669,	Acc:  97.850%	No_Prog: 4
10 Epochs:	Val Loss: 0.150970,	Best_Loss:  0.057669,	Acc:  97.068%	No_Prog: 5

Model stopped EARLY at 11 epochs and 6160 steps
Best Validation Accuracy: 98.2408%


MikesGloriousDNNClassifier(activation_function=<function elu at 0x000002297AEA8B70>,
              batch_norm_momentum_decay=None, batch_size=50,
              dropout_rate=None,
              initializer=<function variance_scaling_initializer.<locals>._initializer at 0x0000022902276D90>,
              k_in_top_val=1, layer_name='hidden', learning_rate=0.01,
              max_epochs_no_progress=5, n_hidden_layers=5,
              n_neurons=array([50, 50, 50, 50, 50]),
              optimizer_class=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
              random_state=42, tb_model_name=None,
              tf_logs_path='./DNN1_LOGS')

##### Model is trained!!! Evaluate the accuracy using the test set!

In [67]:
from sklearn.metrics import accuracy_score
y_pred = dnn_clf.predict(X_test1)
accuracy_score(y_test1, y_pred)

0.99202179412337033

##### Wooohoo!!! Model is working and making predictions like a pro! Lets use RandomizedGridSearchCV to find better hyperparams! This will probably take a while... run it overnight!

In [336]:
from sklearn.model_selection import RandomizedSearchCV

#define leaky relu funciton... because for some reason its not build into TF!
def leaky_relu(alpha=0.01):
    def parametrized_leaky_relu(z, name=None):
        return tf.maximum(alpha*z, z, name=name)
    return parametrized_leaky_relu

params = {
    "n_hidden_layers":[5,10,15],
    "n_neurons":[50,100,150],
    "learning_rate":[0.01,0.02],
    "batch_size": [100, 500],
    "activation_function":[tf.nn.relu, tf.nn.elu, leaky_relu(alpha=0.01),leaky_relu(alpha=0.02)]
}
rand_search = RandomizedSearchCV(MikesGloriousDNNClassifier(), params, n_iter=3, random_state=42, verbose=2, fit_params={"X_valid":X_valid1, "y_valid":y_valid1, "n_epochs":100})
rand_search.fit(X_train1, y_train1)



Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] n_neurons=50, n_hidden_layers=5, learning_rate=0.02, batch_size=100, activation_function=<function leaky_relu.<locals>.parametrized_leaky_relu at 0x00000198ACE85620> 
Training DNN: 
layers=5, neurons=[50 50 50 50 50], epochs=100,  batch_size=100, early_stop:True, batch_norm:False, dropout:False

0 Epochs:	Val Loss: 0.065894,	Best_Loss:  0.065894,	Acc:  97.498%	No_Prog: 0
1 Epochs:	Val Loss: 0.076526,	Best_Loss:  0.065894,	Acc:  97.733%	No_Prog: 1
2 Epochs:	Val Loss: 0.072168,	Best_Loss:  0.065894,	Acc:  98.124%	No_Prog: 2
3 Epochs:	Val Loss: 0.051950,	Best_Loss:  0.051950,	Acc:  98.514%	No_Prog: 0
4 Epochs:	Val Loss: 0.060420,	Best_Loss:  0.051950,	Acc:  98.280%	No_Prog: 1
5 Epochs:	Val Loss: 0.072689,	Best_Loss:  0.051950,	Acc:  98.124%	No_Prog: 2
6 Epochs:	Val Loss: 0.082801,	Best_Loss:  0.051950,	Acc:  98.045%	No_Prog: 3
7 Epochs:	Val Loss: 0.070122,	Best_Loss:  0.051950,	Acc:  98.475%	No_Prog: 4
8 Epochs:	Val Loss: 0.0

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.5s remaining:    0.0s


Training DNN: 
layers=5, neurons=[50 50 50 50 50], epochs=100,  batch_size=100, early_stop:True, batch_norm:False, dropout:False

0 Epochs:	Val Loss: 0.082417,	Best_Loss:  0.082417,	Acc:  97.615%	No_Prog: 0
1 Epochs:	Val Loss: 0.075253,	Best_Loss:  0.075253,	Acc:  97.733%	No_Prog: 0
2 Epochs:	Val Loss: 0.068990,	Best_Loss:  0.068990,	Acc:  97.967%	No_Prog: 0
3 Epochs:	Val Loss: 0.058670,	Best_Loss:  0.058670,	Acc:  98.475%	No_Prog: 0
4 Epochs:	Val Loss: 0.076264,	Best_Loss:  0.058670,	Acc:  98.436%	No_Prog: 1
5 Epochs:	Val Loss: 0.097005,	Best_Loss:  0.058670,	Acc:  97.733%	No_Prog: 2
6 Epochs:	Val Loss: 0.082866,	Best_Loss:  0.058670,	Acc:  98.202%	No_Prog: 3
7 Epochs:	Val Loss: 0.066430,	Best_Loss:  0.058670,	Acc:  98.554%	No_Prog: 4
8 Epochs:	Val Loss: 0.086034,	Best_Loss:  0.058670,	Acc:  98.358%	No_Prog: 5
9 Epochs:	Val Loss: 0.127167,	Best_Loss:  0.058670,	Acc:  98.124%	No_Prog: 6
10 Epochs:	Val Loss: 0.097736,	Best_Loss:  0.058670,	Acc:  98.084%	No_Prog: 7
11 Epochs:	Val Loss: 0

2 Epochs:	Val Loss: 0.053580,	Best_Loss:  0.053580,	Acc:  98.593%	No_Prog: 0
3 Epochs:	Val Loss: 0.042977,	Best_Loss:  0.042977,	Acc:  98.397%	No_Prog: 0
4 Epochs:	Val Loss: 0.042300,	Best_Loss:  0.042300,	Acc:  98.514%	No_Prog: 0
5 Epochs:	Val Loss: 0.041023,	Best_Loss:  0.041023,	Acc:  98.632%	No_Prog: 0
6 Epochs:	Val Loss: 0.036701,	Best_Loss:  0.036701,	Acc:  98.827%	No_Prog: 0
7 Epochs:	Val Loss: 0.034645,	Best_Loss:  0.034645,	Acc:  98.788%	No_Prog: 0
8 Epochs:	Val Loss: 0.035368,	Best_Loss:  0.034645,	Acc:  98.866%	No_Prog: 1
9 Epochs:	Val Loss: 0.034106,	Best_Loss:  0.034106,	Acc:  98.788%	No_Prog: 0
10 Epochs:	Val Loss: 0.037430,	Best_Loss:  0.034106,	Acc:  98.554%	No_Prog: 1
11 Epochs:	Val Loss: 0.035471,	Best_Loss:  0.034106,	Acc:  98.827%	No_Prog: 2
12 Epochs:	Val Loss: 0.034830,	Best_Loss:  0.034106,	Acc:  98.984%	No_Prog: 3
13 Epochs:	Val Loss: 0.034424,	Best_Loss:  0.034106,	Acc:  99.023%	No_Prog: 4
14 Epochs:	Val Loss: 0.034895,	Best_Loss:  0.034106,	Acc:  99.023%	No_Pr

5 Epochs:	Val Loss: 0.073874,	Best_Loss:  0.055282,	Acc:  98.319%	No_Prog: 4
6 Epochs:	Val Loss: 0.113012,	Best_Loss:  0.055282,	Acc:  97.654%	No_Prog: 5
7 Epochs:	Val Loss: 0.111854,	Best_Loss:  0.055282,	Acc:  98.241%	No_Prog: 6
8 Epochs:	Val Loss: 0.072985,	Best_Loss:  0.055282,	Acc:  98.358%	No_Prog: 7
9 Epochs:	Val Loss: 0.070998,	Best_Loss:  0.055282,	Acc:  98.475%	No_Prog: 8
10 Epochs:	Val Loss: 0.096926,	Best_Loss:  0.055282,	Acc:  98.280%	No_Prog: 9
11 Epochs:	Val Loss: 0.102898,	Best_Loss:  0.055282,	Acc:  98.436%	No_Prog: 10
12 Epochs:	Val Loss: 0.109449,	Best_Loss:  0.055282,	Acc:  98.319%	No_Prog: 11
13 Epochs:	Val Loss: 0.195180,	Best_Loss:  0.055282,	Acc:  97.811%	No_Prog: 12
14 Epochs:	Val Loss: 0.109879,	Best_Loss:  0.055282,	Acc:  98.554%	No_Prog: 13
15 Epochs:	Val Loss: 0.096795,	Best_Loss:  0.055282,	Acc:  98.593%	No_Prog: 14
16 Epochs:	Val Loss: 0.135313,	Best_Loss:  0.055282,	Acc:  98.514%	No_Prog: 15
17 Epochs:	Val Loss: 0.158006,	Best_Loss:  0.055282,	Acc:  98.2

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  2.2min finished


Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=100,  batch_size=500, early_stop:True, batch_norm:False, dropout:False

0 Epochs:	Val Loss: 0.061344,	Best_Loss:  0.061344,	Acc:  98.045%	No_Prog: 0
1 Epochs:	Val Loss: 0.050911,	Best_Loss:  0.050911,	Acc:  98.397%	No_Prog: 0
2 Epochs:	Val Loss: 0.033912,	Best_Loss:  0.033912,	Acc:  98.827%	No_Prog: 0
3 Epochs:	Val Loss: 0.035496,	Best_Loss:  0.033912,	Acc:  98.788%	No_Prog: 1
4 Epochs:	Val Loss: 0.028169,	Best_Loss:  0.028169,	Acc:  98.905%	No_Prog: 0
5 Epochs:	Val Loss: 0.030837,	Best_Loss:  0.028169,	Acc:  99.140%	No_Prog: 1
6 Epochs:	Val Loss: 0.027609,	Best_Loss:  0.027609,	Acc:  98.905%	No_Prog: 0
7 Epochs:	Val Loss: 0.026978,	Best_Loss:  0.026978,	Acc:  99.101%	No_Prog: 0
8 Epochs:	Val Loss: 0.030228,	Best_Loss:  0.026978,	Acc:  98.944%	No_Prog: 1
9 Epochs:	Val Loss: 0.025018,	Best_Loss:  0.025018,	Acc:  99.101%	No_Prog: 0
10 Epochs:	Val Loss: 0.026011,	Best_Loss:  0.025018,	Acc:  99.179%	No_Prog: 1
11 Epochs:	Val Lo

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=MikesGloriousDNNClassifier(activation_function=<function elu at 0x000001989B4C40D0>,
              batch_norm_momentum_decay=None, batch_size=50,
              dropout_rate=None,
              initializer=<function variance_scaling_initializer.<locals>._initializer at 0x00000198A51552F0>,
...izer_class=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
              random_state=None),
          fit_params={'X_valid': array([[ 0.,  0., ...,  0.,  0.],
       [ 0.,  0., ...,  0.,  0.],
       ...,
       [ 0.,  0., ...,  0.,  0.],
       [ 0.,  0., ...,  0.,  0.]], dtype=float32), 'y_valid': array([0, 4, ..., 1, 2], dtype=uint8), 'n_epochs': 100},
          iid=True, n_iter=3, n_jobs=1,
          param_distributions={'n_hidden_layers': [5, 10, 15], 'n_neurons': [50, 100, 150], 'learning_rate': [0.01, 0.02], 'batch_size': [100, 500], 'activation_function': [<function relu at 0x000001989B4C6268>, <function elu at 0

In [337]:
#what are the best params
print(rand_search.best_params_)

#print accruacy
y_pred = rand_search.predict(X_test1)
accuracy_score(y_test1, y_pred)

{'n_neurons': 100, 'n_hidden_layers': 5, 'learning_rate': 0.01, 'batch_size': 500, 'activation_function': <function relu at 0x000001989B4C6268>}


0.99318933644677954

#### NO WAY!!! Leaky Relu actually trained the best model... thats pretty unexpected. But just goes to show how its important to test a bunch of different hyperparams to find the best result. 

Our increase in accuracy of 1% means that our error rate when from roughly 2% to 1%... **which is a 50% reduction in errors!!!!**

#### Let's save this model:

In [339]:
rand_search.best_estimator_.save("./tf_logs/my_best_DNN_0_to_4")

## 4. Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce 5 better model?

Great. Now lets train this awesome model, and then train it WITH batch normalizaiton, and see if it converges faster. Use tensor board to look at training speeds!

First rerun the best model using the best_params from above:

In [299]:
reset_graph()
model1 = MikesGloriousDNNClassifier(activation_function=leaky_relu(alpha=0.01),
                                     n_neurons=100,
                                     n_hidden_layers=5,
                                     learning_rate=0.01,
                                     batch_size=500,
                                     tb_model_name='model1')
 
model1.fit(X_train1, y_train1, n_epochs=1000, X_valid=X_valid1, y_valid=y_valid1)

model1.save("./tf_logs/MODEL1_0_to_4")

Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=1000,  batch_size=500, early_stop:True, batch_norm:False, dropout:False

1 Epochs:	Val Loss: 0.072347,	Best_Loss:  0.072347,	Acc:  97.772%	No_Prog: 0
2 Epochs:	Val Loss: 0.058847,	Best_Loss:  0.058847,	Acc:  98.358%	No_Prog: 0
3 Epochs:	Val Loss: 0.043963,	Best_Loss:  0.043963,	Acc:  98.710%	No_Prog: 0
4 Epochs:	Val Loss: 0.044730,	Best_Loss:  0.043963,	Acc:  98.593%	No_Prog: 1
5 Epochs:	Val Loss: 0.047267,	Best_Loss:  0.043963,	Acc:  98.632%	No_Prog: 2
6 Epochs:	Val Loss: 0.037500,	Best_Loss:  0.037500,	Acc:  99.023%	No_Prog: 0
7 Epochs:	Val Loss: 0.040106,	Best_Loss:  0.037500,	Acc:  98.827%	No_Prog: 1
8 Epochs:	Val Loss: 0.037894,	Best_Loss:  0.037500,	Acc:  99.140%	No_Prog: 2
9 Epochs:	Val Loss: 0.042804,	Best_Loss:  0.037500,	Acc:  98.632%	No_Prog: 3
10 Epochs:	Val Loss: 0.041971,	Best_Loss:  0.037500,	Acc:  98.866%	No_Prog: 4
11 Epochs:	Val Loss: 0.038344,	Best_Loss:  0.037500,	Acc:  98.905%	No_Prog: 5
12 Epochs:	Val 

#### OK so we seemed to have reached out optimal model at epoch 9. Thats pretty good I guess... Lets check the accuracy, see if its still as good as before... it should be.

In [98]:
#print accruacy
y_pred = model1.predict(X_test1)
accuracy_score(y_test1, y_pred)

0.99241097489784003

### Dope its still performing well on the test set. Now lets add in some Batch Normalization and see if the mean_loss minimizes faster... lets hand it over to TB (assuming our TensorBoard code tweaks worked...)!

    (C:\Users\mciniello\AppData\Local\Continuum\anaconda3) C:\Users\mciniello\Desktop\Python\Updated projects>python -m tensorflow.tensorboard --logdir DNN1_LOGS\
    Starting TensorBoard b'54' at http://CA47496-MCINI05:6006
    (Press CTRL+C to quit)

In [99]:
reset_graph()
dnn_clf = MikesGloriousDNNClassifier(activation_function=leaky_relu(alpha=0.01),
                                     n_neurons=100,
                                     n_hidden_layers=5,
                                     learning_rate=0.01,
                                     batch_size=500, 
                                     batch_norm_momentum_decay = 0.99, 
                                     tb_model_name='model1_bn')
 
dnn_clf.fit(X_train1, y_train1, n_epochs=1000, X_valid=X_valid1, y_valid=y_valid1)

Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=1000,  batch_size=500, early_stop:True, batch_norm:True, dropout:False

1 Epochs:	Val Loss: 0.086293,	Best_Loss:  0.086293,	Acc:  97.342%	No_Prog: 0
2 Epochs:	Val Loss: 0.053649,	Best_Loss:  0.053649,	Acc:  98.358%	No_Prog: 0
3 Epochs:	Val Loss: 0.051087,	Best_Loss:  0.051087,	Acc:  98.280%	No_Prog: 0
4 Epochs:	Val Loss: 0.041546,	Best_Loss:  0.041546,	Acc:  98.593%	No_Prog: 0
5 Epochs:	Val Loss: 0.040599,	Best_Loss:  0.040599,	Acc:  98.514%	No_Prog: 0
6 Epochs:	Val Loss: 0.032538,	Best_Loss:  0.032538,	Acc:  98.710%	No_Prog: 0
7 Epochs:	Val Loss: 0.036113,	Best_Loss:  0.032538,	Acc:  98.554%	No_Prog: 1
8 Epochs:	Val Loss: 0.033160,	Best_Loss:  0.032538,	Acc:  98.905%	No_Prog: 2
9 Epochs:	Val Loss: 0.030912,	Best_Loss:  0.030912,	Acc:  99.023%	No_Prog: 0
10 Epochs:	Val Loss: 0.030655,	Best_Loss:  0.030655,	Acc:  99.218%	No_Prog: 0
11 Epochs:	Val Loss: 0.030828,	Best_Loss:  0.030655,	Acc:  99.062%	No_Prog: 1
12 Epochs:	Val L

MikesGloriousDNNClassifier(activation_function=<function leaky_relu.<locals>.parametrized_leaky_relu at 0x000002290C18CC80>,
              batch_norm_momentum_decay=0.99, batch_size=500,
              dropout_rate=None,
              initializer=<function variance_scaling_initializer.<locals>._initializer at 0x000002290BFA3950>,
              k_in_top_val=1, layer_name='hidden', learning_rate=0.01,
              max_epochs_no_progress=10, n_hidden_layers=5,
              n_neurons=array([100, 100, 100, 100, 100]),
              optimizer_class=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
              random_state=None, tb_model_name='model1_bn',
              tf_logs_path='./DNN1_LOGS')

In [100]:
#print accruacy
y_pred = dnn_clf.predict(X_test1)
accuracy_score(y_test1, y_pred)

0.99338392683401444

#### Hmmmm we aren't converging any faster, and our accuracy is a little bit lower when we use batch normalization on the exact same model... interesting. 

![](pictures/MC_Project - DNN_1.png)

### Sooo lets try optimizing the params WITH batch normalization... maybe that will help!

In [107]:
params = {
    "n_hidden_layers":[5,10],
    "n_neurons":[100,160],
    "learning_rate":[0.01,0.02],
    "batch_size": [100, 200],
    "batch_norm_momentum_decay":[0.98],
    "activation_function":[tf.nn.relu, tf.nn.elu, leaky_relu(alpha=0.01),leaky_relu(alpha=0.02)]
}
rand_search = RandomizedSearchCV(MikesGloriousDNNClassifier(), params, n_iter=2, random_state=42, verbose=2, fit_params={"X_valid":X_valid1, "y_valid":y_valid1, "n_epochs":100})
rand_search.fit(X_train1, y_train1)



Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] n_neurons=100, n_hidden_layers=5, learning_rate=0.02, batch_size=100, batch_norm_momentum_decay=0.98, activation_function=<function leaky_relu.<locals>.parametrized_leaky_relu at 0x000002290EF5F2F0> 
Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=100,  batch_size=100, early_stop:True, batch_norm:True, dropout:False

1 Epochs:	Val Loss: 0.065194,	Best_Loss:  0.065194,	Acc:  97.850%	No_Prog: 0
2 Epochs:	Val Loss: 0.048834,	Best_Loss:  0.048834,	Acc:  98.475%	No_Prog: 0
3 Epochs:	Val Loss: 0.073980,	Best_Loss:  0.048834,	Acc:  97.420%	No_Prog: 1
4 Epochs:	Val Loss: 0.055525,	Best_Loss:  0.048834,	Acc:  98.280%	No_Prog: 2
5 Epochs:	Val Loss: 0.050570,	Best_Loss:  0.048834,	Acc:  98.554%	No_Prog: 3
6 Epochs:	Val Loss: 0.050438,	Best_Loss:  0.048834,	Acc:  98.749%	No_Prog: 4
7 Epochs:	Val Loss: 0.037595,	Best_Loss:  0.037595,	Acc:  98.749%	No_Prog: 0
8 Epochs:	Val Loss: 0.047837,	Best_Loss:  0.037595,	Acc:  98.71

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.4s remaining:    0.0s


Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=100,  batch_size=100, early_stop:True, batch_norm:True, dropout:False

1 Epochs:	Val Loss: 0.056916,	Best_Loss:  0.056916,	Acc:  98.084%	No_Prog: 0
2 Epochs:	Val Loss: 0.056654,	Best_Loss:  0.056654,	Acc:  98.241%	No_Prog: 0
3 Epochs:	Val Loss: 0.064256,	Best_Loss:  0.056654,	Acc:  98.163%	No_Prog: 1
4 Epochs:	Val Loss: 0.052294,	Best_Loss:  0.052294,	Acc:  98.593%	No_Prog: 0
5 Epochs:	Val Loss: 0.043086,	Best_Loss:  0.043086,	Acc:  98.749%	No_Prog: 0
6 Epochs:	Val Loss: 0.049133,	Best_Loss:  0.043086,	Acc:  98.749%	No_Prog: 1
7 Epochs:	Val Loss: 0.057457,	Best_Loss:  0.043086,	Acc:  98.710%	No_Prog: 2
8 Epochs:	Val Loss: 0.041383,	Best_Loss:  0.041383,	Acc:  98.788%	No_Prog: 0
9 Epochs:	Val Loss: 0.074417,	Best_Loss:  0.041383,	Acc:  98.358%	No_Prog: 1
10 Epochs:	Val Loss: 0.056540,	Best_Loss:  0.041383,	Acc:  98.827%	No_Prog: 2
11 Epochs:	Val Loss: 0.074298,	Best_Loss:  0.041383,	Acc:  98.593%	No_Prog: 3
12 Epochs:	Val Lo

5 Epochs:	Val Loss: 0.070502,	Best_Loss:  0.036973,	Acc:  97.772%	No_Prog: 1
6 Epochs:	Val Loss: 0.039767,	Best_Loss:  0.036973,	Acc:  98.710%	No_Prog: 2
7 Epochs:	Val Loss: 0.049275,	Best_Loss:  0.036973,	Acc:  98.397%	No_Prog: 3
8 Epochs:	Val Loss: 0.054889,	Best_Loss:  0.036973,	Acc:  98.788%	No_Prog: 4
9 Epochs:	Val Loss: 0.040472,	Best_Loss:  0.036973,	Acc:  98.788%	No_Prog: 5
10 Epochs:	Val Loss: 0.037554,	Best_Loss:  0.036973,	Acc:  98.905%	No_Prog: 6
11 Epochs:	Val Loss: 0.034787,	Best_Loss:  0.034787,	Acc:  98.944%	No_Prog: 0
12 Epochs:	Val Loss: 0.039284,	Best_Loss:  0.034787,	Acc:  99.062%	No_Prog: 1
13 Epochs:	Val Loss: 0.034158,	Best_Loss:  0.034158,	Acc:  98.984%	No_Prog: 0
14 Epochs:	Val Loss: 0.031744,	Best_Loss:  0.031744,	Acc:  99.062%	No_Prog: 0
15 Epochs:	Val Loss: 0.033740,	Best_Loss:  0.031744,	Acc:  99.023%	No_Prog: 1
16 Epochs:	Val Loss: 0.035849,	Best_Loss:  0.031744,	Acc:  99.062%	No_Prog: 2
17 Epochs:	Val Loss: 0.035560,	Best_Loss:  0.031744,	Acc:  98.984%	No

[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  2.5min finished


Training DNN: 
layers=10, neurons=[100 100 100 100 100 100 100 100 100 100], epochs=100,  batch_size=200, early_stop:True, batch_norm:True, dropout:False

1 Epochs:	Val Loss: 0.059300,	Best_Loss:  0.059300,	Acc:  98.280%	No_Prog: 0
2 Epochs:	Val Loss: 0.038813,	Best_Loss:  0.038813,	Acc:  98.866%	No_Prog: 0
3 Epochs:	Val Loss: 0.041340,	Best_Loss:  0.038813,	Acc:  98.554%	No_Prog: 1
4 Epochs:	Val Loss: 0.034030,	Best_Loss:  0.034030,	Acc:  98.827%	No_Prog: 0
5 Epochs:	Val Loss: 0.033679,	Best_Loss:  0.033679,	Acc:  98.866%	No_Prog: 0
6 Epochs:	Val Loss: 0.048677,	Best_Loss:  0.033679,	Acc:  98.514%	No_Prog: 1
7 Epochs:	Val Loss: 0.032463,	Best_Loss:  0.032463,	Acc:  98.984%	No_Prog: 0
8 Epochs:	Val Loss: 0.030773,	Best_Loss:  0.030773,	Acc:  99.062%	No_Prog: 0
9 Epochs:	Val Loss: 0.025762,	Best_Loss:  0.025762,	Acc:  99.062%	No_Prog: 0
10 Epochs:	Val Loss: 0.028641,	Best_Loss:  0.025762,	Acc:  99.140%	No_Prog: 1
11 Epochs:	Val Loss: 0.040044,	Best_Loss:  0.025762,	Acc:  98.827%	No_Prog

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=MikesGloriousDNNClassifier(activation_function=<function elu at 0x000002297AEA8B70>,
              batch_norm_momentum_decay=None, batch_size=50,
              dropout_rate=None,
              initializer=<function variance_scaling_initializer.<locals>._initializer at 0x000002290EB6EF28>,
...er'>,
              random_state=None, tb_model_name=None,
              tf_logs_path='./DNN1_LOGS'),
          fit_params={'X_valid': array([[ 0.,  0., ...,  0.,  0.],
       [ 0.,  0., ...,  0.,  0.],
       ...,
       [ 0.,  0., ...,  0.,  0.],
       [ 0.,  0., ...,  0.,  0.]], dtype=float32), 'y_valid': array([0, 4, ..., 1, 2], dtype=uint8), 'n_epochs': 100},
          iid=True, n_iter=2, n_jobs=1,
          param_distributions={'n_hidden_layers': [5, 10], 'n_neurons': [100, 160], 'learning_rate': [0.01, 0.02], 'batch_size': [100, 200], 'batch_norm_momentum_decay': [0.98], 'activation_function': [<function relu at 0x00000229

In [108]:
#what are the best params
print(rand_search.best_params_)

#print accruacy
y_pred = rand_search.predict(X_test1)
accuracy_score(y_test1, y_pred)

{'n_neurons': 100, 'n_hidden_layers': 10, 'learning_rate': 0.01, 'batch_size': 200, 'batch_norm_momentum_decay': 0.98, 'activation_function': <function leaky_relu.<locals>.parametrized_leaky_relu at 0x000002290EF5F2F0>}


0.99182720373613542

#### Meh. Only 0.1% better... not that great. Lets try retraining with dropout!

## 5. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?

First lets go back to our original best model ('Model1') and see if it is overfitting the training data:

In [114]:
#print accruacy on trianing data
y_pred = model1.predict(X_train1)
print('train data: ',accuracy_score(y_train1, y_pred))

#print accruacy on test data
y_pred = model1.predict(X_test1)
print('test data: ',accuracy_score(y_test1, y_pred))

train data:  0.999536343534
test data:  0.992021794123


In [248]:
#there is a little overfitting. A bit of regularization will help, so lets add a dropout_rate of 50%
reset_graph()
model2 = MikesGloriousDNNClassifier(activation_function=leaky_relu(alpha=0.01),
                                     n_neurons=100,
                                     n_hidden_layers=5,
                                     learning_rate=0.01,
                                     batch_size=500, 
                                     batch_norm_momentum_decay = 0.99, 
                                     tb_model_name='model1_bn_dropout', 
                                     dropout_rate=0.5)
 
model2.fit(X_train1, y_train1, n_epochs=1000, X_valid=X_valid1, y_valid=y_valid1)

Training DNN: 
layers=5, neurons=[100 100 100 100 100], epochs=1000,  batch_size=500, early_stop:True, batch_norm:True, dropout:True

1 Epochs:	Val Loss: 0.553163,	Best_Loss:  0.553163,	Acc:  95.661%	No_Prog: 0
2 Epochs:	Val Loss: 0.140643,	Best_Loss:  0.140643,	Acc:  97.146%	No_Prog: 0
3 Epochs:	Val Loss: 0.090050,	Best_Loss:  0.090050,	Acc:  97.772%	No_Prog: 0
4 Epochs:	Val Loss: 0.073070,	Best_Loss:  0.073070,	Acc:  98.163%	No_Prog: 0
5 Epochs:	Val Loss: 0.069932,	Best_Loss:  0.069932,	Acc:  97.928%	No_Prog: 0
6 Epochs:	Val Loss: 0.059706,	Best_Loss:  0.059706,	Acc:  98.475%	No_Prog: 0
7 Epochs:	Val Loss: 0.055358,	Best_Loss:  0.055358,	Acc:  98.593%	No_Prog: 0
8 Epochs:	Val Loss: 0.053248,	Best_Loss:  0.053248,	Acc:  98.593%	No_Prog: 0
9 Epochs:	Val Loss: 0.050808,	Best_Loss:  0.050808,	Acc:  98.749%	No_Prog: 0
10 Epochs:	Val Loss: 0.050649,	Best_Loss:  0.050649,	Acc:  98.632%	No_Prog: 0
11 Epochs:	Val Loss: 0.046101,	Best_Loss:  0.046101,	Acc:  98.788%	No_Prog: 0
12 Epochs:	Val Lo

MikesGloriousDNNClassifier(activation_function=<function leaky_relu.<locals>.parametrized_leaky_relu at 0x000002291C43C9D8>,
              batch_norm_momentum_decay=0.99, batch_size=500,
              dropout_rate=0.5,
              initializer=<function variance_scaling_initializer.<locals>._initializer at 0x000002291B17C158>,
              k_in_top_val=1, layer_name='hidden', learning_rate=0.01,
              max_epochs_no_progress=10, n_hidden_layers=5,
              n_neurons=array([100, 100, 100, 100, 100]),
              optimizer_class=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
              random_state=None, tb_model_name='model1_bn_dropout',
              tf_logs_path='./DNN1_LOGS')

In [249]:
#print accruacy on trianing data
y_pred = model2.predict(X_train1)
print('train data: ',accuracy_score(y_train1, y_pred))

#print accruacy on test data
y_pred = model2.predict(X_test1)
print('test data: ',accuracy_score(y_test1, y_pred))

train data:  0.993223482417
test data:  0.992216384511


#### Nice! Dropout regularization helped a little as the test accuracy went up by 0.2%! HOWEVER, convergence did slow down a little. Let's save this model:

In [250]:
model2.save("./DNN1_LOGS/my_best_DNN_0_to_4")

In [243]:
#check the operations to make sure everything is in orer!
model2._graph.get_operations()

[<tf.Operation 'X' type=Placeholder>,
 <tf.Operation 'y' type=Placeholder>,
 <tf.Operation 'inTrainingMode/input' type=Const>,
 <tf.Operation 'inTrainingMode' type=PlaceholderWithDefault>,
 <tf.Operation 'dropout/cond/Switch' type=Switch>,
 <tf.Operation 'dropout/cond/switch_t' type=Identity>,
 <tf.Operation 'dropout/cond/switch_f' type=Identity>,
 <tf.Operation 'dropout/cond/pred_id' type=Identity>,
 <tf.Operation 'dropout/cond/dropout/keep_prob' type=Const>,
 <tf.Operation 'dropout/cond/dropout/Shape/Switch' type=Switch>,
 <tf.Operation 'dropout/cond/dropout/Shape' type=Shape>,
 <tf.Operation 'dropout/cond/dropout/random_uniform/min' type=Const>,
 <tf.Operation 'dropout/cond/dropout/random_uniform/max' type=Const>,
 <tf.Operation 'dropout/cond/dropout/random_uniform/RandomUniform' type=RandomUniform>,
 <tf.Operation 'dropout/cond/dropout/random_uniform/sub' type=Sub>,
 <tf.Operation 'dropout/cond/dropout/random_uniform/mul' type=Mul>,
 <tf.Operation 'dropout/cond/dropout/random_unifo

In [244]:
with model2._graph.as_default():
    for var in tf.trainable_variables():
        print(var)

<tf.Variable 'hidden_1/kernel:0' shape=(784, 100) dtype=float32_ref>
<tf.Variable 'hidden_1/bias:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization/beta:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization/gamma:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'hidden_2/kernel:0' shape=(100, 100) dtype=float32_ref>
<tf.Variable 'hidden_2/bias:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization_1/beta:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization_1/gamma:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'hidden_3/kernel:0' shape=(100, 100) dtype=float32_ref>
<tf.Variable 'hidden_3/bias:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization_2/beta:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'batch_normalization_2/gamma:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'hidden_4/kernel:0' shape=(100, 100) dtype=float32_ref>
<tf.Variable 'hidden_4/bias:0' shape=(100,) dtype=float32_ref>
<tf.Variable 'bat

## 6. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a fresh new one.

Let's first load in the best model's graph and get a handle on ALL of the important operations we need. Note that instead of creating a new softmax output layer, we will just reuse the existing one (since it has the same number of outputs, 5). We will reinitialize the hyperparams before training.

In [291]:
reset_graph()
restore_meta = tf.train.import_meta_graph("./DNN1_LOGS/my_best_DNN_0_to_4.meta")

X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
mean_loss = tf.get_default_graph().get_tensor_by_name("mean_loss:0")
y_proba = tf.get_default_graph().get_tensor_by_name("y_proba:0")
logits = y_proba.op.inputs[0]
mean_accuracy = tf.get_default_graph().get_tensor_by_name("mean_accuracy:0")

In [292]:
mean_accuracy

<tf.Tensor 'mean_accuracy:0' shape=() dtype=float32>

In [293]:
# To freeze lower layers, exclude their vars from optimizers list of trainable variables,
# keeping only the output layers trainable variables:
learning_rate = 0.01
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name="Adam2")
#pass in var list to optimizer
logit_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='output_logits')
training_op = optimizer.minimize(mean_loss, var_list=logit_vars)

#create accuracy calcs
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

init = tf.global_variables_initializer()
five_frozen_saver = tf.train.Saver()

## 7. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?

Create the training, validation, and test sets for numbers 5 and up! We minus 5 from each set to get n_inputs-1 classes (0 to 4)... but this time tehse represent the digits 5,6,7,8,9. 

We will also only keep 100 instances per class (and only 30 per class in the validation set).

In [277]:
X_train2_full = mnist.train.images[mnist.train.labels >= 5]
y_train2_full = mnist.train.labels[mnist.train.labels >= 5] - 5
X_valid2_full = mnist.validation.images[mnist.validation.labels >= 5]
y_valid2_full = mnist.validation.labels[mnist.validation.labels >= 5] - 5
X_test2 = mnist.test.images[mnist.test.labels >= 5]
y_test2 = mnist.test.labels[mnist.test.labels >= 5] - 5

In [289]:
def sample_n_instances_per_class(X, y, n=100):
    Xs, ys = [], []
    #for each class, select the first n rows (the data is already randomized)
    for label in np.unique(y):
        idx = (y == label)
        Xc = X[idx][:n]
        yc = y[idx][:n]
        #append the list of classes
        Xs.append(Xc)
        ys.append(yc)
    #concatenate the lists into a single list (stack the data!)
    return np.concatenate(Xs), np.concatenate(ys)

X_train2, y_train2 = sample_n_instances_per_class(X_train2_full, y_train2_full, n=100)
X_valid2, y_valid2 = sample_n_instances_per_class(X_valid2_full, y_valid2_full, n=30)

### Now let's train the model. This is the same training code as earlier, using early stopping, except for the initialization: we first initialize all the variables, then we restore the best model trained earlier (on digits 0 to 4), and finally we reinitialize the output layer variables. REBUILD the graph below for easier viewing:

In [304]:
#########################################################################################
################################# CONSTRUCTION PHASE ####################################
#########################################################################################

#restore graph
reset_graph()
restore_meta = tf.train.import_meta_graph("./tf_logs/MODEL1_0_to_4.meta") #import meta data for graph

#grab variables that we need to reference
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
y_proba = tf.get_default_graph().get_tensor_by_name("y_proba:0")
logits = y_proba.op.inputs[0]
mean_loss = tf.get_default_graph().get_tensor_by_name("mean_loss:0")

#first input of y_proba (the softmax of logits) is the logitss
#this could also be found by reffering to the 'logits' layer
#order of calcs in graph:
    #1. logits
    #2. y_proba (though this isnt used in calc)
    #3. cross_entropy (though this isnt used in calc)
    #4. mean_loss and mean_accuracy

#freeze lower layers
learning_rate = 0.01
logit_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='output_logits')
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name="Adam2")
training_op = optimizer.minimize(mean_loss, var_list=logit_vars)

#create accuracy calcs
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

init = tf.global_variables_initializer()
five_frozen_saver = tf.train.Saver()

#########################################################################################
################################## EVALUATION PHASE #####################################
#########################################################################################

import time

n_epochs = 1000
batch_size = 20

max_checks_without_progress = 10
checks_without_progress = 0
best_loss = np.infty

with tf.Session() as sess:
    init.run()
    restore_meta.restore(sess, "./tf_logs/MODEL1_0_to_4")
    
    #this was in original code, but why do I need them?!? Doesnt the init.run() take care of this??!?!?
    #for var in logit_vars:
    #    var.initializer.run()

    t0 = time.time()
        
    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_train2))
        for rnd_indices in np.array_split(rnd_idx, len(X_train2) // batch_size):
            X_batch, y_batch = X_train2[rnd_indices], y_train2[rnd_indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        loss_val, acc_val = sess.run([mean_loss, accuracy], feed_dict={X: X_valid2, y: y_valid2})
        if loss_val < best_loss:
            save_path = five_frozen_saver.save(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
            best_loss = loss_val
            checks_without_progress = 0
        else:
            checks_without_progress += 1
            if checks_without_progress > max_checks_without_progress:
                print("Early stopping!")
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
            epoch, loss_val, best_loss, acc_val * 100))

    t1 = time.time()
    print("Total training time: {:.1f}s".format(t1 - t0))

with tf.Session() as sess:
    five_frozen_saver.restore(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
    acc_test = accuracy.eval(feed_dict={X: X_test2, y: y_test2})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

INFO:tensorflow:Restoring parameters from ./tf_logs/MODEL1_0_to_4
0	Validation loss: 1.159437	Best loss: 1.159437	Accuracy: 50.67%
1	Validation loss: 1.082819	Best loss: 1.082819	Accuracy: 58.67%
2	Validation loss: 1.010775	Best loss: 1.010775	Accuracy: 56.00%
3	Validation loss: 0.964012	Best loss: 0.964012	Accuracy: 64.00%
4	Validation loss: 0.938321	Best loss: 0.938321	Accuracy: 66.67%
5	Validation loss: 0.943860	Best loss: 0.938321	Accuracy: 66.67%
6	Validation loss: 0.906695	Best loss: 0.906695	Accuracy: 67.33%
7	Validation loss: 0.894095	Best loss: 0.894095	Accuracy: 66.00%
8	Validation loss: 0.933458	Best loss: 0.894095	Accuracy: 65.33%
9	Validation loss: 0.884227	Best loss: 0.884227	Accuracy: 70.00%
10	Validation loss: 0.922858	Best loss: 0.884227	Accuracy: 65.33%
11	Validation loss: 0.957360	Best loss: 0.884227	Accuracy: 59.33%
12	Validation loss: 0.942646	Best loss: 0.884227	Accuracy: 64.67%
13	Validation loss: 0.907987	Best loss: 0.884227	Accuracy: 64.00%
14	Validation loss: 

### This wasnt in original Code, but I think you could also do this:

#### Get rid of the 'logits' and 'y_proba' vals, and just evaluate the 'mean_accuracy' and 'mean_loss' by calling those tensors directly... isnt that easier? I think the result should be the exact same as well. 

In [313]:
#########################################################################################
################################# CONSTRUCTION PHASE ####################################
#########################################################################################

#restore graph
reset_graph()
restore_meta = tf.train.import_meta_graph("./tf_logs/MODEL1_0_to_4.meta") #import meta data for graph

#grab variables that we need to reference
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
mean_loss = tf.get_default_graph().get_tensor_by_name("mean_loss:0")
mean_accuracy = tf.get_default_graph().get_tensor_by_name("mean_accuracy:0")

#freeze lower layers
learning_rate = 0.01
logit_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='output_logits')
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name="Adam2")
training_op = optimizer.minimize(mean_loss, var_list=logit_vars)

init = tf.global_variables_initializer()
five_frozen_saver = tf.train.Saver()

#########################################################################################
################################## EVALUATION PHASE #####################################
#########################################################################################

import time

n_epochs = 1000
batch_size = 20

max_checks_without_progress = 5
checks_without_progress = 0
best_loss = np.infty

with tf.Session() as sess:
    init.run()
    restore_meta.restore(sess, "./tf_logs/MODEL1_0_to_4")

    t0 = time.time()  
    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_train2))
        for rnd_indices in np.array_split(rnd_idx, len(X_train2) // batch_size):
            X_batch, y_batch = X_train2[rnd_indices], y_train2[rnd_indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        loss_val, acc_val = sess.run([mean_loss, mean_accuracy], feed_dict={X: X_valid2, y: y_valid2})
        if loss_val < best_loss:
            save_path = five_frozen_saver.save(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
            best_loss = loss_val
            checks_without_progress = 0
        else:
            checks_without_progress += 1
            if checks_without_progress > max_checks_without_progress:
                print("Early stopping!")
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
            epoch, loss_val, best_loss, acc_val * 100))

    t1 = time.time()
    print("Total training time: {:.1f}s".format(t1 - t0))

with tf.Session() as sess:
    five_frozen_saver.restore(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
    acc_test = mean_accuracy.eval(feed_dict={X: X_test2, y: y_test2})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

INFO:tensorflow:Restoring parameters from ./tf_logs/MODEL1_0_to_4
0	Validation loss: 1.159437	Best loss: 1.159437	Accuracy: 50.67%
1	Validation loss: 1.082819	Best loss: 1.082819	Accuracy: 58.67%
2	Validation loss: 1.010775	Best loss: 1.010775	Accuracy: 56.00%
3	Validation loss: 0.964012	Best loss: 0.964012	Accuracy: 64.00%
4	Validation loss: 0.938321	Best loss: 0.938321	Accuracy: 66.67%
5	Validation loss: 0.943860	Best loss: 0.938321	Accuracy: 66.67%
6	Validation loss: 0.906695	Best loss: 0.906695	Accuracy: 67.33%
7	Validation loss: 0.894095	Best loss: 0.894095	Accuracy: 66.00%
8	Validation loss: 0.933458	Best loss: 0.894095	Accuracy: 65.33%
9	Validation loss: 0.884227	Best loss: 0.884227	Accuracy: 70.00%
10	Validation loss: 0.922858	Best loss: 0.884227	Accuracy: 65.33%
11	Validation loss: 0.957360	Best loss: 0.884227	Accuracy: 59.33%
12	Validation loss: 0.942646	Best loss: 0.884227	Accuracy: 64.67%
13	Validation loss: 0.907987	Best loss: 0.884227	Accuracy: 64.00%
14	Validation loss: 

#### Brutal accuracy... but thats what you expect when training only a single layer!!! However the model did run super fast which is pretty sweet!

## 8. Try caching the frozen layers, and train the model again: how much faster is it now?

Let's start by getting a handle on the output of the last frozen layer. Then train the model using roughly the same code as earlier. The difference is that we **compute the output of the top frozen layer at the beginning (both for the training set and the validation set), and we cache it.** This makes training roughly 1.5 to 3 times faster in this example (this may vary greatly, depending on your system):

In [314]:
import time
#########################################################################################
################################# CONSTRUCTION PHASE ####################################
#########################################################################################

#restore graph
reset_graph()
restore_meta = tf.train.import_meta_graph("./tf_logs/MODEL1_0_to_4.meta") #import meta data for graph

#grab variables that we need to reference
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
mean_loss = tf.get_default_graph().get_tensor_by_name("mean_loss:0")
mean_accuracy = tf.get_default_graph().get_tensor_by_name("mean_accuracy:0")

#freeze lower layers
learning_rate = 0.01
logit_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='output_logits')
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name="Adam2")
training_op = optimizer.minimize(mean_loss, var_list=logit_vars)

init = tf.global_variables_initializer()
five_frozen_saver = tf.train.Saver()

#get frozen layer tensor
hidden5_out = tf.get_default_graph().get_tensor_by_name("hidden_5_ACTIVATED:0")

#########################################################################################
################################## EVALUATION PHASE #####################################
#########################################################################################

n_epochs = 1000
batch_size = 20

max_checks_without_progress = 
checks_without_progress = 0
best_loss = np.infty

with tf.Session() as sess:
    init.run()
    restore_meta.restore(sess, "./tf_logs/MODEL1_0_to_4")

    t0 = time.time()
    
    #cache hidden_5_ACTIVATED 
    hidden5_train = hidden5_out.eval(feed_dict={X: X_train2, y: y_train2})
    hidden5_valid = hidden5_out.eval(feed_dict={X: X_valid2, y: y_valid2})
        
    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_train2))
        #replace X with hidden5_out values
        for rnd_indices in np.array_split(rnd_idx, len(X_train2) // batch_size):
            h5_batch, y_batch = hidden5_train[rnd_indices], y_train2[rnd_indices]
            sess.run(training_op, feed_dict={hidden5_out: h5_batch, y: y_batch})
        loss_val, acc_val = sess.run([mean_loss, mean_accuracy], feed_dict={hidden5_out: hidden5_valid, y: y_valid2})
        if loss_val < best_loss:
            save_path = five_frozen_saver.save(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
            best_loss = loss_val
            checks_without_progress = 0
        else:
            checks_without_progress += 1
            if checks_without_progress > max_checks_without_progress:
                print("Early stopping!")
                break
        print("{}\tValidation loss: {:.6f}\tBest loss: {:.6f}\tAccuracy: {:.2f}%".format(
            epoch, loss_val, best_loss, acc_val * 100))

    t1 = time.time()
    print("Total training time: {:.1f}s".format(t1 - t0))

with tf.Session() as sess:
    five_frozen_saver.restore(sess, "./tf_logs/my_mnist_model_5_to_9_five_frozen")
    acc_test = mean_accuracy.eval(feed_dict={X: X_test2, y: y_test2})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

INFO:tensorflow:Restoring parameters from ./tf_logs/MODEL1_0_to_4
0	Validation loss: 1.159436	Best loss: 1.159436	Accuracy: 50.67%
1	Validation loss: 1.082819	Best loss: 1.082819	Accuracy: 58.67%
2	Validation loss: 1.010775	Best loss: 1.010775	Accuracy: 56.00%
3	Validation loss: 0.964013	Best loss: 0.964013	Accuracy: 64.00%
4	Validation loss: 0.938321	Best loss: 0.938321	Accuracy: 66.67%
5	Validation loss: 0.943860	Best loss: 0.938321	Accuracy: 66.67%
6	Validation loss: 0.906695	Best loss: 0.906695	Accuracy: 67.33%
7	Validation loss: 0.894095	Best loss: 0.894095	Accuracy: 66.00%
8	Validation loss: 0.933458	Best loss: 0.894095	Accuracy: 65.33%
9	Validation loss: 0.884227	Best loss: 0.884227	Accuracy: 70.00%
10	Validation loss: 0.922858	Best loss: 0.884227	Accuracy: 65.33%
11	Validation loss: 0.957360	Best loss: 0.884227	Accuracy: 59.33%
12	Validation loss: 0.942646	Best loss: 0.884227	Accuracy: 64.67%
13	Validation loss: 0.907987	Best loss: 0.884227	Accuracy: 64.00%
14	Validation loss: 

### OK... soooo no difference in training time or accuracy... perhaps I am not running the model long enough (increase max epochs) or there is something wrong in the above code. We will have to dive into this again later, because right now we need to dive into recurrent neural nets!