# Reusing Neural Networks with Tensorflow
<p class='lead'>
Author: Oliveira, Markos F. B. G.<br />
Date: 8/22/2017
</p>

# Description

This document is based on Chapter 11 content of Geron's book *Hands-On Machine Learning with Scikit-Learn & TensorFlow*. It covers only the basics of the related topics and it's not meant to be a tutorial. Please, refer to the referenced book and scikit-learn/tensorflow online documentation for specific information.

In this Notebook an example of reusing neural networks is presented. It's a simple case that demostrates how reusing the weights of a pre-trained deep neural network (DNN) can help modeling similar concepts in another (but related) supervised problem. This can provide a better generalization capabillity of the new model even in cases of lack of data. The information that would be on the data, it's encoded now in the weights of the old model, reused by the new model. Because most of ML problems deal with hierarchical structures (such as images, where high level structures such as faces can be constructed using low level structures such as small lines, and corners), it makes sense to reuse learned low level structures/feature detectors in further related problems (i.g. hair recognition system can be build from a DNN that learned to recognize faces).

A DNN makes it possible to encode concepts with different granularities due to its hierarchical structure. A general approach is to fetch the weights of some layers of the old DNN, frozen the lower layers and train only the high level layers. Note that even though we have a high complexity mapping (deep net), just some parameters are trainable. Thus, less data is necessary to construct such mapping. As the similarity of the related problems increases, more layers can be reused.

## Problem Description

In particular, in this Notebook it's shown how to reuse part of a deep net that learned hot to recognize the difference between '5's and '3's hadwritten digits, in another recognition problem: recognize the difference between '8's and '9's digits. For the second problem, less data is presented, to demonstrade the usefulness of such methodology in cases where the available data is not big enough to find a good representation. The digits used is from MNIST database.

### Limitations

- The presented example is simple, the database of the two training procedure is the same: MNIST. Thus, the number of features is the same in both examples. It's not clear how to reuse a network that was trained in a previous problem with different number of features.

- The training procedure is not optimized. Due to the fact that this is a small learning problem, any optimized techniques related to weight initialization, normalization, optimization, activation functions, regularization, etc. were used. Hence, plain stochastic gradient descent is used in this work.

- To gain a speed boost it's possible to cache the output of a frozen layers and run SGB using this as inputs. This will give you a huge speed boost as you will only need to go through the frozen layers once per training instance, instead of once per epoch. However, this was not implemented.

### Importing the necessary libraries/modules/functions

In [1]:
import tensorflow as tf
import numpy as np
from datetime import datetime
import time
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

### Loading and pre-processing the Datasets

It's necessary to create two datasets, one for each recognition problem. No validation set will be used.

In [2]:
from tensorflow.examples.tutorials.mnist import input_data

#Downloading and extracting the data
mnist = input_data.read_data_sets("/tmp/data/")

#Fetching the data (no validation set will be used in this project)
X_train_35 = mnist.train.images[(mnist.train.labels.astype("int") == 3) + (mnist.train.labels.astype("int") == 5)]
y_train_35 = mnist.train.labels.astype("int")[(mnist.train.labels.astype("int") == 3) + (mnist.train.labels.astype("int") == 5)]
y_train_35 = y_train_35 == 5 #transform to 1s and 0s vector
X_test_35 = mnist.test.images[(mnist.test.labels.astype("int") == 3) + (mnist.test.labels.astype("int") == 5)]
y_test_35 = mnist.test.labels.astype("int")[(mnist.test.labels.astype("int") == 3) + (mnist.test.labels.astype("int") == 5)]
y_test_35 = y_test_35 == 5 #transform to 1s and 0s vector
X_train_89 = mnist.train.images[(mnist.train.labels.astype("int") == 8) + (mnist.train.labels.astype("int") == 9)]
y_train_89 = mnist.train.labels.astype("int")[(mnist.train.labels.astype("int") == 8) + (mnist.train.labels.astype("int") == 9)]
y_train_89 = y_train_89 == 9 #transform to 1s and 0s vector
X_test_89 = mnist.test.images[(mnist.test.labels.astype("int") == 8) + (mnist.test.labels.astype("int") == 9)]
y_test_89 =  mnist.test.labels.astype("int")[(mnist.test.labels.astype("int") == 8) + (mnist.test.labels.astype("int") == 9)]
y_test_89 = y_test_89 == 9 #transform to 1s and 0s vector

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


### Training the First DNN (3's vs 5's)

To train the DNNs using tensorflow, it will be used SGD algorithm.

In [3]:
#Storing some important values
X_train = X_train_35
X_test = X_test_35
y_train = y_train_35
y_test = y_test_35
n_inputs = X_train.shape[1]
n_outputs = (np.unique(y_train)).size
m = X_train.shape[0]
print('Inputs shape of training data: ', X_train.shape)
print('Outputs shape of training data: ', y_train.shape)
print('# inputs: ', n_inputs) # 28*28 = MNIST
print('# outputs: ', n_outputs)
print('# instances: ', m)

Inputs shape of training data:  (10625, 784)
Outputs shape of training data:  (10625,)
# inputs:  784
# outputs:  2
# instances:  10625


In stochastic gradient descent it's necessary to provide stochastic small batches of examples so that the algorithm updates more often and eventually converges faster. The function returns a mini-batch when its called using two approaches.

- In the 1st approach random indices from dataset are selected to be inside the mini-batch using a randomized seed, which is the number of the training iteration (epoch*n_batches + batch_index). In this case, one epoch does not pass through all the different examples.
- In the 2nd approach a random permutation of the examples is passed and a slice of this permutation is used as mini-batch. In this case all the examples are used in one epoch, thus it's a better approach. A random permutation of the examples is necessary to be computed once per epoch oustide the iteration over the batches.

In [4]:
def fetch_batch(epoch, batch_index, batch_size, idxs, Xs, ys, approach=2):
    """Returns the mini-batch (X, y) for a training step.

    Parameters
    ----------
    epoch : int
        Number of epoch.
    batch_index : int
        Number of batch inside the epoch.
    batch_size : int
        Batch size.
    idxs : list of int
        Permutation of training set indices.
    approach: int, optional
        Approach used for chosing the mini-batches. 
    Returns
    -------
    X_batch : np.ndarray
        Mini-batch of inputs.
    y_batch : np.ndarray
        Mini-batch of outputs.
    """
    
    if approach == 1:
        # 1st approach: in this approach random indices from dataset are selected to be inside the mini-batch using a
        #randomized seed, which is the number of the training iteration. In this case, one epoch probably does
        #not have all the different examples.
        rnd.seed(epoch * n_batches + batch_index)
        indices = rnd.randint(m, size=batch_size)
    elif approach == 2:
        # 2nd approach: in this approach a random permutation of the examples is passed and a slice of this permutation
        #is used as mini-batch.
        indices = idxs[batch_index * batch_size: (batch_index + 1) * batch_size]
    
    X_batch = Xs[indices]
    y_batch = ys[indices]
    
    return X_batch, y_batch

It's good to have a function that manually creates the archictecure of a neural network. In this way, it's possible to create structures that are not fully connected. Below, it's implemented a function that creates a standard feedforward network. The function creates each layer at a time when it's called, and not the fully structure. The weights are initialized using common method found in literature to speed up convergence and avoid saturation in early stages of learning.

In [5]:
def neuron_layer(X, n_neurons, name, activation=None):
    """Manually creates the layers of the neural network.

    Parameters
    ----------
    X : np.ndarray
        Input values of the layer (m_batch, n_inputs).
    n_neurons : int
        Number of neurons in the layer.
    name : string
        Scope name of the layer.
    activation : string
        Type of activation function.
    Returns
    -------
    z : np.ndarray
        The output of the layer.
    """
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs) # good strategy to initialize the NN's weights.
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev, seed = 0) # using truncated distribution we avoid
        #values whose magnitude is more than 2 standard deviations (95%) from the mean, which is zero (these values are
        #dropped and re-picked). It helps with convergence speed.
        W = tf.Variable(init, name="weights")
        b = tf.Variable(tf.zeros([n_neurons]), name="biases")
        z = tf.matmul(X, W) + b #If m_match > 1 tf.matmul(X, W) is a matrix; b i summed to all of its columns.
        
        if activation=="relu":
            return tf.nn.relu(z)
        elif activation=="sigmoid":
            return tf.nn.sigmoid
        elif activation=="tanh":
            return tf.nn.tanh
        else:
            return z

Below, some of the training paramters can be set. The archictecture of the DNN though, is set in the construction phase of the TF graph.

In [6]:
#SGD parameters:
n_epochs = 100
learning_rate = 0.01
batch_size = 50
n_batches = int(np.ceil(m / batch_size))

#Monitoring:
prt = True # enable printing training statistics.
print_step = 10 # number of epochs to periodically print training statistics.

#Neural networks layers:
manual_layers = True #manually creates the layers using 'neural_layer' function.
tf_batch = fetch_batch # use implemented TF function to load mini-batches.
#for MNIST: tf_batch = mnist.train.next_batch(batch_size)
n_hidden1 = 200
n_hidden2 = 100
n_hidden3 = 100
#Obs.: the archictecture of the layer must be constructed inside the 'CONSTRUCTION PHASE' of the nn graph.

#Save/restore:
restore = False # restore old model
save_ckpt = False # save chackpoints
saver_step = 10 # number of epochs to periodically save model's checkpoint.
path_restore = '/tmp/finals/my_model_final.ckpt'

#Tensorboard:
tb = False # log training statistics for tensorboard.
root_logdir = "tf_logs"
tensorboard_step = 10 # number of epochs to periodically log statistics in tensorboard log files.

Below, we effectively construct and execute the TF graph of the first DNN. The archictecture chosen have three hidden layers with 200, 100 and 100 ReLU neurons. This implementation does not includes early-stopping, hence, with 100 epochs the net highly overfitts the training data, getting a perfect score after the 30/40 training epoch.

In [7]:
tf.reset_default_graph() #restoring the default graph.
tf.logging.set_verbosity(tf.logging.WARN) #supress TF logging messages when saving ckpt files.
start_time = time.time()
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
logdir = "{}/run-{}/".format(root_logdir, now) #relative path of tensorboard logs for a particular run (current time
#is used so that each folder has different running stats, comparision between them can be made inside tensorboard).

# TF CONSTRUCTION PHASE

# 1- Creating variables, placeholders and constants.
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

# 2- Creating the operations.
with tf.name_scope("dnn"):
    if manual_layers:
        hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
        hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
        hidden3 = neuron_layer(hidden2, n_hidden3, "hidden3", activation="relu")
        logits = neuron_layer(hidden3, n_outputs, "output") # the final layer returns the logits only.
        softmax = tf.nn.softmax(logits, name="softmax")
    else:
        hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
        hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
        hidden3 = tf.layers.dense(hidden2, n_hidden3, name="hidden3", activation=tf.nn.relu)
        logits = tf.layers.dense(hidden3, n_outputs, name="outputs")
        softmax = tf.nn.softmax(logits, name="softmax")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) #this computes the cross entropy based 
    #directly on the logits of the output layer: it expects integer labels (from 0 to n_outputs-1). This returns the cross-entropy
    #scalar value for each instance.
    #Use 'softmax_cross_entropy_with_logits()' if the labels are in the form of one-hot vectors.
    loss = tf.reduce_mean(xentropy, name="loss") #computes the mean over the mini_batch.
    
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1) #returns boolean == True if the output y is in the first k highest probabilities.  
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) #computes the accuracy.

# 3- Node that initialize the variables.
init = tf.global_variables_initializer()

# 4- Creating the saver.
saver_save = tf.train.Saver(max_to_keep=None) #save all variable values.
if restore:
    saver_restore = tf.train.Saver() #restore all variable values.
    #saver_restore = tf.train.Saver({"weights": theta}) #restore only the old theta variable under the name of 'weights'.

# 5- Tensorboard definitions:
if tb:
    accuracy_summary = tf.summary.scalar('ACC', accuracy) # Creates a node in the graph that will evaluate the ACC value
    #and write it to a TensorBoard-compatible binary log string called a summary.
    summary_writer = tf.summary.FileWriter(logdir, tf.get_default_graph()) # Creates a FileWriter that you will
    #use to write summaries to logfiles in the log directory.

# TF EXECUTION PHASE:

with tf.Session() as sess:
    
    # To restore a model the construction phase must be identical than the one used to save it. 
    if restore:
        saver_restore.restore(sess, path_restore)
    else:
        sess.run(init) # Initializing the variables.
    
    for epoch in range(n_epochs+1): # for each epoch..
        
        idXs = np.random.permutation(range(m))
        
        for batch_index in range(n_batches-1): # for each mini-batch..
            X_batch, y_batch = tf_batch(epoch, batch_index, batch_size, idXs, X_train, y_train)
            #Training step:
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        
        if (prt) and (epoch % print_step == 0):
            acc_train = accuracy.eval(feed_dict={X: X_train, y: y_train})
            print("Epoch", epoch, "Training accuracy:", acc_train)
        if (tb) and (epoch % tensorboard_step == 0):
            summary_str = accuracy_summary.eval(feed_dict={X: X_train, y: y_train}) # This will output a summary that
            #you can then write to the events file using the file_writer. Here is the updated code:
            step = epoch * n_batches + batch_index
            summary_writer.add_summary(summary_str, step)
        if (save_ckpt) and (epoch % saver_step == 0):
            save_path = saver_save.save(sess, "./tmp/my_model-{}.ckpt".format(epoch))

    #Saving the model
    save_path = saver_save.save(sess, "./tmp/best_model-{}.ckpt".format(now)) 
    
# Flushing and closing FileWriter        
if tb:                            
    summary_writer.flush()
    summary_writer.close() 

best_model_path = "./tmp/best_model-{}.ckpt".format(now)
print("Best model saved as:", best_model_path)
print("Training time: %.6s seconds" % (time.time() - start_time))

Epoch 0 Training accuracy: 0.955671
Epoch 10 Training accuracy: 0.994071
Epoch 20 Training accuracy: 0.998965
Epoch 30 Training accuracy: 0.999718
Epoch 40 Training accuracy: 1.0
Epoch 50 Training accuracy: 1.0
Epoch 60 Training accuracy: 1.0
Epoch 70 Training accuracy: 1.0
Epoch 80 Training accuracy: 1.0
Epoch 90 Training accuracy: 1.0
Epoch 100 Training accuracy: 1.0
Best model saved as: ./tmp/best_model-20170821195934.ckpt
Training time: 42.365 seconds


### Reloading and Testing the Network

In [9]:
saver_restore = tf.train.Saver() 
with tf.Session() as sess:
    saver_restore.restore(sess, best_model_path)
    Z = logits.eval(feed_dict={X: X_test})
    y_prob = softmax.eval(feed_dict={X: X_test})
    y_pred = np.argmax(Z, axis=1) #or np.argmax(y_prob, axis=1)
    
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy in the test set: ', accuracy)

Accuracy in the test set:  0.992639327024


### Training the second DNN (8's vs 9's) reusing pre-trained weights

In the first fully-connected DNN we had (784\*200)+(200\*100)+(100\*100)+(100\*2)= 187k parameters. Note that freezing the first layer, we have only ~30k learnable parameters, a great reduction. Let's see our performance in the test set when reusing the weights between the input layer and the first hidden layer with 200 neurons and the weights between this layer and the second hidden layer with 100 layers. However, only the first batch of weights will be frozen, the second batch will be used as initial values only.

The main differences in the code resort inside the step-4 in the construction phase 'Creating the Saver' and in the optimizer definition under the 'train' scope.

In [10]:
#Storing some important values
X_train = X_train_89
X_test = X_test_89
y_train = y_train_89
y_test = y_test_89
n_inputs = X_train.shape[1]
n_outputs = (np.unique(y_train)).size
m = X_train.shape[0]
print('Inputs shape of training data: ', X_train.shape)
print('Outputs shape of training data: ', y_train.shape)
print('# inputs: ', n_inputs) # 28*28 = MNIST
print('# outputs: ', n_outputs)
print('# instances: ', m)

Inputs shape of training data:  (10843, 784)
Outputs shape of training data:  (10843,)
# inputs:  784
# outputs:  2
# instances:  10843


In [15]:
tf.reset_default_graph() #restoring the default graph.
tf.logging.set_verbosity(tf.logging.WARN) #supress TF logging messages when saving ckpt files.
start_time = time.time()
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
logdir = "{}/run-{}/".format(root_logdir, now) #relative path of tensorboard logs for a particular run (current time
#is used so that each folder has different running stats, comparision between them can be made inside tensorboard).

# TF CONSTRUCTION PHASE

# 1- Creating variables, placeholders and constants.
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

# 2- Creating the operations.
with tf.name_scope("dnn"):
    if manual_layers:
        hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
        hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
        hidden3 = neuron_layer(hidden2, n_hidden3, "hidden3", activation="relu")
        logits = neuron_layer(hidden3, n_outputs, "outputs") # the final layer returns the logits only.
        softmax = tf.nn.softmax(logits, name="softmax")
    else:
        hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
        hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
        hidden3 = tf.layers.dense(hidden2, n_hidden3, name="hidden3", activation=tf.nn.relu)
        logits = tf.layers.dense(hidden3, n_outputs, name="outputs")
        softmax = tf.nn.softmax(logits, name="softmax")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) #this computes the cross entropy based 
    #directly on the logits of the output layer: it expects integer labels (from 0 to n_outputs-1). This returns the cross-entropy
    #scalar value for each instance.
    #Use 'softmax_cross_entropy_with_logits()' if the labels are in the form of one-hot vectors.
    loss = tf.reduce_mean(xentropy, name="loss") #computes the mean over the mini_batch.
    
with tf.name_scope("train"):
    #Defining the trainable variables:
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="(dnn/hidden[23]) || (dnn/outputs)")
    #Note that hidden1 weights are not allowed to change. Only weights between h1 and h2 ('hidden2'), between h2 and f3
    #('hidden3') ans between h3 and out ('outputs').
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss, var_list=train_vars)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1) #returns boolean == True if the output y is in the first k highest probabilities.  
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) #computes the accuracy.

# 3- Node that initialize the variables.
init = tf.global_variables_initializer()

# 4- Creating the saver.
reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="dnn/hidden[12]") #scope accept a regular expression.
#In this case, we are loading the weights under 'hidden1' and 'hidden2'. Only 'hidden1' will be frozen, 'hidden2' will be used
#as initial values only.
reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars])
#print('Original vars:', reuse_vars_dict)
original_saver = tf.train.Saver(reuse_vars_dict) # saver to restore the original model
new_saver = tf.train.Saver() # saver to save the new model

# TF EXECUTION PHASE:

with tf.Session() as sess:
    
    #intead of sess.run(init)..
    sess.run(init) #initializes all variables
    original_saver.restore(sess, best_model_path) #restores some variables.
    
    for epoch in range(n_epochs+1): # for each epoch..
        
        idXs = np.random.permutation(range(m))
        
        for batch_index in range(n_batches-1): # for each mini-batch..
            X_batch, y_batch = tf_batch(epoch, batch_index, batch_size, idXs, X_train, y_train)
            #Training step:
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        
        if (prt) and (epoch % print_step == 0):
            acc_train = accuracy.eval(feed_dict={X: X_train, y: y_train})
            print("Epoch", epoch, "Training accuracy:", acc_train)
        if (save_ckpt) and (epoch % saver_step == 0):
            save_path = saver_save.save(sess, "./tmp/my_model-{}.ckpt".format(epoch))
        
        Z = logits.eval(feed_dict={X: X_test})
        y_prob = softmax.eval(feed_dict={X: X_test})
        y_pred = np.argmax(Z, axis=1) #or np.argmax(y_prob, axis=1)
    
    #Saving the model
    save_path = saver_save.save(sess, "./tmp/best_2_model-{}.ckpt".format(now)) 

print("Training time: %.6s seconds" % (time.time() - start_time))
print('Accuracy in the test set: ', accuracy_score(y_test, y_pred))

Epoch 0 Training accuracy: 0.970765
Epoch 10 Training accuracy: 0.994743
Epoch 20 Training accuracy: 0.99834
Epoch 30 Training accuracy: 0.999539
Epoch 40 Training accuracy: 0.999816
Epoch 50 Training accuracy: 0.999908
Epoch 60 Training accuracy: 0.999908
Epoch 70 Training accuracy: 1.0
Epoch 80 Training accuracy: 1.0
Epoch 90 Training accuracy: 1.0
Epoch 100 Training accuracy: 1.0
Training time: 57.358 seconds
Accuracy in the test set:  0.989914271306


### Results

The aim of this document is to present the use of TF to train a net using pre-trained weights. Some prelimineary results were observed. This results are not conclusive and should not be generalized.

- Training with all (~1000) examples:
    - We still got 100% accuracy, however, more epochs were necessary.
    - The training time was decreased from 55s to 46s (a 20% reduction). 
    - The test accuracies were approximatly the same: 0.99.
- Training with 100 examples:
    - Training were very fast with and without pre-training.
    - Even without pre-training, it was possible to overfit the training data; the accuracy in the test set was 0.92.
    - With pre-training, the test set accuracy were 0.9.
    - Because the training set is small, it becomes less representative of the true hand-written behavior, hence, overffiting is more severe.
- Training with 50 examples:
    - Without pre-training the net was not possible to learn to recognize 8's and 9's as different digits very well. The training and test sets accuracies become around 0.6. The accuracy in the training set was low probably because the fixed number of epochs of 100.
    - With pre-training it was indeed possible to solve the recognition task even with 100 epochs for training. It was possible to get a earfect score on the training set and 0.84 accuracy on the test set.
    
It was possible to see from the preliminary runs that the training time is not entirely proportional to the weights in the net. A reduction of 89% in the weights (from 187k to 20k) result in a training time 20% slower. It seems better to avoid pre-training if you have enough time available. However, to create deep structures with very little data, pre-training weights may be necessary.

A couple of general questions are still open (the answers for these questions depend much on the problem domain you are working with):

- What's the effect on training time, generalization ability and fitness, for different ratios of freezing layers? 
- What's the difference on training time, generalization ability and fitness when initializing weights as pre-trained values instead of random strategies?
- In problems without much data, is it preferable to create a smaller net or a big net with pre-trained weights?
- How similar the problems must be to pre-training be useful?
- In which problem conditions should we use pre-training weights?
    