# Important Information

This file has two usages. 
1. This file can be displayed in Jupyter. You can read the task and insert your answers here. Start the notebook from an Anaconda prompt and change to the working directory containing the *.ipynb file.
2. You can execute the code in Google Colab making use of Keras and Tensorflow. If you do not want to create a Google Account, you have to create a local environment for Keras and Tensorflow.

For submission, convert your notebook with your solutions and all images to a pdf-file and upload this file into OLAT.

Setup for this exercise sheet. Download data and define Tensorflow version.
Execute code only if you setup your enviroment correctly or if you are inside a colab enviroment.

In [None]:
! git clone https://gitlab+deploy-token-26:XBza882znMmexaQSpjad@git.informatik.uni-kiel.de/las/nndl.git

%tensorflow_version 2.x

# Exercise 1 (Learning in neural networks)

a) Explain the following terms related to neural networks as short and precise as possible. 

* Loss function
* Stochastic gradient descent
* Mini-batch 
* Regularization
* Dropout
* Batch normalization
* Learning with momentum
* Data augmentation
* Unsupervised pre-training / supervised fine-tuning
* Deep learning


In [None]:
Answer: Write your answer here.

b) Name the most important output activation functions f(z), i.e., activation function of the output neuron(s), together with a corresponding suitable loss function L (in both cases, give the mathematical equation). Indicate whether such a perceptron is used for a classification or a regression task.

In [None]:
Answer: Write your answer here.

# Exercise 2 (Multi-layer perceptron: Backprogagation, regression problem)
a) Consider the multi-layer perceptron in the following figure: 

![IMAGE: perceptron](images/MLP.png)

The activation function at all hidden nodes is ReLU and at the output node linear. 

Perform one iteration of plain backpropagation (without momentum, regularization etc.), based on a mini-batch composed of two input samples ${x}^{(\mu)}$ with corresponding target values $y^{(\mu)}$, learning rate $\eta$ and SSE loss:

${x}^{(1)}=(-1,1)^T$ with target $y^{(1)}=1$ and ${x}^{(2)}=(2,-1)^T$ with target $y^{(2)}=-1$  

The initial weights and biases are given as ($t$ is the iteration index):

${W}^1(t=0)=\begin{bmatrix} 1 & 2 \\ 0 & -1 \\ -1 & -3 \\ -2 & 2 \end{bmatrix}$; 
${W}^2(t=0)=\begin{bmatrix} 1 & 0 & -1 & 2 \end{bmatrix}$

${b}^1(t=0)=\begin{bmatrix} -2 \\ 2 \\ 0 \\-2 \end{bmatrix}$;
${b}^2(t=0)=-2$

For the forward path, calculate the postsynaptic potential (PSP), the activations and outputs and insert them into the following table:

Input<br> ${x}=(x_1, x_2)^T = a^0$ | PSP<br> $z^1$ | Activation<br> $a^1$ |  Ouput<br>$\hat{y}=a^2$ 
:--------:|:--------:|:--------:|:--------:
$(-1, 1)^T$ |   | | |  
$(2, -1)^T$ |   | | |  
 
For the backward path, calculate the updated weights and biases for the hidden and output layer and insert them into the following table:

Weights<br> $W^1(t=1)$ | Bias<br> $b^1(t=1)$ | Weights<br> $W^2(t=1)$ |  Bias<br> $b^2(t=1)$ 
:---:|:---:|:---:|:---:
 |   | | |  
 |   | | | 

b) The goal of this exercise is to train a multi-layer perceptron to solve a high difficulty level nonlinear regression problem. The data has been generated using an exponential function with the following shape:

![IMAGE: perceptron](images/Eckerle4Dataset.png)

This graph corresponds to the values of a dataset that can be downloaded from the Statistical Reference Dataset of the Information Technology Laboratory of the United States on this link:
http://www.itl.nist.gov/div898/strd/nls/data/eckerle4.shtml

This dataset is provided in the file Eckerle4.csv. Note that this dataset is divided into a training and test corpus comprising 60% and 40% of the data samples, respectively. Moreover, the input and output values are normalized to the interval [0, 1]. Basic code to load the dataset and divide it into a training and test corpus, normalizing the data and to apply a multi-layer perceptron is provided in the Jupyter notebook.

Choose a suitable network topology (number of hidden layers and hidden neurons, potentially include dropout, activation function of hidden layers) and use it for the multi-layer perceptron defined in the Jupyter notebook. Set further parameters (learning rate, loss function, optimizer, number of epochs, batch size; see the lines marked with *# FIX!!!* in the Jupyter notebook). Try to avoid underfitting and overfitting. Vary the network and parameter configuration in order to achieve a network performance as optimal as possible. For each network configuration, due to the random components in the experiment, perform (at least) 4 different training and evaluation runs and report the mean and standard deviation of the training and evaluation results. Report on your results and conclusions.

(Source of exercise: http://gonzalopla.com/deep-learning-nonlinear-regression)

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from os.path import join
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras import Model, Input, Sequential
from tensorflow.keras.optimizers import SGD, Adam, Adadelta, Adagrad, Nadam, RMSprop
from tensorflow.keras.utils import normalize
import pandas
from sklearn import preprocessing
from sklearn import model_selection
import sys

###--------
# load data
###--------

# Imports csv into pandas DataFrame object.
path_to_task = "nndl/Lab4"
Eckerle4_df = pandas.read_csv(join(path_to_task,"Eckerle4.csv"), header=0)
 
# Converts dataframes into numpy objects.
Eckerle4_dataset = Eckerle4_df.values.astype("float32")
# Slicing all rows, second column...
X = Eckerle4_dataset[:,1]
# Slicing all rows, first column...
y = Eckerle4_dataset[:,0]
 
# plot data
plt.plot(X,y, color='red')
plt.legend(labels=["data"], loc="upper right")
plt.title("data")
plt.show()

###-----------
# process data
###-----------

# Data Scaling from 0 to 1, X and y originally have very different scales.
X_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
y_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
X_scaled = ( X_scaler.fit_transform(X.reshape(-1,1)))
y_scaled = (y_scaler.fit_transform(y.reshape(-1,1)).reshape(-1) )
 
# Preparing test and train data: 60% training, 40% testing.
X_train, X_test, y_train, y_test = model_selection.train_test_split( X_scaled, y_scaled, test_size=0.40, random_state=3)


###-----------
# define model
###-----------

num_inputs = X_train.shape[1] # should be 1 in case of Eckerle4
num_hidden = ... # for each hidden layer: number of hidden units in form of a python list   # FIX!!!
num_outputs = 1 # predict single number in case of Eckerle4

activation = '...' # activation of hidden layers   # FIX!!!
dropout = ... # 0 if no dropout, else fraction of dropout units (e.g. 0.2)   # FIX!!!

# Sequential network structure.
model = Sequential()

if len(num_hidden) == 0:
  print("Error: Must at least have one hidden layer!")
  sys.exit()  

# add first hidden layer connecting to input layer
model.add(Dense(num_hidden[0], input_dim=num_inputs, activation=activation))

if dropout: 
  # dropout of fraction dropout of the neurons and activation layer.
  model.add(Dropout(dropout))
  # model.add(Activation("linear"))

# potentially further hidden layers
for i in range(1, len(num_hidden)):
  # add hidden layer with len[i] neurons
  model.add(Dense(num_hidden[i], activation=activation))
#  model.add(Activation("linear"))

  if dropout:
  # dropout of fraction dropout of the neurons and activation layer.
    model.add(Dropout(dropout))
  #  model.add(Activation("linear"))

# output layer
model.add(Dense(1))

# show how the model looks
model.summary()

# compile model
opt = ... # FIX!!!
model.compile(loss='...', optimizer=opt, metrics=["..."])# FIX!!!

# Training model with train data. Fixed random seed:
np.random.seed(3)
num_epochs = ...   # FIX !!!
batch_size = ... # FIX !!! 
history = model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size, verbose=2)

###-----------
# plot results
###-----------

print("final (mse) training error: %f" % history.history['loss'][num_epochs-1])

plt.plot(history.history['loss'], color='red', label = 'training loss')
plt.legend(labels=["loss"], loc="upper right")
plt.title("training (mse) error")
plt.show()

# Plot in blue color the predicted data and in green color the
# actual data to verify visually the accuracy of the model.
predicted = model.predict(X_test)
plt.plot(y_scaler.inverse_transform(predicted.reshape(-1,1)), color="blue")
plt.plot(y_scaler.inverse_transform(y_test.reshape(-1,1)), color="green")
plt.legend(labels=["predicted", "target"], loc="upper right")
plt.title("evaluation on test corpus")
plt.show()
print("test error: %f" % model.evaluate(X_test, y_test)[0])

In [None]:
Answer: Write your answer here.

# Exercise 3 (Parameters of a multi-layer perceptron – digit recognition)

In the following exercises, we use Tensorflow and Keras to configure, train and apply a multi-layer perceptron to the problem of recognizing handwritten digits (the famous “MNIST” problem). The MNIST data are loaded using a Tensorflow Keras built-in function.

Perform experiments on this pattern recognition problem trying to investigate the influence of a number of parameters on the classification performance. This may refer to

-	the learning rate and potentially learning schedule,
-	the number of hidden neurons (in a network with a single hidden layer),
-	the number of hidden layers as well as applying dropout and / or batch normalization,
-	the solver (including momentum),
-	the activation function at hidden layers,
-	regularization.

The script in the Jupyter notebook can serve as a basis or starting point.

Report your findings and conclusions.

**Note: These experiments may require a lot of computation time!**

**Further investigations and experiments as well as code extensions and modifications are welcome!**

 

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from os.path import join
from tensorflow.keras.layers import Dense, Dropout, Activation, BatchNormalization
from tensorflow.keras import Model, Input, Sequential
from tensorflow.keras.optimizers import SGD, Adam, Adadelta, Adagrad, Nadam, RMSprop, schedules
from tensorflow.keras.utils import normalize
import tensorflow.keras.datasets as tfds
import tensorflow.keras.initializers as tfi
import tensorflow.keras.regularizers as tfr

###--------
# load data
###--------

(training_input, training_target), (test_input, test_target)  = tfds.mnist.load_data()

# Reserve 10,000 samples for validation
validation_input = training_input[-10000:]
validation_target = training_target[-10000:]
training_input = training_input[:-10000]
training_target = training_target[:-10000]

print("training input shape: %s, training target shape: %s"  % (training_input.shape, training_target.shape))
print("validation input shape: %s, validation target shape: %s"  % (validation_input.shape, validation_target.shape))
print("test input shape: %s, test target shape: %s"  % (test_input.shape, test_target.shape))
# range of input values: 0 ... 255
print("\n")

# plot some sample images
num_examples = ... # FIX!!!
for s in range(num_examples):
  print("Example image, true label: %d" % training_target[s])
  plt.imshow(training_input[s], vmin=0, vmax=255, cmap=plt.cm.gray)
  plt.show()

###-----------
# process data
###-----------

# Note: shuffling is performed in fit method

# scaling inputs from range 0 ... 255 to range [0,1] if desired
scale_inputs = True # scale inputs to range [0,1]
if scale_inputs:
  training_input = training_input / 255
  validation_input = validation_input / 255 
  test_input = test_input / 255

print("min. training data: %f" % np.min(training_input))
print("max. training data: %f" % np.max(training_input))
print("min. validation data: %f" % np.min(validation_input))
print("max. validation data: %f" % np.max(validation_input))
print("min. test data: %f" % np.min(test_input))
print("max. test data: %f" % np.max(test_input))

# histograms of input values
nBins = 100
fig, axes = plt.subplots(1, 3, figsize=(15,10))
axes[0].hist(training_input.flatten(), nBins)
axes[0].set_xlabel("training")
axes[0].set_ylabel("counts")
axes[0].set_ylim((0,1e6))

axes[1].hist(validation_input.flatten(), nBins)
axes[1].set_xlabel("validation")
axes[1].set_ylabel("counts")
axes[1].set_ylim((0,1e6))
axes[1].set_title('historgrams of input values')

axes[2].hist(test_input.flatten(), nBins)
axes[2].set_xlabel("test")
axes[2].set_ylabel("counts")
axes[2].set_ylim((0,1e6))

plt.show()

# flatten inputs to vectors
training_input = training_input.reshape(training_input.shape[0], training_input.shape[1] * training_input.shape[2])
validation_input = validation_input.reshape(validation_input.shape[0], validation_input.shape[1] * validation_input.shape[2])
test_input = test_input.reshape(test_input.shape[0], test_input.shape[1] * test_input.shape[2])
print(training_input.shape)
print(validation_input.shape)
print(test_input.shape)

num_classes = ... # FIX!!!

###-----------
# define model
###-----------

histories = {}
opt_learning_rate = {}
final_training_loss = {}
final_training_accuracy = {}
final_validation_loss = {}
final_validation_accuracy = {}
final_test_loss = {}
final_test_accuracy = {}

configurations = [
        # FIX!!!
        # ... 
        {'learningRates': [...], # numpy array, e.g. [0.1, 0.2]
         'hiddenLayerSizes': [...], # as before
         'solver': '...',
         'activation':'...'}, # activation of hidden layers
         

]

learningRateSchedule = ... # FIX!!! True: apply (exponential) learning rate schedule; False: constant learning rate
dropout = ... # FIX!!! 0 if no dropout, else fraction of dropout units (e.g. 0.2)   # FIX!!!
batch_normalization = ... # FIX!!!
regularization_weight = ... # FIX!!! 0 for no regularization or e.g. 0.01 to apply regularization
regularizer = tfr.l1(l=regularization_weight) # or l2 or l1_l2; used for both weights and biases
momentum = ... # FIX!!! 0 or e.g. 0.9, 0.99; ONLY FOR STOCHASTIC GRADIENT DESCENT AND RMSPROP
nesterov = ... # FIX!!! ONLY FOR STOCHASTIC GRADIENT DESCENT

numRepetitions = ... # FIX!!! repetitions of experiment due to stochastic nature

num_inputs = training_input.shape[1] 
num_outputs = num_classes 

idx_config = 0

for config in configurations:
  print("=======")
  print("Now running tests for config", config)

  learningRates = config['learningRates']
  num_hidden = config['hiddenLayerSizes']
  solver = config['solver']
  activation = config['activation']

  # Sequential network structure.
  model = Sequential()

  if len(num_hidden) == 0:
    print("Error: Must at least have one hidden layer!")
    sys.exit()  

  # add first hidden layer connecting to input layer
  model.add(Dense(num_hidden[0], input_dim=num_inputs, activation=activation, kernel_regularizer=regularizer, bias_regularizer=regularizer))

#  if dropout: # dropout at input layer is generally not recommended
#    # dropout of fraction dropout of the neurons and activation layer.
#    model.add(Dropout(dropout))
#  #  model.add(Activation("linear"))

  if batch_normalization:
    model.add(BatchNormalization())

  # potentially further hidden layers
  for i in range(1, len(num_hidden)):
    # add hidden layer with len[i] neurons
    model.add(Dense(num_hidden[i], activation=activation, kernel_regularizer=regularizer, bias_regularizer=regularizer))
  #  model.add(Activation("linear"))

    if dropout:
    # dropout of fraction dropout of the neurons and activation layer.
      model.add(Dropout(dropout))
    #  model.add(Activation("linear"))

    if batch_normalization:
      model.add(BatchNormalization())  

  # output layer
  model.add(Dense(units=num_outputs, name = "output", kernel_regularizer=regularizer, bias_regularizer=regularizer))

  if dropout:
  # dropout of fraction dropout of the neurons and activation layer.
    model.add(Dropout(dropout))
  #  model.add(Activation("linear"))    
    
  # print configuration
  print("\nModel configuration: ")
  print(model.get_config())
  print("\n")

  # show how the model looks
  model.summary()

  optLearningRate = 0
  optValidationAccuracy = 0

  histories_lr = [] # remember history for each learning rate

  for idx_lr in range(len(learningRates)):
  
    print("MODIFYING LEARNING RATE")
    learningRate = learningRates[idx_lr]
    if learningRateSchedule == True:
      lr_schedule = schedules.ExponentialDecay(initial_learning_rate = learningRate, decay_steps=100000, decay_rate=0.96, staircase=True) # or PiecewiseConstantDecay or PolynomialDecay or InverseTimeDecay 
      print("... applying exponential decay learning rate schedule with initial learning rate %f" % learningRate)
    else:
      lr_schedule = learningRate # constant learning rate
      print("... constant learning rate %f" % learningRate)

    train_loss = np.zeros(numRepetitions)
    train_acc = np.zeros(numRepetitions)
    val_loss = np.zeros(numRepetitions)
    val_acc = np.zeros(numRepetitions)
    test_loss = np.zeros(numRepetitions)
    test_acc = np.zeros(numRepetitions)

    histories_rep = [] # (temporarily) remember history of each repetition
    for idx_rep in range(numRepetitions):
      print("\nIteration %d..." % idx_rep)  
      
      # compile model
      if solver == 'SGD':        
        opt = SGD(learning_rate=lr_schedule, momentum=momentum, nesterov=nesterov) # SGD or Adam, Nadam, Adadelta, Adagrad, RMSProp, potentially setting more parameters
      elif solver == 'Adam':
        opt = Adam(learning_rate=lr_schedule) 
      elif solver == 'Nadam':
        opt = Nadam(learning_rate=lr_schedule) # Nadam doesn't support adaptive learning rate schedule!
      elif solver == 'Adadelta':
        opt = Adadelta(learning_rate=lr_schedule) 
      elif solver == 'Adagrad':
        opt = Adagrad(learning_rate=lr_schedule) 
      elif solver == 'RMSprop':
        opt = RMSprop(learning_rate=lr_schedule, momentum = momentum)
      model.compile(optimizer=opt,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['sparse_categorical_accuracy'])

      # Training model with train data. Fixed random seed:
      num_epochs = ... # FIX !!!
      batch_size = ... # FIX !!! 
      history = model.fit(training_input, training_target, epochs=num_epochs, batch_size=batch_size, shuffle="True", verbose=2)
      histories_rep.append(history) # remember all histories from all repetitions
      train_loss[idx_rep] = history.history['loss'][num_epochs-1] 
      train_acc[idx_rep] = history.history['sparse_categorical_accuracy'][num_epochs-1]
      val_loss[idx_rep] = model.evaluate(validation_input, validation_target)[0]
      val_acc[idx_rep] = model.evaluate(validation_input, validation_target)[1]
      test_loss[idx_rep] = model.evaluate(test_input, test_target)[0]
      test_acc[idx_rep] = model.evaluate(test_input, test_target)[1]

    # print results:
    print("training loss (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % train_loss[i])
    print("(%f +/- %f)\n" % (np.mean(train_loss), np.std(train_loss, ddof=1)))

    print("training accuracy (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % train_acc[i])
    print("(%f +/- %f)\n" % (np.mean(train_acc), np.std(train_acc, ddof=1)))

    print("validation loss (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % val_loss[i])
    print("(%f +/- %f)\n" % (np.mean(val_loss), np.std(val_loss, ddof=1)))

    print("validation accuracy (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % val_acc[i])
    print("(%f +/- %f)\n" % (np.mean(val_acc), np.std(val_acc, ddof=1)))

    print("test loss (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % test_loss[i])
    print("(%f +/- %f)\n" % (np.mean(test_loss), np.std(test_loss, ddof=1)))

    print("test accuracy (in brackets: mean +/- std):")
    for i in range(numRepetitions):
        print("%f" % test_acc[i])
    print("(%f +/- %f)\n" % (np.mean(test_acc), np.std(test_acc, ddof=1)))

    # remember history of best repetition (based on maximal validation accuracy)
    idx_best_rep = np.argmax(val_acc)

    # plot training loss and accuracy for best repetition
    print("\nbest repetition: experiment %d" % idx_best_rep)
    plt.plot(histories_rep[idx_best_rep].history['loss'], color = 'blue', 
                  label = 'training loss')
    plt.plot(histories_rep[idx_best_rep].history['sparse_categorical_accuracy'], color = 'red', 
                  label = 'traning accuracy')
    plt.xlabel('Epoch number')
    plt.ylim(0, 1)
    plt.legend()
    plt.show()

    # determine optimal learning rate (based on mean validation accuracy over repetitions)
    if np.mean(val_acc) > optValidationAccuracy:
        optValidationAccuracy = np.mean(val_acc)
        opt_learning_rate[idx_config] = learningRate  
        # remember history
        histories[idx_config] = histories_rep[idx_best_rep]
        # remember evaluation results
        final_training_loss[idx_config] = train_loss[idx_best_rep]
        final_training_accuracy[idx_config] = train_acc[idx_best_rep]
        final_validation_loss[idx_config] = val_loss[idx_best_rep]
        final_validation_accuracy[idx_config] = val_acc[idx_best_rep]
        final_test_loss[idx_config] = test_loss[idx_best_rep]
        final_test_accuracy[idx_config] = test_acc[idx_best_rep]   

  print("\n\noptimal learning rate for this configuration: %f\n\n" % opt_learning_rate[idx_config])

  # print evaluation results
  print("\nconfiguration %s:\n" % configurations[idx_config])
  print("optimal learning rate: %f" % opt_learning_rate[idx_config])
  print("final training loss: %f" % final_training_loss[idx_config])
  print("final training accuracy: %f" % final_training_accuracy[idx_config])
  print("final validation loss: %f" % final_validation_loss[idx_config])
  print("final validation accuracy: %f" % final_validation_accuracy[idx_config])
  print("final test loss: %f" % final_test_loss[idx_config])
  print("final test accuracy: %f" % final_test_accuracy[idx_config])

  # increment configuration index
  idx_config = idx_config + 1

###--------------------------------
# Summary: print evaluation results
###--------------------------------

print("\n\nSummary:\n\n")
for i in range(len(configurations)): 
  print("\nconfiguration %s:\n" % configurations[i])
  print("optimal learning rate: %f" % opt_learning_rate[i])
  print("final training loss: %f" % final_training_loss[i])
  print("final training accuracy: %f" % final_training_accuracy[i])
  print("final validation loss: %f" % final_validation_loss[i])
  print("final validation accuracy: %f" % final_validation_accuracy[i])
  print("final test loss: %f" % final_test_loss[i])
  print("final test accuracy: %f" % final_test_accuracy[i])

###--------------------
# Summary: plot results
###--------------------
 
# plot setup
num_rows = np.int(np.ceil(len(configurations)/2))
fig, axes = plt.subplots(num_rows, 2, figsize=(15, 10))
fig.tight_layout() # improve spacing between subplots, doesn't work
plt.subplots_adjust(left=0.125, right=0.9, bottom=0.1, top=0.9, wspace=0.2, hspace=0.2) # doesn't work
legend = []
i = 0
axes_indices = {}

if (len(configurations) <= 2):
  for i in range(len(configurations)):
    axes_indices[i] = i
else:
  for i in range(num_rows):
    axes_indices[2*i] = (i, 0)
    axes_indices[2*i+1] = (i, 1)

for i in range(len(configurations)):
  # plot loss    
  axes[axes_indices[i]].set_title('configuration ' + str(i))
  if i == 8 or i == 9:  
    axes[axes_indices[i]].set_xlabel('Epoch number')
  axes[axes_indices[i]].set_ylim(0, 1)
  axes[axes_indices[i]].plot(histories[i].history['loss'], color = 'blue', 
              label = 'training loss')
  axes[axes_indices[i]].plot(histories[i].history['sparse_categorical_accuracy'], color = 'red', 
              label = 'traning accuracy')
  axes[axes_indices[i]].legend()

  i = i + 1


# show the plot
plt.show()

In [None]:
Answer: Write your answer here.

# Exercise 4 (Vanishing gradient)

a)	The Jupyter notebook implements a multi-layer perceptron for use on the MNIST digit classification problem. Apart from the training loss and accuracy, it also displays a histogram of the weights (between the input and the first hidden layer) after initialization and at the end of the training, and visualizes the weights (between the input layer and 16 hidden neurons of the first hidden layer). Using a sigmoid activation function, compare the output for a single hidden layer, five and six hidden layers. Then change to a ReLU activation function and inspect the results for six hidden layers. Discuss your findings.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from os.path import join
from tensorflow.keras.layers import Dense, Dropout, Activation, BatchNormalization
from tensorflow.keras import Model, Input, Sequential
from tensorflow.keras.optimizers import SGD, Adam, Adadelta, Adagrad, Nadam, RMSprop, schedules
from tensorflow.keras.utils import normalize
import tensorflow.keras.datasets as tfds
import tensorflow.keras.initializers as tfi
import tensorflow.keras.regularizers as tfr

###--------
# load data
###--------

(training_input, training_target), (test_input, test_target)  = tfds.mnist.load_data()

# Reserve 10,000 samples for validation
validation_input = training_input[-10000:]
validation_target = training_target[-10000:]
training_input = training_input[:-10000]
training_target = training_target[:-10000]

print("training input shape: %s, training target shape: %s"  % (training_input.shape, training_target.shape))
print("validation input shape: %s, validation target shape: %s"  % (validation_input.shape, validation_target.shape))
print("test input shape: %s, test target shape: %s"  % (test_input.shape, test_target.shape))
# range of input values: 0 ... 255
print("\n")

###-----------
# process data
###-----------

# Note: shuffling is performed in fit method

# scaling inputs from range 0 ... 255 to range [0,1] if desired
scale_inputs = True # scale inputs to range [0,1]
if scale_inputs:
  training_input = training_input / 255
  validation_input = validation_input / 255 
  test_input = test_input / 255

# flatten inputs to vectors
training_input = training_input.reshape(training_input.shape[0], training_input.shape[1] * training_input.shape[2])
validation_input = validation_input.reshape(validation_input.shape[0], validation_input.shape[1] * validation_input.shape[2])
test_input = test_input.reshape(test_input.shape[0], test_input.shape[1] * test_input.shape[2])
print(training_input.shape)
print(validation_input.shape)
print(test_input.shape)

num_classes = 10 # 10 digits

###-----------
# define model
###-----------

num_inputs = training_input.shape[1] 
num_hidden = [...] # FIX!!!
num_outputs = num_classes 

initialLearningRate = 0.01 # FIX!!!
# select constant learning rate or (flexible) learning rate schedule,
# i.e. select one of the following two alternatives
lr_schedule = initialLearningRate # constant learning rate
# lr_schedule = schedules.ExponentialDecay(initial_learning_rate = initialLearningRate, decay_steps=100000, decay_rate=0.96, staircase=True) # or PiecewiseConstantDecay or PolynomialDecay or InverseTimeDecay 

solver = 'SGD'
activation = '...' # FIX!!! e.g. sigmoid or relu
dropout = 0 # 0 if no dropout, else fraction of dropout units (e.g. 0.2)   # FIX!!!
batch_normalization = False

weight_init = tfi.glorot_uniform() # FIX!!! default: glorot_uniform(); e.g. glorot_normal(), he_normal(), he_uniform(), lecun_normal(), lecun_uniform(), RandomNormal(), RandomUniform(), Zeros() etc.
bias_init = tfi.Zeros() # FIX!!! default: Zeros(); for some possible values see weight initializers

regularization_weight = 0.0 # 0 for no regularization or e.g. 0.01 to apply regularization
regularizer = tfr.l1(l=regularization_weight) # or l2 or l1_l2; used for both weights and biases

num_epochs = ... # FIX !!!
batch_size = ... # FIX !!! 

# Sequential network structure.
model = Sequential()

if len(num_hidden) == 0:
  print("Error: Must at least have one hidden layer!")
  sys.exit()  

# add first hidden layer connecting to input layer

model.add(Dense(num_hidden[0], input_dim=num_inputs, activation=activation, kernel_initializer=weight_init, bias_initializer = bias_init, kernel_regularizer=regularizer, bias_regularizer=regularizer))

# if dropout: # dropout at input layer is generally not recommended
#  # dropout of fraction dropout of the neurons and activation layer.
#  model.add(Dropout(dropout))
# #  model.add(Activation("linear"))

if batch_normalization:
  model.add(BatchNormalization())

# potentially further hidden layers
for i in range(1, len(num_hidden)):
  # add hidden layer with len[i] neurons
  model.add(Dense(num_hidden[i], activation=activation, kernel_initializer=weight_init, bias_initializer = bias_init, kernel_regularizer=regularizer, bias_regularizer=regularizer))
#  model.add(Activation("linear"))
  
  if dropout:
  # dropout of fraction dropout of the neurons and activation layer.
    model.add(Dropout(dropout))
  #  model.add(Activation("linear"))

  if batch_normalization:
    model.add(BatchNormalization())  

# output layer
model.add(Dense(units=num_outputs, name = "output", kernel_initializer=weight_init, bias_initializer = bias_init, kernel_regularizer=regularizer, bias_regularizer=regularizer))

if dropout:
# dropout of fraction dropout of the neurons and activation layer.
  model.add(Dropout(dropout))
#  model.add(Activation("linear"))
  
# print configuration
print("\nModel configuration: ")
print(model.get_config())
print("\n")
print("... number of layers: %d" % len(model.layers))

# show how the model looks
model.summary()
      
# compile model
if solver == 'SGD':
  momentum = 0 # e.g. 0.0, 0.5, 0.9 or 0.99
  nesterov = False
  opt = SGD(learning_rate=lr_schedule, momentum=momentum, nesterov=nesterov) # SGD or Adam, Nadam, Adadelta, Adagrad, RMSProp, potentially setting more parameters
elif solver == 'Adam':
  opt = Adam(learning_rate=lr_schedule)
elif solver == 'Nadam':
  opt = Adam(learning_rate=lr_schedule)
elif solver == 'Adadelta':
  opt = Adam(learning_rate=lr_schedule)
elif solver == 'Adagrad':
  opt = Adam(learning_rate=lr_schedule)
elif solver == 'RMSprop':
  opt = RMSprop(learning_rate=lr_schedule)
model.compile(optimizer=opt,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['sparse_categorical_accuracy'])

# histogram of weights (first layer) after initialization
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]

nBins = 100
fig, axes = plt.subplots(1, 2, figsize=(15,10))
axes[0].hist(weights.flatten(), nBins)
axes[0].set_xlabel("weights")
axes[0].set_ylabel("counts")
axes[0].set_title("weight histogram after initialization")

axes[1].hist(biases.flatten(), nBins)
axes[1].set_xlabel("biases")
axes[1].set_ylabel("counts")
axes[1].set_title("bias histogram after initialization")
plt.show()

# visualize the weights between input layer and some 
# of the hidden neurons of the first hidden layer after initialization
# model.layers[0].get_weights()[0] is a (784 x numHiddenNeurons) array
# model.layers[0].get_weights()[0].T (transpose) is a (numHiddenNeurons x 784) array,
# the first entry of which contains the weights of all inputs connecting
# to the first hidden neuron; those weights will be displayed in (28 x 28) format
# until all plots (4 x 4, i.e. 16) are "filled" or no more hidden neurons are left
print("Visualization of the weights between input and some of the hidden neurons of the first hidden layer:")
fig, axes = plt.subplots(4, 4, figsize=(15,15))
# use global min / max to ensure all weights are shown on the same scale
weights = model.layers[0].get_weights()[0]
vmin, vmax = weights.min(), weights.max()
for coef, ax in zip(weights.T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()

# Training
history = model.fit(training_input, training_target, epochs=num_epochs, batch_size=batch_size, shuffle="True", verbose=2)

# plot training loss and accuracy 
plt.plot(history.history['loss'], color = 'blue', label = 'training loss')
plt.plot(history.history['sparse_categorical_accuracy'], color = 'red', label = 'traning accuracy')
plt.xlabel('Epoch number')
plt.ylim(0, 1)
plt.legend()
plt.show()

# model evaluation
train_loss = history.history['loss'][num_epochs-1] 
train_acc = history.history['sparse_categorical_accuracy'][num_epochs-1]
val_loss = model.evaluate(validation_input, validation_target)[0]
val_acc = model.evaluate(validation_input, validation_target)[1]
test_loss = model.evaluate(test_input, test_target)[0]
test_acc = model.evaluate(test_input, test_target)[1]

print("\n")
print("final training loss: %f" % train_loss)
print("final training accuracy: %f" % train_acc)
print("final validation loss: %f" % val_loss)
print("final validation accuracy: %f" % val_acc)
print("final test loss: %f" % test_loss)
print("final test accuracy: %f" % test_acc)
print("\n")

# histogram of weights (first layer) after training
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]

nBins = 100
fig, axes = plt.subplots(1, 2, figsize=(15,10))
axes[0].hist(weights.flatten(), nBins)
axes[0].set_xlabel("weights")
axes[0].set_ylabel("counts")
axes[0].set_title("weight histogram after training")

axes[1].hist(biases.flatten(), nBins)
axes[1].set_xlabel("biases")
axes[1].set_ylabel("counts")
axes[1].set_title("bias histogram after training")
plt.show()

# visualize the weights between input layer and some 
# of the hidden neurons of the first hidden layer after training
# model.layers[0].get_weights()[0] is a (784 x numHiddenNeurons) array
# model.layers[0].get_weights()[0].T (transpose) is a (numHiddenNeurons x 784) array,
# the first entry of which contains the weights of all inputs connecting
# to the first hidden neuron; those weights will be displayed in (28 x 28) format
# until all plots (4 x 4, i.e. 16) are "filled" or no more hidden neurons are left
print("Visualization of the weights between input and some of the hidden neurons of the first hidden layer:")
fig, axes = plt.subplots(4, 4, figsize=(15,15))
# use global min / max to ensure all weights are shown on the same scale
weights = model.layers[0].get_weights()[0]
vmin, vmax = weights.min(), weights.max()
for coef, ax in zip(weights.T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()


In [None]:
Answer: Write your answer here.

b)	Give a theoretical justification, why the weights and biases of neurons in the first hidden layers in a multi-layer perceptron with many hidden layers are modified only slowly when using a sigmoid activation function and gradient descent. To this end, consider – as an example – a simplified network with three hidden layers (and a single neuron per layer), compute and analyse the change  of the bias of the first hidden neuron with respect to a change in the cost function C. What changes in your analysis when using a ReLU activation function instead of a sigmoid?

In [None]:
Answer: Write your answer here.

c)	Starting from your analysis for the multi-layer perceptron with six hidden layers and sigmoid activation function in part a), try to find other model configurations which lead to a successful training. You may modify e.g. the learning rate and batch size, the weight and bias initialization, apply batch normalization and / or dropout, and add regularization.

In [None]:
Answer: Write your answer here.