# Image Classification Using CIFAR-10 dataset

The CIFAR-10 (Canadian Institute For Advanced Research) dataset contains 60,000 32x32 color images. Each image is labeled with one of the following 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. There are 50000 training images and 10000 test images. 

# Table of Contents

This notebook has 5 parts.  You will practice TensorFlow on three different levels of abstraction.

1. Part I, Preparation: load the CIFAR-10 dataset.
2. Part II, Barebone TensorFlow: **Abstraction Level 1**, we will work directly with low-level TensorFlow graphs. 
3. Part III, Keras Model API: **Abstraction Level 2**, we will use `tf.keras.Model` to define arbitrary neural network architecture. 
4. Part IV, Keras Sequential + Functional API: **Abstraction Level 3**, we will use `tf.keras.Sequential` to define a linear feed-forward network very conveniently, and then explore the functional libraries for building unique and uncommon models that require more flexibility.
5. Part V, Tuning: Experiment with different architectures, activation functions, weight initializations, optimizers, hyperparameters, regularizations or other advanced features. Your goal is to get accuracy as high as possible on CIFAR-10 (without using convolutional layers).

We will discuss Keras in more detail later in the notebook.

Here is a table of comparison:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| Barebone      | High        | Low         |
| `tf.keras.Model`     | High        | Medium      |
| `tf.keras.Sequential` | Low         | High        |

# Part I: Preparation

First, we load the CIFAR-10 dataset. The downloading might take a couple minutes the first time you do it, but the files should stay cached after that. 

The `tf.keras.datasets` package in TensorFlow provides prebuilt utility functions for loading many common datasets. For the purposes of this assignment we will write our own code to preprocess the data and iterate through it in minibatches. The `tf.data` package in TensorFlow provides tools for automating this process, but working with this package adds extra complication and is beyond the scope of this notebook. However using `tf.data` can be much more efficient than the simple approach used in this notebook, so you should consider using it for your project.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass


In [0]:
import os
import tensorflow as tf
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt

%matplotlib inline

In [0]:
print(tf.__version__)

2.0.0


In [0]:
def load_cifar10(num_training=49000, num_validation=1000, num_test=10000):
    """
    Fetch the CIFAR-10 dataset from the web and perform preprocessing to prepare
    it for the two-layer neural net classifier.
    """
    # Load the raw CIFAR-10 dataset and use appropriate data types and shapes
    cifar10 = tf.keras.datasets.cifar10.load_data()
    (X_train, y_train), (X_test, y_test) = cifar10
    X_train = np.asarray(X_train, dtype=np.float32)
    y_train = np.asarray(y_train, dtype=np.int32).flatten()
    X_test = np.asarray(X_test, dtype=np.float32)
    y_test = np.asarray(y_test, dtype=np.int32).flatten()

    # Subsample the data
    mask = range(num_training, num_training + num_validation)
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = range(num_training)
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = range(num_test)
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean pixel and divide by std
    mean_pixel = X_train.mean(axis=(0, 1, 2), keepdims=True)
    std_pixel = X_train.std(axis=(0, 1, 2), keepdims=True)
    X_train = (X_train - mean_pixel) / std_pixel
    X_val = (X_val - mean_pixel) / std_pixel
    X_test = (X_test - mean_pixel) / std_pixel

    return X_train, y_train, X_val, y_val, X_test, y_test


# Invoke the above function to get our data.
NHW = (0, 1, 2)
X_train, y_train, X_val, y_val, X_test, y_test = load_cifar10()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape, y_train.dtype)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Train data shape:  (49000, 32, 32, 3)
Train labels shape:  (49000,) int32
Validation data shape:  (1000, 32, 32, 3)
Validation labels shape:  (1000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)


In [0]:
class Dataset(object):
    def __init__(self, X, y, batch_size, shuffle=False):
        """
        Construct a Dataset object to iterate over data X and labels y
        
        Inputs:
        - X: Numpy array of data, of any shape
        - y: Numpy array of labels, of any shape but with y.shape[0] == X.shape[0]
        - batch_size: Integer giving number of elements per minibatch
        - shuffle: (optional) Boolean, whether to shuffle the data on each epoch
        """
        assert X.shape[0] == y.shape[0], 'Got different numbers of data and labels'
        self.X, self.y = X, y
        self.batch_size, self.shuffle = batch_size, shuffle

    def __iter__(self):
        N, B = self.X.shape[0], self.batch_size
        idxs = np.arange(N)
        if self.shuffle:
            np.random.shuffle(idxs)
        return iter((self.X[i:i+B], self.y[i:i+B]) for i in range(0, N, B))


train_dset = Dataset(X_train, y_train, batch_size=64, shuffle=True)
val_dset = Dataset(X_val, y_val, batch_size=64, shuffle=False)
test_dset = Dataset(X_test, y_test, batch_size=64)

In [0]:
# We can iterate through a dataset like this:
for t, (x, y) in enumerate(train_dset):
    print(t, x.shape, y.shape)
    if t > 5: break

0 (64, 32, 32, 3) (64,)
1 (64, 32, 32, 3) (64,)
2 (64, 32, 32, 3) (64,)
3 (64, 32, 32, 3) (64,)
4 (64, 32, 32, 3) (64,)
5 (64, 32, 32, 3) (64,)
6 (64, 32, 32, 3) (64,)


You can optionally **use GPU by setting the flag to True below**. It's not neccessary to use a GPU for this assignment; if you are working on Google Cloud then we recommend that you do not use a GPU, as it will be significantly more expensive.

In [0]:
# Set up some global variables
USE_GPU = True

if USE_GPU:
    device = '/device:GPU:0'
else:
    device = '/cpu:0'

# Constant to control how often we print when training models
print_every = 700

print('Using device: ', device)

Using device:  /device:GPU:0


# Part II: Barebones TensorFlow
TensorFlow comes with various high-level APIs which make it very convenient to define and train neural networks; we will cover some of these constructs in Part III and Part IV of this notebook. In this section, we will start by building a model with basic TensorFlow constructs to help you better understand what's going on under the hood of the higher-level APIs.

**"Barebones Tensorflow" is important to understanding the building blocks of TensorFlow, but much of it involves concepts from TensorFlow 1.x.** We will be working with legacy modules such as `tf.Variable`.

Therefore, please read and understand the differences between legacy (1.x) TF and the new (2.0) TF.

### Historical background on TensorFlow 1.x

TensorFlow 1.x is primarily a framework for working with **static computational graphs**. Nodes in the computational graph are Tensors which will hold n-dimensional arrays when the graph is run; edges in the graph represent functions that will operate on Tensors when the graph is run to actually perform useful computation.

Before Tensorflow 2.0, we had to configure the graph into two phases. There are plenty of tutorials online that explain this two-step process. The process generally looks like the following for TF 1.x:
1. **Build a computational graph that describes the computation that you want to perform**. This stage doesn't actually perform any computation; it just builds up a symbolic representation of your computation. This stage will typically define one or more `placeholder` objects that represent inputs to the computational graph.
2. **Run the computational graph many times.** Each time the graph is run (e.g. for one gradient descent step) you will specify which parts of the graph you want to compute, and pass a `feed_dict` dictionary that will give concrete values to any `placeholder`s in the graph.

### The new paradigm in Tensorflow 2.0
Now, with Tensorflow 2.0, we can simply adopt a functional form that is more Pythonic and similar in spirit to PyTorch and direct Numpy operation. Instead of the 2-step paradigm with computation graphs, making it (among other things) easier to debug TF code. You can read more details at https://www.tensorflow.org/guide/eager.

The main difference between the TF 1.x and 2.0 approach is that the 2.0 approach doesn't make use of `tf.Session`, `tf.run`, `placeholder`, `feed_dict`. To get more details of what's different between the two version and how to convert between the two, check out the official migration guide: https://www.tensorflow.org/alpha/guide/migration_guide

Later, in the rest of this notebook we'll focus on this new, simpler approach.

### TensorFlow warmup: Flatten Function

We can see this in action by defining a simple `flatten` function that will reshape image data for use in a fully-connected network.

In TensorFlow, data for convolutional feature maps is typically stored in a Tensor of shape N x H x W x C where:

- N is the number of datapoints (minibatch size)
- H is the height of the feature map
- W is the width of the feature map
- C is the number of channels in the feature map

This is the right way to represent the data when using convolutional neural networks (we will explore CNNs in a future assignment). When we use fully connected linear/affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "flatten" operation to collapse the `H x W x C` values per representation into a single long vector. 

Notice the `tf.reshape` call has the target shape as `(N, -1)`, meaning it will reshape/keep the first dimension to be N, and then infer as necessary what the second dimension is in the output, so we can collapse the remaining dimensions from the input properly.

**NOTE**: TensorFlow and PyTorch differ on the default Tensor layout; TensorFlow uses N x H x W x C but PyTorch uses N x C x H x W.

In [0]:
def flatten(x):
    """    
    Input:
    - TensorFlow Tensor of shape (N, D1, ..., DM)
    
    Output:
    - TensorFlow Tensor of shape (N, D1 * ... * DM)
    """
    N = tf.shape(x)[0]
    return tf.reshape(x, (N, -1))

In [0]:
def test_flatten():
    # Construct concrete values of the input data x using numpy
    x_np = np.arange(24).reshape((2, 3, 4))
    print('x_np:\n', x_np, '\n')
    # Compute a concrete output value.
    x_flat_np = flatten(x_np)
    print('x_flat_np:\n', x_flat_np, '\n')

# test_flatten()

### Barebones TensorFlow: Define a Two-Layer Network
We will now implement our first neural network with TensorFlow: a fully-connected ReLU network with two hidden layers and no biases on the CIFAR10 dataset. For now we will use only low-level TensorFlow operators to define the network; later we will see how to use the higher-level abstractions provided by `tf.keras` to simplify the process.

We will define the forward pass of the network in the function `two_layer_fc`; this will accept TensorFlow Tensors for the inputs and weights of the network, and return a TensorFlow Tensor for the scores. 

After defining the network architecture in the `two_layer_fc` function, we will test the implementation by checking the shape of the output.

**It's important that you read and understand this implementation.**

In [0]:
def two_layer_fc(x, params):
    """
    A fully-connected neural network; the architecture is:
    fully-connected layer -> ReLU -> fully connected layer.
    Note that we only need to define the forward pass here; TensorFlow will take
    care of computing the gradients for us.
    
    The input to the network will be a minibatch of data, of shape
    (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
    and the output layer will produce scores for C classes.

    Inputs:
    - x: A TensorFlow Tensor of shape (N, d1, ..., dM) giving a minibatch of
      input data.
    - params: A list [w1, w2] of TensorFlow Tensors giving weights for the
      network, where w1 has shape (D, H) and w2 has shape (H, C).
    
    Returns:
    - scores: A TensorFlow Tensor of shape (N, C) giving classification scores
      for the input data x.
    """
    w1, w2 = params                   # Unpack the parameters
    x = flatten(x)                    # Flatten the input; now x has shape (N, D)
    h = tf.nn.relu(tf.matmul(x, w1))  # Hidden layer: h has shape (N, H)
    scores = tf.matmul(h, w2)         # Compute scores of shape (N, C)
    return scores

In [0]:
def two_layer_fc_test():
    hidden_layer_size = 42

    # Scoping our TF operations under a tf.device context manager 
    # lets us tell TensorFlow where we want these Tensors to be
    # multiplied and/or operated on, e.g. on a CPU or a GPU.
    with tf.device(device):        
        x = tf.zeros((64, 32, 32, 3))
        w1 = tf.zeros((32 * 32 * 3, hidden_layer_size))
        w2 = tf.zeros((hidden_layer_size, 10))

        # Call our two_layer_fc function for the forward pass of the network.
        scores = two_layer_fc(x, [w1, w2])

    print(scores.shape)

# two_layer_fc_test()

### Barebones TensorFlow: Training Step

We now define the `training_step` function performs a single training step. This will take three basic steps:

1. Compute the loss
2. Compute the gradient of the loss with respect to all network weights
3. Make a weight update step using (stochastic) gradient descent.


We need to use a few new TensorFlow functions to do all of this:
- For computing the cross-entropy loss we'll use `tf.nn.sparse_softmax_cross_entropy_with_logits`: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits

- For averaging the loss across a minibatch of data we'll use `tf.reduce_mean`:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/reduce_mean

- For computing gradients of the loss with respect to the weights we'll use `tf.GradientTape` (useful for Eager execution):  https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/GradientTape

- We'll mutate the weight values stored in a TensorFlow Tensor using `tf.assign_sub` ("sub" is for subtraction): https://www.tensorflow.org/api_docs/python/tf/assign_sub 


In [0]:
def training_step(model_fn, x, y, params, learning_rate):
    with tf.GradientTape() as tape:
        scores = model_fn(x, params) # Forward pass of the model
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=scores)
        total_loss = tf.reduce_mean(loss)
        grad_params = tape.gradient(total_loss, params)

        # Make a vanilla gradient descent step on all of the model parameters
        # Manually update the weights using assign_sub()
        for w, grad_w in zip(params, grad_params):
            w.assign_sub(learning_rate * grad_w)
                        
        return total_loss

In [0]:
def train_part2(model_fn, init_fn, learning_rate):
    """
    Train a model on CIFAR-10.
    
    Inputs:
    - model_fn: A Python function that performs the forward pass of the model
      using TensorFlow; it should have the following signature:
      scores = model_fn(x, params) where x is a TensorFlow Tensor giving a
      minibatch of image data, params is a list of TensorFlow Tensors holding
      the model weights, and scores is a TensorFlow Tensor of shape (N, C)
      giving scores for all elements of x.
    - init_fn: A Python function that initializes the parameters of the model.
      It should have the signature params = init_fn() where params is a list
      of TensorFlow Tensors holding the (randomly initialized) weights of the
      model.
    - learning_rate: Python float giving the learning rate to use for SGD.
    """
    
    
    params = init_fn()  # Initialize the model parameters            
        
    for t, (x_np, y_np) in enumerate(train_dset):
        # Run the graph on a batch of training data.
        loss = training_step(model_fn, x_np, y_np, params, learning_rate)
        
        # Periodically print the loss and check accuracy on the val set.
        if t % print_every == 0:
            print('Iteration %d, loss = %.4f' % (t, loss))
            check_accuracy(val_dset, x_np, model_fn, params)

In [0]:
def check_accuracy(dset, x, model_fn, params):
    """
    Check accuracy on a classification model, e.g. for validation.
    
    Inputs:
    - dset: A Dataset object against which to check accuracy
    - x: A TensorFlow placeholder Tensor where input images should be fed
    - model_fn: the Model we will be calling to make predictions on x
    - params: parameters for the model_fn to work with
      
    Returns: Nothing, but prints the accuracy of the model
    """
    num_correct, num_samples = 0, 0
    for x_batch, y_batch in dset:
        scores_np = model_fn(x_batch, params).numpy()
        y_pred = scores_np.argmax(axis=1)
        num_samples += x_batch.shape[0]
        num_correct += (y_pred == y_batch).sum()
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))

### Barebones TensorFlow: Initialization
We'll use the following utility method to initialize the weight matrices for our models using Kaiming's normalization method.

[1] He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
*, ICCV 2015, https://arxiv.org/abs/1502.01852

In [0]:
def create_matrix_with_kaiming_normal(shape):
    if len(shape) == 2:
        fan_in, fan_out = shape[0], shape[1]
    elif len(shape) == 4:
        fan_in, fan_out = np.prod(shape[:3]), shape[3]
    return tf.keras.backend.random_normal(shape) * np.sqrt(2.0 / fan_in)

### Barebones TensorFlow: Train a Two-Layer Network
We are finally ready to use all of the pieces defined above to train a two-layer fully-connected network on CIFAR-10.

We just need to define a function to initialize the weights of the model, and call `train_part2`.

Defining the weights of the network introduces another important piece of TensorFlow API: `tf.Variable`. A TensorFlow Variable is a Tensor whose value is stored in the graph and persists across runs of the computational graph; however unlike constants defined with `tf.zeros` or `tf.random_normal`, the values of a Variable can be mutated as the graph runs; these mutations will persist across graph runs. Learnable parameters of the network are usually stored in Variables.

Without any hyperparameter tuning, you should achieve validation accuracies above 40% after one epoch of training.

In [0]:
def two_layer_fc_init():
    """
    Initialize the weights of a two-layer network, for use with the
    two_layer_network function defined above. 
    You can use the `create_matrix_with_kaiming_normal` helper!
    
    Inputs: None
    
    Returns: A list of:
    - w1: TensorFlow tf.Variable giving the weights for the first layer
    - w2: TensorFlow tf.Variable giving the weights for the second layer
    """
    hidden_layer_size = 4000
    w1 = tf.Variable(create_matrix_with_kaiming_normal((3 * 32 * 32, 4000)))
    w2 = tf.Variable(create_matrix_with_kaiming_normal((4000, 10)))
    return [w1, w2]

learning_rate = 1e-2
# train_part2(two_layer_fc, two_layer_fc_init, learning_rate)

# Part III: Keras Model Subclassing API

Implementing a neural network using the low-level TensorFlow API is a good way to understand how TensorFlow works, but it's a little inconvenient - we had to manually keep track of all Tensors holding learnable parameters. This was fine for a small network, but could quickly become unweildy for a large complex model.

Fortunately TensorFlow 2.0 provides higher-level APIs such as `tf.keras` which make it easy to build models out of modular, object-oriented layers. Further, TensorFlow 2.0 uses eager execution that evaluates operations immediately, without explicitly constructing any computational graphs. This makes it easy to write and debug models, and reduces the boilerplate code.

In this part of the notebook we will define neural network models using the `tf.keras.Model` API. To implement your own model, you need to do the following:

1. Define a new class which subclasses `tf.keras.Model`. Give your class an intuitive name that describes it, like `TwoLayerFC` or `ThreeLayerConvNet`.
2. In the initializer `__init__()` for your new class, define all the layers you need as class attributes. The `tf.keras.layers` package provides many common neural-network layers, like `tf.keras.layers.Dense` for fully-connected layers and `tf.keras.layers.Conv2D` for convolutional layers. Under the hood, these layers will construct `Variable` Tensors for any learnable parameters. **Warning**: Don't forget to call `super(YourModelName, self).__init__()` as the first line in your initializer!
3. Implement the `call()` method for your class; this implements the forward pass of your model, and defines the *connectivity* of your network. Layers defined in `__init__()` implement `__call__()` so they can be used as function objects that transform input Tensors into output Tensors. Don't define any new layers in `call()`; any layers you want to use in the forward pass should be defined in `__init__()`.

After you define your `tf.keras.Model` subclass, you can instantiate it and use it like the model functions from Part II.

### Keras Model Subclassing API: Two-Layer Network

Here is a concrete example of using the `tf.keras.Model` API to define a two-layer network. There are a few new bits of API to be aware of here:

We use an `Initializer` object to set up the initial values of the learnable parameters of the layers; in particular `tf.initializers.VarianceScaling` gives behavior similar to the Kaiming initialization method we used in Part II. You can read more about it here: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/initializers/VarianceScaling

We construct `tf.keras.layers.Dense` objects to represent the two fully-connected layers of the model. In addition to multiplying their input by a weight matrix and adding a bias vector, these layer can also apply a nonlinearity for you. For the first layer we specify a ReLU activation function by passing `activation='relu'` to the constructor; the second layer uses softmax activation function. Finally, we use `tf.keras.layers.Flatten` to flatten the output from the previous fully-connected layer.

In [0]:
class TwoLayerFC(tf.keras.Model):
    def __init__(self, hidden_size, num_classes):
        super(TwoLayerFC, self).__init__()        
        initializer = tf.initializers.VarianceScaling(scale=2.0)
        self.fc1 = tf.keras.layers.Dense(hidden_size, activation='relu',
                                   kernel_initializer=initializer)
        self.fc2 = tf.keras.layers.Dense(num_classes, activation='softmax',
                                   kernel_initializer=initializer)
        self.flatten = tf.keras.layers.Flatten()
    
    def call(self, x, training=False):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.fc2(x)
        return x


def test_TwoLayerFC():
    """ A small unit test to exercise the TwoLayerFC model above. """
    input_size, hidden_size, num_classes = 50, 42, 10
    x = tf.zeros((64, input_size))
    model = TwoLayerFC(hidden_size, num_classes)
    with tf.device(device):
        scores = model(x)
        print(scores.shape)
        
# test_TwoLayerFC()

### Keras Model Subclassing API: Eager Training

While keras models have a builtin training loop (using the `model.fit`), sometimes you need more customization. Here's an example, of a training loop implemented with eager execution.

In particular, notice `tf.GradientTape`. Automatic differentiation is used in the backend for implementing backpropagation in frameworks like TensorFlow. During eager execution, `tf.GradientTape` is used to trace operations for computing gradients later. A particular `tf.GradientTape` can only compute one gradient; subsequent calls to tape will throw a runtime error. 

TensorFlow 2.0 ships with easy-to-use built-in metrics under `tf.keras.metrics` module. Each metric is an object, and we can use `update_state()` to add observations and `reset_state()` to clear all observations. We can get the current result of a metric by calling `result()` on the metric object.

In [0]:
def train_part34(model_init_fn, optimizer_init_fn, num_epochs=1, is_training=False):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CIFAR-10 training set and periodically checks
    accuracy on the CIFAR-10 validation set.
    
    Inputs:
    - model_init_fn: A function that takes no parameters; when called it
      constructs the model we want to train: model = model_init_fn()
    - optimizer_init_fn: A function which takes no parameters; when called it
      constructs the Optimizer object we will use to optimize the model:
      optimizer = optimizer_init_fn()
    - num_epochs: The number of epochs to train for
    
    Returns: Nothing, but prints progress during trainingn
    """    
    with tf.device(device):

        # Compute the loss like we did in Part II
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
        
        model = model_init_fn()
        optimizer = optimizer_init_fn()
        
        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    
        val_loss = tf.keras.metrics.Mean(name='val_loss')
        val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')
        
        t = 0
        for epoch in range(num_epochs):
            
            # Reset the metrics - https://www.tensorflow.org/alpha/guide/migration_guide#new-style_metrics
            train_loss.reset_states()
            train_accuracy.reset_states()
            
            for x_np, y_np in train_dset:
                with tf.GradientTape() as tape:
                    
                    # Use the model function to build the forward pass.
                    scores = model(x_np, training=is_training)
                    loss = loss_fn(y_np, scores)
      
                    gradients = tape.gradient(loss, model.trainable_variables)
                    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                    
                    # Update the metrics
                    train_loss.update_state(loss)
                    train_accuracy.update_state(y_np, scores)
                    
                    if t % print_every == 0:
                        val_loss.reset_states()
                        val_accuracy.reset_states()
                        for test_x, test_y in val_dset:
                            # During validation at end of epoch, training set to False
                            prediction = model(test_x, training=False)
                            t_loss = loss_fn(test_y, prediction)

                            val_loss.update_state(t_loss)
                            val_accuracy.update_state(test_y, prediction)
                        
                        template = 'Iteration {}, Epoch {}, Loss: {}, Accuracy: {}, Val Loss: {}, Val Accuracy: {}'
                        print (template.format(t, epoch+1,
                                             train_loss.result(),
                                             train_accuracy.result()*100,
                                             val_loss.result(),
                                             val_accuracy.result()*100))
                    t += 1

                

### Keras Model Subclassing API: Train a Two-Layer Network
We can now use the tools defined above to train a two-layer network on CIFAR-10. We define the `model_init_fn` and `optimizer_init_fn` that construct the model and optimizer respectively when called. Here we want to train the model using stochastic gradient descent with no momentum, so we construct a `tf.keras.optimizers.SGD` function; you can [read about it here](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/optimizers/SGD).

Without any hyperparameter tuning, you should achieve validation accuracies above 40% after one epoch of training.

In [0]:
hidden_size, num_classes = 4000, 10
learning_rate = 1e-2

def model_init_fn():
    return TwoLayerFC(hidden_size, num_classes)

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

# train_part34(model_init_fn, optimizer_init_fn)

# Part IV: Keras Sequential API
In Part III we introduced the `tf.keras.Model` API, which allows you to define models with any number of learnable layers and with arbitrary connectivity between layers.

However for many models you don't need such flexibility - a lot of models can be expressed as a sequential stack of layers, with the output of each layer fed to the next layer as input. If your model fits this pattern, then there is an even easier way to define your model: using `tf.keras.Sequential`. You don't need to write any custom classes; you simply call the `tf.keras.Sequential` constructor with a list containing a sequence of layer objects.

One complication with `tf.keras.Sequential` is that you must define the shape of the input to the model by passing a value to the `input_shape` of the first layer in your model.

### Keras Sequential API: Two-Layer Network
In this subsection, we will rewrite the two-layer fully-connected network using `tf.keras.Sequential`, and train it using the training loop defined above.

Without any hyperparameter tuning, you should see validation accuracies above 40% after training for one epoch.

In [0]:
learning_rate = 1e-2

def model_init_fn():
    input_shape = (32, 32, 3)
    hidden_layer_size, num_classes = 4000, 10
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    layers = [
        tf.keras.layers.Flatten(input_shape=input_shape),
        tf.keras.layers.Dense(hidden_layer_size, activation='relu',
                              kernel_initializer=initializer),
        tf.keras.layers.Dense(num_classes, activation='softmax', 
                              kernel_initializer=initializer),
    ]
    model = tf.keras.Sequential(layers)
    return model

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate) 

# train_part34(model_init_fn, optimizer_init_fn)

### Abstracting Away the Training Loop
In the previous examples, we used a customised training loop to train models (e.g. `train_part34`). Writing your own training loop is only required if you need more flexibility and control during training your model. Alternately, you can also use  built-in APIs like `tf.keras.Model.fit()` and `tf.keras.Model.evaluate` to train and evaluate a model. Also remember to configure your model for training by calling `tf.keras.Model.compile.

Without any hyperparameter tuning, you should see validation and test accuracies above 42% after training for one epoch.

In [0]:
# model = model_init_fn()
# model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
#               loss='sparse_categorical_crossentropy',
#               metrics=[tf.keras.metrics.sparse_categorical_accuracy])
# model.fit(X_train, y_train, batch_size=64, epochs=1, validation_data=(X_val, y_val))
# model.evaluate(X_test, y_test)

##  Part IV: Functional API
### Demonstration with a Two-Layer Network 

In the previous section, we saw how we can use `tf.keras.Sequential` to stack layers to quickly build simple models. But this comes at the cost of losing flexibility.

Often we will have to write complex models that have non-sequential data flows: a layer can have **multiple inputs and/or outputs**, such as stacking the output of 2 previous layers together to feed as input to a third! (Some examples are residual connections and dense blocks.)

In such cases, we can use Keras functional API to write models with complex topologies such as:

 1. Multi-input models
 2. Multi-output models
 3. Models with shared layers (the same layer called several times)
 4. Models with non-sequential data flows (e.g. residual connections)

Writing a model with Functional API requires us to create a `tf.keras.Model` instance and explicitly write input tensors and output tensors for this model. 

In [0]:
def two_layer_fc_functional(input_shape, hidden_size, num_classes):  
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(hidden_size, activation='relu',
                                 kernel_initializer=initializer)(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def test_two_layer_fc_functional():
    """ A small unit test to exercise the TwoLayerFC model above. """
    input_size, hidden_size, num_classes = 50, 42, 10
    input_shape = (50,)
    
    x = tf.zeros((64, input_size))
    model = two_layer_fc_functional(input_shape, hidden_size, num_classes)
    
    with tf.device(device):
        scores = model(x)
        print(scores.shape)
        
# test_two_layer_fc_functional()

### Keras Functional API: Train a Two-Layer Network
You can now train this two-layer network constructed using the functional API.

Without any hyperparameter tuning, but you should see validation accuracies above 40% after training for one epoch.

In [0]:
input_shape = (32, 32, 3)
hidden_size, num_classes = 4000, 10
learning_rate = 1e-2

def model_init_fn():
    return two_layer_fc_functional(input_shape, hidden_size, num_classes)

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

# train_part34(model_init_fn, optimizer_init_fn)

# Part V: Tuning
In this section, you are asked to experiment with different dense/fully connnected architectures, activation functions, weight initializations, hyperparameters, optimizers, and regularization approaches to train models on the CIFAR-10 dataset. You can use the built-in train function, the `train_part34` function from above, or implement your own training loop.

Describe what you did at the end of the notebook.

### Things to experiment with:
- **Network architectures**: The network above has two layers of trainable parameters. Can you do better with a deeper network? Or maybe with a wider network? Try five different architectures and observe the performance on the validation data. Use the architectures in combinations with other hyperparameters, as outlines below. Discuss your findings.
- **Activation functions**: In your networks, use five different activation functions, such as ReLU, leaky ReLU, parametric ReLU, ELU, MaxOut, or tanh to gain practical insights into their ability to improve accuracy. 

```==> Try 30 different models - combination of [Network Architecture * Activation Functions]```
- **Weight initialization**: Corresponding to your activation functions, use different weight initialization schemes. Discuss your findings. What happens if you use the zero_weight initialization? 

```==> On top 3 models try various weight inicialization uniform, lecun_uniform, normal, zero, glorot_normal glorot_uniform, he_normal, he_uniform```

- **Batch normalization**: Try adding batch normalization. Do your networks train faster? Does the accuracy improve?


- **Optimizers**: Use different optimizers, including SGD, SGD with momentum, RMSprop and Adam. Use the optimizers with and without batch normalization to observe what optimizers benefit more from batch normalization, or different weight initializations schemes and what optimizers are more robust to initialization/normalization. 

```==> On the top two models of the previous top three compare performance with/without batch normalization - does it need to be deeper network? + combination with optimizers  [SGD, SGD with momentum, RMSprop and Adam] =10```
- **Regularization**: Compare L2 weight regularization, with dropout, batch normalization, and data augmentation. Discuss your findings.

```==> On the best working model from the precious tries comparing [L2 Weight inicialization][batch normalization][data augmentaion = add noise to the images]?```  
- **Model Ensemble**: Construct a model ensemble using some of your best hyperparameters as identified before, and compare the accuracy of the model assemble with the accuracy of your best individual model (based on the validation dataset). 

```==> Pick the best one and show results```

### NOTE: Batch Normalization / Dropout
When you are using Batch Normalization and Dropout, remember to pass `is_training=True` if you use the `train_part34()` function. BatchNorm and Dropout layers have different behaviors at training and inference time. `training` is a specific keyword argument reserved for this purpose in any `tf.keras.Model`'s `call()` function. Read more about this here : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/BatchNormalization#methods
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dropout#methods

### Tips for training
For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind: 

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.



# Here starts my CODE



##Modifed train_34()  function

In [0]:

# Load the TensorBoard notebook extension
%load_ext tensorboard
# Clear any logs from previous runs
!rm -rf ./logs/ 

print_every =700

def train_model(model, optimizer, num_epochs=10, noise=0, is_training=False):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CIFAR-10 training set and periodically checks
    accuracy on the CIFAR-10 validation set.
    
    Inputs:
    - model: the model we want to train
    - optimizer:  the Optimizer object we will use to optimize the model
    - num_epochs: The number of epochs to train for
    - noise: level of noise in the data augmentation process
    Returns: Nothing, but prints progress during trainingn
    """    
    with tf.device(device):

        # Compute the loss like we did in Part II
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
        
        model = model
        optimizer = optimizer
        
        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    
        val_loss = tf.keras.metrics.Mean(name='val_loss')
        val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')
        
        t = 0
        for epoch in range(num_epochs):
            
            # Reset the metrics - https://www.tensorflow.org/alpha/guide/migration_guide#new-style_metrics
            train_loss.reset_states()
            train_accuracy.reset_states()
            
            for x_np, y_np in train_dset:
                with tf.GradientTape() as tape:
                    if noise != 0: 
                        im = np.zeros(x_np.shape, np.uint8)
                        im = cv2.randn(im,(0),(1)) 
                        x_np = x_np + im 

                    # Use the model function to build the forward pass.
                    scores = model(x_np, training=is_training)
                    loss = loss_fn(y_np, scores)
      
                    gradients = tape.gradient(loss, model.trainable_variables)
                    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                    
                    # Update the metrics
                    train_loss.update_state(loss)
                    train_accuracy.update_state(y_np, scores)
                    
                    if t % print_every == 0:
                        val_loss.reset_states()
                        val_accuracy.reset_states()
                        for test_x, test_y in val_dset:
                            # During validation at end of epoch, training set to False
                            prediction = model(test_x, training=False)
                            t_loss = loss_fn(test_y, prediction)

                            val_loss.update_state(t_loss)
                            val_accuracy.update_state(test_y, prediction)
                        
                        template = 'Iteration {}, Epoch {}, Loss: {}, Accuracy: {}, Val Loss: {}, Val Accuracy: {}'
                        print (template.format(t, epoch+1,
                                             train_loss.result(),
                                             train_accuracy.result()*100,
                                             val_loss.result(),
                                             val_accuracy.result()*100))
                    t += 1
    return model

In [0]:
# Code inspired by tf2.0 documentation example on https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams

import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

device = '/device:GPU:0'   

learning_rate = 1e-2
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

## Process of finding the right hyperparameters and network architecture

### Two Layers



My first 2 network architecture is 2 layer network with first test with hidden size 2000 and 4000.
On both the best performance was by using relu (second elu) activation function and 4000 performed better than 2000,

> **relu 2000**: Iteration 7000, Epoch 10, Loss: 0.8704050183296204, Accuracy: 72.1962661743164, Val Loss: 1.4491089582443237, Val Accuracy: 52.39999771118164

> **relu 4000**: Iteration 7000, Epoch 10, Loss: 0.7758519053459167, Accuracy: 75.37967681884766, Val Loss: 1.4674423933029175, Val Accuracy: 54.10000228881836

 so decided to try even bigger network with 8000:

 >**relu 8000**: Iteration 7000, Epoch 10, Loss: 0.6558446884155273, Accuracy: 80.91413116455078, Val Loss: 1.5722764730453491, Val Accuracy: 50.30000305175781

with enlargening I saw an overfitting to test data and the generaliziing was not good. 




In [0]:
'''
Definition and testing of a Two Layer network 
with diferent sizes and activations functions

'''

def two_layer(hparams):  
    input_shape = (32, 32, 3)
    num_classes=10

    initializer = tf.initializers.VarianceScaling(scale=2.0)
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(hparams[HP_NUM_UNITS], hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model




def test_two_layer_net():
    for num_units in HP_NUM_UNITS.domain.values:
            for activation in HP_ACTIVATION.domain.values:
                # define hyper params and model
                hparams = {
                    HP_NUM_UNITS: num_units,
                    HP_ACTIVATION: activation
                }

                hp.hparams(hparams)
                model = two_layer(hparams)
                
                # Log current performance
                print('--- Starting trial for a new model ---')
                print({h.name: hparams[h] for h in hparams})
                
                train_model(model, optimizer, num_epochs=num_epochs, is_training=True)

def test_network():
    HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'elu', 'selu','tanh']))
    HP_NUM_UNITS = hp.HParam('units', hp.Discrete([4000, 2000]))
    test_two_layer_net()

    HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu']))
    HP_NUM_UNITS = hp.HParam('units', hp.Discrete([8000]))
    test_two_layer_net()

# test_network

### Three Layers

In second model i tried various combination all posible combination of large and small  hidden sizes and various activation function. Again relu and elu had the best results and I tried to experiment with various hidden sizes. Here are some interesting results, mostly perform around the same level on the validation data. 


> **relu 2000 -> 4000** Iteration 7000, Epoch 10, Loss: 0.4351615905761719, Accuracy: 89.92406463623047, Val Loss: 1.4986521005630493, Val Accuracy: 54.79999923706055


> **elu 4000 -> 2000** 

Iteration 5600, Epoch 8, Loss: 0.7979512810707092, Accuracy: 74.05204010009766, Val Loss: 1.3630650043487549, Val Accuracy: 55.0

Iteration 6300, Epoch 9, Loss: 0.7180293798446655, Accuracy: 77.06828308105469, Val Loss: 1.4294097423553467, Val Accuracy: 54.400001525878906

Iteration 7000, Epoch 10, Loss: 0.6495599746704102, Accuracy: 79.65829467773438, Val Loss: 1.474124789237976, Val Accuracy: 53.70000076293945

> **relu 4000 -> 4000** Iteration 7000, Epoch 10, Loss: 0.3278921842575073, Accuracy: 93.93983459472656, Val Loss: 1.4880443811416626, Val Accuracy: 54.10000228881836

> **elu 600 -> 300**  Iteration 7000, Epoch 10, Loss: 0.9017217755317688, Accuracy: 69.3049087524414, Val Loss: 1.4206169843673706, Val Accuracy: 53.500003814697266

In [0]:
'''
Definition and testing of a Three layer network 
with diferent sizes and activations functions

'''

HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'elu']))
HP_NUM_UNITS_1 = hp.HParam('hidden_1', hp.Discrete([4000, 2000]))
HP_NUM_UNITS_2 = hp.HParam('hidden_2', hp.Discrete([4000, 2000]))

# HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['elu']))
# HP_NUM_UNITS_1 = hp.HParam('hidden_1', hp.Discrete([2000]))
# HP_NUM_UNITS_2 = hp.HParam('hidden_2', hp.Discrete([1000]))
num_epochs = 10

def three_layer(hparams, h2_size):  
    input_shape = (32, 32, 3)
    num_classes=10

    initializer = tf.initializers.VarianceScaling(scale=2.0)
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(hparams[HP_NUM_UNITS_1], hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(h2_size, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(fc1_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model


def test_three_layer_net():
    for num_units_1 in HP_NUM_UNITS_1.domain.values:
        for num_units_2 in HP_NUM_UNITS_2.domain.values:
            for activation in HP_ACTIVATION.domain.values:
                # define hyper params and model
                hparams = {
                    HP_NUM_UNITS_1: num_units_1,
                    HP_ACTIVATION: activation
                }

                hp.hparams(hparams)
                model = three_layer(hparams, num_units_2)
                
                # Log current performance
                print('--- Starting trial for a new model ---')
                print({h.name: hparams[h] for h in hparams})
                print(f"**{activation} {num_units_1} -> {num_units_2}**")
                
                train_model(model, optimizer, num_epochs=num_epochs, is_training=True)

def test_network():
    # HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'elu', 'selu','tanh']))
    # HP_NUM_UNITS_1 = hp.HParam('hidden_1', hp.Discrete([4000, 2000]))
    # HP_NUM_UNITS_2 = hp.HParam('hidden_2', hp.Discrete([4000, 2000]))
    # test_three_layer_net()


    test_three_layer_net()
# test_network()

### Four Layers

Experiments with three layer network don't finish too well, but also not far from best results. 

Examples of different architectures to see if some configuration would pereform a lot better than others. (again relu had best results) 


> **relu 4000 -> 2000 --> 1000** 7000, Epoch 10, Loss: 0.2371554970741272, Accuracy: 95.76518249511719, Val Loss: 1.6421884298324585, Val Accuracy: 53.10000228881836

> **relu 2000 -> 1000 --> 500** Iteration 7000, Epoch 10, Loss: 0.42495739459991455, Accuracy: 89.39836883544922, Val Loss: 1.5798757076263428, Val Accuracy: 52.10000228881836

> **relu 2000 -> 2500 --> 2000** Iteration 7000, Epoch 10, Loss: 0.2751586139202118, Accuracy: 94.88902282714844, Val Loss: 1.6353516578674316, Val Accuracy: 51.89999771118164

> **relu 1000 -> 2000 --> 4000** Iteration 7000, Epoch 10, Loss: 0.33775943517684937, Accuracy: 92.59637451171875, Val Loss: 1.6860527992248535, Val Accuracy: 51.20000076293945



In [0]:
'''
Definition and testing of a four layer network 
with diferent sizes and activations functions

'''

HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu']))
HP_NUM_UNITS_1 = hp.HParam('hidden_1', hp.Discrete([1000]))
HP_NUM_UNITS_2 = hp.HParam('hidden_2', hp.Discrete([2000]))
HP_NUM_UNITS_3 = hp.HParam('hidden_2', hp.Discrete([4000]))

num_epochs = 10

def three_layer(hparams, h2_size, h3_size):  
    input_shape = (32, 32, 3)
    num_classes=10

    initializer = tf.initializers.VarianceScaling(scale=2.0)
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(hparams[HP_NUM_UNITS_1], hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(h2_size, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(fc1_output)
    fc3_output = tf.keras.layers.Dense(h3_size, hparams[HP_ACTIVATION],
                                kernel_initializer=initializer)(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc3_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model


def test_three_layer_net():
    for num_units_1 in HP_NUM_UNITS_1.domain.values:
        for num_units_2 in HP_NUM_UNITS_2.domain.values:
            for num_units_3 in HP_NUM_UNITS_3.domain.values:
                for activation in HP_ACTIVATION.domain.values:
                    # define hyper params and model
                    hparams = {
                        HP_NUM_UNITS_1: num_units_1,
                        HP_ACTIVATION: activation
                    }

                    hp.hparams(hparams)
                    model = three_layer(hparams, num_units_2, num_units_3)
                    
                    # Log current performance
                    print('--- Starting trial for a new model ---')
                    print({h.name: hparams[h] for h in hparams})
                    print(f"**{activation} {num_units_1} -> {num_units_2} --> {num_units_3}**")
                    
                    train_model(model, optimizer, num_epochs=num_epochs, is_training=True)

def test_network():
    # HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'elu', 'selu','tanh']))
    # HP_NUM_UNITS_1 = hp.HParam('hidden_1', hp.Discrete([4000, 2000]))
    # HP_NUM_UNITS_2 = hp.HParam('hidden_2', hp.Discrete([4000, 2000]))
    # test_three_layer_net()


    test_three_layer_net()
# test_network()

## Weight initialization:


 Corresponding to your activation functions, use different weight initialization schemes. Discuss your findings. 

Now I use top 3 networks architectures, which achieved best results and try them with various weight inicialization and  activation function.

Those were:
 

**relu 4000**

Best results were achieved with:
> relu glorot_uniform 52.89999771118164

> relu glorot_normal 53.10000228881836

> relu lecun_uniform 53.60000228881836




**relu 2000 -> 4000**
> relu lecun_uniform 54.000003814697266

> relu lecun_normal 54.000003814697266

> relu glorot_uniform 54.20000076293945

**relu 4000 -> 2000 --> 1000**

> relu glorot_uniform 53.20000076293945

> relu lecun_uniform 53.70000076293945

> relu glorot_normal 53.89999771118164

*(Best results again with an relu activation function and mostly glorot_uniform/normal, lecun_uniform)*

###Result from test with two layer network and zero weight initialization:
*(On others the results were quite similar)* 

**elu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**relu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**selu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**tanh zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**Q:** What happens if you use the zero_weight initialization?

**A:** 
All the nurons were initialized with the same weight, forward pass will be zero in case (+ bias - which i think was initialized non zero) and same weight initialization for all nuerons mean, they follow the same gradient and will end up doing the same thing as one another. Therefore there is no way for an improvement. 

*(My intuitive thought with what I know, how it should work)*




In [0]:
def two_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    
    fc1_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)

In [0]:
# define weight's inicialization functions and activations
# init_modes = [ 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
# HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'elu', 'selu','tanh']))
HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu']))

init_modes = [ 'glorot_normal', 'glorot_uniform', 'lecun_uniform']

'''
define models to be tested
'''

def two_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def three_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(fc1_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def four_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10


    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer)(fc1_output)
    fc3_output = tf.keras.layers.Dense(1000, hparams[HP_ACTIVATION],
                                kernel_initializer=initializer)(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc3_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

'''
 Testing function
'''
def test_networks_weight_act(net_func):
    for init_mode in init_modes:
        for activation in HP_ACTIVATION.domain.values:
            # define hyper params and model
            hparams = {
                HP_ACTIVATION: activation
            }

            hp.hparams(hparams)
            model = net_func(hparams, init_mode)
            
            # Log current performance
            print('--- Starting trial for a new model ---')
            print({h.name: hparams[h] for h in hparams})
            print(f"**{activation} {init_mode}**")
            
            train_model(model, optimizer, num_epochs=num_epochs, is_training=True)

# test_networks_weight_act(two_layer)



```
# This is formatted as code



--- Starting trial for a new model ---
{'activation': 'relu'}
**relu glorot_normal**
Iteration 0, Epoch 1, Loss: 2.7262372970581055, Accuracy: 14.0625, Val Loss: 2.5445079803466797, Val Accuracy: 10.899999618530273
Iteration 700, Epoch 1, Loss: 1.7010219097137451, Accuracy: 40.54475784301758, Val Loss: 1.5846952199935913, Val Accuracy: 46.39999771118164
Iteration 1400, Epoch 2, Loss: 1.4249463081359863, Accuracy: 50.666831970214844, Val Loss: 1.466536045074463, Val Accuracy: 48.0
Iteration 2100, Epoch 3, Loss: 1.2996604442596436, Accuracy: 55.53053665161133, Val Loss: 1.4632164239883423, Val Accuracy: 49.70000076293945
Iteration 2800, Epoch 4, Loss: 1.2044527530670166, Accuracy: 59.05815505981445, Val Loss: 1.39975905418396, Val Accuracy: 52.20000076293945
Iteration 3500, Epoch 5, Loss: 1.1273502111434937, Accuracy: 62.1174201965332, Val Loss: 1.3844925165176392, Val Accuracy: 52.70000076293945
Iteration 4200, Epoch 6, Loss: 1.0585988759994507, Accuracy: 64.7363510131836, Val Loss: 1.3464016914367676, Val Accuracy: 54.400001525878906
Iteration 4900, Epoch 7, Loss: 0.9957124590873718, Accuracy: 67.3463134765625, Val Loss: 1.3465337753295898, Val Accuracy: 55.099998474121094
Iteration 5600, Epoch 8, Loss: 0.9353195428848267, Accuracy: 69.71757507324219, Val Loss: 1.3473161458969116, Val Accuracy: 53.89999771118164
Iteration 6300, Epoch 9, Loss: 0.8739699125289917, Accuracy: 72.19111633300781, Val Loss: 1.3470885753631592, Val Accuracy: 53.60000228881836
Iteration 7000, Epoch 10, Loss: 0.8180127739906311, Accuracy: 74.15303802490234, Val Loss: 1.385282278060913, Val Accuracy: 53.79999923706055
--- Starting trial for a new model ---
{'activation': 'relu'}
**relu glorot_uniform**
Iteration 0, Epoch 1, Loss: 2.842233180999756, Accuracy: 4.6875, Val Loss: 2.4674298763275146, Val Accuracy: 13.899999618530273
Iteration 700, Epoch 1, Loss: 1.702019453048706, Accuracy: 40.81000518798828, Val Loss: 1.5660946369171143, Val Accuracy: 45.89999771118164
Iteration 1400, Epoch 2, Loss: 1.4256442785263062, Accuracy: 50.82185363769531, Val Loss: 1.4696632623672485, Val Accuracy: 48.89999771118164
Iteration 2100, Epoch 3, Loss: 1.3008179664611816, Accuracy: 55.54975891113281, Val Loss: 1.4750423431396484, Val Accuracy: 48.10000228881836
Iteration 2800, Epoch 4, Loss: 1.2048171758651733, Accuracy: 59.020877838134766, Val Loss: 1.407007098197937, Val Accuracy: 51.70000076293945
Iteration 3500, Epoch 5, Loss: 1.1279350519180298, Accuracy: 62.13887405395508, Val Loss: 1.389534831047058, Val Accuracy: 53.20000076293945
Iteration 4200, Epoch 6, Loss: 1.0603731870651245, Accuracy: 64.67317962646484, Val Loss: 1.366346001625061, Val Accuracy: 53.39999771118164
Iteration 4900, Epoch 7, Loss: 0.9969092011451721, Accuracy: 67.27458953857422, Val Loss: 1.3677680492401123, Val Accuracy: 53.20000076293945
Iteration 5600, Epoch 8, Loss: 0.9363572001457214, Accuracy: 69.60643768310547, Val Loss: 1.367872714996338, Val Accuracy: 52.70000076293945
Iteration 6300, Epoch 9, Loss: 0.8731353282928467, Accuracy: 72.0285415649414, Val Loss: 1.3650671243667603, Val Accuracy: 54.29999923706055
Iteration 7000, Epoch 10, Loss: 0.8127890825271606, Accuracy: 74.47429656982422, Val Loss: 1.4217565059661865, Val Accuracy: 53.10000228881836
--- Starting trial for a new model ---
{'activation': 'relu'}
**relu lecun_uniform**
Iteration 0, Epoch 1, Loss: 2.576890707015991, Accuracy: 4.6875, Val Loss: 2.5337603092193604, Val Accuracy: 13.500000953674316
Iteration 700, Epoch 1, Loss: 1.697668194770813, Accuracy: 40.49126052856445, Val Loss: 1.5742788314819336, Val Accuracy: 45.0
Iteration 1400, Epoch 2, Loss: 1.4540557861328125, Accuracy: 49.71210479736328, Val Loss: 1.4872617721557617, Val Accuracy: 48.89999771118164
Iteration 2100, Epoch 3, Loss: 1.3395832777023315, Accuracy: 54.11083221435547, Val Loss: 1.4869284629821777, Val Accuracy: 47.900001525878906
Iteration 2800, Epoch 4, Loss: 1.2488690614700317, Accuracy: 57.49565505981445, Val Loss: 1.41399347782135, Val Accuracy: 51.30000305175781
Iteration 3500, Epoch 5, Loss: 1.173937201499939, Accuracy: 60.343963623046875, Val Loss: 1.3967193365097046, Val Accuracy: 52.499996185302734
Iteration 4200, Epoch 6, Loss: 1.107445478439331, Accuracy: 62.89168167114258, Val Loss: 1.36811363697052, Val Accuracy: 52.70000076293945
Iteration 4900, Epoch 7, Loss: 1.046114206314087, Accuracy: 65.13832092285156, Val Loss: 1.3543466329574585, Val Accuracy: 53.89999771118164
Iteration 5600, Epoch 8, Loss: 0.987697184085846, Accuracy: 67.08290100097656, Val Loss: 1.3560082912445068, Val Accuracy: 52.79999923706055
Iteration 6300, Epoch 9, Loss: 0.923699676990509, Accuracy: 69.63511657714844, Val Loss: 1.3493584394454956, Val Accuracy: 54.5
Iteration 7000, Epoch 10, Loss: 0.8638867735862732, Accuracy: 71.88960266113281, Val Loss: 1.3931227922439575, Val Accuracy: 53.70000076293945
```

Compare regulization 

## L2 Weight regularization



For weight regularization I tried refulaziation in different layers 

In [0]:
from keras import regularizers
# https://keras.io/regularizers/

init_modes = [ 'glorot_normal']

def two_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10


    
    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.01))(flattened_inputs)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer,
                             kernel_regularizer=regularizers.l2(0.01))(fc1_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def three_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(fc1_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def four_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10


    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.001)
                                 
                                 )(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.001)
                                 
                                 )(fc1_output)
    fc3_output = tf.keras.layers.Dense(1000, hparams[HP_ACTIVATION],
                                kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.001)
                                )(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc3_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

In [0]:
# test_networks_weight_act(three_layer)

## Batch normalization

In [0]:
"Here I tested various level of batch normalization and placed on in various stages to the network"
def three_layer_batch(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(flattened_inputs)
    bn1_output = tf.keras.layers.BatchNormalization()(fc1_output)
    fc2_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(bn1_output)
    bn2_output = tf.keras.layers.BatchNormalization()(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(bn2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

# test_networks_weight_act(three_layer_batch)

## Optimizers




I tried all the mentioned optimizers and the SGD with momentum achieved the best results on the neural network on the network with bach normalization, generaly there was not big improvemnt on the validation accuracy for each optimizer, but increase at how fast they acchieved some level of accuracy. Optimizers, with usage of the momentym usually happend to have bit slower start, but improved quickly afterwards. 

In [0]:
optimizers = [ tf.keras.optimizers.SGD(),  tf.keras.optimizers.SGD(momentum=0.1), tf.keras.optimizers.RMSprop(), tf.keras.optimizers.Adam()]

def three_layer_weight_inicialization(initializer='glorot_uniform'):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, 'relu',
                                 kernel_initializer=initializer)(flattened_inputs)
    fc2_output = tf.keras.layers.Dense(4000, 'relu',
                                 kernel_initializer=initializer)(fc1_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(fc2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def three_layer_batch( initializer='glorot_uniform'):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, 'relu',
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(flattened_inputs)
    bn1_output = tf.keras.layers.BatchNormalization()(fc1_output)
    fc2_output = tf.keras.layers.Dense(4000, 'relu',
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.3))(bn1_output)
    bn2_output = tf.keras.layers.BatchNormalization()(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(bn2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def test_networks(net_func):
    for optimizer in optimizers:
        model = net_func( initializer='glorot_uniform')
        
        # Log current performance
        print('--- Starting trial for a new model ---')
        print({optimizer})
     
        train_model(model, optimizer, is_training=True)
test_networks(three_layer_batch)

## Droupout 


In [0]:

def three_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer, kernel_regularizer=regularizers.l2(0.2)
                                )(flattened_inputs)
    do1_output = tf.keras.layers.Dropout(0.1)(fc1_output)
    fc2_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer, kernel_regularizer=regularizers.l2(0.2)
                                 )(do1_output)
    do2_output = tf.keras.layers.Dropout(0.1)(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(do2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

test_networks_weight_act(three_layer)

## Data Augmentation 


In [0]:
## Data augmentation
import cv2
import numpy as np
print_every =700

def train_model_with_da(model, optimizer, num_epochs=1, is_training=False):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CIFAR-10 training set and periodically checks
    accuracy on the CIFAR-10 validation set.
    
    Inputs:
    - model: the model we want to train
    - optimizer:  the Optimizer object we will use to optimize the model
    - num_epochs: The number of epochs to train for
    
    Returns: Nothing, but prints progress during trainingn
    """    
    with tf.device(device):

        # Compute the loss like we did in Part II
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
        
        model = model
        optimizer = optimizer
        
        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
    
        val_loss = tf.keras.metrics.Mean(name='val_loss')
        val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')
        
        t = 0
        for epoch in range(num_epochs):
            
            # Reset the metrics - https://www.tensorflow.org/alpha/guide/migration_guide#new-style_metrics
            train_loss.reset_states()
            train_accuracy.reset_states()
            
            for x_np, y_np in train_dset:
                with tf.GradientTape() as tape:
                    
                    # Use the model function to build the forward pass.
                    
                    im = np.zeros(x_np.shape, np.uint8)
                    im = cv2.randn(im,(0),(1)) 
                    x_np = x_np + im 
                    scores = model(x_np, training=is_training)
                    loss = loss_fn(y_np, scores)
      
                    gradients = tape.gradient(loss, model.trainable_variables)
                    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                    
                    # Update the metrics
                    train_loss.update_state(loss)
                    train_accuracy.update_state(y_np, scores)
                    
                    if t % print_every == 0:
                        val_loss.reset_states()
                        val_accuracy.reset_states()
                        for test_x, test_y in val_dset:
                            # During validation at end of epoch, training set to False
                            prediction = model(test_x, training=False)
                            t_loss = loss_fn(test_y, prediction)

                            val_loss.update_state(t_loss)
                            val_accuracy.update_state(test_y, prediction)
                        
                        template = 'Iteration {}, Epoch {}, Loss: {}, Accuracy: {}, Val Loss: {}, Val Accuracy: {}'
                        print (template.format(t, epoch+1,
                                             train_loss.result(),
                                             train_accuracy.result()*100,
                                             val_loss.result(),
                                             val_accuracy.result()*100))
                    t += 1


def three_layer(hparams, initializer):  
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.2))(flattened_inputs)
    bn1_output = tf.keras.layers.BatchNormalization()(fc1_output)
    fc2_output = tf.keras.layers.Dense(4000, hparams[HP_ACTIVATION],
                                 kernel_initializer=initializer,
                                 kernel_regularizer=regularizers.l2(0.2))(bn1_output)
    bn2_output = tf.keras.layers.BatchNormalization()(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer=initializer)(bn2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

def test_networks_weight_act(net_func):
    for init_mode in init_modes:
        for activation in HP_ACTIVATION.domain.values:
            # define hyper params and model
            hparams = {
                HP_ACTIVATION: activation
            }

            hp.hparams(hparams)
            model = net_func(hparams, init_mode)
            
            # Log current performance
            print('--- Starting trial for a new model ---')
            print({h.name: hparams[h] for h in hparams})
            print(f"**{activation} {init_mode}**")
            
            train_model_with_da(model, optimizer, num_epochs=num_epochs, is_training=True)

In [0]:
# test_networks_weight_act(three_layer)

## Testing and evaluating the best model

In [0]:
def get_best_model():
    input_shape = (32, 32, 3)
    num_classes=10

    inputs = tf.keras.Input(shape=input_shape)
    flattened_inputs = tf.keras.layers.Flatten()(inputs)
    fc1_output = tf.keras.layers.Dense(2000, 'relu',
                                 kernel_initializer='glorot_uniform',
                                 kernel_regularizer=regularizers.l2(0.2))(flattened_inputs)
    bn1_output = tf.keras.layers.BatchNormalization()(fc1_output)
    fc2_output = tf.keras.layers.Dense(4000, 'relu',
                                 kernel_initializer='glorot_uniform',
                                 kernel_regularizer=regularizers.l2(0.2))(bn1_output)
    bn2_output = tf.keras.layers.BatchNormalization()(fc2_output)
    scores = tf.keras.layers.Dense(num_classes, activation='softmax',
                             kernel_initializer='glorot_uniform')(bn2_output)

    # Instantiate the model given inputs and outputs.
    model = tf.keras.Model(inputs=inputs, outputs=scores)
    return model

## For some reason (the model as proposed in the assignment could not be evaluated for some call() error and I could not figure out why, even thought I took another day to figure it out...)
## I think it's becaouse of the batch normalization (which gives in combination with other hyperparameters best results)
## On the validation data the accuracy results top were something around 55


# What I did: 
I approached to do this assignment searching for correct hyperparameters by testing a lot of options (by some predefined gridsearch) and than trying to improve and test the best one (by manually testing the hyperparameters, which I saw as good working and with some potecional way for improvement by finding boundaries, where it does work better or does not work). So my code does not fully describe my searching process.


## My first goal was to find a good performing neural network architecture + Activation functions
> *The network above has two layers of trainable parameters. Can you do better with a deeper network? Or maybe with a wider network? Try five different architectures and observe the performance on the validation data. Use the architectures in combinations with other hyperparameters, as outlines below. Discuss your findings.*

> In your networks, use five different activation functions, such as ReLU, leaky ReLU, parametric ReLU, ELU, MaxOut, or tanh to gain practical insights into their ability to improve accuracy. 

**Two Layer** 

My first network architecture is a 2 layer network. 
At first tested it with hidden size 2000 and 4000 and different activation functions.
The best performance on both of them  was accomlished by using relu (second elu) activation function and 4000 performed better than 2000,

> **relu 2000**: Iteration 7000, Epoch 10, Loss: 0.8704050183296204, Accuracy: 72.1962661743164, Val Loss: 1.4491089582443237, Val Accuracy: 52.39999771118164

> **relu 4000**: Iteration 7000, Epoch 10, Loss: 0.7758519053459167, Accuracy: 75.37967681884766, Val Loss: 1.4674423933029175, Val Accuracy: 54.10000228881836

 so decided to try even bigger network with 8000:

 >**relu 8000**: Iteration 7000, Epoch 10, Loss: 0.6558446884155273, Accuracy: 80.91413116455078, Val Loss: 1.5722764730453491, Val Accuracy: 50.30000305175781

with enlargening I saw an overfitting to test data and the generaliziing was not good. 

**Three Layer**

In the second model I tried various combination all posible combination of large and small  hidden sizes and various activation function. Again relu and elu had the best results. Here are some interesting results, mostly perform around the same level on the validation data. 


> **relu 2000 -> 4000** Iteration 7000, Epoch 10, Loss: 0.4351615905761719, Accuracy: 89.92406463623047, Val Loss: 1.4986521005630493, Val Accuracy: 54.79999923706055


> **elu 4000 -> 2000** Iteration 7000, Epoch 10, Loss: 0.6495599746704102, Accuracy: 79.65829467773438, Val Loss: 1.474124789237976, Val Accuracy: 53.70000076293945

> **relu 4000 -> 4000** Iteration 7000, Epoch 10, Loss: 0.3278921842575073, Accuracy: 93.93983459472656, Val Loss: 1.4880443811416626, Val Accuracy: 54.10000228881836

> **elu 600 -> 300**  Iteration 7000, Epoch 10, Loss: 0.9017217755317688, Accuracy: 69.3049087524414, Val Loss: 1.4206169843673706, Val Accuracy: 53.500003814697266

Again overfitting was an issue, but I hoped I will fix it with regularization.

**Four Layer**

Experiments with four layer network didn't finish too well, but also not far from best results. 

I again tried different architectures to see if some configuration would pereform a lot better than others. (as before relu had the best results) 


> **relu 4000 -> 2000 --> 1000** 7000, Epoch 10, Loss: 0.2371554970741272, Accuracy: 95.76518249511719, Val Loss: 1.6421884298324585, Val Accuracy: 53.10000228881836

> **relu 2000 -> 1000 --> 500** Iteration 7000, Epoch 10, Loss: 0.42495739459991455, Accuracy: 89.39836883544922, Val Loss: 1.5798757076263428, Val Accuracy: 52.10000228881836

> **relu 2000 -> 2500 --> 2000** Iteration 7000, Epoch 10, Loss: 0.2751586139202118, Accuracy: 94.88902282714844, Val Loss: 1.6353516578674316, Val Accuracy: 51.89999771118164

> **relu 1000 -> 2000 --> 4000** Iteration 7000, Epoch 10, Loss: 0.33775943517684937, Accuracy: 92.59637451171875, Val Loss: 1.6860527992248535, Val Accuracy: 51.20000076293945


## Second Weight initialization:
> *Corresponding to your activation functions, use different weight initialization schemes. Discuss your findings. What happens if you use the zero_weight initialization? *

Now I use top 3 networks architectures, which achieved best results and try them with various weight inicialization ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'] and 5 different activation function .

The best networks arhitectures and weight inicialization results:
 

**relu 4000**

Best results were achieved with:
> relu glorot_uniform 52.89999771118164

> relu glorot_normal 53.10000228881836

> relu lecun_uniform 53.60000228881836




**relu 2000 -> 4000**
> relu lecun_uniform 54.000003814697266

> relu lecun_normal 54.000003814697266

> relu glorot_uniform 54.20000076293945

**relu 4000 -> 2000 --> 1000**

> relu glorot_uniform 53.20000076293945

> relu lecun_uniform 53.70000076293945

> relu glorot_normal 53.89999771118164

*(Best results again with an relu activation function and mostly glorot_uniform/normal, lecun_uniform), which is bit unexpected as He initialization works better for layers with ReLu activation and glorot works better on sigmoid activation functions* - [src](https://stats.stackexchange.com/questions/319323/whats-the-difference-between-variance-scaling-initializer-and-xavier-initialize)



**Result from test with two layer network and zero weight initialization:**
*(On the others the results were quite similar)* 

**elu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**relu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**selu zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**tanh zero**
Iteration 7000, Epoch 10, Loss: 2.3026206493377686, Accuracy: 9.59404182434082, Val Loss: 2.303201198577881, Val Accuracy: 7.800000190734863

**Q:** What happens if you use the zero_weight initialization?

**A:** 
All the nurons were initialized with the same weight, forward pass will be zero in case (+ bias - which i think was initialized non zero) and same weight initialization for all nuerons mean, they follow the same gradient and will end up doing the same thing as one another. Therefore there is little to no way for an improvement. 

*(My intuitive thought with what I know, how it should work)*



## Third was trying the model with batch normalization:
> *Try adding batch normalization. Do your networks train faster? Does the accuracy improve?*

Here I experimentednwith adding batch normalization in between layers, various times and also various strength.

The networks training much faster and also has in the end better accuracy results, but they seem to overfit too much. 

Here are results of the few experiments: 
> **Single layer batch normalization** Iteration 7000, Epoch 10, Loss: 0.2675759196281433, Accuracy: 95.67757415771484, Val Loss: 1.4499861001968384, Val Accuracy: 53.500003814697266

> **Two Layer batch normalization**  Iteration 7000, Epoch 10, Loss: 0.058461688458919525, Accuracy: 99.73715209960938, Val Loss: 1.8086633682250977, Val Accuracy: 53.500003814697266
 
and yet the best achieving model on the validation data:  

> **Combination of two layer nn model with 0.4 l2 weight regularization and two layer batch normalization** Iteration 7000, Epoch 10, Loss: 0.057503633201122284, Accuracy: 99.78096008300781, Val Loss: 1.65192449092865, Val Accuracy: 56.0

## Fouth - Optimizers:
> *Use different optimizers, including SGD, SGD with momentum, RMSprop and Adam. Use the optimizers with and without batch normalization to observe what optimizers benefit more from batch normalization, or different weight initializations schemes and what optimizers are more robust to initialization/normalization.* 

I tried all the mentioned optimizers and the SGD with momentum achieved the best results on the neural network on the network with bach normalization, generaly there was not big improvemnt on the validation accuracy for each optimizer, but increase at how fast they acchieved some level of accuracy. Optimizers, with usage of the momentym usually happend to have bit slower start, but improved quickly afterwards. 

## Fifth - Regularization: 
> *Compare L2 weight regularization, with dropout, batch normalization, and data augmentation. Discuss your findings.*


**L2 Weight regularization**

For weight regularization I tried refulaziation in different layers and different strength.

Usually it made learning faster, but it started to overfit to soon, so in the end the validation accuracy was not too good. 

However really good result was achieved with and two layer l2 (0.2) weight regulaziation on three layer network:
>  Iteration 7000, Epoch 10, Loss: 0.6876577138900757, Accuracy: 79.58528137207031, Val Loss: 1.4075878858566284, Val Accuracy: 54.20000076293945

**Dropout**

Dropoout was tested in a similar way as the L2 weight regularization.  
but here the learining was a lot slower and did actually help a lot. Here are I would say currently the best results or improvements. 

> **0.2 dropout** Iteration 6300, Epoch 9, Loss: 1.065208077430725, Accuracy: 63.3399543762207, Val Loss: 1.271921992301941, Val Accuracy: 56.099998474121094

> **0.3 dropout** Iteration 7000, Epoch 10, Loss: 1.1485105752944946, Accuracy: 60.470211029052734, Val Loss: 1.28786301612854, Val Accuracy: 55.80000305175781

**Data Augmentation** 

I implemented data augmentation by adding noisy training image data. This really helped for keeping away the model from overfitting (in case of low noise) in case of bigger it made the training too hard. 

> **Here is the result of low noise data augmentation** Iteration 7000, Epoch 10, Loss: 1.2334572076797485, Accuracy: 56.32301712036133, Val Loss: 1.369227409362793, Val Accuracy: 52.499996185302734



