# CE-40959: Deep Learning
## HW2 - MLP / Optimization Algorithms /  Batch Normalization / Dropout (Numpy)
(18 points)

### Deadline: 23 Esfand

#### Name:
#### Student No.:

# 1. Imports and Data Loading

In this Notebook, we're going to use the modules you implemented in Part1. We'll define a classification task and train a few models for it.

The dataset we're going to use is called Cifar100. It contains 60000 rgb images (50000 for train and 10000 for test or validation) each with shape (32, 32, 3). Every image has a corresponding label which is a number in range 0 to 99, indicating a class of object. You can see the classes in the picture below.

![title](images/cifar100.gif)

We'll also ask you to compare the results of different models.

** Keep in mind that the accuracy of Random Guess is 1 percent **

## 1.1 Imports
Run this cell to import the necessary libraries.

In [14]:
import numpy as np
import matplotlib.pyplot as plt

from model import *
from adam import *
from batch_norm import *
from dense import *
from dropout import *
from module import *
from optimizer import *
from relu import *
from sgd import *
from sigmoid import *
from softmax_crossentropy import *

# we simply download cifar100 using tensorflow.keras library. this library is installed in google colab by default.
from tensorflow.keras.datasets import cifar100

## 1.2 Load the Data
We download the dataset using `cifar100.load_data()` and store the results into (`X_train`, `y_train`), (`X_val`, `y_val`)

In [4]:
(X_train, y_train), (X_val, y_val)= cifar100.load_data()
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
(50000, 32, 32, 3) (50000, 1)
(10000, 32, 32, 3) (10000, 1)


As we are using MLP for image classification, we need to flatten the images. Meaning reshaping the images to single dimension vectors.

In [5]:
X_train = X_train.reshape(X_train.shape[0], -1)# flatten each image in X_train
X_val = X_val.reshape(X_val.shape[0], -1)# flatten each image in X_val
X_train.shape, X_val.shape

((50000, 3072), (10000, 3072))

We also need the labels to be scalers. We use `np.squeeze()` method to do this.

In [6]:
y_train = np.squeeze(y_train)# squeeze y_train
y_val = np.squeeze(y_val)# squeeze y_val
y_train.shape, y_val.shape

((50000,), (10000,))

 ## 1.3. Normalization
 Run the cell below.

In [7]:
print(X_train.max(), X_train.min())
print(X_val.max(), X_val.min())

255 0
255 0


As you see, rgb values range from 0 to 255. We can devide every value by 255 to normalize the data.

In [8]:
X_train = X_train/255# devide X_train by 255
X_val = X_val/255# devide X_val by 255
print(X_train.max(), X_train.min())
print(X_val.max(), X_val.min())

1.0 0.0
1.0 0.0


## 1.4. Plot Function
This function get a history of training (loss, acc, val_loss, val_acc) as input and plots it.

In [9]:
def plot_history(history):
    losses, accs, val_losses, val_accs = history
    x = [i for i in range(len(losses))]
    
    # plot for losses
    plt.plot(x, losses, '-g', label='train')
    plt.plot(x, val_losses, '-r', label='validation')
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.legend(loc='upper right')
    plt.show()
    
    # plot for accuracies
    plt.plot(x, accs, '-g', label='train')
    plt.plot(x, val_accs, '-r', label='validation')
    plt.xlabel('epoch')
    plt.ylabel('accuracy')
    plt.legend(loc='upper left')
    plt.show()

# 2. Problem: Learning Rate (5 points)
In this problem, We are going to see the effect of the `learning_rate` of `Adam` optimizer. We use the same model in the different sections of this problem. The only thing that is different, is the `learning_rate`.

We use `beta1 = 0.9`, `beta2 = 0.999` and `epsilon = 1e-8` for `Adam`.
The model contains a `Dense` layer with 100 neurones, followed by a `BatchNormalization` and a `ReLU` activation function. After that, we put another `Dense` layer with 100 neurones to output the probability of each class, followed by a `SoftmaxCrossentropy` module.

We train each model for 10 epochs and `batch_size = 1024`. We also plot the history of training.

**Note that you should assign a unique name for every module you define in your model.**

You can use `model.add(Module)` to add a module to the model. The modules are added in order.
It's also recommended to read the `model.py` file.

## 2.1. `learning_rate = 0.001`

In [17]:
model = Model(Adam(learning_rate=1e-3, beta1=.9, beta2=.999))
batch_size = 1024
number_neurones = 100
model.add(Dense('dense1', X_train.shape[1], number_neurones))
model.optimizer.v['dense1'] = {'W': np.zeros((X_train.shape[1], number_neurones)), 'b': np.zeros((number_neurones,))}
model.optimizer.m['dense1'] = {'W': np.zeros((X_train.shape[1], number_neurones)), 'b': np.zeros((number_neurones,))}
model.add(BatchNormalization('batch1', number_neurones))
model.optimizer.v['batch1'] = {'gamma': np.zeros((number_neurones,)), 'beta': np.zeros((number_neurones,))}
model.optimizer.m['batch1'] = {'gamma': np.zeros((number_neurones,)), 'beta': np.zeros((number_neurones,))}
model.add(ReLU('relu1'))
model.add(Dense('dense2', number_neurones, number_neurones))
model.optimizer.v['dense2'] = {'W': np.zeros((number_neurones, number_neurones)), 'b': np.zeros((number_neurones,))}
model.optimizer.m['dense2'] = {'W': np.zeros((number_neurones, number_neurones)), 'b': np.zeros((number_neurones,))}
model.add(SoftmaxCrossentropy('last'))
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=10)
plot_history(history)

Epoch 1: 

KeyError: 'W'

## 2.2. `learning_rate = 0.01`

In [None]:
model = Model(Adam(learning_rate=1e-2, beta1=.9, beta2=.999))
# add the layers to the model
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=10)
plot_history(history)

## 2.3. `learning_rate = 0.1`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=10)
plot_history(history)

## 2.4. `learning_rate = 1`

In [None]:
model = Model(Adam(learning_rate=1, beta1=.9, beta2=.999))
# add the layers to the model
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=10)
plot_history(history)

## 2.5. Compare the Results
** Compare the results of the models above. Which Learning Rate worked best? Why? (Feel free to write in Persian) **

write your answer here

# 3. Problem: Optimizers & Batch Normalization (5 points)
In this problem, We are going to try different optimizers with or without `BatchNormalization`. Except for optimizer and `BatchNormalization`, the other parts of the models in differenct sections are the same.

We use `learning_rate = 0.1`, `beta1 = 0.9`, `beta2 = 0.999`, `epsilon = 1e-8` for `Adam` and `learning_rate = 0.1`, `momentum = 0.9` for SGD.


The model contains a `Dense` layer with 100 neurones, followed by a `Sigmoid` activation function. After that, we put another `Dense` layer with 100 neurones to output the probability of each class, followed by a `SoftmaxCrossentropy` module.

In some models, we add a `BatchNormalization` after the first `Dense` module.

We train each model for 50 epochs and `batch_size = 1024`. We also plot the history of training.

## 3.1. SGD & No BatchNorm

In [None]:
model = Model(SGD(learning_rate=1e-1, momentum=.9))
# add the layers to the model
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

## 3.2. SGD & BatchNorm

In [None]:
model = Model(SGD(learning_rate=1e-1, momentum=.9))
# add the layers to the model. don't forget to add a BatchNormalization module right after the first Dense module.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

## 3.3. Adam & BatchNorm

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model. don't forget to add a BatchNormalization module right after the first Dense module.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

## 3.4. Compare the Results
** Compare the results of the models above. **

write your answer here

# 4. Problem: Regularization (8 points)
Take another look at the results of 3.3. You should be able to see a significant gap between `acc` and `val_acc`. In this problem, We try to reduce the overfitting and make the model generalize better.

## 4.1. L2 Regularization

`L2 Regularization` restricts the scale of the weights of a `Dense` module and Reduces the complexity of the model.

The `Dense` module you implemented, takes a `l2_coef` argument as input. In this problem, we try different values for `l2_coef`.

The optimizers and the models' structure is the same as in 3.3.

We train each model for 50 epochs and `batch_size = 1024`. We also plot the history of training.

### 4.1.1. `l2_coef = 1e-2`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model. don't forget to set l2_coef for both Dense modules.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.1.2. `l2_coef = 1e-3`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model. don't forget to set l2_coef for both Dense modules.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.1.3. `l2_coef = 1e-4`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model. don't forget to set l2_coef for both Dense modules.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.1.4. `l2_coef = 1e-5`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model. don't forget to set l2_coef for both Dense modules.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.1.5 Compare the Results
** Compare the results of the models above. **

write your answer here

## 4.2. Dropout
A more common regularization method is `Dropout`. In this problem, we try different values for `keep_prob` for `Dropout` module and compare the results.

The optimizers and the models' structure is the same as in 3.3. The only difference is we add a `Dropout` module right after `Sigmoid` activation.

We train each model for 50 epochs and `batch_size = 1024`. We also plot the history of training.

### 4.2.1. `keep_prob = 0.3`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.2.2. `keep_prob = 0.5`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.2.3. `keep_prob = 0.7`

In [None]:
model = Model(Adam(learning_rate=1e-1, beta1=.9, beta2=.999))
# add the layers to the model.
history = model.fit(X_train, y_train, X_val, y_val, batch_size=1024, epochs=50)
plot_history(history)

### 4.2.4. Compare the Results
** Compare the results of the models above. **

write your answer here

## 4.3. Compare L2 Regularization and Dropout
** Which one of the L2 Regularization and Dropout worked better? **

write your answer here