# 1 Feedforward Neural Network

In [69]:
# Example code for reading the data and the initial weights and biases.
# Note: This is just an example of how to read these files, you can modify the code in your own implementation.

import numpy as np
import random

train_x, train_y = np.load('train_x.npy'), np.load('train_y.npy')
test_x, test_y = np.load('test_x.npy'), np.load('test_y.npy')

print('shape of data:')
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)


checkpoint = np.load('weights.npy', allow_pickle=True).item()
init_weights = checkpoint['w']
init_biases = checkpoint['b']

print('shape of weights:')
for w in init_weights:
    print(w.shape)
    

print()

print('shape of biases:')
for b in init_biases:
    print(b.shape)

shape of data:
(4500, 784)
(4500,)
(500, 784)
(500,)
shape of weights:
(784, 2048)
(2048, 512)
(512, 5)

shape of biases:
(2048,)
(512,)
(5,)


## 1-1
<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>

Design a FNN model architecture and use the file of the initial weights and biases “<span class="blue">weights.npy</span>”. 

Run the <span class="red">backpropagation</span> algorithm and use the <span class="red">mini-batch SGD</span> (stochastic gradient descent) 
$$
    \mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla J\left(\mathbf{w}^{(\tau)}\right)
$$
to optimize the parameters (<span class="blue">the weights and biases)</span>,
where $\nabla$ is the learning rate. 

<span class="red">You should implement the FNN training under the following settings:</span>

- number of layers: 3
- number of neurons in each layer (in order): 2048, 512, 5
- activation function for each layer (in order): relu, relu, softmax
- number of training epochs: 30
- learning rate: 0.01
- batch size: 200
- **important note**: For 1(a), <span class="red">DO NOT RESHUFFLE THE DATA.</span> We had already shuffled the data for you.

Reshuffling will make <span class="blue">your result differ from our ground-truth result</span>, and <span class="red">any difference will result in reduction of your points.</span>

On the same note, when splitting the samples into batches, split them in the given sample order.

<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>
(a) **Plot** the <span class="blue">learning curves</span> of $J(\mathbf{w})$ and the <span class="blue">accuracy</span> of classification <span class="blue">for every 25 iterations</span>, with training data as well as test data, also, **show** the final loss and accuracy values.

In [137]:
# create a class for feedfarward neural network
class FeedforwardNeuralNetwork:
    def __init__(self, init_weights, init_biases, lr = 0.01, epoch = 30, batch_size = 200):
        self.weights = init_weights
        self.biases = init_biases
        self.num_layers = len(init_weights)
        self.num_neurons = [b.shape for b in init_biases]
        self.lr = lr
        self.epoch = epoch
        self.batch_size = batch_size
        self.loss = []
        self.accuracy = []
        
    def forward_propagation(self, x):
        for i in range(self.num_layers):
            x = np.dot(x, self.weights[i]) + self.biases[i]
            # relu as the activation function
            if i != self.num_layers - 1:
                x = np.maximum(x, 0)
        return x
        
    def backpropagation(self, x, y):
        # calculate the gradient
        grad_w = [np.zeros(w.shape) for w in self.weights]
        grad_b = [np.zeros(b.shape) for b in self.biases]
        # forward
        output_y = self.forward_propagation(x)
        pred_y = np.argmax(output_y, axis = 1)
        # backward
        for i in range(self.num_layers - 1, -1, -1):
            if i == self.num_layers - 1:
                grad_w[i] = np.dot(x.T, pred_y - y)
                grad_b[i] = np.sum(pred_y - y, axis = 0, keepdims = True)
            else:
                grad_w[i] = np.dot
                grad_b[i] = np.sum(np.dot(pred_y - y, self.weights[i + 1].T) * (x > 0), axis = 0, keepdims = True)
        return grad_w, grad_b
        
    # train the train data but without shuffle
    def train_model(self, train_x, train_y):
        for i in range(self.epoch):
            for j in range(0, len(train_x), self.batch_size):
                batch_x = train_x[j:j+self.batch_size]
                batch_y = train_y[j:j+self.batch_size]
                # backward
                grad_w, grad_b = self.backpropagation(batch_x, batch_y)
                # update the weights and biases
                for k in range(self.num_layers):
                    self.weights[k] -= self.lr * grad_w[k]
                    self.biases[k] -= self.lr * grad_b[k]
                # calculate the loss
                loss = np.sum(np.square(np.dot(train_x, self.weights[0]) + self.biases[0]) - train_y)
                # print the loss
                print("Epoch: %d, Batch: %d, Loss: %f" % (i, j, loss))
                self.loss.append(loss)
                # predict the train data based on the weights and biases
                output_y = np.dot(train_x, self.weights[0]) + self.biases[0]
                # output the class with the highest probability
                pred_y = np.argmax(output_y, axis = 1)
                # accuracy
                acc = np.sum(np.argmax(pred_y, axis = 1) == np.argmax(train_y, axis = 1) / len(train_y))
                self.accuracy.append(acc)
                print("Accuracy: %f" % acc)
        return self.weights, self.biases
    
    # test the test data
    def test_model(self, test_x, test_y):
        # use the trained weights and biases to predict the test data
        for i in range(self.num_layers):
            test_x = np.dot(test_x, self.weights[i]) + self.biases[i]
            # use relu as the activation function
            if i != self.num_layers - 1:
                test_x = np.maximum(test_x, 0)
            # softmax with 5 classes
            test_x = np.exp(text_x)
            test_x = test_x / np.sum(text_x, axis = 1, keepdims = True)
            acc = np.sum(np.argmax(test_x, axis = 1) == np.argmax(test_y, axis = 1) / len(test_x))
            print("Accuracy: %f" % acc)

In [138]:
FNN = FeedforwardNeuralNetwork(init_weights, init_biases, lr = 0.01, epoch = 30, batch_size = 200)
# train the model
FNN.train_model(train_x, train_y)
# test the model
FNN.test_model(test_x, test_y)

TypeError: dot() missing 1 required positional argument: 'b'

<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>
(b) **Repeat 1(a)** by considering <span class="red">zero initialization</span> for the model weights. And **make some discussion.**

## 1-2
<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>

Based on the model in 1, please <span class="blue">implement the dropout layers</span> and apply them <span class="blue">after the first two hidden layers</span>, i.e. the layers with 2048 and 512 neurons. 

The <span class="blue">dropout rate should be set as 0.2</span> for both layers. 

Note that the dropout operation <span class="blue">should only be applied in the training phase</span> and should be disabled in the test phase.

(a) **Train** the model by using the same settings in 1 and **repeat 1(a).**

(b) Based on the experimental results, how the dropout layers affect the model performance and why? Please **make some discussion.**

## 1-3

Based on the model in 1, please implement mini-batch SGD (stochastic gradient descent).

In this problem, we need to reshuffle the data in every batch. Note that the other settings remain the same. 

Please set the random seed as **42**, and please use **random** library that we have imported.

<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>

(a) **Plot** the <span class="blue">learning curves</span> of $J(\mathbf{w})$ and the classification <span class="blue">accuracy for every 25 iterations.</span> Please **show** the final values of loss and accuracy.

<style>
    .red {
        color: red;
    }
    .blue {
        color: skyblue;
    }
</style>

(b) Based on the experimental results, how the <span class="blue">process of reshuffling images</span> affects the model performance and why? Please **make some discussion.**