# NNIA 18/19 Project 2:  Gradient Descent & Backpropagation

## Deadline: 4. January 2018, 23:59

## 1. Multinomial Logistic Regression and Cross Validation $~$ (12 points)

In this exercise, you will implement a [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) model with tensorflow for Fashion-MNIST dataset. Cross Validation will be used to find the best **regularization parameter** $\lambda$ for the L2-regularization term. Fashion-MNIST dataset is similar to the sklearn Digit dataset you used in the Project 1. It contains 60,000 training images and 10,000 testing images. Each example is a 28×28 grayscale image, associated with a label from 10 classes.

![Neural Network](https://s3-eu-central-1.amazonaws.com/zalando-wp-zalando-research-production/2017/08/fashion-mnist-sprite.png)

Multinomial logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$. Classification is done by projecting an input vector onto a set of hyperplanes, each of which corresponds to a class. The distance from the input to a hyperplane reflects the probability that the input is a member of the corresponding class.

Mathematically, the probability that an input vector $\bf{x} \in \mathbb{R}^p$ is a member of a class $i$ can be written as:
$$P(Y=i|\textbf{x}, W, b) = softmax(W\textbf{x} + b)_i = \frac{e^{W_i\textbf{x} + b_i}}{\sum_j{e^{W_j\textbf{x} + b_j}}}$$
where $W \in \mathbb{R}^{c \times p}$, $b \in \mathbb{R}^c$ and $W_i \in \mathbb{R}^p$.

The model’s prediction $y_{pred}$ is the class whose probability is maximal, specifically:
$$y_{pred} = argmax_iP(Y=i|\textbf{x}, W, b)$$

We use cross-entropy loss with L2 regularization.

### 1.1 Dataset and Normalization

Load **Fashion-MNIST** dataset and normalized it.

In [1]:
import os
import struct
import numpy as np
import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist
(X_trainval, Y_trainval), (X_test, Y_test) = fashion_mnist.load_data()

In [2]:
X_trainval = np.reshape(X_trainval, (X_trainval.shape[0], X_trainval.shape[1] * X_trainval.shape[2]))
print('The X_trainval has the following shape:')
print('Rows: %d, columns: %d' % (X_trainval.shape[0], X_trainval.shape[1]))

The X_trainval has the following shape:
Rows: 60000, columns: 784


In [3]:
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1] * X_test.shape[2]))
print('The X_test has the following shape:')
print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1]))

The X_test has the following shape:
Rows: 10000, columns: 784


Normalize the data. Subtract the mean and divide by the standard deviation.

In [4]:
def data_normalization(X_trainval, X_test):
    # TODO: Implement
    # subtract the mean
    mean = np.mean(X_trainval, axis=0)
    X_trainval = (X_trainval - mean)
    # standard deviation
    std = np.std(X_trainval, axis=0)
    # no zero std index
    idx = std != 0
    X_trainval_normalized = X_trainval[:, idx] / std[idx]

    # subtract the mean
    X_test = (X_test - mean)
    X_test_normalized = X_test[:, idx] / std[idx]
    return X_trainval_normalized, X_test_normalized

In [5]:
# The normalization should be done on X_train and X_test. 
# The normalized data should have the exactly same shape as the original data matrix.

X_trainval, X_test = data_normalization(X_trainval, X_test)
print('The X_trainval has the following shape:')
print('Rows: %d, columns: %d' % (X_trainval.shape[0], X_trainval.shape[1]))
print('The X_test has the following shape:')
print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1]))

The X_trainval has the following shape:
Rows: 60000, columns: 784
The X_test has the following shape:
Rows: 10000, columns: 784


---
**Points:** $0.0$ of $1.0$
**Comments:** None

---

### 1.2 Define the Computation Graph

In [6]:
# Here the global configuration of this program is 
# defined, which you shouldn't change.

class global_config(object):
    lr = 0.0001  # learning rate
    img_h = 28  # image height
    img_w = 28  # image width
    num_class = 10  # number of classes
    num_epoch = 20  # number of training epochs
    batch_size = 16  # batch size
    K = 3  # K-fold cross validation
    num_train = None  # the number of training data
    lambd = None  # the factor for the L2-regularization

config = global_config()
config.num_train = X_trainval.shape[0]

In [7]:
def train_val_split(X_trainval, Y_trainval, i, K):
    """
    sklearn library is not allowed to use here.
    
    K is the total number of folds and i is the current fold.
    
    Think about how to deal with the case when the number of 
    training data can't be divided by K evenly.
    """
    # TODO: Implement
    lens = np.ceil(X_trainval.shape[0] / K).astype(int)
    begin_idx = i * lens
    end_idx = min(X_trainval.shape[0], begin_idx + lens - 1)  # substract 1 due to idx start from 0
    X_val = X_trainval[begin_idx:end_idx, :]
    Y_val = Y_trainval[begin_idx:end_idx]
    X_train = np.delete(X_trainval, np.linspace(begin_idx, end_idx, lens), axis=0)
    Y_train = np.delete(Y_trainval, np.linspace(begin_idx, end_idx, lens), axis=0)
    return X_train, X_val, Y_train, Y_val

---
**Points:** $0.0$ of $2.0$
**Comments:** None

---

In [8]:
def shuffle_train_data(X_train, Y_train):
    """called after each epoch"""
    perm = np.random.permutation(len(Y_train))
    Xtr_shuf = X_train[perm]
    Ytr_shuf = Y_train[perm]
    return Xtr_shuf, Ytr_shuf

In [9]:
"""
training
"""
class logistic_regression(object):
    
    def __init__(self, X, Y_gt, config, name):
        """
        :param X: the training batch, which has the shape [batch_size, n_features].
        :param Y_gt: the corresponding ground truth label vector.
        :param config: the hyper-parameters you need for the implementation.
        :param name: the name of this logistic regression model which is used to
                     avoid the naming confict with the help of tf.variable_scope and reuse.
       
        Define the computation graph within the variable_scope here. 
        First define two variables W and b with tf.get_variable.
        Then do the forward pass.
        Then compute the cross entropy loss with tensorflow, don't forget the L2-regularization.
        The Adam optimizer is already given. You shouldn't change it.
        Finally compute the accuracy for one batch
        """
        self.config = config
        with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
            # TODO: Define two variables and the forward pass.
            self._w = tf.get_variable('w', [X.shape[1], config.num_class])
            self._b = tf.get_variable('b', [config.num_class])

            # forward pass
            self._logits = tf.matmul(X, self._w) + self._b
            # TODO: Compute the cross entropy loss with L2-regularization.
            self.Y_gt_onehot = tf.one_hot(Y_gt, self.config.num_class, 1, 0)
            self._loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=self.Y_gt_onehot,
                                                                 logits=self._logits)) + self.config.lambd * tf.nn.l2_loss(self._w)
            # Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent
            # to update network weights iteratively.
            # It will be introduced in the lecture when talking about the optimization algorithms.
            self._train_step = tf.train.AdamOptimizer(config.lr).minimize(self._loss)

            # TODO: Compute the accuracy
            self._predict = tf.argmax(self._logits, 1)
            self._num_acc = tf.reduce_sum(tf.cast(tf.equal(self._predict, Y_gt), tf.float32))

            
    @property
    def train_op(self):
        return self._train_step
    
    @property
    def loss(self):
        return self._loss
    
    @property
    def num_acc(self):
        return self._num_acc

---
**Points:** $0.0$ of $2.0$
**Comments:** None

---

In [10]:
def testing(model, X_test, Y_test, config):
    """ 
    Go through the X_test and use sess.run() to compute the loss and accuracy.
    
    Return the total loss and the accuracy for X_test.
    
    Note that this function will be used for the validation data
    during training and the test data after training.
    """
    num_test = X_test.shape[0]
    total_cost = 0
    accs = 0
    # TODO: Implement
    total_cost = sess.run(model.loss, feed_dict={X: X_test, Y_gt: Y_test})
    accs = sess.run(model.num_acc, feed_dict={X: X_test, Y_gt: Y_test})
    return total_cost / len(Y_test), accs / len(Y_test)

---
**Points:** $0.0$ of $2.0$
**Comments:** None

---

In [11]:
def train(model, X_train, X_val, Y_train, Y_val, config):
    """
    Train the model with sess.run().
    
    You should shuffle the data after each epoch and
    evaluate training and validation loss after each epoch.
    
    Return the lists of the training/validation loss and accuracy.
    """
    cost_trains = []
    acc_trains = []
    cost_vals = []
    acc_vals = []
    
    for i in range(config.num_epoch):
        # TODO: Implement
        #with tf.Session() as sess:
        # Initialize the variables of the model
        # sess.run(tf.global_variables_initializer())???
        sess.run(model.train_op, feed_dict={X: X_train, Y_gt: Y_train})
        cost_train = sess.run(model.loss, feed_dict={X: X_train, Y_gt: Y_train}) / len(Y_train)  # average cost
        acc_train = sess.run(model.num_acc, feed_dict={X: X_train, Y_gt: Y_train}) / len(Y_train)  # average accuracy

        cost_trains.append(cost_train)
        acc_trains.append(acc_train)
        print("Epoch: %d :" % (i + 1))
        print("Train Loss: %f" % cost_train)
        print("Training acc: %f" % acc_train)

        cost_val, acc_val = testing(model, X_val, Y_val, config)
        cost_vals.append(cost_val)
        acc_vals.append(acc_val)
        print("Validation Loss: %f" % cost_val)
        print("Validation acc: %f" % acc_val)
        ## shuffle the data
        X_train, Y_train = shuffle_train_data(X_train, Y_train)
    return cost_trains, acc_trains, cost_vals, acc_vals

---
**Points:** $0.0$ of $2.0$
**Comments:** None

---

### 1.3 Cross Validation

Implement cross validation to find an optimal value of $\lambda$. The optimal hyper-parameters should be determined by the validation accuracy. The test set should only be used in the very end after all other processing, e.g. hyper-parameter choosing.

In [13]:
"""
Initialization
"""
# Use cross validation to choose the best lambda for the L2-regularization from the list below
lambda_list = [100, 1, 0.1]


X = tf.placeholder(tf.float32, [None, config.img_h * config.img_w])
Y_gt = tf.placeholder(tf.int64, [None, ])

for lambd in lambda_list:
    val_loss_list = []
    config.lambd = lambd
    print("lambda is %f" % lambd)
    
    for i in range(config.K):
        # Prepare the training and validation data
        X_train, X_val, Y_train, Y_val = train_val_split(X_trainval, Y_trainval, i, config.K)
        
        # For each lambda and K, we build a new model and train it from scratch
        model = logistic_regression(X, Y_gt, config, name=str(lambd)+'_'+str(config.K))
        
        with tf.Session() as sess:
            
            # Initialize the variables of the model
            sess.run(tf.global_variables_initializer())
            
            # Train the model
            cost_trains, acc_trains, cost_vals, acc_vals = train(model, X_train, X_val, Y_train, Y_val, config)
            
        val_loss_list.append(cost_vals[-1])
        
    print("The validation loss for lambda %f is %f" % (lambd, np.mean(val_loss_list)))
    

lambda is 100.000000


  app.launch_new_instance()


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

Epoch: 1 :
Train Loss: 2.877019
Training acc: 0.106475
Validation Loss: 2.912482
Validation acc: 0.104455
Epoch: 2 :
Train Loss: 2.836704
Training acc: 0.112150
Validation Loss: 2.872011
Validation acc: 0.110656
Epoch: 3 :
Train Loss: 2.796931
Training acc: 0.119550
Validation Loss: 2.832082
Validation acc: 0.116956
Epoch: 4 :
Train Loss: 2.757710
Training acc: 0.126200
Validation Loss: 2.792709
Validation acc: 0.122556
Epoch: 5 :
Train Loss: 2.719054
Training acc: 0.133200
Validation Loss: 2.753901
Validation acc: 0.128956
Epoch: 6 :
Train Loss: 2.680970
Training acc: 0.140400
Validation Loss: 2.715669
Validation acc: 0.135007
Epoch: 7 :
Train Loss: 2.643469
Training acc: 0.147350
Validation Loss: 2.678022
Validation acc: 0.142257
Epoch: 8 :
Train Loss: 2.606558
Training acc: 0.154725
Validation 

Epoch: 14 :
Train Loss: 2.860222
Training acc: 0.154700
Validation Loss: 2.856380
Validation acc: 0.156858
Epoch: 15 :
Train Loss: 2.827863
Training acc: 0.158800
Validation Loss: 2.824334
Validation acc: 0.161808
Epoch: 16 :
Train Loss: 2.795995
Training acc: 0.163325
Validation Loss: 2.792783
Validation acc: 0.167158
Epoch: 17 :
Train Loss: 2.764614
Training acc: 0.168150
Validation Loss: 2.761721
Validation acc: 0.171759
Epoch: 18 :
Train Loss: 2.733714
Training acc: 0.172800
Validation Loss: 2.731143
Validation acc: 0.176159
Epoch: 19 :
Train Loss: 2.703289
Training acc: 0.178300
Validation Loss: 2.701042
Validation acc: 0.181359
Epoch: 20 :
Train Loss: 2.673333
Training acc: 0.182150
Validation Loss: 2.671412
Validation acc: 0.185759
Epoch: 1 :
Train Loss: 3.633360
Training acc: 0.038075
Validation Loss: 3.650145
Validation acc: 0.037152
Epoch: 2 :
Train Loss: 3.587719
Training acc: 0.039750
Validation Loss: 3.604565
Validation acc: 0.039152
Epoch: 3 :
Train Loss: 3.542515
Trainin

Validation Loss: 2.037929
Validation acc: 0.309215
Epoch: 11 :
Train Loss: 2.020798
Training acc: 0.315125
Validation Loss: 2.013581
Validation acc: 0.314516
Epoch: 12 :
Train Loss: 1.996841
Training acc: 0.320700
Validation Loss: 1.989748
Validation acc: 0.322216
Epoch: 13 :
Train Loss: 1.973400
Training acc: 0.327325
Validation Loss: 1.966429
Validation acc: 0.328716
Epoch: 14 :
Train Loss: 1.950474
Training acc: 0.333050
Validation Loss: 1.943622
Validation acc: 0.334667
Epoch: 15 :
Train Loss: 1.928059
Training acc: 0.339050
Validation Loss: 1.921322
Validation acc: 0.341667
Epoch: 16 :
Train Loss: 1.906150
Training acc: 0.345200
Validation Loss: 1.899525
Validation acc: 0.347267
Epoch: 17 :
Train Loss: 1.884740
Training acc: 0.351825
Validation Loss: 1.878225
Validation acc: 0.353068
Epoch: 18 :
Train Loss: 1.863823
Training acc: 0.358075
Validation Loss: 1.857415
Validation acc: 0.360418
Epoch: 19 :
Train Loss: 1.843392
Training acc: 0.364375
Validation Loss: 1.837087
Validation 

### 1.4 Combine Train and Validation data.

Use the hyper-parameters you choose from the cross validation to re-train the model.

In [14]:
config.lambd =   1.000000 #TODO: Choose the best lambda
model = logistic_regression(X, Y_gt, config, name='trainval')
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    cost_trains, acc_trains, cost_tests, acc_tests = train(model, X_trainval, X_test, Y_trainval, Y_test, config)

print("The final test acc is %f" % acc_tests[-1])

Epoch: 1 :
Train Loss: 2.925089
Training acc: 0.088350
Validation Loss: 2.951783
Validation acc: 0.086800
Epoch: 2 :
Train Loss: 2.889350
Training acc: 0.091467
Validation Loss: 2.916056
Validation acc: 0.091000
Epoch: 3 :
Train Loss: 2.854046
Training acc: 0.095017
Validation Loss: 2.880759
Validation acc: 0.094800
Epoch: 4 :
Train Loss: 2.819185
Training acc: 0.098283
Validation Loss: 2.845899
Validation acc: 0.098100
Epoch: 5 :
Train Loss: 2.784773
Training acc: 0.102350
Validation Loss: 2.811486
Validation acc: 0.102300
Epoch: 6 :
Train Loss: 2.750820
Training acc: 0.106100
Validation Loss: 2.777527
Validation acc: 0.106300
Epoch: 7 :
Train Loss: 2.717330
Training acc: 0.110583
Validation Loss: 2.744026
Validation acc: 0.110200
Epoch: 8 :
Train Loss: 2.684308
Training acc: 0.114967
Validation Loss: 2.710989
Validation acc: 0.115300
Epoch: 9 :
Train Loss: 2.651758
Training acc: 0.119567
Validation Loss: 2.678420
Validation acc: 0.119300
Epoch: 10 :
Train Loss: 2.619683
Training acc:

---
**Points:** $0.0$ of $0.5$
**Comments:** None

---

### 1.5 Questions

1. What is the impact of k in k-fold cross validation?

2. What will happen to the training if you change the $\lambda$ for L2-regularization?

3. Why do we perform the gradient descent on a batch of the data rather than all of the data?

4. Why does the loss increase, when the learning rate is too large?

5. Do we apply L2-regularization for the bias $b$?

*Answer:* 
* 1. balance the bias and variance 
* 2. if we change the lambda for L2-regularization, the weights of leaset loss maybe change.
* 3. if your dataset is large, each gradient descent step is too expensive 
* 4. too large of a learning reate causes drastic parameters updatas which lead to divergent from mimima, in this case the loss increase
* 5. no

---
**Points:** $0.0$ of $2.5$
**Comments:** None

---

## 2. Getting to know Back-Propagation in details $~$ (18 points)

In the following exercise you would build a **feed-forward network** from scratch using **only** Numpy. For this, you also have to implement **Back-propagation** in python. Additionally, this network should have the option of **L2 regularization** enabled within it.

**Before you start**: In this exercise you will implement a single hidden layer feedforward neural network. In case you are unfamiliar with the terminology and notation used here, please consult chapter 6 of the Deep Learning Book before you proceed.

Generally speaking, a feedword neural network with a single hidden layer can be represented by the following function $$ f(x;\theta) = f^{(2)}(f^{(1)}(f^{(0)}(x)))$$ where $f^{(0)}(x)$ is the input layer, $f^{(1)}(.)$ is the so called hidden layer, and $f^{(2)}(.)$ is the ouput layer of the network. $\theta$ represents the parameters of the network whose values will be learned during the training phase.

The network that you will implement in this exercise has the following layers:
* $f^{(0)}(x) = \mathbf{X}$, with $\mathbf{X} \in \mathbb{R}^{b,p}$ where $b$ is the batch size and $p$ is the number of features.
* $f^{(1)}(.) = \sigma(\mathbf{X} \mathbf{W_1}+b_1)$, with $\mathbf{X} \in \mathbb{R}^{b, p}$, $\mathbf{W_1} \in \mathbb{R}^{p,u_1}$, $\textbf{b}_1 \in \mathbb{R}^{u_1}$ where $u_1$ is the number of **hidden units**. Additonally, $\sigma(x) = \frac{1}{1 + \exp{(-x})}$ is the **sigmoid** function.
* $f^{(2)}(.) = softmax(\mathbf{X} \mathbf{W_2}+b_2)$, with $\mathbf{X} \in \mathbb{R}^{b, u_1}$, $\mathbf{W_2} \in \mathbb{R}^{u_1,u_2}$, $\textbf{b}_2 \in \mathbb{R}^{u_2}$ where $u_2$ is the number of **output classes** in this particular layer.

Note that both, $\sigma(x)$ are applied **elementwise**. Further, the addition with the bias vector is also applied **elementwise** to each row of the matrix $\mathbf{X} \mathbf{W}$.

In [None]:
import numpy as np


class Fully_connected_Neural_Network(object):
    """ Fully-connected neural network with one hidden layer.

    Parameters
    ------------
    n_output : int
        Number of class labels.
        
    n_features : int
        Number of input features.
        
    n_hidden : int
        Number of hidden units.
        
    l2 : float
        regularization parameter
        0 means no regularization
        
    epochs : int
        One Epoch is when the entire dataset is passed forward and backward through the neural network only once.
        
    lr : float
        Learning rate.
        
    batchsize : int
        Total number of training examples present in a single batch.
        

    Attributes
    -----------
    w1 : array, shape = [n_features, n_hidden_units]
        Weight matrix for input layer -> hidden layer.
    w2 : array, shape = [n_hidden_units, n_output_units]
        Weight matrix for hidden layer -> output layer.
    b1 : array, shape = [n_hidden_units, ]
        Bias for input layer-> hidden layer.
    b2 : array, shape = [n_output_units, ]
        Bias for hidden layer -> output layer.

    """
    # Points: 2.0
    def __init__(self, n_output, n_features, n_hidden=30,
                 l2=0.0, epochs=50, lr=0.001, batchsize=1):
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.l2 = l2
        self.epochs = epochs
        self.lr = lr
        self.batchsize = batchsize
        #TODO Initialize weights and biases with np.random.uniform or np.random.normal and specify the shape
        self.w1 = np.random.uniform(0, 1, [self.n_features, self.n_hidden])
        self.w2 = np.random.uniform(0, 1, [self.n_hidden, self.n_output])
        self.b1 = np.random.uniform(0, 1, [self.n_hidden,])
        self.b2 = np.random.uniform(0, 1, [self.n_output,])
        
    # Points: 0.5
    def sigmoid(self, z):
        """Compute sigmoid function"""
        #TODO Implement
        return 1.0 / (1.0 + np.exp(-z))

    # Points: 0.5
    def sigmoid_gradient(self, z):
        """Compute gradient of the sigmoid function"""
        #TODO Implement
        sig = self.sigmoid(z)
        return sig * (1 - sig)
    
    # Points: 1.0
    def softmax(self, z):
        """Compute softmax function.
        Implement a stable version which 
        takes care of overflow and underflow.
        """        
        #TODO Implement
        exp_z = np.exp(z - np.max(z))
        return exp_z / np.sum(exp_z)
    
    def softmax_gradient(self,z):
        """
        Compute gradient of the softmax function
        """
        SM = z.reshape((-1, 1))
        return np.diag(z) - np.dot(SM, SM.T)
    
    # Points: 2.0
    def forward(self, X):
        """Compute feedforward step

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            
        Returns
        ----------
        z2 : array,
            Input of the hidden layer.
        a2 : array,
            Output of the hidden layer.
        z3 : array,
            Input of the output layer.
        a3 : array,
            Output of the output layer.

        """
        # TODO Implement
        z2 = X
        a2 = self.sigmoid(np.matmul(z2, self.w1) + self.b1)
        z3 = a2
        a3 = self.softmax(np.matmul(z3, self.w2) + self.b2)
        
        return z2, a2, z3, a3
        
    # Points: 0.5
    def L2_regularization(self, lambd):
        """Implement L2-regularization loss"""
        #TODO Implement
        #I add in the loss function
        
    # Points: 2.0
    def loss(self, y_enc, output, epsilon=1e-12):
        """Implement the cross-entropy loss.

        Parameters
        ----------
        y_enc : array, one-hot encoded labels.
        
        output : array, output of the output layer
        
        epsilon: used to turn log(0) into log(epsilon)

        Returns
        ---------
        cost : float, total loss.

        """
        #TODO Implement
        pred = np.clip(output, epsilon, 1. - epsilon)
        N = pred.shape[0]
        cost = -np.sum(y_enc * np.log(pred + epsilon)) / N
        if self.l2:
            cost = cost + self.l2 * (np.sum(self.w1**2) + np.sum(self.w2**2))
        return cost
        
        
    # Points: 4.0
    def compute_gradient(self, X, a2, a3, z2, y_enc):
        """ Compute gradient using backpropagation.

        Parameters
        ------------
        X : array, Input.
        a2 : array, output of the hidden layer.
        a3 : array, output of the output layer.
        z2 : array, input of the hidden layer.
        y_enc : array, one-hot encoded labels.

        Returns
        ---------
        grad1 : array, Gradient of the weight matrix w1.
        grad2 : array, Gradient of the weight matrix w2.
        grad3 : array, Gradient of the bias vector b1.
        grad4 : array, Gradient of the bias vector b2.
        """
        #TODO Implement
        N = y_enc.shape[0]
        grad4 = a3[range(idx.shape[0]),idx] - 1
        grad3 = 1
        grad2 = 1
        grad1 = 1

        return grad1, grad2, grad3, grad4
        
    # Points: 1.0
    def inference(self, X):
        """Predict class labels

        Parameters
        -----------
        X : array, Input.

        Returns:
        ----------
        y_pred : array, Predicted labels.

        """
        # TODO Implement
        _, _, _, y_pred = self.forward(X)
        return y_pred
    
    
    def shuffle_train_data(self, X, Y):
        """called after each epoch"""
        perm = np.random.permutation(Y.shape[0])
        X_shuf = X[perm]
        Y_shuf = Y[perm]
        return X_shuf, Y_shuf
    
    def to_one_hot(self, Y, num_class):
        return np.eye(Y.shape[0], num_class)[Y]
    
    # Points: 2.0
    def train(self, X_train, Y_train, verbose=False):
        """ Fit the model.

        Parameters
        -----------
        X : array, Input.
        y : array, Ground truth class labels.
        verbose : bool, Print the training progress

        Returns:
        ----------
        self

        """
        #TODO Initialization
        self.cost_ = []
        

        for i in range(self.epochs):
        
            if verbose:
                print('\nEpoch: %d/%d' % (i+1, self.epochs))

            bs = 0
            nsamples = X_train.shape[0]
            # perform a single epoch
            while bs < nsamples:
                X = X_train[bs: bs+ self.batchsize]
                Y = Y_train[bs: bs+ self.batchsize]
                # feedforward and loss computation
                z2, a2, z3, a3 = self.forward(X)
                y_enc = self.to_one_hot(Y, self.n_output)
                # compute gradient via backpropagation and update the weights
                grad1, grad2, grad3, grad4 = self.compute_gradient(X, a2, a3, z2, y_enc)
                self.w1 = self.w1 - self.lr * grad1
                self.w2 = self.w2 - self.lr * grad2
                self.b1 = self.b1 - self.lr * grad3
                self.b2 = self.b2 - self.lr * grad4
                # cost for one iteration
                _, _, _, a3 = self.forward(X)
                cost = self.loss(y_enc, a3)
                self.cost_itera.append(cost)
                bs = bs + self.batchsize
            # cost for one epoch
            _, _, _, a3 = self.forward(X_train)
            y_enc = self.to_one_hot(Y_train, self.n_output)
            cost = self.loss(y_enc, a3)
            print('\nloss: %f' % (cost))
            self.cost_.append(cost)
            #shuffle the data
            X_train, Y_train = self.shuffle_train_data(X_train, Y_train)

        return self

---
**Points:** $0.0$ of $15.5$
**Comments:** None

---

In [None]:
nn = Fully_connected_Neural_Network(n_output=10, 
                                    n_features=X_trainval.shape[1], 
                                    n_hidden=50, 
                                    l2=0.1, 
                                    epochs=1000, 
                                    lr=0.001,
                                    batchsize=50)

In [None]:
nn.train(X_trainval, Y_trainval, verbose=True)

In [None]:
import matplotlib.pyplot as plt

# Plot the training error for every iteration
# in every epoch

x = range(nn.epochs)
nn.cost_itera.shape[0]
bs = 0
iteras = np.floor(X_trainval.shape[0] / nn.batchsize)
for i in range(x.shape[0]):
    y = nn.cost_itera[bs, bs+iteras]
    plt.plot(x,y)
    bs = bs + iteras
# TODO Implement
plt.plot(range(self.epochs), self.cost_)
plt.xlabel("iteration")
plt.ylabel("error")
plt.title('training error')
plt.tight_layout()
plt.show()

---
**Points:** $0.0$ of $1.0$
**Comments:** None

---

In [None]:
# Plot the training error in every epoch
# TODO Implement
plt.plot(x, nn.cost_)
plt.xlabel("epoch")
plt.ylabel("training error")
plt.title("training error")
plt.tight_layout()
plt.show()


---
**Points:** $0.0$ of $1.0$
**Comments:** None

---

In [None]:
# Compute Training Accuracy
def accuracy(nn, X, Y):
    Y_one_hot = nn.inference(X)
    Y_labels = np.argmax(Y_one_hot, axis=1)
    acc = np.sum(np.equal(Y_labels, Y).astype(np.float64))
    return acc

# TODO Implement
accuracy(nn, X_trainval, Y_trainval)
print('Training accuracy: %.2f%%' % (acc * 100))

# Compute Test Accuracy
# TODO Implement
accuracy(nn, X_test, Y_test)
print('Test accuracy: %.2f%%' % (acc * 100))

---
**Points:** $0.0$ of $0.5$
**Comments:** None

---

## Submission instructions
You should provide a single Jupyter notebook (.ipynb file) as the solution. Put the names and student ids of your team members below. **Make sure to submit only 1 solution to only 1 tutor.**

- Pengqiu Li 2575746
- Wentao Liu 2572849
- Fei Chen 2567445

## Points: 0.0 of 30.0 points