<div style="color: green; font-weight: bold">For the ReLULayer you use different functions in forward and backward such as heaviside function which is helpful to calculate the gradient of ReLU function.But you implement the same method as in example.</div>

<div style="color: green; font-weight: bold">For the OutputLayer your functions are almost totally same as the example and are right.</div>

<div style="color: green; font-weight: bold">For the LinearLayer except the size of bias matrix functions are basicly same as the example. But using different size also get right answer if you do not use transpose in the calculation. So you are also right.</div>

<div style="color: green; font-weight: bold">For the backward part of MLP, your method is basicly right but the dimention of the Layer when you iterate the layer is wrong. Because the parameters contains in the layer before last layer. So you should start from that layer as in the example instead of the last layer.</div>

<div style="color: green; font-weight: bold">Your evaluation part is totally right and you also compare the influence of differnt size of the input. It is very great!</div>

In [4]:
import numpy as np
from sklearn import datasets

####################################

class ReLULayer(object):
    def forward(self, input):
        # remember the input for later backpropagation
        self.input = input
        # return the ReLU of the input
        relu = np.maximum(0,input) # definition of Relu function element-wise
        return relu

    def backward(self, upstream_gradient):
        # compute the derivative of ReLU from upstream_gradient and the stored input
        downstream_gradient = upstream_gradient*np.heaviside(self.input,0) # derivative of Relu is heaviside of input
        return downstream_gradient

    def update(self, learning_rate):
        pass # ReLU is parameter-free

####################################

class OutputLayer(object):
    def __init__(self, n_classes):
        self.n_classes = n_classes

    def forward(self, input):
        # remember the input for later backpropagation
        self.input = input
        # return the softmax of the input
        exp_values=np.exp(input-np.max(input, axis=1, keepdims=True)) #substracting the max to preventing a overflow 
        softmax=exp_values/np.sum(exp_values, axis=1, keepdims=True)
        return softmax

    def backward(self, predicted_posteriors, true_labels):
        # return the loss derivative with respect to the stored inputs
        # (use cross-entropy loss and the chain rule for softmax,
        #  as derived in the lecture)
        
        #derivative of softmax is just softmax itself except for the entry representing the true label where derivative is softmax - 1
        downstream_gradient = predicted_posteriors 
        downstream_gradient[range(len(true_labels)), true_labels] -= 1 
        downstream_gradient = downstream_gradient/len(true_labels) 
        return downstream_gradient

    def update(self, learning_rate):
        pass # softmax is parameter-free

####################################

class LinearLayer(object):
    def __init__(self, n_inputs, n_outputs):
        self.n_inputs  = n_inputs
        self.n_outputs = n_outputs
        # randomly initialize weights and intercepts
        self.B = np.random.normal(size=(n_inputs, n_outputs)) # initialize random weight matrix with n_inputs rows and n_outputs columns
        self.b = np.random.normal(size=(1,n_outputs)) # initialize random bias vector --> has to have same dimension as output

    def forward(self, input):
        # remember the input for later backpropagation
        self.input = input
        # compute the scalar product of input and weights
        # (these are the preactivations for the subsequent non-linear layer)
        preactivations = np.dot(self.input, self.B) + self.b # Linear combination of input with weights as in weight matrix and bias vector added afterwards
        return preactivations

    def backward(self, upstream_gradient):
        # compute the derivative of the weights from
        # upstream_gradient and the stored input
        self.grad_b = np.sum(upstream_gradient) # derivative with respect to b is just 1 (in each entry) --> chain rule gives just sum of upstream_gradient 
        self.grad_B = np.dot(self.input.T, upstream_gradient) # since layer is linear, derivative w.r.t. the weights is just input --> Chain rule gives product between input at upstream_gradient 
        # compute the downstream gradient to be passed to the preceding layer
        downstream_gradient = np.dot(upstream_gradient, self.B.T) # derivative of Z_l w.r.t. Z_{l-1} is just B since layer is linear with weights B
        return downstream_gradient

    def update(self, learning_rate):
        # update the weights by batch gradient descent
        self.B = self.B - learning_rate * self.grad_B
        self.b = self.b - learning_rate * self.grad_b

####################################

class MLP(object):
    def __init__(self, n_features, layer_sizes):
        # constuct a multi-layer perceptron
        # with ReLU activation in the hidden layers and softmax output
        # (i.e. it predicts the posterior probability of a classification problem)
        #
        # n_features: number of inputs
        # len(layer_size): number of layers
        # layer_size[k]: number of neurons in layer k
        # (specifically: layer_sizes[-1] is the number of classes)
        self.n_layers = len(layer_sizes)
        self.layers   = []

        # create interior layers (linear + ReLU)
        n_in = n_features
        for n_out in layer_sizes[:-1]:
            self.layers.append(LinearLayer(n_in, n_out))
            self.layers.append(ReLULayer())
            n_in = n_out

        # create last linear layer + output layer
        n_out = layer_sizes[-1]
        self.layers.append(LinearLayer(n_in, n_out))
        self.layers.append(OutputLayer(n_out))

    def forward(self, X):
        # X is a mini-batch of instances
        batch_size = X.shape[0]
        # flatten the other dimensions of X (in case instances are images)
        X = X.reshape(batch_size, -1)

        # compute the forward pass
        # (implicitly stores internal activations for later backpropagation)
        result = X
        for layer in self.layers:
            result = layer.forward(result)
        return result

    def backward(self, predicted_posteriors, true_classes):
        # perform backpropagation w.r.t. the prediction for the latest mini-batch X
        downstream_gradient = self.layers[-1].backward(predicted_posteriors, true_classes) # first step of backpropagation 
        for layer in reversed(self.layers[0:-1]): #iterate through remaining layers in reverse order (excluding the last layer)
            downstream_gradient = layer.backward(downstream_gradient) 

    def update(self, X, Y, learning_rate):
        posteriors = self.forward(X)
        self.backward(posteriors, Y)
        for layer in self.layers:
            layer.update(learning_rate)

    def train(self, x, y, n_epochs, batch_size, learning_rate):
        N = len(x)
        n_batches = N // batch_size
        for i in range(n_epochs):
            # print("Epoch", i)
            # reorder data for every epoch
            # (i.e. sample mini-batches without replacement)
            permutation = np.random.permutation(N)

            for batch in range(n_batches):
                # create mini-batch
                start = batch * batch_size
                x_batch = x[permutation[start:start+batch_size]]
                y_batch = y[permutation[start:start+batch_size]]

                # perform one forward and backward pass and update network parameters
                self.update(x_batch, y_batch, learning_rate)

##################################

if __name__=="__main__":

    # set training/test set size
    N = 2000

    # create training and test data
    X_train, Y_train = datasets.make_moons(N, noise=0.05)
    X_test,  Y_test  = datasets.make_moons(N, noise=0.05)
    n_features = 2
    n_classes  = 2

    # standardize features to be in [-1, 1]
    offset  = X_train.min(axis=0)
    scaling = X_train.max(axis=0) - offset
    X_train = ((X_train - offset) / scaling - 0.5) * 2.0
    X_test  = ((X_test  - offset) / scaling - 0.5) * 2.0

    # set hyperparameters (play with these!) --> define 4 different networks with different layer sizes
    layer_sizes_1 = [2, 2, n_classes]
    layer_sizes_2 = [3, 3, n_classes]
    layer_sizes_3 = [5, 5, n_classes]
    layer_sizes_4 = [30, 30, n_classes]
    n_epochs = 100
    batch_size = 5
    learning_rate = 0.06

    # create network
    network_1 = MLP(n_features, layer_sizes_1)
    network_2 = MLP(n_features, layer_sizes_2)
    network_3 = MLP(n_features, layer_sizes_3)
    network_4 = MLP(n_features, layer_sizes_4)
    
    # train networks
    network_1.train(X_train, Y_train, n_epochs, batch_size, learning_rate)
    network_2.train(X_train, Y_train, n_epochs, batch_size, learning_rate)
    network_3.train(X_train, Y_train, n_epochs, batch_size, learning_rate)
    network_4.train(X_train, Y_train, n_epochs, batch_size, learning_rate)

    # test
    predicted_posteriors_1 = network_1.forward(X_test)
    predicted_posteriors_2 = network_2.forward(X_test)
    predicted_posteriors_3 = network_3.forward(X_test)
    predicted_posteriors_4 = network_4.forward(X_test)
    # determine class predictions from posteriors by winner-takes-all rule
    predicted_classes_1 = np.argmax(predicted_posteriors_1, axis=1) # to determine winner we return index with highest value in the predicted_posteriors
    predicted_classes_2 = np.argmax(predicted_posteriors_2, axis=1)
    predicted_classes_3 = np.argmax(predicted_posteriors_3, axis=1)
    predicted_classes_4 = np.argmax(predicted_posteriors_4, axis=1)
    # compute and output the error rate of predicted_classes
    error_rate_1 = (np.sum(predicted_classes_1 != Y_test))/(len(Y_test)) # (#wrong predictions)/(#total test instances)
    error_rate_2 = (np.sum(predicted_classes_2 != Y_test))/(len(Y_test))
    error_rate_3 = (np.sum(predicted_classes_3 != Y_test))/(len(Y_test))
    error_rate_4 = (np.sum(predicted_classes_4 != Y_test))/(len(Y_test))
    print("error rate network_1:", error_rate_1)
    print("error rate network_2:", error_rate_2)
    print("error rate network_3:", error_rate_3)
    print("error rate network_4:", error_rate_4)

error rate network_1: 0.115
error rate network_2: 0.099
error rate network_3: 0.0
error rate network_4: 0.0
