# Lab 2

### Objectives

1. Understand Backpropagation
2. Write a neural network with one or more hidden layers
3. Solve the XOR
4. Understand how to build general classifiers

Importing all relevant code from the previous lab. 

In [8]:
import numpy as np

W1 = np.random.randn(3, 2)
B1 = np.random.randn(3)
W2 = np.random.randn(1, 3)
B2 = np.random.randn(1)

def sigm(X, W, B):
    M = 1/(1+np.exp(-(X.dot(W.T)+B)))
    return M

def diff_W(X, Z, Y, B, W):

    dS = sigm(X, W, B)*(1-sigm(X, W, B)) # differentiating sigm function
    dW = (Y-Z)*dS

    return X.T.dot(dW) # dot product between X transpose and dW

def diff_B(X, Z, Y, B, W):

    dS = sigm(X, W, B)*(1-sigm(X, W, B))
    dB = (Y-Z)*dS

    return dB.sum(axis=0)

X = np.random.randint(2, size=[15, 2]) # produces an array size [15, 2] containing either 0 or 1
Z = np.array( [X[:,0] ^ X[:,1] ]).T

X_Test = np.random.randint(2, size=[15, 2])
Y_Test = np.array(X[:,0] ^ X[:,1] ).T

learning_rate = 0.1

<img src=".\i\lab2.png" width="400"> </br>
**Why have the dimensions of the weights and biases changed from lab 1?** </br>
The structure in the image above shows how for this lab, there are 2 inputs (1 or 0), 2 layers (first layer has three sigmoids, second layer has one sigmoid) and one output. </br> </br>

W1: 3 sets of weights for the 3 sigmoids in layer 1. 2 weights in each set corresponding to the two inputs. </br>
B1: 3 biases for the 3 sigmoids in layer 1.</br></br>

W2: The problem has been reduced to one sigmoid therefore there is one set of weights. This one set contains 3 terms to account for the 3 outputs from the previous layer.
B2: 1 bias for the 1 sigmoid in layer 2.

</br></br>
**Why do we need 3 sigmoids in layer 1?** </br>
We don't. 3 was a randomly chosen number. We need at least 2. You can try with 2 and should see the same results.

Adding in a forward function to reflect the network topology that we want to replicate.

In [9]:
def Forward(X, W1, B1, W2, B2):
    #first layer

    H = sigm(X, W1, B1)

    #second layer

    Y = sigm(H, W2, B2)

    # We return both the final output and the output from the hidden layer

    return Y, H

### Derivation of backpropogation functions 

// insert derivation here

In [10]:
def diff_B2(Z, Y):
    dB = (Z-Y)*Y*(1-Y)
    return dB.sum(axis=0)

def diff_W2(H, Z, Y):
    dW = (Z-Y)*Y*(1-Y)
    return H.T.dot(dW)

def diff_W1(X, H, Z, Y, W2):
    dZ = (Z-Y).dot(W2)*Y*(1-Y)*H*(1-H)
    return X.T.dot(dZ)

def diff_B1(Z, Y, W2, H):
    return ((Z-Y).dot(W2)*Y*(1-Y)*H*(1-H)).sum(axis=0)

Unlike the previous lab, we are not making use of the sigmoid function inside the update rules. Instead, we feed them the outputs from the middle layer (H, in this example). The results are the same and which expression you use is simply a matter of readbility and compactness of code.

In [11]:
learning_rate = 0.1

for epoch in range(500):
    Y, H = Forward(X, W1, B1, W2, B2)

    W2 = W2 + learning_rate * diff_W2(H, Z, Y).T
    B2 = B2 + learning_rate * diff_B2(Z, Y)
    W1 = W1 + learning_rate * diff_W1(X, H, Z, Y, W2).T
    B1 = B1 + learning_rate * diff_B1(Z, Y, W2, H)
    if not epoch % 50:
        Accuracy = 1 -np.mean((Z-Y)**2)
        print("Epoch: ", epoch, " Accuracy: ", Accuracy)

Epoch:  0  Accuracy:  0.5038360735248066
Epoch:  50  Accuracy:  0.7594914978953156
Epoch:  100  Accuracy:  0.7699804088500772
Epoch:  150  Accuracy:  0.7840129363956222
Epoch:  200  Accuracy:  0.8045644446560197
Epoch:  250  Accuracy:  0.8334493491672781
Epoch:  300  Accuracy:  0.8681677922567888
Epoch:  350  Accuracy:  0.9016801000017931
Epoch:  400  Accuracy:  0.9285187679319801
Epoch:  450  Accuracy:  0.9476763961131898


### Softmax

Softmax turns the numeric output of the last linear layer of a multi-class classification neural network into a probability with the probabilities summing to 1.

The exponent is taken for each output then normalized by dividing by the sum of all the exponents. Softmax function:
<img src=".\i\softmax.png" width="400"> </br>

Softmax is an activation function. Given it is a multivariable function, we cannot plot it like the sigmoid.

**From lab notes:** This representation is ideal when your label z, is a “one-hot vector”, where there is 1 one (the correct class) and N-1 zeros (the incorrect ones). For instance, imagine that you were trying to teach a network to distinguish between images of cats and dogs. 

Softmax is usually used on the last layer of an image classification network.

Source: https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d 

In [39]:
# testing softmax
import numpy as np

W1 = np.random.randn(1, 2)
B1 = np.random.randn(2)
X = np.random.randint(2, size=[15, 2])

def softmax(X, W, B): # my attempt
    exps = [np.exp(i) for i in X.dot(W.T)+B]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    return softmax

def softmax2(X, W, B): # correct solution
    A = X.dot(W.T) + B
    expA = np.exp(A)
    output = expA/expA.sum(axis=1, keepdims = True)
    return output

softmax2(X, W1, B1)

array([[0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267],
       [0.92870733, 0.07129267]])

Why my attempt was wrong:
- When summing the exponentials, my function summed all the column elements (rather than the rows) meaning I got 2 outputs instead of 15.

Note that in the returned array:
- Each row gives the probability of each class being the correct output.
- The sum of each row is 1.

### Cross-entropy function

This is usually used as the loss function alongside Softmax.

<img src=".\i\crossentropy.png" width="400"> </br>

**From lab notes:** Cross-entropy loss, measures the performance of a classification model whose output is a probability value between 0 and 1, and the error increases as the predicted probability deviates from the actual label. A perfect model would have a loss of 0.

**Example:** the expected output is [0, 1] but the output from the neural net is [0.05, 0.95].

<img src=".\i\crossentropy2.png" width="400"> </br>


### Updating XOR to use Softmax and Cross-entropy

The following code is the same as above with the following changes:
- The sigmoid function has been replaced with the softmax function.

In [None]:
import numpy as np

W1 = np.random.randn(3, 2)
B1 = np.random.randn(3)
W2 = np.random.randn(1, 3)
B2 = np.random.randn(1)

def softmax(X, W, B):
    exps = [np.exp(i) for i in X.dot(W.T)+B]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    return softmax

def diff_W(X, Z, Y, B, W):

    dS = sigm(X, W, B)*(1-sigm(X, W, B)) # differentiating sigm function
    dW = (Y-Z)*dS

    return X.T.dot(dW) # dot product between X transpose and dW

def diff_B(X, Z, Y, B, W):

    dS = sigm(X, W, B)*(1-sigm(X, W, B))
    dB = (Y-Z)*dS

    return dB.sum(axis=0)

X = np.random.randint(2, size=[15, 2]) # produces an array size [15, 2] containing either 0 or 1
Z = np.array( [X[:,0] ^ X[:,1] ]).T

X_Test = np.random.randint(2, size=[15, 2])
Y_Test = np.array(X[:,0] ^ X[:,1] ).T

def Forward(X, W1, B1, W2, B2):
    #first layer

    H = sigm(X, W1, B1)

    #second layer

    Y = sigm(H, W2, B2)

    # We return both the final output and the output from the hidden layer

    return Y, H

def diff_B2(Z, Y):
    dB = (Z-Y)*Y*(1-Y)
    return dB.sum(axis=0)

def diff_W2(H, Z, Y):
    dW = (Z-Y)*Y*(1-Y)
    return H.T.dot(dW)

def diff_W1(X, H, Z, Y, W2):
    dZ = (Z-Y).dot(W2)*Y*(1-Y)*H*(1-H)
    return X.T.dot(dZ)

def diff_B1(Z, Y, W2, H):
    return ((Z-Y).dot(W2)*Y*(1-Y)*H*(1-H)).sum(axis=0)

learning_rate = 0.1

for epoch in range(500):
    Y, H = Forward(X, W1, B1, W2, B2)

    W2 = W2 + learning_rate * diff_W2(H, Z, Y).T
    B2 = B2 + learning_rate * diff_B2(Z, Y)
    W1 = W1 + learning_rate * diff_W1(X, H, Z, Y, W2).T
    B1 = B1 + learning_rate * diff_B1(Z, Y, W2, H)
    if not epoch % 50:
        Accuracy = 1 -np.mean((Z-Y)**2)
        print("Epoch: ", epoch, " Accuracy: ", Accuracy)