### Exercise 3 c) (3 points)

This is intended to be a tutorial on PyTorch computation graphs. Your taks is to implement XOR. Each of the 3 subexercises is worth 1 point.

The notebook provides code for you to start with. All necessary functions and classes are already imported. If you are unsure about how to use them, you can consult the corresponding documentation page (for example [torch.matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html)). You may use numpy to define the XOR tables.

In [31]:
# import necessary modules
import torch
import numpy as np # for data preparation
from torch import matmul, sigmoid
from torch.nn import MSELoss # Mean squared error

torch.manual_seed(8)

# number of epochs, learning rate and objective function should be the same for everyone
epochs = 1000
lr = 0.5
criterion = MSELoss()


## 1

Now, as everything is imported, load the data into PyTorch tensors and define the model parameters. Explain the your decisions in 1-2 sentences each:

a) How do you initialize the parameters (all zero, all ones, random distribution etc.)?

What happens if biases are initialized as torch.ones(1)?

b) Which of the variables need gradient tracking? What do you have to do to toggle it in PyTorch?


In [35]:
# data
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32, requires_grad=True)
Y = torch.tensor([0, 1, 1, 0], dtype=torch.float32, requires_grad=True).reshape(X.shape[0], 1)

# params
W_i = torch.rand(2, 2, requires_grad=True)
W_o = torch.rand(2, 1, requires_grad=True)
b_a = torch.zeros(1, 2, requires_grad=True)
b_o = torch.zeros(1, 1, requires_grad=True)
W_i


## What happens if biases are initialized as torch.ones(1)? - our pricdted output will be mean toward bias term if we have no input, so to avoid this, we 
## inlialize bias with 0



tensor([[0.5979, 0.8453],
        [0.9464, 0.2965]], requires_grad=True)

## 2

The next step is to implement a full forward pass of the model. This means you have to implement the computation graph built in a) and present each XOR input once to it.

In [36]:
# TODO: forward pass
def forward_pass(x, wi, wo, ba, bo):
    
    hl_activation = torch.matmul(x, wi) + ba
    hl_output = torch.sigmoid(hl_activation)
    
    output_activation = torch.matmul(hl_output, wo) + bo
    predicted_output = torch.sigmoid(output_activation)
    return (predicted_output,hl_activation)

## 3
Enhance your model by actually doing backpropagation! 

Gradient from W_o, W_i + biases

a) Employ the criterion defined in the first code cell to get a loss value and backpropagate by using the .backward() method. Then, access the gradient information in the tracked variables and perform the update. You can do so for a variable w by:

```python
w.data -= lr*w.grad.data
```

You also need to reset the gradient data after performing the update:

```python
w.grad.data.zero_()
```

You will have to loop over the data set several times to get a nice outcome. If your implementation is correct the values defined above (number of epochs, learning rate) will suffice.

In [42]:
for epoch in range(epochs):
    #TODO: Forward pass
    prediction, _ = forward_pass(X, W_i, W_o, b_a, b_o)
    #TODO: Backward pass
    loss_fun = criterion(prediction, Y)
    loss_fun.backward()
    W_i.data -= lr*W_i.grad.data
    W_o.data -= lr*W_o.grad.data
    b_a.data -= lr*b_a.grad.data
    b_o.data -= lr*b_o.grad.data
    W_i.grad.data.zero_()
    W_o.grad.data.zero_()
    
  

b) Now, test your model on all data points! Did it learn?

In [43]:
# TODO: Test

X_test = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
Output, hidd_act = forward_pass(X_test, W_i, W_o, b_a, b_o)
test_prediction = (Output > 0.5) * 1.0
print (test_prediction, Output)

tensor([[0.],
        [1.],
        [1.],
        [0.]]) tensor([[0.0943],
        [0.8996],
        [0.8996],
        [0.1044]], grad_fn=<SigmoidBackward>)


c) Finally, print the final weight matrix W_o and the hidden activation for each item. Can you tell what the model does to separate the classes? Would it work without sigmoid activation? Explain your answer in 2-3 sentences.

In [41]:
# TODO: print W_o & hidden activation

print(W_i.data)
print(W_o.data)
print(b_a.data)
print(b_o.data)
print(hidd_act)
print(Output)

tensor([[2.9917, 0.9858],
        [3.0104, 0.6011]])
tensor([[ 2.6341],
        [-0.6243]])
tensor([[-0.3747,  0.0267]])
tensor([[-1.6616]])
tensor([[-0.3747,  0.0267],
        [ 2.6358,  0.6278],
        [ 2.6170,  1.0125],
        [ 5.6275,  1.6136]], grad_fn=<AddBackward0>)
tensor([[0.2881],
        [0.5961],
        [0.5830],
        [0.6088]], grad_fn=<SigmoidBackward>)
