#### This submission is for... (*put up to three people*)
- Erika Mustermann (87654321)
- Max Mustermann (12345678)
- Maxine Musterfrau (87651234)

# Exercise 1 - Simple Neural Network

In the first exercise, you will incrementally build a simple neural network from scratch with numpy. Our first challenge is solving the XOR task that you've seen in the lecture, before we move to a slightly more complex problem, namely the Iris dataset.

You can receive up to three points for your implementation of Exercise 1. After that you can either choose Exercise 2A or Exercise 2B to receive another three points. In sum, you can get up to six bonus points for the exam.

- **Exercise 2A**: Building a Transformer network with PyTorch applied in the NLP domain
- **Exercise 2B**: Building a GAN network with PyTorch applied in the image domain

**Important Notice**: Throughout the notebook, basic structures are provided such as functions and classes without bodies or partial bodies, and variables that you need to assign to. **Don't change the names of functions, variables, and classes - and make sure that you are using them!** You're allowed to introduce helper variables and functions. Occasionally, we use **type annotations** that you should follow. They are not enforced by Python. Whenenver you see an ellipsis `...` or TODO comment, you're supposed to insert code.

## XOR Task

XOR (exclusive OR) is a logic function that gives 1 as an output when the number of true inputs is odd, otherwise it outputs a 0. Our goal is to model this function using neurons. We'll start with a single neuron.

<center><img src="https://community.anaplan.com/t5/image/serverpage/image-id/29631i3AA6C01377A8550F/image-size/large?v=v2&px=999" width="250"/></center>

## A Single Neuron (Perceptron)

Let's start with importing some necessary dependencies that we will need throughout the notebook.

In [1]:
import numpy as np

In the first part of this exercise you'll build a perceptron, a single neuron, that takes both binary input values and returns a binary output value.

<center><img src="https://i.stack.imgur.com/eBSki.jpg" width="280" />

<center><img src="" width="280"/>

Perceptron can be seen as a single neuron, mapping an input $\textbf{x}$ to an output $o$ using weights $\textbf{w}$ and a bias $b$. $\cdot$ is the dot product.

$o = \textbf{w}\cdot \textbf{x}+b$

#### Perceptron Update Rule

In the lecture we learned about the **Perceptron algorithm** / **Perceptron update rule** which we can apply to binary classification problems. Let's use it here to have a first baseline.

For classification problems $0>o$ is interpreted as class 1, and $o<0$ is interpreted as class 0. 

For updating the associated weights, we can use the following update rule:

$w_i = w_i + \nabla w_i$

where

$\nabla w_i = \eta(t-o)x_i$

- $t$ is the target
- $o$ is the output
- $\eta$ is the learning rate (a small constant)

### Implementation of a Perceptron

In [None]:
class perceptron_implementation():
    def __init__(self):
        self.neuron_weights = None
        self.bias = None
        self.initialize_weights()
        
    def initialize_weights(self):
        # TODO: 
        # Initialize weights 
        # For perceptrons, it's possible to initialize the weights with 0
        
        # END TODO

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        output = None

        # END TODO
        return output

    def perceptron_update_rule(self, target, prediction, learning_rate = 1):
        # TODO
        # Perform perceptron update rule that is defined above
        # use self.neuron_weights
        new_weights = None

        # END TODO
        self.neuron_weights = new_weights

    def train(self, input_data, targets):
        """
        input_data: Multi-dimensional array that contains all inputs
        """
        # TODO
        # Call the necessary functions to train a single neuron for the given task
        # Complete the rest of the code to correctly train the model

        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained neuron

        # END TODO
        return output

IndentationError: ignored

### Training

In [None]:
perceptron = perceptron_implementation()

# TODO

input_data = ...
targets = ...

# train the corresponding single neuron 
perceptron.train(input_data, targets)
# END TODO

### Inference

In [None]:
# TODO
# Test the trained model

predictions = perceptron.inference(input_data)
print(predictions)
# END TODO

### Evaluation

For evaluation, we will need to consider appropriate metrics. For classification tasks, **accuracy** is one of the most common metrics.

It is defined as:

$\textrm{Accuracy}=\frac{1}{N}\sum_i^N1(y_i=\hat{y}_i)$

where $y$ is an array of our target values, and $\hat{y}$ is an array of our predictions.

For accuracy, if outputs are probabilities, there needs to be a threshold for transforming logit predictions to binary `(0,1)` predictions. We will set this threshold to `0.5`. For our perceptron this is not needed, since we already output binary values, however, we will use the `accuracy` function later on, so the predictions should be considered to be probabilities.

In [None]:
def accuracy(predictions: np.ndarray, targets: np.ndarray, threshold=0.5) -> float:
    # TODO
    # Implement the accuracy metric
    ...
    # END TODO
    return accuracy_value

# TODO
# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value = accuracy(...)
print(accuracy_value)
# END TODO

## Multiple Neurons

The perceptron algorithm can't be generalized to multiple neurons or even layers of neurons, that's why we will now use **backpropagation**. This requires us to have a **loss function**.

For our XOR task, it is now our goal is now to build a network akin to this, i.e., a network with three single-neuron hidden layers:

<img src="https://i.imgur.com/oErVmm2.png">

#### Backpropagation

<center><img src="https://i.imgur.com/LgBzpYD.png" width="400" /></center>

### Sigmoid Activation Function

For a binary classification problem, we can use the sigmoid activation function in the output layer which outputs values in the range of 0 and 1. So, for a positive case (class 1), we can interpret $p_1 = \sigma(o)$ as the probability of that class, while $p_0 = 1 - p_1$ can be seen the probability of the negative case (class 0).

In [None]:
class sigmoid_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement Sigmoid function for the input_data

        # END TODO
        return output
    
    def backward(self, gradients):
        ... # calculate the gradients with help of the derivative
        return ...

### Loss Function (Binary Cross Entropy)

$L=-\frac{1}{N}\sum_{i=0}^Ny_i log(p(y_i))+(1-y_i)log(1-p(y_i))$

where $N$ is the batch size.

In [None]:
class binary_cross_entropy():

    def forward(self, output, target):
        loss = None
        
        # TODO
        # implement Binary Cross-Entrops loss function for output, target

        # END TODO
        return loss
    
    def backward(self):
        ... # calculate the gradients with help of the derivative
        return ...

### Initializing Weights

Xavier intitialization is commonly used to initialize the weights of a network. It is a random uniform distribution that’s bounded between $\pm\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}}$ where $n_i$ is the number of incoming network connections, and $n_{i+1}$ is the number of outgoing network connections.

In [None]:
def xavier_initialization(n_incoming: np.ndarray, n_outgoing: np.ndarray) -> np.ndarray:
    """ Returns a numpy array of initialized weights """
    ...

### Implement Multiple Neurons

#### Feed-Forward Layer

A feed-forward layer applies a linear transformation to the input $x$ using a weight matrix $\textbf{W}$ and a bias vector $b$:

$z = x\textbf{W}^T+b$

In [None]:
class multi_neuron_implementation():
    def __init__(self, number_of_neurons, loss_function, output_activation_function):
        self.neuron_weights = None
        self.number_of_neurons = number_of_neurons
        self.loss_function = loss_function
        self.output_activation_function = output_activation_function

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        output = None

        # END TODO
        return output

    def backward_pass(self, x):
        # TODO
        # Perform backpropagation by calculating derivative
        output = None

        # END TODO
        return output

    def update_parameter(self, loss, derivative, learning_rate = 1):
        # TODO
        # Perform weight update
        # use self.neuron_weights
        new_weights = None

        # END TODO
        self.neuron_weights = new_weights

    def train(self, input_data, targets):
        # TODO
        # Call the necessary functions to train the model with multiple neurons for the given task
        # Complete the rest of the code to correctly train the model

        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained model

        # END TODO
        return output

### Training

In [None]:
multi_neuron = multi_neuron_implementation(...)

# TODO
# train the corresponding single neuron 
multi_neuron.train(input_data, targets)
# END TODO

### Inference

In [None]:
# TODO
# Test the trained model
predictions = multi_neuron.inference(input_data)
print(predictions)
# END TODO

### Evaluation

In [None]:
# TODO
# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value = accuracy(...)
print(accuracy_value)
# END TODO

## Multi-Layer Perceptron (MLP)

Let's generalize even further and build a network with an arbitrary (parametrized) number of hidden layers and hidden dimensions. For the XOR task specifically, we will consider a network with three hidden layers and a hidden dimension of three. We will also add an activiation function to introduce nonlinearity in our hidden layers.

<img src="https://i.imgur.com/IUQ05Ol.png">

### Implementation

In [None]:
class MLP_implementation():
    def __init__(self,
        hidden_layers,
        hidden_layers_size,
        hidden_activation_func,
        output_activation_function,
        loss_function,
    ):
        self.hidden_layers = hidden_layers
        self.hidden_layers_size = hidden_layers_size
        self.hidden_activation_func = hidden_activation_func
        self.loss_function = loss_function
        self.output_activation_function = output_activation_function
        # TODO
        # Implement your MLP model 

        # END TODO

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        output = None

        # END TODO
        return output

    def backward_pass(self, x):
        # TODO
        # Perform backpropagation by calculating derivative
        output = None

        # END TODO
        return output

    def update_parameter(self, loss, derivative, learning_rate = 1):
        # TODO
        # Perform weight update

        # END TODO

    def train(self, input_data, targets):
        # TODO
        # Call the necessary functions to train the model for the given task
        # Complete the rest of the code to correctly train the model

        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained model

        # END TODO
        return output

### Adding Nonlinearity

This time, you need to implement and apply nonlinearity. For this, you should implement Rectified Linear Unit (ReLU) and apply it to provide nonlinearity to the network.

Basically, ReLU activation function is a mathematical operation that processes the input data and checks whether the input is positive or not. If it is positive, then it does not change anything. Otherwise, ReLU outputs zero. 

When we examine the ReLU behavior, it looks like it is the combination of two different linear functions. This property makes the training easier yet effective since ReLU does not have any learnable parameters as well as easy to apply because of combination of two simple linear functions. The following equation and figure show how ReLU acts.

$$ y = max(0, x) $$

<center><figure><img src="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png" width="450"/><figcaption>Graph of the ReLU activation function. <a href="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png">Image is taken from</a></figcaption></figure></center>

In [None]:
# You need to use the same implementation that you do in the previous task, if you need.
# You only need to include ReLU activation function in your implementation.

class relu_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement ReLU function for the input_data

        # END TODO
        return output
    
    def backward(self, gradients):
        ... # calculate the gradients with help of the derivative
        return ...

### MLP Inititialization

In [None]:
relu = relu_activation_function()
xor_mlp = MLP_implementation(...)

### Training

In [None]:
# Train the same model with ReLU activation function
# TODO

# END TODO

### Evaluation

In [None]:
# Test and evaluate your new model as in the previous task
# TODO

# END TODO

## Application

### Iris Dataset 🌷

Iris is a genus of hundreds of species of flowering plants with showy flowers. The Iris data set consists of 150 samples from three species of Iris which are hard to distinguish (Iris setosa, Iris virginica and Iris versicolor). There are four features from each sample: the length and the width of the sepals and petals, in centimeters. Based on these features, the goal is to predict which species of Iris the sample belongs to.

<center><img src="https://www.oreilly.com/library/view/python-artificial-intelligence/9781789539462/assets/462dc4fa-fd62-4539-8599-ac80a441382c.png" width="450"/></center>

###  Loading Dataset

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and test dataset

X_train, y_train = ...
X_test, y_test = ...

### Softmax

Previously, we only considered a **binary classification problem**. Iris, however, is a **multiclass classification problem** that requires us to distinguish between three classes. For this case, we can use a **Softmax activation function** in the output layer to transform our outputs (logits) to a probability distribution over our classes.

Softmax is defined as

$\texttt{softmax}(z)_i=\frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}}$

In [None]:
class softmax_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement Softmax function for the input_data

        # END TODO
        return output
    
    def backward(self, gradients):
        ... # calculate the gradients with help of the derivative
        return ...

### Loss Function (Cross-Entropy)

Related to the previous notes about sigmoid and softmax, we now also need to move from a binary cross entropy loss to a more general cross entropy loss for a multiclass classification problem.

Cross-Entropy loss is defined as:

$L=-\frac{1}{N}\sum_{n=0}^{N}\sum_i y_i log(y_i')$

In [None]:
class cross_entropy_loss():

    def forward(self, output, target):
        loss = None
        
        # TODO
        # implement Cross-Entrops loss function for output, target

        # END TODO
        return loss
    
    def backward(self):
        ... # calculate the gradients with help of the derivative
        return ...

### Architecture

We will again use an MLP for this task. Intitialize a model with **4 hidden layers** and a **hidden layer size of 24**.

### Training

In [None]:
iris_mlp = MLP_implementation(...)
iris_mlp.train(X_train, y_train)

### Evaluation

Show the overall accuracy of our model on the test dataset. Use the existing `accuracy` function that you implemented earlier.

In [None]:
# TODO
# Test the trained model
predictions_MLP = iris_mlp.inference(X_test)
print(predictions_MLP)

# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value_MLP = accuracy(...)
print(accuracy_value_MLP)

# END TODO

Print the confusion matrix using `sklearn.metrics.confusion_matrix`.

In [None]:
import confusion_matrix from sklearn.metrics

...

Now please also look at the confusion matrix, what can you conclude from it? (no code, write text as part of this question)

...

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=faa4af3b-d086-4f42-8b7d-d29c91b1d0f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>