# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
# <font color="#003660">Week 3: NN from Scratch</font>
# <font color="#003660">Notebook 1: NN with Numpy</font>


<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... describe the reasoning behind single-layer perceptrons;<br>
        ... interpret the functionality of artificial neural networks;<br>
        ... build your first artificial neural network.
    </font>
</div>
</center>
</p>

# 1. What are Artificial Neural Networks?

<p>In this lesson, we will start experimenting with so-called artificial neural networks (ANNs). Neural networks are a class of machine learning algorithms that can provide you with extremely powerful models and are, as a result, a great addition to your machine learning toolbox. ANNs are known as universal function approximators &mdash; i.e., they are capable of representing and approximating almost any arbitrary function, and that, regardless of the complexity. Self-driving cars, voice assistants, and modern medical imagery systems are all modern examples of what artifical neural networks can do. Even though artifical neural networks can be extremely powerful, it is important to keep in mind that they do not provide a one-size-fits-all solution out of the box.</p>

<table class="image">
<center>
<caption align="bottom">(Patterson &amp; Gibson, 2017, p.55)</caption>
<tr><td><img width=540 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/nn_topology.png'></td></tr>
</center>
</table>

# 2. Perceptron: The Building Block of Neural Networks

<p><center><font color="#085986"><strong><i>The following section is based on the book "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili (2017).</i></strong></font></center></p>

<p>In order to get a better understanding of artifical neural networks, one must first understand its underlying building block &mdash; i.e., the perceptron. In its primitive form, the artificial perceptron &mdash; i.e., the Rosenblatt perceptron (Rosenblatt, 1958) &mdash; is a mathematical function that mimics how neurons work in the human brain by enabling the binary classification of linearly seperable observations. As shown below, a single-layer perceptron consists of a set of input features ($x_{n}$) and their corresponding weights ($w_{n}$) as well as a threshold function (a.k.a. step function). Ultimately, the goal is to minimise the error by iteratively optimising the weight coefficients.</p>

<table class="image">
<center>
<caption align="bottom">(Raschka &amp; Mirjalili, 2017, p.24)</caption>
<tr><td><img width=540 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/perceptron.png'></td></tr>
</center>
</table>

<p>In a more algorithmic way, the model illustrated above can be represented as follows:</p>

\begin{equation}
\LARGE
   \phi(z)=%
   \begin{cases}
     \;\;\;\;1 \;\;\;\; \text{if $z$} \geq \theta\\
     -1 \;\;\;\; \text{otherwise}
   \end{cases}
\end{equation}

\begin{align}
\LARGE
z = w_0x_0 + w_1x_1 + \ldots + w_mx_m = \sum_{j}^{}w_{j}x_{j}
\end{align}

<p>where $z$ represents the weighted sum or dot product of all inputs and weights and $\theta$ the pre-defined threshold value for the threshold function. The model above would then output $1$ if $z$ $\geq$ $\theta$ &mdash; i.e., positive class &mdash; or $-1$ if $z$ $<$ $\theta$ &mdash; i.e., negative class.</p>

<table class="image">
<center>
<caption align="bottom">(Raschka &amp; Mirjalili, 2017, p.21)</caption>
<tr><td><img width=540 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/threshold.png'></td></tr>
</center>
</table>

<p>Since our goal is to minimise the error, we can, in an iterative manner, simultaneously update the weight coefficients of our model by following Rosenblatt's perceptron learning rule:</p>

<ol>
    <li>Initialise all weight coefficients &mdash; e.g., with zeros or random values.</li>
    <li>For each training sample $x^i$:
        <ol>
            <li>Compute the output value $\hat{y}$ &mdash; i.e., the class label.</li>
            <li>Update the weights accordingly (as shown below).</li>
        </ol>
</ol>

\begin{align}
\LARGE
w_j := w_j + \Delta w_j
\end{align}

\begin{align}
\LARGE
\Delta w_j = \eta \; (y^{(i)} - \hat{y}^{(i)}) \; {x_{j}}^{(i)}
\end{align}

<p>where $\Delta w_j$ represents the update value for each weight $w_j$ in the weight vector, $\eta$ the so-called learning rate &mdash; i.e., a constant between $0.0$ and $1.0$ &mdash;, $y^i$ the true class label of a given training sample, and $\hat{y}^i$ the predicted class label of a given training sample.</p>

<p><center><i>(Raschka &amp; Mirjalili, 2017, pp.18-24)</i></center></p>

## 2.1 Single-Layer Perceptron: An Example

<p><center><font color="#085986"><strong><i>The following section is based on the book "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili (2017).</i></strong></font></center></p>

<table class="image">
<center>
<caption align="bottom"><a href="https://archive.ics.uci.edu/ml/datasets/iris">https://archive.ics.uci.edu/ml/datasets/iris</a></caption>
<tr><td><img width=540 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/iris.png'></td></tr>
</center>
</table>

In [None]:
################################################
# Load Iris dataset                            #
# https://archive.ics.uci.edu/ml/datasets/iris #
################################################

# Import
import numpy as np
import pandas as pd

# Load dataset (UCI)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df = df.sample(n=50, random_state=42).reset_index(drop=True)

# Define features (X)
X = df.iloc[:, [0, 2]].values

# Define targets (y)
y = df.iloc[:, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)

print('Training set:')
print('X: {}'.format(X.shape))
print('y: {}'.format(y.shape))

In [None]:
#########################################
# Define perceptron class               #
# (Raschka & Mirjalili, 2017, pp.24-33) #
#########################################

class Perceptron(object):

    def __init__(self, eta=0.1, n_iter=3, random_state=42):

        # Define learning rate
        self.eta = eta

        # Define number of epochs
        self.n_iter = n_iter

        # Define random state
        self.random_state = random_state

    def fit(self, X, y):

        # Initialise weights (random)
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])

        self.errors_ = []

        for _ in range(self.n_iter):

            # Print
            print('=========================================================')
            print(f'\t\t\tEpoch {_+1}')
            print('=========================================================')

            errors = 0

            for index, (xi, target) in enumerate(zip(X, y)):

                # Make prediction
                prediction = self.predict(xi)

                # Prediction != Target
                if prediction != target:

                    # Calculate update
                    update = # TODO

                    # Update weight coefficients
                    self.w_[1:] += # TODO

                    # Update bias coefficient
                    self.w_[0] += # TODO

                    # Print
                    print(f'Index: {index:2} | Pred.: {self.predict(xi):2} | Target: {target:2} | Update: {update:2}')

                    # Count errors
                    errors += 1

            self.errors_.append(errors); print(); print()

        return self

    def net_input(self, X):
        return # TODO

    def predict(self, X):
        return np.where(self.net_input(X) >= 0.0, 1, -1)

In [None]:
######################
# Train model        #
# Epochs: 5          #
# Learning rate: 0.1 #
######################

ppn = Perceptron(n_iter=3, eta=0.1)
ppn.fit(X, y)

In [None]:
#########################
# Plot training history #
#########################

%matplotlib inline

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions

plot_decision_regions(X, y, clf=ppn)

plt.title('Perceptron')
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.show()

plt.plot(range(1, len(ppn.errors_)+1), ppn.errors_, marker='o')
plt.xlabel('Iterations')
plt.ylabel('Misclassifications')
plt.show()

<p><center><i>(Raschka &amp; Mirjalili, 2017, pp.24-33)</i></center></p>

# 3. Building our First Artifical Neural Network

<p><center><font color="#085986"><strong><i>The following section is (partly) based on the book "Neural Network Projects with Python" by James Loy (2019).</i></strong></font></center></p>

<table class="image">
<center>
<caption align="bottom">(Loy, 2019, p.16)</caption>
<tr><td><img width=420 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/mlp.png'></td></tr>
</center>
</table>

## 3.1 Multilayer Perceptron: A Simple Architecture

<p>The model illustrated above is known as a multilayer perceptron (MLP) &mdash; i.e.,  a type of feedforward artificial neural networks consisiting of an <b>input layer, one or multiple hidden layers, and an output layer</b>. This simple architecture, which we will be implementing in this section, generates an output that can be expressed as follows:</p>

\begin{align}
\LARGE
\hat{y} = \phi(W_{2} \; \phi(W_{1}x + b_{1})+b_{2})
\end{align}

<p>where $W_1$ and $W_2$ represent the weight vectors, $b_1$ and $b_2$ the bias units, and $\phi$ the activation functions. Please note that MLPs, in contrast to single-layer perceptrons, make use of <b>non-linear activation functions</b>. As a result, in order to move from discrete to continuous outputs and therefore obtain non-linearity, threshold functions should be replaced with more suitable alternatives.</p>

<table class="image">
<center>
<caption align="bottom"><a href="https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044">https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044</a></caption>
<tr><td><img width=540 align='middle' src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/activ_functions.png'></td></tr>
</center>
</table>

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/tip.png"></center>

<p>As it is often the case with technical literature, it is important to keep in mind that different notations and styles are used to explain identical concepts. Even though the algorithm exposed above may seem different to the one presented in section 2, its core idea remains the same.</p>

## 3.2 Training a Neural Network: An Iterative Process

<p>As exposed in the previous section, the overall quality of a model is determined by the the weights and biases obtained during the training process. The term training is used, since the idea is to fine-tune these weights and biases over several iterations. <b>Keep in mind that the ultimate goal is to minimise the cost function</b>. Closely analoguous to Rosenblatt's perceptron learning rule, each iteration of the training process consists of the following steps:</p>

<ul style="list-style-type:round">
    <li>feedforward &mdash; i.e., feeding the training examples from the input layer to the output layer ($\hat{y}$);</li>
    <li>backpropagation &mdash; i.e., updating the weights and biases based on the obtained error.</li>
</ul>

<table class="image">
<center>
<caption align="bottom">(Loy, 2019, p.18)</caption>
<tr><td><img width=800 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/nn_sequence.png'></td></tr>
</center>
</table>

## 3.3 Implementation (PyTorch)

<p>Feeling a bit overwhelmed? No worries! Let's start by building our first model using the <b>PyTorch</b> framework provided by <a href="https://ai.facebook.com">Facebook AI</a>. Even though not necessarily the simplest alternative available, we believe that this highly popular framework provides the most suitable environment for an introductory course and will allow you to better understand the inner workings of neural networks.</p><br>

<center><a href="https://pytorch.org"><img width=384 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/pytorch-logo-dark.png"/></a></center>
    

<p>Based on the illustration found at the beginning of this section, our model is pretty straightforward and contains only one hidden layer.  However, since our training data only contains two features (a.k.a. independent variables), the input layer of our model must only contain 2 neurons instead of 3. The resulting architecture can be pictured as follows:</p><br>

<table class="image">
<center>
<tr><td><img width=520 align='middle' src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/ann_example.png'></td></tr>
</center>
</table>

<p>Using the example below, implement the architecture exposed above. As can be seen, the <code>__init__(self)</code> function is used to define the layers of the model while the <code>forward(self, x)</code> function is used to define the feedforward network. Do not forget the non-linear activation functions!!!</p>

```python
class MyPyTorchModel(nn.Module):
    
    # Define layers here...
    def __init__(self):
        
        # Required to initialize the nn.Module
        super(MyPyTorchModel, self).__init__()

        self.layer1 = nn.linear()
        self.layer2 = nn.linear()
        self.layer3 = nn.linear()
        self.layer4 = nn.linear()
        
    # Define feedforward here...
    def forward(self, x):
        
        x = self.layer1(x)
        x = torch.sigmoid(x)
        
        x = self.layer2(x)
        x = torch.sigmoid(x)
        
        x = self.layer3(x)
        x = torch.sigmoid(x)
        
        x = self.layer4(x)
        x = torch.sigmoid(x)
        
        return x
```

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/tip.png"></center>


<p>Make sure to pass <code>nn.Module</code> to the models's class. By doing so, your class inherits all required <b>PyTorch</b> methods and functions, which is essential to the implementation!</p>

In [None]:
#################################
# Define architecture (PyTorch) #
#################################

# Import
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set seed
torch.manual_seed(42)

# Define architecture
class FirstTorchModel(nn.Module):

    def __init__(self):

        super(FirstTorchModel, self).__init__()

        # Define hidden layer (4 units)
        self.hidden_layer = # TODO

        # Define output layer (1 unit)
        self.output_layer = # TODO

    # Feedforward
    def forward(self, x):

        # Hidden layer (Sigmoid activation)
        x = # TODO
        x = # TODO

        # Output layer
        x = # TODO
        x = # TODO

        return x

<p>Before we can go any further, we must first initialise our model and define a few more key ingredients, namely:</p>

<ul style="list-style-type:round">
    <li>a loss function; and</li>
    <li>an optimizer &mdash; i.e., the method used to update our weights and biases based on the backpropagated error.</li>
</ul>

<p>For the sake of this tutorial, we will stick to a common binary approach by using the <code><a href="https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html">torch.nn.BCELoss (Binary Cross Entropy)</a></code> loss function in combination with the <code><a href="https://pytorch.org/docs/stable/optim.html">torch.optim.SGD (Stochastic Gradient Descent)</a></code> optimizer. Run the code below in order to initialise the model and all required components!</p>

In [None]:
#################
# Compile model #
#################

# Initialise model
model = FirstTorchModel()

# Define loss function (a.k.a. criterion)
criterion = # TODO

# Define optimizer
optimizer = # TODO

# Print overview
print(model)

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/question.png"></center>

<p><center><b>Based on the architecture above, how many parameters (a.k.a. weights) will we be training?</b></center></p>

\begin{align}
params_{layer} = inputs \times outputs + biases
\end{align}

In [None]:
########################
# How many parameters? #
########################

# Hidden layer
params_hidden = # TODO

# Output layer
params_output = # TODO

# Total parameters
total_params = params_hidden + params_output

print(f'>>> The model contains {total_params} parameters in total!')

<p>As can be seen, our model contains a total of 17 trainable parameters (a.k.a. weights). Here is the calculation:</p>

\begin{align}
params_{hidden} = 2 \times 4 + 4 \;\;\;\;\;\;\;\;\;\; params_{output} = 4 \times 1 + 1
\end{align}

\begin{align}
params_{total} = 12 + 5 = 17
\end{align}

<p>Akin to our <code>Perceptron</code> implementation, loading the model's class initialises the model's weights and bias units with random values. Let's take a look at these values!</p>

In [None]:
############################
# Initial weights & biases #
############################

# Hidden layer
print('>>> Hidden Layer:')
print(model.hidden_layer.weight)
print(model.hidden_layer.bias)

# Output layer
print('\n>>> Output Layer:')
print(model.output_layer.weight)
print(model.output_layer.bias)

# Save initial weights and biases (not relevant!)
# For tutorial only!
w1_initial = torch.transpose(model.hidden_layer.weight.detach().clone(), 0, 1)
b1_initial = model.hidden_layer.bias.detach().clone()
w2_initial = torch.transpose(model.output_layer.weight.detach().clone(), 0, 1)
b2_initial = model.output_layer.bias.detach().clone()

<p>As we can see, the 17 parameters contained within our model are initialised and ready to go! As a result, we are now ready to load and prepare our dataset before training our model.</p><br>

<center><img width=25% src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/ready.jpg"/></center>

In [None]:
################################################
# Reload Iris dataset                          #
# https://archive.ics.uci.edu/ml/datasets/iris #
################################################

# Import
import numpy as np
import pandas as pd

# Load dataset (UCI)
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)

# Shuffle
df = df.sample(frac=1., random_state=42).reset_index(drop=True)

# Define features (X)
X = df.iloc[:, [0, 2]].values

# Define targets (y)
y = df.iloc[:, 4].values
y = np.where(y == 'Iris-setosa', 0, 1).reshape((y.shape[0], 1))

print('Shapes:')
print('X: {}'.format(X.shape))
print('y: {}'.format(y.shape))

<p>Lastly, before we can fit (or train) our model, we must define the following hyperparameters:</p>

<ul style="list-style-type:round">
    <li><code>epochs</code> &mdash; i.e., the number of iterations;</li>
    <li><code>batch_size</code> &mdash; the number of samples per gradient update.</li>
</ul>

<p>For the sake of simplicity, we will train our model for 100 epochs and split our data into 6 batches &mdash; i.e., 25 samples per batch. By doing so, we can compute the loss and update the weights after every 25 samples, and therefore making the training process more efficient.</p>

In [None]:
##########################
# Define hyperparameters #
##########################

NUM_EPOCHS = 100
BATCH_SIZE = 25

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/tip.png"></center>

<p>The <b>PyTorch</b> framework can provide you with useful utilities, such as <a href="https://pytorch.org/docs/stable/data.html">torch.utils.data.TensorDataset</a> and <a href="https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader">torch.utils.data.DataLoader</a>, that will help you prepare and split your dataset into trainable batches!</p>

In [None]:
###################
# Prepare dataset #
###################

dataset = torch.utils.data.TensorDataset(torch.from_numpy(X).float(), torch.from_numpy(y).float())
train_loader = torch.utils.data.DataLoader(dataset, shuffle=False, batch_size=BATCH_SIZE)

<p>Finally, we can go ahead and train our model using the following steps:></p>

<ol>
    <li>Sets the gradients to zero;</li>
    <li>Feed the data through the network;</li>
    <li>Compute the loss using the outputs from the model;</li>
    <li>Backpropagate the loss and compute the gradients; and</li>
    <li>Update the weights using the optimizer.</li>
</ol>

In [None]:
###############
# Train model #
###############

torch_training_history = []

# Training loop
for epoch in range(NUM_EPOCHS):

    running_loss = torch.zeros(int(X.shape[0]/BATCH_SIZE))

    for batch_id, batch in enumerate(train_loader):

        data, target = batch

        # Clear gradients
        # TODO

        # Feedforward
        # TODO

        # Compute loss
        loss = # TODO

        # Backpropagate errors
        # TODO

        # Update weights
        # TODO

        # Append loss (epoch)
        running_loss[batch_id] = loss.item()

    print('==========================================')
    print(f'>>> Epoch {epoch+1}')
    print('==========================================')
    print(f'>>> Batch 1: {running_loss[0]:.4}')
    print(f'>>> Batch 2: {running_loss[1]:.4}')
    print(f'>>> Batch 3: {running_loss[2]:.4}')
    print(f'>>> Batch 4: {running_loss[3]:.4}')
    print(f'>>> Batch 5: {running_loss[4]:.4}')
    print(f'>>> Batch 6: {running_loss[5]:.4}')
    print(f'\n>>> Epoch:   {torch.mean(running_loss):.4}\n')

    # Append loss (training)
    torch_training_history.append(running_loss)

In [None]:
##################################
# Weights & biases post training #
##################################

# Hidden layer
print('>>> Hidden Layer:')
print(model.hidden_layer.weight)
print(model.hidden_layer.bias)

# Output layer
print('\n>>> Output Layer:')
print(model.output_layer.weight)
print(model.output_layer.bias)

## 3.4 Implementation (Own)

<p>For learning purposes, let's try to reproduce the results generated above without using all the tools provided by the <code>nn.Module</code>. High-level APIs, such as <b>Keras</b> or even <b>PyTorch</b>, often come at a cost &mdash; i.e., not knowing what's really happening behind the scenes! Below is a step-by-step guide on how to compute the feedforward pass based on the architecture presented in the previous section. <b>Keep in mind that this procedure must be repeated for each batch!</b></p>

### 1. Compute weighted sums for the hidden layer

\begin{align}
\LARGE
z_1 = W_1x + b_1
\end{align}

### 2. Activate hidden layer with sigmoid function

<p>With the weighted sums $z_1$ at hand, we still need to activate the hidden neurons with a non-linear activation function. As mentioned earlier, we will be using the sigmoid function ($\sigma$) throughout this example.</p>

\begin{align}
\LARGE
\sigma = \frac{1}{1+e^{-x}}
\end{align}

<p>The output of our layer is then called $a_1$ and can be computed as follows:</p>

\begin{align}
\LARGE
a_1 = \sigma(z_1) = \sigma(W_1x + b_1)
\end{align}

### 3. Compute weighted sum for the output layer

\begin{align}
\LARGE
z_2 = W_2a_1 + b_2
\end{align}

### 4. Activate output layer with sigmoid function and generate predictions

\begin{align}
\LARGE
a_2 = \sigma(z_2) = \sigma(W_2z_1 + b_2)
\end{align}

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/tip.png"></center>

<p>As can be seen above, only two functions are required to compute the aforementioned forward pass. <b>Keep in mind that it is essential to use the activated outputs from the hidden layer &mdash; i.e., $a_1$ &mdash; as inputs for the output layer</b>.</p>

<p>Let's define our <code>MyNeuralNetwork</code> class and compute the feedforward pass for our first batch using the step-by-step guide exposed above. To do so, we will start by splitting our training set into two batches containing 2 samples each.</p>

In [None]:
###############
# Get batches #
###############

# DataLoader
loader_iter = iter(train_loader)

# Batches
X_batch1, y_batch1 = next(loader_iter)
X_batch2, y_batch2 = next(loader_iter)
X_batch3, y_batch3 = next(loader_iter)
X_batch4, y_batch4 = next(loader_iter)
X_batch5, y_batch5 = next(loader_iter)
X_batch6, y_batch6 = next(loader_iter)

# Shapes
print(X_batch1.shape, y_batch1.shape)
print(X_batch2.shape, y_batch2.shape)
print(X_batch3.shape, y_batch3.shape)
print(X_batch4.shape, y_batch4.shape)
print(X_batch5.shape, y_batch5.shape)
print(X_batch6.shape, y_batch6.shape)

In [None]:
##########################
# Define MyNeuralNetwork #
##########################

# MyNeuralNetwork
class MyNeuralNetwork:

    def __init__(self, weights1, bias1, weights2, bias2):

        self.weights1 = weights1
        self.bias1 = bias1
        self.weights2 = weights2
        self.bias2 = bias2

    def _sigmoid_activation(self, x):
        return # TODO

    def _feedforward(self, x):

        # Step 1
        z_1 = # TODO

        # Step 2
        a_1 = # TODO

        # Step 3
        z_2 = # TODO

        # Step 4
        a_2 = # TODO

        return z_1, a_1, z_2, a_2

    def fit_one_batch(self, x):

        # Feedforward
        self.z_1, self.a_1, self.z_2, self.a_2 = self._feedforward(x)

# Initialise model
my_model = MyNeuralNetwork(
    weights1 = w1_initial,
    bias1 = b1_initial,
    weights2 = w2_initial,
    bias2 = b2_initial
)

# Fit model
my_model.fit_one_batch(X_batch1)

# Display predictions from output layer
print(f'>>> Predictions:\n{my_model.a_2}')

<p>Without any surprise, our model outputs five predictions &mdash; i.e., one for every sample in our batch. By comparing these outputs with the true labels &mdash; i.e., $y$ &mdash; we can compute the binary cross-entropy loss, or logarithmic loss, for (every batch) by using the following formula:</p>

\begin{align}
\LARGE
H(p, y) = -(y \; \log(p) + (1-y) \; \log(1-p))
\end{align}

<p>where $p$ reprensents the model's prediction and $y$ the true label.</p>

In [None]:
########################################
# Compare predictions with true labels #
########################################

for p, y in zip(my_model.a_2, y_batch1):
    print(f'Pred.: {p[0]:.4} -> Target: {y[0]}')

<p>Since our predictions aren't perfect, let's add the log loss function to our <code>MyNeuralNetwork</code> class and compute the error after every feedforward pass.</p>

```python
    def _loss_computation(self, p, y):
            return torch.mean(-(y * torch.log(p) + (1-y) * torch.log(1-p)))
```

In [None]:
##########################
# Define MyNeuralNetwork #
##########################

# MyNeuralNetwork
class MyNeuralNetwork:

    def __init__(self, weights1, bias1, weights2, bias2):
        self.weights1 = weights1
        self.bias1 = bias1
        self.weights2 = weights2
        self.bias2 = bias2

    def _sigmoid_activation(self, x):
        return # TODO

    def _loss_computation(self, p, y):
        return # TODO

    def _feedforward(self, x):

        # Step 1
        z_1 = # TODO

        # Step 2
        a_1 = # TODO

        # Step 3
        z_2 = # TODO

        # Step 4
        a_2 = # TODO

        return z_1, a_1, z_2, a_2

    def fit_one_batch(self, x, y):

        # Feedforward
        self.z_1, self.a_1, self.z_2, self.a_2 = self._feedforward(x)

         # Loss
        self.loss = # TODO

# Initialise model
my_model = MyNeuralNetwork(
    weights1 = w1_initial,
    bias1 = b1_initial,
    weights2 = w2_initial,
    bias2 = b2_initial
)

# Fit model
my_model.fit_one_batch(X_batch1, y_batch1)

# Loss
print(f'>>> Batch 1: {my_model.loss:.4}')

<p>Because the loss is calculated at the end of every batch, our computation returns one error value for every sample. The overall error for a given batch is simply the mean of all loss values. Before moving on to backpropagation, let's validate the results of our manual computations by comparing them with the ones obtained using the <b>PyTorch</b> framework.</p>

In [None]:
###################
# Own vs PyTorch  #
# 1st batch only! #
###################

print(f'Our implementation:     {my_model.loss:.4}')
print(f'PyTorch implementation: {torch_training_history[0][0]:.4}')

<center><img width=40% src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/worked.jpg"/></center>

## 3.4 Backpropagation &amp; Gradient Descent

<p>Now that we are done training our first batch &mdash; i.e., we generated predictions and computed the model's error &mdash; we can now proceed with backpropagation &mdash; i.e., <b>a method used to efficiently compute the gradient of the error function with respect to the weights in artificial neural networks</b>. Gradient descent, on the other hand,  is <b>"an optimization algorithm used to minimise some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient"</b> (Machine Learning Glossary, 2017). Since the concepts of backpropagation and gradient descent can be considered rather complex, we will, in this tutorial, set our focus on the most essential ideas.</p>

<table class="image">
<center>
<caption align="bottom">(Machine Learning Glossary, 2017)</caption>
<tr><td><img width=550 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/gradient_descent.png'></td></tr>
</center>
</table>

<p>From a visual standpoint, it is evident, when we look back at our model's architecture, that the output of a given layer is determined by the weights preceeding this layer. Hence, the resultant error must also be dependent from these weights. This is one of the main assumptions that needs to be made when dealing with gradient descent and the reason why backpropagation is used to optimise neural networks. By backpropagating the error through a network &mdash; i.e., from output to input &mdash; we can fine-tune the model's weights and biases based on their effect on the cost function.</p>

## 3.5 Backpropagation (Step-by-Step)

<p><center><font color="red"><strong><i>Since the concepts of backpropagation and gradient descent are rather complex,<br>they will not be covered in this introductory lesson on artificial neural networks.</i></strong></font></center></p>

### 1. Compute the error term $\delta$ for the output layer

\begin{align}
\LARGE
\delta_{output} = H^{\prime}(p,y) \cdot \sigma^{\prime}(a_2)
\end{align}

\begin{align}
H^{\prime}(p,y) = -\left ( \frac{y}{p} - \frac{1-y}{1-p} \right ) \;\;\;\;\;\;\;\;\;\; \sigma^{\prime}(a_2) = a_2 \cdot (1-a_2)
\end{align}

where $H^{\prime}(p,y)$ represents the partial derivative of the cross-entropy loss function and $\sigma^{\prime}(a_2)$ the partial derivative of the sigmoid activation function at the output layer. Keep in mind that $a_2$ also represents the predictions made by the model. However, for the sake of simplicity, the variable $p$ is used to reprensent the prediction within the loss function.

### 2. Compute the error term $\delta$ for the hidden layer

\begin{align}
\LARGE
\delta_{hidden} = \delta_{output}W_2 \cdot \sigma^{\prime}(a_1)
\end{align}

where $\delta_{output}W_2$ represents the weighted $\delta_{output}$ and $\sigma^{\prime}(a_1)$ the partial derivative of the sigmoid activation function at the hidden layer.

### 3. Update weights &rarr; $W_2$ and $b_2$

\begin{align}
\LARGE
W_2 := W_2 - \eta \; (a_1 \cdot \delta_{output})
\end{align}

\begin{align}
\LARGE
b_2 := b_2 - \eta \; (\delta_{output})
\end{align}

### 4. Update weights &rarr; $W_1$ and $b_1$

\begin{align}
\LARGE
W_1 := W_1 - \eta \; (x \cdot \delta_{hidden})
\end{align}

\begin{align}
\LARGE
b_1 := b_1 - \eta \; (\delta_{hidden})
\end{align}

<p>In order to be able to perform the backpropagation step presented above, we will be adding, besides the computations, the following functions to our <code>MyNeuralNetwork</code> class:</p>


```python
    def _sigmoid_derivative(self, p):
            return p * (1.0 - p)

    def _loss_derivative(self, p, y):
        return (1 / y.shape[0]) * (-torch.divide(y, p) + torch.divide((1-y), (1-p)))
```

In [None]:
##########################
# Define MyNeuralNetwork #
##########################

# MyNeuralNetwork
class MyNeuralNetwork:

    def __init__(self, weights1, bias1, weights2, bias2, eta):

        self.weights1 = weights1
        self.bias1 = bias1
        self.weights2 = weights2
        self.bias2 = bias2
        self.eta = eta

    def _sigmoid_activation(self, x):
        return 1 / (1 + torch.exp(-x))

    def _loss_computation(self, p, y):
        return torch.mean(-(y * torch.log(p) + (1-y) * torch.log(1-p)))

    def _sigmoid_derivative(self, p):
        return p * (1.0 - p)

    def _loss_derivative(self, p, y):
        return (1 / y.shape[0]) * (-torch.divide(y, p) + torch.divide((1-y), (1-p)))

    def _feedforward(self, x):

        # Step 1
        z_1 = torch.matmul(x, self.weights1) + self.bias1

        # Step 2
        a_1 = self._sigmoid_activation(z_1)

        # Step 3
        z_2 = torch.matmul(a_1, self.weights2) + self.bias2

        # Step 4
        a_2 = self._sigmoid_activation(z_2)

        return z_1, a_1, z_2, a_2

    def _backpropagation(self, a_2, a_1, y):

        # Output layer
        output_delta = torch.mul(self._loss_derivative(a_2, y), self._sigmoid_derivative(a_2))

        # Hidden layer
        hidden_delta = torch.mul(torch.mul(output_delta, self.weights2.T), self._sigmoid_derivative(a_1))

        return output_delta, hidden_delta

    def fit(self, n_epochs, batches):

        self.training_history = []

        # Training loop
        for epoch in range(n_epochs):

            running_loss = torch.zeros(int(X.shape[0]/BATCH_SIZE))

            for batch_id, batch in enumerate(batches):

                data, target = batch

                # Feedforward
                z_1, a_1, z_2, a_2 = self._feedforward(data)

                 # Loss
                loss = self._loss_computation(a_2, target)

                # Backpropagation
                output_delta, hidden_delta = self._backpropagation(a_2, a_1, target)

                # Update weights (Output layer)
                self.weights2 -= # TODO
                self.bias2 -= # TODO

                # Update weights (Hidden layer)
                self.weights1 -= # TODO
                self.bias1 -= # TODO

                # Append loss (epoch)
                running_loss[batch_id] = loss

            print('==========================================')
            print(f'>>> Epoch {epoch+1}')
            print('==========================================')
            print(f'>>> Batch 1: {running_loss[0]:.4}')
            print(f'>>> Batch 2: {running_loss[1]:.4}')
            print(f'>>> Batch 3: {running_loss[2]:.4}')
            print(f'>>> Batch 4: {running_loss[3]:.4}')
            print(f'>>> Batch 5: {running_loss[4]:.4}')
            print(f'>>> Batch 6: {running_loss[5]:.4}')
            print(f'\n>>> Epoch:   {torch.mean(running_loss):.4}\n')

            # Append loss (training)
            self.training_history.append(running_loss)


# Initialise model
my_model = MyNeuralNetwork(
    weights1 = w1_initial,
    bias1 = b1_initial,
    weights2 = w2_initial,
    bias2 = b2_initial,
    eta=0.1
)

<p>Finally, we are now ready to train/fit our model.</p>

In [None]:
##################
# Initial values #
##################

print('>>> Hidden layer:')
print(my_model.weights1)
print(my_model.bias1)

print('\n>>>> Output layer:')
print(my_model.weights2)
print(my_model.bias2)

In [None]:
#########
# Train #
#########

# Dataset
batches = [
    [X_batch1, y_batch1],
    [X_batch2, y_batch2],
    [X_batch3, y_batch3],
    [X_batch4, y_batch4],
    [X_batch5, y_batch5],
    [X_batch6, y_batch6]
]

# Fit
my_model.fit(n_epochs=NUM_EPOCHS, batches=batches)

<p>We are now done training our model. Once again, we can validate the results of our manual computations by comparing them with the ones obtained using the <b>PyTorch</b> framework. Please note that a small variation is expected due to disprencies between both implementations &mdash; e.g., the implementation of the loss function and the optimizer &mdash; as well as floating point precision.</p>

In [None]:
##################
# Own vs PyTorch #
##################

print(f'Our implementation:     {torch.mean(my_model.training_history[-1])}')
print(f'PyTorch implementation: {torch.mean(torch_training_history[-1])}')

In [None]:
############
# Training #
# Overview #
############

import matplotlib.pyplot as plt

plt.plot([torch.mean(epoch) for epoch in torch_training_history], label='PyTorch implementation')
plt.plot([torch.mean(epoch) for epoch in my_model.training_history], label='Our implementation')

plt.title('Training Overview')
plt.xlabel('Epoch')
plt.ylabel('BCELoss')

plt.legend()
plt.grid()

plt.show()

<center><img width=60% src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_2/images/done.jpg"/></center>

<ul style="list-style-type:round">
<i>
    <li>Lane, H., Howard, C., & Hapke, H.M. (2019). Natural Language Processing in Action. Shelter Island, NY: Manning Publications Co.</li>
    <li>Loy, J. (2019). Neural Network Projects with Python. Birgmingham, UK: Packt Publishing Ltd.</li>
    <li>Machine Learning Glossary. (2017). Gradient Descent. Retrieved from https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html.</li>
    <li>Patterson, J., & Gibson, A. (2017). Deep learning. Sebastopol, CA: O’Reilly Media, Inc.</li>
    <li>Raschka, S., &amp; Mirjalili, V. (2017). Python Machine Learning (2nd ed.). Birgmingham: Packt Publishing Ltd.</li>
</i>
</ul>