In [None]:
### Imports
from IPython.display import display
from ipywidgets import interactive
import ipywidgets as widgets
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Basic structure of artificial neural networks

The conceptual foundations of **[artificial neural networks](https://en.wikipedia.org/wiki/Neural_network_(machine_learning))** go back to the work of [Warren McCulloch](https://en.wikipedia.org/wiki/Warren_Sturgis_McCulloch) and [Walter Pitts](https://en.wikipedia.org/wiki/Walter_Pitts), who proposed linked networks for spatial pattern recognition in analogy to neurons as early as $1943$. In 1958, [Frank Rosenblatt et al.](https://en.wikipedia.org/wiki/Frank_Rosenblatt) achieved the first practical implementation of a neural network in the form of the **[perceptron](https://en.wikipedia.org/wiki/Perceptron)**. In 1969, criticism by [Marvin Minsky](https://en.wikipedia.org/wiki/Marvin_Minsky) of the inability to solve non-linearly separable problems (such as the **[XOR problem](https://en.wikipedia.org/wiki/XOR_gate)**) with simple perceptrons led to a temporary decline in research interest (the so-called AI winter). This changed in the 1980s, when various advances in AI research, such as the method of **[backpropagation](https://en.wikipedia.org/wiki/Backpropagation)**, showed that multilayer perceptrons are also capable of solving non-linearly separable problems.

We have talked in detail elsewhere about the different types of machine learning - **[unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning)**, **[supervised learning](https://en.wikipedia.org/wiki/Supervised_learning)** and **[reinforcement learning](https://de.wikipedia.org/wiki/Best%C3%A4rkendes_Lernen)**. We have presented various machine learning algorithms that are used for the different types of learning. Neural networks are characterized in particular by the fact that they can be used successfully in all three types of learning with appropriate preparation. This universal applicability also explains the increased use of artificial neural networks in a wide variety of areas.

In this workshop we will look at the strengths and weaknesses of **neural networks** as well as their possible applications and the underlying mathematical formulation.

## Structure of the neuron

Let's first take a look at the model for neural networks: the nerve cell.

<img src="./images_en/realistische-neuronenanatomie_1284-68077.avif" alt="drawing" width="80%"/>

As shown in the figure above, a nerve cell consists of **dendrites**, **soma** and **axon**. The **dendrites** absorb messenger substances, the so-called neurotransmitters, when these are released by excited neighboring nerve cells. The connections between the dendrites and the axon of the preceding cell are called **synapses**. The neuron has a membrane potential that initially suppresses the transmission of a nerve stimulus. Only when a certain **excitation threshold** is exceeded is an **action potential** triggered and a nerve stimulus transmitted to the next cell via the **axon**. 

When the neuron is stimulated, the so-called **[All-or-none_law](https://en.wikipedia.org/wiki/All-or-none_law)** applies, which states that either a stimulus is triggered completely or not at all.

This can be expressed mathematically with the **Heaviside step function**, which is defined as

$$H(x) \begin{cases} x \lt 0 \cdots 0 \\ x \ge 0 \cdots 1 \end{cases} $$

In [None]:
# Heaviside step function
def heaviside(x):
    return 0.5 * (np.sign(x) + 1)


# Generate values for x
x = np.linspace(-5, 5, 1000)

# Calculate the values of the heaviside step function
y = heaviside(x)

# Plot
plt.figure(dpi=600, figsize=(6, 3))
plt.rcParams.update({'font.size': 6})
plt.tight_layout()
plt.plot(x, y, linewidth = 2.5)
plt.xlabel('x')
plt.ylabel('H(x)')
plt.title('Heaviside step function')
plt.ylim([-0.5,1.5])
plt.grid(True)
plt.show()

In biological terms, we therefore have a **variable stimulus**, a **threshold value** that must be exceeded and an **activation function** to trigger the stimulus.

## Simple perceptron

Next, let's look at the simplest structure of a neural network, the **peceptron** originally proposed by Rosenblatt. This consists of two inputs, the neuron itself and one output, as shown in the following figure.

<img src="./images_en/perceptron.png" alt="drawing" width="80%"/>

The structure of **artificial neural networks** follows the basic principle of a **biological nervous system**. In neural networks, neurons correspond to the **[artificial neurons](https://en.wikipedia.org/wiki/Artificial_neuron)** or nodes in the network. These neurons are the basic processing units. 

Artificial neural networks consist of individual neurons that are arranged in so-called layers. The first layer or *input layer* consists of the input values, followed by further layers of neurons, the so-called *hidden layers*, and finally an *output layer*.

In analogy to biological neurons, which are activated by stimulation above a certain threshold value to transmit stimuli, in neural networks the weighted sum of the inputs is passed on to connected neurons. In relation to a neuron, this results in:

$$ \text{Input} = \sum_{i=1}^N x_i w_i $$

The $x_i$ are the individual input values and the $w_i$ are the respective weights of the $N$ inputs.

Equivalent to the biological threshold at which a neuron is activated to transmit a signal, we can add a threshold value $b$, the so-called bias.

$$ \text{Input} = \sum_{i=1}^N x_i w_i + b$$

We still normally have to apply a suitable activation function to the output calculated in this way.

$$f_{\text{Aktiv}}(\text{Input}) = f_{\text{Aktiv}}(\sum_{i=1}^N x_i w_i + b) = \text{Output}$$

The **activation function** maps the weighted sum of the input values to a specific value range and is used to control the output of a neuron or a layer. It decides whether and to what extent a neuron is activated and what information is passed on to the next layers. In addition, suitable activation functions make it possible to describe non-linear relationships. We will come back to the exact form of different activation functions later and for the moment we will assume linearly transmitted values.

So mathematically speaking 

$$f_{\text{Active}}(\text{Input}) = \text{Input}$$ 

and thus the following applies

$$ \text{Input} = \sum_{i=1}^N x_i w_i + b = \text{Output}$$

In [None]:
### Figure: Weights and bias
plt.rcParams.update({'font.size': 10})
# Values for weight and bias
weights = [3, 2, 1]  # Weight values for the first plot
bias_values = [0]  # Bias for the first plot (bias = 0)
weight_constant = 1  # Constant value for weight in the second plot
bias_values_2 = [1, 0, -1]  # Different bias values for the second plot

# Create two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6), dpi = 600)

# Plot for the weight values with bias = 0
for weight in weights:
    ax1.plot([-2, -1, 0, 1, 2], [weight * x + 0 for x in [-2, -1, 0, 1, 2]], label=f"Weight = {weight}")
ax1.set_title("Different weights with constant bias")
ax1.set_xlabel("x")
ax1.set_ylabel("y = weight * x + bias")
ax1.legend()
ax1.grid(True)
ax1.set_xticks(np.arange(-2,3,1))
ax1.set_yticks(np.arange(-6,7,1))
ax1.axhline(0, 2, color='black',linewidth=1)
ax1.axvline(0, 6, color='black',linewidth=1)
ax1.annotate('', xy=(2, 0), xytext=(0, 0), arrowprops=dict(facecolor='black', edgecolor='black', arrowstyle='->', lw=1))
ax1.annotate('', xy=(0, 6), xytext=(0, 0), arrowprops=dict(facecolor='black', edgecolor='black', arrowstyle='->', lw=1))


# Plot for constant weight = 1 and different bias values
for bias in bias_values_2:
    ax2.plot([-2, -1, 0, 1, 2], [weight_constant * x + bias for x in [-2, -1, 0, 1, 2]], label=f"Bias = {bias}")
ax2.set_title("Different bias at constant weight")
ax2.set_xlabel("x")
ax2.set_ylabel("y = weight * x + bias")
ax2.legend()
ax2.grid(True)
ax2.set_xticks(np.arange(-2,3,1))
ax2.set_yticks(np.arange(-4,7,1))
ax2.axhline(0, 2, color='black',linewidth=1)
ax2.axvline(0, 5, color='black',linewidth=1)
ax2.annotate('', xy=(2, 0), xytext=(0, 0), arrowprops=dict(facecolor='black', edgecolor='black', arrowstyle='->', lw=1))
ax2.annotate('', xy=(0, 5), xytext=(0, 0), arrowprops=dict(facecolor='black', edgecolor='black', arrowstyle='->', lw=1))

# Show plot
plt.tight_layout()
plt.show()

Let's look at a simple example in `Python` to understand some of the mathematical concepts involved in neural networks.

## Simple neural network in `Python`

<img src="./images_en/einfaches_netz1.png" alt="drawing" width="30%"/>

We first try to model the input layer of a network consisting of a neuron with two input values $x_1, x_2$, two weights $w_1, w_2$ and a bias $b$ in `Python`. During the first run of the network, the input values are first passed to the input layer and the weights and bias are initialized randomly.

### Forward propagation of the network in `Python`

We start by creating the input layer and model the calculation of the output of the first neuron. This corresponds to the first step of forward propagation.

$$\text{Input} = \sum_{i=1}^2 x_i w_i + b = x_1 w_1 + x_2 w_2 + b$$

### Feedforward propagation - input layer with one neuron

In [None]:
# Input values
inputs = [5, 7]

# Randomly initialized weights
W1 = [6, 4]

# Randomly initialized bias
b1 = 3

# Activation
A1 = inputs[0] * W1[0] + inputs[1] * W1[1] + b1
A1

### Exercise:

Calculate the activation of a neuron for the input values $10$ and $12$, the weights of the input layer $1$ and $2$ and a bias of $5$. How can we extend the model to three input values?

In [None]:
### Your code here ...

Let's generalize our model to two neurons in the hidden layer.

### Feedforward propagation - hidden layer with two neurons

In [None]:
### Figure: two neurons in the hidden layer
# Slides
image_paths = ["./images_en/w_b1.png", "./images_en/w_b2.png", "./images_en/w_b3.png", "./images_en/w_b4.png", "./images_en/w_b5.png", "./images_en/w_b6.png", "./images_en/w_b7.png", "./images_en/w_b8.png"]

# Selection of images
def show_image(index):
    img = mpimg.imread(image_paths[index])
    plt.figure(figsize=(2, 1), dpi = 600)
    plt.imshow(img)
    plt.axis('off')
    plt.show()

# Create slider widget
slider = widgets.IntSlider(min=0, max=len(image_paths) - 1, step=1, description='Picture')
widgets.interactive(show_image, index=slider)


In [None]:
# Input values
inputs = [5, 7]

# Randomly initialized weights for inputs in neuron 1
W1 = [6, 4]

# Randomly initialized bias for inputs in neuron 1
b1 = 3

# Randomly initialized weights for inputs in neuron 2
W2 = [5, 3]

# Randomly initialized bias for inputs in neuron 2
b2 = -5

A1 = inputs[0] * W1[0] + inputs[1] * W1[1] + b1
     
A2 = inputs[0] * W2[0] + inputs[1] * W2[1] + b2

print('Output of neuron 1:', A1)
print('Output of neuron 2:', A2)

This can be simplified by summarizing all weights of a layer into a matrix and specifying the inputs and biases as column vectors. We therefore calculate the following:

$$
A_1
=
W_1 \cdot \vec{X} + \vec{b_1}
=
\begin{pmatrix}
6 & 4 \\
5 & 3
\end{pmatrix}
\cdot
\begin{pmatrix}
5  \\
7 
\end{pmatrix}
+
\begin{pmatrix}
3  \\
-5 
\end{pmatrix}
=
\begin{pmatrix}
61 \\
41
\end{pmatrix}
$$

In [None]:
# Input values
X = np.array([[5],[7]])
# Weights of the input layer as a matrix
W1 = np.array([[6, 4],[5, 3]])
# Biases of the input layer as a column vector
b1 = np.array([[3],[-5]])

In [None]:
# Output hidden layer
W1 @ X + b1

### Vectors and matrices in `NumPy`

In this section we deal with the framework `NumPy`. `NumPy` is specialized for numerical calculations in `Python` and covers many areas of mathematics. Let's see how we can use `NumPy` to do linear algebra. 

### Vectors

The **scalar product** of two vectors $\vec{a}=\begin{pmatrix}
1 \\
2 \\
3
\end{pmatrix}$ and $\vec{b}=\begin{pmatrix}
4 \\
5 \\
6
\end{pmatrix}$ is given by :

$$\vec{a} \cdot \vec{b}=
\begin{pmatrix}
1 \\
2 \\
3
\end{pmatrix}
\cdot
\begin{pmatrix}
4 \\
5 \\
6
\end{pmatrix}
=
1 \cdot 4 + 2 \cdot 5 + 3 \cdot 6 = 4 + 10 + 18 = 32
$$

We can create arrays in `NumPy` with the `array()` function by passing rows and columns as lists. We can write to the two vectors $\vec{a}$ and $\vec{b}$ from the above example in `NumPy` as follows:

In [None]:
a = np.array([1,2,3])

b = np.array([4,5,6])

In `NumPy` we can calculate the scalar product of two vectors with the function `dot()` or the symbol `@`:

In [None]:
# Vector or matrix multiplication
a @ b

*Note*: An $n$-dimensional array is not distinguished in `NumPy` in the mathematical sense of row or column vector without further specification, but is interpreted according to the so-called broadcasting rules.

In [None]:
vector = np.array([1,2,3,4])

In [None]:
vector.shape

### Example: Dyadic product

With Broadcasting:

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])
print('Shape_a', a.shape)
print('Shape_b', b.shape)

In [None]:
a @ b

In [None]:
b @ a

With vectors:

In [None]:
a = np.array([[1],[2],[3]])
b = np.array([[4,5,6]])
print('Shape_a', a.shape)
print('Shape_b', b.shape)

In [None]:
#a @ b

In [None]:
# b @ a

To define a vector uniquely in `NumPy` we can specify the missing dimension with additional square brackets.

#### Example: Column vector

$
\vec{a}
= 
\begin{pmatrix}
1 \\
2 \\
3 \\
4
\end{pmatrix}
$

In [None]:
vector_column = np.array([[1],[2],[3],[4]])

In [None]:
vector_column.shape

#### Example: Row vector

$
\vec{b}
= 
\begin{pmatrix}
1 & 2 & 3 & 4
\end{pmatrix}
=
\vec{a}^T
$

In [None]:
vector_row = np.array([[1, 2, 3, 4]])

In [None]:
vector_row.shape

We can transpose vectors in `NumPy` with the syntax `vector.T`:

In [None]:
vector_column.T.shape

### Exercise: 

Calculate the scalar product $\vec{a}^T \cdot \vec{b}$ for the vectors:

$
\vec{a}
=
\begin{pmatrix}
1 \\
2 \\
3
\end{pmatrix}$
,
$
\vec{b}
=
\begin{pmatrix}
4 \\
5 \\
6
\end{pmatrix}$

In [None]:
### Your code here ...

### Matrices

A matrix is made up of column and row vectors, whereby an $(m \times n)$ matrix consists of $m$ rows and $n$ columns. The example below therefore represents a $(3 \times 3)$ matrix:

$$A =\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}$$

Note that matrices are a generalization of vectors in that they correspond to the special case of a $(n \times 1)$ matrix. We can write to the above matrix $A$ in `NumPy` by passing rows and columns as a list of lists.

In [None]:
a = np.array([[1],[2],[3]])

A = np.array([[1,2,3],
              [4,5,6],
              [7,8,9]])

### Multiplication of vectors with matrices

$\vec{a}=\begin{pmatrix}
1 \\
2 \\
3
\end{pmatrix}$ , $A =\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}$

$A \cdot \vec{a} = \begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{pmatrix}
\cdot
\begin{pmatrix}
1 \\
2 \\
3
\end{pmatrix}=
\begin{pmatrix}
1 \cdot 1 + 2 \cdot 2 + 3 \cdot 3 \\
4 \cdot 1 + 5 \cdot 2 + 6 \cdot 3 \\
7 \cdot 1 + 8 \cdot 2 + 9 \cdot 3
\end{pmatrix}=
\begin{pmatrix}
14 \\
32 \\
50
\end{pmatrix}
$ 

In [None]:
A @ a

### Dimension of vectors and matrices

### Matrices

Matrices have dimensions corresponding to their number of rows $m$ and columns $n$:

#### Example: $4 \times 4$ matrix

$A =
\begin{pmatrix}
1 & 2 & 3 & 4 \\
5 & 6 & 7 & 8 \\
9 & 10 & 11 & 12 \\
13 & 14 & 15 & 16
\end{pmatrix}
$

In [None]:
A_matrix = np.array([[1,2,3,4],
                    [5,6,7,8],
                    [9,10,11,12],
                    [13,14,15,16]])


print('A_matrix:')
print(A_matrix)
print('')
print('Dimensions of A_matrix:',A_matrix.shape)

Similar to vectors, matrices can also be transposed by swapping rows and columns:

$A^T=
\begin{pmatrix}
1 & 5 & 9 & 13 \\
2 & 6 & 10 & 14 \\
3 & 7 & 11 & 15 \\
4 & 8 & 12 & 16
\end{pmatrix}
$

In [None]:
A_matrix.T

*Note*: In order for vectors or matrices to be multiplied together, the left vector or matrix must have the same number of columns as the right vector or matrix has rows. 

In [None]:
# Multiplication between vector and matrix is not commutative
print('Dimensions of a:', a.shape)
print('Dimensions of A:', A.shape)
#a @ A

In [None]:
# In general, for two matrices A * B with dimensions A = (m, n), B = (o, p),
# that the dimension of the columns n of A must be equal to the dimension of the rows o of B
print('Dimensions of a.T:', a.T.shape)
print('Dimensions of A :', A.shape)
a.T @ A

#### Exercise: 

Create the two matrices $A = \begin{pmatrix}1 & 2 \\ 4 & 4 \end{pmatrix}$ and $B=\begin{pmatrix}5 & 6 \\ 7 & 8 \end{pmatrix}$ in `NumPy` and calculate $A \cdot B$ and $B \cdot A$. Does $A \cdot B = B \cdot A$ apply?

In [None]:
### Your code here ...

## Simple neural network with matrices

Based on the previous theoretical considerations, we try to create a neural network with three input values, three neurons in the first hidden layer, four neurons in the second hidden layer and two neurons in the output layer.

### Feedforward Propagation - Neural network with multiple layers in matrix representation

### Initializing the weights and biases

In the practical application, the weights $W_i$ and thresholds $b_i$ of a neural network are initialized with random values at the beginning of the training phase. In `NumPy` it is possible to use the function `random.randn(m, n)` to create an array of dimension $(m \times n)$ filled with standard normally distributed random numbers.

We therefore create a column vector for this mesh with the input values $X$ of dimension $(3 \times 1)$, the matrix $W_1$ of the weights of the first hidden layer of dimension $(3 \times 3$ and the corresponding biases $b_1$ of dimension $(3 \times 1)$, the matrix $W_2$ of the weights of the second hidden layer of dimension $(4 \times 3$ and the associated biases $b_2$ of dimension $(4 \times 1)$ and the matrix of the weights of the output layer $W_3$ of dimension $(2 \times 4)$ and the associated biases $b_3$ of dimension $(2 \times 1)$.

<img src="./images_en/3_3_4_2_netz.png" alt="drawing" width="80%"/>

In [None]:
input_size = 3 # input layer
W1_size    = 3 # 3 neurons in the 1st hidden layer
W2_size    = 4 # 4 neurons in the 2nd hidden layer
W3_size    = 2 # 2 output classes

In [None]:
X = np.array([[5],[7],[1]]) # Input values

In [None]:
# Initialize random weights and biases
W1 = np.random.randn(W1_size, input_size) - 0.5
b1 = np.random.randn(input_size, 1)
W2 = np.random.randn(W2_size, W1_size) - 0.5
b2 = np.random.randn(W2_size, 1) - 0.5
W3 = np.random.randn(W3_size, W2_size) - 0.5
b3 = np.random.randn(W3_size, 1) - 0.5

### Output hidden layer $1$

In [None]:
W1

In [None]:
b1

In [None]:
# Output of the first hidden layer
A1 = W1 @ X + b1
A1

In [None]:
print('shape_inputs   :',X.shape, '\n')
print('shape_W1       :',W1.shape, '\n')
print('shape_b1       :',b1.shape, '\n')
print('shape_A1       :',A1.shape, '\n')

### Output hidden layer $2$

In [None]:
W2

In [None]:
b2

In [None]:
# Output of the second hidden layer

A2 = W2 @ A1 + b2
A2

In [None]:
print('shape_A1       :',A1.shape, '\n')
print('shape_W2       :',W2.shape, '\n')
print('shape_b2       :',b2.shape, '\n')
print('shape_A2       :',A2.shape, '\n')

### Output layer

In [None]:
W3

In [None]:
b3

In [None]:
# Output layer
A3 = W3 @ A2 + b3
A3

In [None]:
print('shape_A2       :',A2.shape, '\n')
print('shape_W3       :',W3.shape, '\n')
print('shape_b3       :',b3.shape, '\n')
print('shape_A3       :',A3.shape, '\n')

### Activation functions

Let's take a brief look at another subtlety of the neural network, the activation function. These modulate the transmitted signal and ensure non-linear modulation of the signal.

An essential property of activation functions is the ability to map **non-linear relationships**. If we were to use only the linear weighted sum of the inputs, we would essentially only be linking linear functions with each other, resulting in linear functions again. For example, consider the two linear functions $f(g(x)) = 3 g(x) +1$ and $g(x) = 4 x +2$. The concatenation of these functions $f(g(x))$ can then be written as follows:

$$ f(g(x)) = 3 g(x) + 1 = 3 (4 x + 2) + 1 = 12 x + 7 $$

The result is again a linear function! On the other hand, a sufficiently large artificial neural network with non-linear activation functions is able to approximate any continuous function.

A frequently used activation function in this context is the **[sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function)** known from logistic regression:

$$A(z) = \frac{1}{1 + e^{-z}} $$

Where $A(z)$ is the activation of the neuron and $z = \sum_i w_i x_i +b$ is the weighted sum of the inputs. 

The sigmoid function is used both in hidden layers and as an output activation function in **binary classification networks**.

An activation function similar to the sigmoid function is the **[hyperbolic tangent](https://en.wikipedia.org/wiki/Hyperbolic_functions)**:

$$\tanh (z) = \frac{\sinh (z)}{\cosh (z)} = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

Both functions have in common that they restrict the range of values to which they map. In the case of the logistic function, this maps any values from $\mathbb{R}$ to the interval $[ 0 \ $ ,$ \ 1 ]$. In the case of the hyperbolic tangent, this maps values from $\mathbb{R}$ to the interval $[ -1 \ $ ,$ \ 1 ]$.

The calculations of the sigmoid function and the tangent hyperbolic function are comparatively time-consuming operations, as the terms $e^{\pm z}$ have to be calculated and divided. 

The **[rectifier function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))** (*rectified linear unit, ReLU*) serves as an alternative. This has the advantage that it is easier to calculate than some other activation functions such as the sigmoid function or the tangent hyperbolic function. It also helps to avoid the vanishing gradient problem that can occur with deep neural networks. The **ReLU function** is defined as follows:

$$ A(z) = max(0,z) \begin{cases}
z & \text{for} \ z \gt 0, \\ 0 & \text{else}
\end{cases} $$

Although there is no slope for $z < 0$ in the values of the ReLU function, ReLU activation functions in artificial neural networks show very good optimization performance and are now one of the most commonly used activation functions in deep neural networks.

An extension of the ReLU function is the **[Leaky ReLU function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Leaky_ReLU)**. The idea behind it is that negative values for $z$ also have a low gradient in order to avoid the so-called **[vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)**. The Leaky-ReLU function is defined as follows:

$$ A(z) \begin{cases}
z & \text{for} \ z \gt 0, \\ 0,01 \cdot z & \text{else}
\end{cases} $$

Another important activation function for the **output layer of classification networks** is the **[softmax function](https://en.wikipedia.org/wiki/Softmax_function)**. This is used to calculate the probability distribution of $K$ in different possible classes for **multi-class classification**. The softmax function is defined in *component notation* as follows:

$$ A(z_j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} $$

For three classes, the **Softmax function** would be an example:

$$ A(z_1) = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} \ , \   A(z_2) = \frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}} \ , \   A(z_3) = \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}}$$

The sum of the individual components adds up to $1$ in the sense of a probability of belonging to one of $j$ classes:

$$ A(z_1) +  A(z_2)+  A(z_3) = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} + \frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}}+ \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}} =  1 $$

In contrast to the output activation of networks for classification, a **linear activation function** is used in the **output layer of regression networks** in order to obtain continuous values that are not restricted to a specific interval. The linear activation function can be written as follows:

$$A(z) = z$$

The most important **activation functions** are summarized again in the following illustration.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate values
x = np.linspace(-6, 6, 100)
sigmoid = 1 / (1 + np.exp(-x))
relu = np.maximum(0, x)
tanh = np.tanh(x)
linear = x
softmax = np.exp(x) / np.sum(np.exp(x), axis=0)
leaky_relu = np.where(x > 0, x, 0.01 * x)  # Leaky ReLU mit Alpha = 0.01

# Create plot
plt.figure(dpi=600, figsize=(6, 3))
plt.rcParams.update({'font.size': 6})
plt.tight_layout()

# Sigmoid activation function
plt.subplot(2, 3, 1)
plt.plot(x, sigmoid, label="Sigmoid", color="b")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("Sigmoid activation function")
plt.grid(True)
plt.legend()

# ReLU activation function
plt.subplot(2, 3, 2)
plt.plot(x, relu, label="ReLU", color="r")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("ReLU activation function")
plt.grid(True)
plt.legend()

# Tanh activation function
plt.subplot(2, 3, 3)
plt.plot(x, tanh, label="Tanh", color="g")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("Tanh activation function")
plt.grid(True)
plt.legend()

# Linear activation function
plt.subplot(2, 3, 4)
plt.plot(x, linear, label="Lineare", color="m")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("Linear activation function")
plt.grid(True)
plt.legend()

# Softmax activation function
plt.subplot(2, 3, 5)
plt.plot(x, softmax, label="Softmax", color="c")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("Softmax activation function")
plt.grid(True)
plt.legend()

# Leaky ReLU activation function
plt.subplot(2, 3, 6)
plt.plot(x, leaky_relu, label="Leaky ReLU", color="y")
plt.xlabel("Input")
plt.ylabel("Output")
plt.title("Leaky ReLU activation function (Alpha=0.01)")
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

## Simple neural network with activation function

Let's apply activation functions to the network outlined above. We use **ReLU activation** in the first and second hidden layer and **Sigmoid activation** in the output layer.

In [None]:
def relu_func(x):
    return np.maximum(0, x)

In [None]:
def sigmoid(X):
    return (1/(1 + np.exp(1)**(-X)))

### Feedforward Propagation

### Initializing the weights and biases

We are trying to create a neural network from the theoretical considerations so far.

In [None]:
input_size = 3 # input layer
W1_size    = 3 # 3 neurons in the 1st hidden layer - ReLU activation
W2_size    = 4 # 4 neurons in the 2nd hidden layer - ReLU activation
W3_size    = 2 # 2 output classes - sigmoid activation

In [None]:
X= np.array([[5],[7],[1]]) # Input values

In [None]:
# Initialize random weights and biases
W1 = np.random.rand(W1_size, input_size) - 0.5
b1 = np.random.randn(input_size, 1)
W2 = np.random.rand(W2_size, W1_size) - 0.5
b2 = np.random.rand(W2_size, 1) - 0.5
W3 = np.random.rand(W3_size, W2_size) - 0.5
b3 = np.random.rand(W3_size, 1) - 0.5

### Output hidden layer $1$

In [None]:
W1

In [None]:
b1

In [None]:
# Output of the first hidden layer with ReLU activation
A1 = relu_func(W1 @ X + b1)
A1

### Output hidden layer $2$

In [None]:
W2

In [None]:
b2

In [None]:
# Output of the second hidden layer with ReLU activation
A2 = relu_func(W2 @ A1 + b2)
A2

### Output layer

In [None]:
W3

In [None]:
b3

In [None]:
# Output layer with sigmoid activation
A3 = sigmoid(W3 @ A2 + b3)
A3

## Numerical derivation

In preparation for the **backpropagation algorithm**, we will briefly deal with the numerical derivation of functions.

As an example, let's look at the function $f(x) = x^5 + x^3 + x$ and its derivative:

In [None]:
def f(x):
    return x**5 + x**3 + x

In [None]:
### Figure: f(x) = x**5 + x**3 + x
plt.figure(dpi=600, figsize=(6, 3))
plt.rcParams.update({'font.size': 6})
plt.grid()
plt.tight_layout()
x = np.linspace(-2,2, 1000)
y = f(x)


_ = plt.plot(x,y)

### Difference quotient

To derive the function numerically, we can use the **difference quotient** $\frac{ f (x + \epsilon) - f (x)}{(x + \epsilon) - x }$ of the function at the points $x$ and $x + \epsilon$ as the simplest approximation, where $\epsilon$ denotes the step size.

$$f^{\prime}(x) = \frac{d f (x)}{d x}  \approx \frac{ f (x + \epsilon) - f (x)}{(x + \epsilon) - x } = \frac{ f (x + \epsilon) - f (x)}{\epsilon} $$

$$eg.: f^{\prime}(x) = (x^2)^{\prime} = 2x  \approx \frac{ (x + \epsilon)^2 - (x)^2}{\epsilon}$$

We can write this in `Python` as follows:

In [None]:
def derivative(f, x, delta = 10**-5):
    return (f(x + delta) - f(x))/delta

In [None]:
### Figure: f(x), f'(x)
plt.figure(dpi=600, figsize=(6, 3))
plt.grid()
plt.xlim([-2.2,2.2])
plt.ylim([-2.2,7.2])
plt.yticks(np.arange(-2, 8, 1))
plt.xticks(np.arange(-2, 3, 1))
plt.rcParams.update({'font.size': 6})
plt.tight_layout()
plt.plot(x,y)
_ = plt.plot(x, derivative(f, x, 10**-5))

### Derivation of the ReLU activation function

The ReLU function is given by $f_{ReLU}(x) = max(0, x)$. Let's try to numerically derive this important and perhaps somewhat unintuitive function with our `derivative()` function.

In [None]:
def relu_func(x):
    return np.maximum(0, x)

In [None]:
derivative(relu_func, x, delta = 1e-5)

In [None]:
# Plot
plt.figure(dpi=600, figsize=(6, 3))
plt.rcParams.update({'font.size': 6})
plt.tight_layout()
plt.plot(x, derivative(relu_func, x, delta = 1e-5), linewidth = 2.5)
plt.xlabel("x")
plt.ylabel("H(x)")
plt.title("Heaviside step function as a derivative of the ReLU function")
plt.ylim([-0.5,1.5])
plt.grid(True)
plt.show()

### Numerical derivative problem - discontinuous functions

In [None]:
### Figure: ReLU function and derivative of the ReLU function
# ReLU function and numerical derivative
def relu_func(x):
    return np.maximum(0, x)

def derivative(func, x, delta=1e-5):
    return (func(x + delta) - func(x)) / (delta)

# Function that updates the plot
def plot_reAct(num_points):
    x = np.linspace(-2, 2, num_points)  # Adjust the number of points

    plt.figure(dpi=600, figsize=(6, 3))
    plt.rcParams.update({'font.size': 6})
    plt.tight_layout()

    # ReLU function
    plt.subplot(2, 1, 1)
    plt.plot(x, relu_func(x), label='ReLU(x)', color='blue')
    plt.title('ReLU function')
    plt.xlabel('x')
    plt.ylabel('ReLU(x)')
    plt.xlim(-2, 2)
    plt.grid(True)
    plt.legend()

    # Numerical derivation of the ReLU function
    plt.subplot(2, 1, 2)
    plt.plot(x, derivative(relu_func, x, delta=1e-5), label="ReLU'(x)", color='red')
    plt.title('Numerical derivation of the ReLU function')
    plt.xlabel('x')
    plt.ylabel("ReLU'(x)")
    plt.xlim(-2, 2)
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()

# Slider for the number of points displayed
interactive_plot = interactive(plot_reAct, num_points=(70, 1000, 1))  # Schieberegler von 70 bis 1000 Punkten
interactive_plot

A simpler solution is not to form the actual derivative, but to define the derivative using the Heaviside function.

In [None]:
def derivative_ReLU(Z):
    return np.where(Z > 0, 1, 0) 

In [None]:
x = np.linspace(-2,2, 1000)
plt.title('The Heaviside function as a derivative of the ReLU function')
plt.xlabel('x')
plt.ylabel("ReLU'(x)")
_ = plt.plot(x, derivative_ReLU(x), color = 'red')

### Exercise

Using the function `derivative(f, x, delta = 10**-5)` defined above, derive the function $f(x) = sin(x) \cdot cos(x)$. And plot the derivative in the range $[-5, 5]$

In [None]:
x_werte = np.linspace(-5, 5)

In [None]:
### Your code here ...

In [None]:
#_ = plt.plot(x, derivative(f, x))

## Gradient Descent

The classical **[gradient descend](https://en.wikipedia.org/wiki/Gradient_descent)** is about determining a global maximum/minimum of a given function in the ideal case; this is also referred to as a so-called **[optimization problem](https://en.wikipedia.org/wiki/Mathematical_optimization)**. The procedure is to form the **[gradient](https://en.wikipedia.org/wiki/Gradient)** of a function for all independent parameters. The gradient points in the direction of the largest increase in the function. To arrive at the minimum of the function, we choose a starting value $w_{\text{old}}$, which serves as the starting point of the iterative descent to the minimum, and at each step of the gradient procedure we go in the direction of the negative gradient of the function by subtracting the gradient times a learning rate $\alpha$ from $w_{\text{old}}$ to calculate $w_{\text{new}}$. In general, you can write:

$$w_{\text{new}} = w_{\text{old}} - \alpha \cdot \nabla f(w) $$

Here, $\nabla$ is the **[Nabla operator](https://en.wikipedia.org/wiki/Del)** also called **Del operator**. In the one-dimensional case, it simply corresponds to the derivative with respect to the independent variable.

For example, let's consider determining the minimum of the function $f(x)= x^2$ to understand the gradient descent method.

In [None]:
N = 50
x = np.linspace(-5, 5, N)
y = x**2
plt.figure(dpi=600, figsize=(6, 3))
plt.rcParams.update({'font.size': 6})
plt.tight_layout()
_ = plt.plot(x, y)

We determine the gradient of the function $f(x)=x^2$:

$$\frac{df}{dx} = 2 x$$

To determine the minimum, we subtract the gradient multiplied by a learning rate $\alpha$ from a randomly selected starting value and iterate until the gradient converges to $0$. We thus move in the opposite direction of the largest increase in the function with each iteration towards a minimum. In each step, we calculate the next $x$ value:

$$x_{\text{new}} = x_{\text{old}} - \alpha \cdot \frac{df}{dx} = x_{\text{old}} - \alpha \cdot 2 x$$

Let's take a look at a code example:

In [None]:
x_old = 5
alpha = 0.1

In [None]:
for i in range(0, 30):
    x_new = x_old - alpha * (2 * x_old)
    x_old = x_new
x_new

As we can see, the value for $x_{\text{new}}$ converges towards the minimum of $f(x)$ at $x = 0$.

The following figure shows the first three steps in the gradient descent method with a learning rate of `alpha = 0.1` and starting from the starting point `x_old = 5`. Note that the step size decreases the closer we get to the minimum.

In [None]:
# Objective function and its derivative
def f(x):
    return x**2


def df(x):
    return 2 * x


# Starting point and learning rate for the gradient descent
x_start = 5.0
alpha = 0.1

# Number of steps
num_steps = 3
x_history = [x_start]

# Perform gradient descent and draw the arrows
for _ in range(num_steps):
    x_current = x_start
    x_start = x_start - alpha * df(x_start)
    x_history.append(x_start)

# X values for the function
x = np.linspace(-5, 5, 100)

# Create diagram
plt.figure(dpi=600, figsize=(6, 3))
plt.plot(x, f(x), label="f(x) = $x^2$", color="blue")
_ = plt.scatter(
    x_history, [f(x) for x in x_history], color="red", label="Gradient Descent Steps"
)

# Draw arrows connecting the steps of the gradient descent
for i in range(1, len(x_history)):
    dx = x_history[i] - x_history[i - 1]
    dy = f(x_history[i]) - f(x_history[i - 1])
    plt.quiver(
        x_history[i - 1],
        f(x_history[i - 1]),
        dx,
        dy,
        angles="xy",
        scale_units="xy",
        scale=1,
        color="green",
        width=0.0075,
        headaxislength=4,
        headlength=4,
    )

plt.xlabel("x")
plt.ylabel("f(x)")
plt.legend()
plt.title("Gradient Descent")
plt.grid(True)
plt.rcParams.update({'font.size': 6})
plt.tight_layout()
plt.show()

The convergence speed (the necessary number of steps) depends on the learning rate and the randomly selected starting point. The aim is to find a middle way between a learning rate that is **too low**, which leads to an unnecessarily **high number of steps**, and a learning rate that is **too high**, which leads to **oscillating solutions** as it repeatedly jumps over the minimum. Often, finding the best learning rate is an iterative process of experimentation. You can start with different learning rate settings (e.g. $0.1$, $0.01$, $0.001$) and monitor the performance on a validation dataset.

Let us now try to apply the gradient descent method to train a neural network and consider the following example.

# Backpropagation

Let's look at an example of the **backpropagation algorithm** step by step. For the sake of clarity, we will use a network with two inputs, a hidden layer with two neurons and **linear activation** and an output layer with **sigmoid activation**.

## Forwardpropagation

First, we carry out a forward propagation for the neural network shown in the figure as usual.

<img src="./images_en/backprop_forward.png" alt="drawing" width="80%"/>

### Initial parameters

We assume the following initial parameters for weights and biases:

In [None]:
X1 = 0.85

X2 = 0.5

w1 = 0.75

w2 = 0.55

w3 = 0.05

w4 = 0.05

w5 = 0.05

w6 = 0.015

w7 = 0.85

w8 = 0.95

b1 = 0.25

b2 = 0.15

b3 = 0.15

b4 = 0.25

We use linear activation for the input and hidden layers and the sigmoid activation function for the output layer.

In [None]:
def sigmoid(X):
    return (1/(1 + np.exp(1)**(-X)))

In [None]:
def linear_activation(x):
    return x

In [None]:
fig, ax = plt.subplots(figsize=(12, 6), dpi = 600)
x_werte = np.linspace(-20,20,1000)
ax.set_title('Sigmoid activation function', fontsize = 14)
_ = ax.plot(x_werte, sigmoid(x_werte))

The derivative of the sigmoid function is given by

$S(x)^{\prime} = S(x) (1 - S(x)) $

We can check this by deriving the sigmoid function with our function `derivative()` and plotting the result simultaneously with the above expression.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6), dpi = 600)
ax.set_title('Derivation of the sigmoid function', fontsize = 14)
#plt.plot(x_werte, sigmoid(x_werte)*(1 - sigmoid(x_werte)))
_ = ax.plot(x_werte, derivative(sigmoid, x_werte, 10**-12))
plt.show()

We calculate the forward pass through the network up to the activations $A_3$, $A_4$ in the output layer.

In [None]:
Z1 = w1 * X1 + w2 * X2 + b1
Z1

In [None]:
Z2 = w3 * X1 + w4 * X2 + b2
Z2

In [None]:
A1 = linear_activation(Z1)
A1

In [None]:
A2 = linear_activation(Z2)
A2

#### Activation in the output neuron $1$: $A_3$

In [None]:
Z3 = w5 * A1 + w6 * A2 + b3
Z3

In [None]:
A3 = sigmoid(Z3)
A3

#### Activation in the output neuron $2$: $A_4$

In [None]:
Z4 = w7 * A1 + w8 * A2 + b4
Z4

In [None]:
A4 = sigmoid(Z4)
A4

###  Loss function - Mean Squared Error (MSE) 

First, we need to introduce a measure for the error of the model. One way to evaluate the error of a model is given by the **[MSE (Mean Squared Error)](https://en.wikipedia.org/wiki/Mean_squared_error)**. In general, we speak of **[loss functions](https://en.wikipedia.org/wiki/Loss_function)**.

$$ MSE = E_{total} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat y_i)^2 $$

Where $y_i$ is the expected output (*ground truth*) and $\hat y_i$ is the prediction of the model (*prediction*). We assume $\hat y_1 = 0.01$ and $\hat y_2 = 0.99$ are given.

The index $n$ runs across all neurons in the output layer. In our example, $E_{total} = E_1 + E_2$:

$E_1 = \frac{1}{2} (\hat y_1 - A_3 )^2 = \frac{1}{2} (0.01 - A_3 )^2$

In [None]:
# Error output layer - first neuron
E_1 = 1/2*(0.01 - A3)**2 
E_1

$E_2 = \frac{1}{2} (\hat y_2 - A_4 )^2 = \frac{1}{2} (0.99 - A_4)^2$

In [None]:
# Error output layer - second neuron
E_2 = 1/2*(0.99 - A4)**2 
E_2

$E_{total} = E_1 + E_2$

In [None]:
E_total = E_1 + E_2
E_total

## Backpropagation

We have therefore given the total error of the output by $E_{total}$.

Let us now turn to the optimization method for this error, the **[backpropagation algorithm](https://en.wikipedia.org/wiki/Backpropagation)**. To do this, we will use the basics we discussed earlier, such as the **gradient method** and **activation functions**, and try to systematically adjust the weights of the network in order to obtain better predictions.

We go backwards from the outputs of the neural network and adjust the weights and biases in the direction of the gradient descent method. In order to calculate the dependencies of the error (the loss function), we have to apply the chain rule.

As a reminder, let's look at the application of the chain rule using an example:

### Chain rule

$f(g(x))^{\prime} = \frac{df}{dg}\frac{dg}{dx}$

e.g.:

$f = g^2, g = sin(x)$

$\frac{d}{dx}(sin(x))^2 = 2 \cdot sin(x) \cdot cos(x)$

## Output layer

Starting from the output layer, let's calculate the adjustment of the weights ($w_5, w_6, w_7, w_8$) of the output layer: 

<img src="./images_en/backprop_w5_f2.png" alt="drawing" width="80%"/>

### Partial derivative of $E_{total}$ with respect to (w.r.t.) $w_5$ 

#### Total derivation: $\frac{\partial E_{total}}{\partial w_5} = \frac{\partial E_{total}}{\partial A_3} \frac{\partial A_3}{\partial Z_3} \frac{\partial Z_3}{\partial w_5}   $

#### 1.)

#### $\frac{\partial E_{total}}{\partial A_3} = \frac{\partial }{\partial A_3}(E_1 + E_2 ) = \frac{\partial }{\partial A_3}(\frac{1}{2} (\hat y_1 - A_3 )^2 + \frac{1}{2} (\hat y_2 - A_4 )^2 ) = \frac{\partial }{\partial A_3}(\frac{1}{2} ( 0.01 - A_3 )^2 + \frac{1}{2} (0.99 - A_4 )^2 )  \\ = -(0.01 - A_3)$

In [None]:
dE_total_wrt_A3 = -(0.01 - A3)
dE_total_wrt_A3

#### 2.)

####  $\frac{\partial A_3}{\partial Z_3} = -\frac{\partial }{\partial Z_3} \frac{1}{1 + e^{-Z_3}} = A_3 (1 - A_3)$

In [None]:
dA3_wrt_Z3 = A3 * (1 - A3)
dA3_wrt_Z3

#### 3.)

#### $\frac{\partial Z_3}{\partial w_5} = \frac{\partial }{\partial w_5}(w_5 \cdot A_1 + w_6 \cdot A_2 + b_3) = A_1$

In [None]:
dZ3_wrt_w5 = A1
dZ3_wrt_w5

#### Total derivation: $\frac{\partial E_{total}}{\partial w_5} = \frac{\partial E_{total}}{\partial A_3} \frac{\partial A_3}{\partial Z_3} \frac{\partial Z_3}{\partial w_5}   $

In [None]:
dE_total_wrt_w5 = dE_total_wrt_A3 * dA3_wrt_Z3 * dZ3_wrt_w5
dE_total_wrt_w5

#### Adjust weight $w_5$

$w_{5 neu} = w_5 - \alpha \cdot \frac{\partial E_{total}}{\partial w_5}$, $\alpha = 0.5 \cdots \text{Learning rate}$

In [None]:
w5_new = w5 - 0.5 * dE_total_wrt_w5
w5_new

In [None]:
# Change in w5
delta_w5 = w5_new - w5
delta_w5

### Partial derivative of $E_{total}$ w.r.t. $w_6$ 

<img src="./images_en/backprop_w6_f3.png" alt="drawing" width="80%"/>

#### Total derivation: $\frac{\partial E_{total}}{\partial w_6} = \frac{\partial E_{total}}{\partial A_3} \frac{\partial A_3}{\partial Z_3} \frac{\partial Z_3}{\partial w_6}   $

#### 1.)

#### $\frac{\partial E_{total}}{\partial A_3} = -(0.01 - A_3)$

In [None]:
dE_total_wrt_A3 = -(0.01 - A3)
dE_total_wrt_A3

#### 2.)

####  $\frac{\partial A_3}{\partial Z_3} = -\frac{\partial }{\partial Z_3} \frac{1}{1 + e^{-Z_3}} = A_3 (1 - A_3)$

In [None]:
dA3_wrt_Z3 = A3 * (1 - A3)
dA3_wrt_Z3

#### 3.)

#### $\frac{\partial Z_3}{\partial w_6} = \frac{\partial }{\partial w_6}(w_5 \cdot A_1 + w_6 \cdot A_2 + b_3) = A_2$

In [None]:
dZ3_wrt_w6 = A2
dZ3_wrt_w6

#### Total deriviation: $\frac{\partial E_{total}}{\partial w_6} = \frac{\partial E_{total}}{\partial A_3} \frac{\partial A_3}{\partial Z_3} \frac{\partial Z_3}{\partial w_6}   $

In [None]:
dE_total_wrt_w6 = dE_total_wrt_A3 * dA3_wrt_Z3 * dZ3_wrt_w6
dE_total_wrt_w6

#### Adjust weight $w_6$

In [None]:
w6_new = w6 - 0.5 * dE_total_wrt_w6
w6_new

In [None]:
# Change in w6
delta_w6 = w6_new - w6
delta_w6

### Partial derivative of $E_{total}$ w.r.t. $w_7$ 

<img src="./images_en/backprop_w7_f.png" alt="drawing" width="80%"/>

#### Total derivation: $\frac{\partial E_{total}}{\partial w_7} = \frac{\partial E_{total}}{\partial A_4} \frac{\partial A_4}{\partial Z_4} \frac{\partial Z_4}{\partial w_7}   $

#### 1.)

#### $\frac{\partial E_{total}}{\partial A_4}  =\frac{\partial }{\partial A_4}(\frac{1}{2} (\hat y_1 - A_3 )^2 + \frac{1}{2} (\hat y_2 - A_4 )^2 ) \\ = \frac{\partial }{\partial A_4}(\frac{1}{2} ( 0.01 - A_3 )^2 + \frac{1}{2} (0.99 - A_4 )^2 )  \\ = -(0.99 - A_4)$

In [None]:
dE_total_wrt_A4 = -(0.99 - A4)
dE_total_wrt_A4

#### 2.)

####  $\frac{\partial A_4}{\partial Z_4} = \frac{\partial }{\partial Z_4} \frac{1}{1 + e^{-Z_4}} = A_4 (1 - A_4)$

In [None]:
dA4_wrt_Z4 = A4 * (1 - A4)
dA4_wrt_Z4

#### 3.)

#### $\frac{\partial Z_4}{\partial w_7} = \frac{\partial }{\partial w_7}(w_7 \cdot A_1 + w_8 \cdot A_2 + b_4) = A_1$

In [None]:
dZ4_wrt_w7 = A1
dZ4_wrt_w7

In [None]:
dE_total_wrt_w7 = dE_total_wrt_A4 * dA4_wrt_Z4 * dZ4_wrt_w7
dE_total_wrt_w7

#### Adjust weight $w_7$

In [None]:
w7_new = w7 - 0.5 * dE_total_wrt_w7
w7_new

In [None]:
# Change in w7
delta_w7 = w7_new - w7
delta_w7

### Partial derivative of $E_{total}$ w.r.t. $w_8$

<img src="./images_en/backprop_w8_f.png" alt="drawing" width="80%"/>

#### Total derivation: $\frac{\partial E_{total}}{\partial w_8} = \frac{\partial E_{total}}{\partial A_4} \frac{\partial A_4}{\partial Z_4} \frac{\partial Z_4}{\partial w_8}   $

### Exercise:

Calculate the partial derivative of $E_{total}$ w.r.t. $w_8$.

#### 1.)

#### $\frac{\partial E_{total}}{\partial A_4} = -(0.99 - A_4)$

#### 2.)

####  $\frac{\partial A_4}{\partial Z_4} = \frac{\partial }{\partial Z_4} \frac{1}{1 + e^{-Z_4}} = A_4 (1 - A_4)$

#### 3.)

#### $\frac{\partial Z_4}{\partial w_8} = \frac{\partial }{\partial w_8}(w_7 \cdot A_1 + w_8 \cdot A_2 + b_2) = A_2$

In [None]:
### Your code here ...

In [None]:
### Solution

In [None]:
dE_total_wrt_A4 = -(0.99 - A4)
dE_total_wrt_A4

In [None]:
dA4_wrt_Z4 = A4 * (1 - A4)
dA4_wrt_Z4

In [None]:
dZ4_wrt_w8 = A2
dZ4_wrt_w8

In [None]:
dE_total_wrt_w8 = dE_total_wrt_A4 * dA4_wrt_Z4 * dZ4_wrt_w8
dE_total_wrt_w8

#### Adjust weight $w_8$

In [None]:
w8_new = w8 - 0.5 * dE_total_wrt_w8
w8_new

In [None]:
# Change in w8
delta_w8 = w8_new - w8
delta_w8

## Hidden layer

In the hidden layer, there is an additional dependency since both $E_1$ and $E_2$ depend on $A_1$ and $A_2$, resulting in $E_{total}$: 

### Partial derivative of $E_{total}$ w.r.t. $w_1$ 

<img src="./images_en/backprop_w1_f.png" alt="drawing" width="80%"/>

$\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial A_1} \frac{\partial A_1}{\partial Z_1} \frac{\partial Z_1}{\partial w_1}   $

$\frac{\partial E_{total}}{\partial A_1} = \frac{\partial E_1}{\partial A_1} +  \frac{\partial E_2}{\partial A_1} $

therefore

$E_1 = \frac{1}{2} (\hat y_1 - A_3 )^2 \\ = \frac{1}{2} (0.01 - A_3 )^2 \\ = \frac{1}{2} (0.01 - sigmoid(Z3) )^2 \\ = \frac{1}{2} (0.01 - sigmoid(w_5 \cdot A_1 + w_6 \cdot A_2 + b_2) )^2 \\ = E_1 (A_1, A_2)$

and

$E_2 = \frac{1}{2} (\hat y_2 - A_4 )^2 \\ = \frac{1}{2} (0.99 - A_4 )^2 \\ = \frac{1}{2} (0.99 - sigmoid(Z4) )^2 \\ = \frac{1}{2} (0.99 - sigmoid(w_7 \cdot A_1 + w_8 \cdot A_2 + b_2) )^2 \\ = E_2 (A_1, A_2)$

#### 1.)

$\frac{\partial E_1}{\partial A_1} = \frac{\partial E_1}{\partial Z_3} \frac{\partial Z_3}{\partial A_1}$

We have already calculated $\frac{\partial E_1}{\partial Z_3}$ since the following applies:

$\frac{\partial E_1}{\partial Z_3} = \frac{\partial E_1}{\partial A_3} \frac{\partial A_3}{\partial Z_3}$

In [None]:
dE1_wrt_Z3 = dE_total_wrt_A3 * dA3_wrt_Z3
dE1_wrt_Z3

$\frac{\partial Z_3}{\partial A_1}$ results in the following:

$\frac{\partial Z_3}{\partial A_1} = \frac{\partial }{\partial A_1} (w_5 A_1 + w_6 A_2 + b_2) = w_5$

In [None]:
dZ3_wrt_A1 = w5
dZ3_wrt_A1

Overall, this results in $\frac{\partial E_1}{\partial A_1} = \frac{\partial E_1}{\partial Z_3} \frac{\partial Z_3}{\partial A_1}$:

In [None]:
dE1_wrt_A1 = dE1_wrt_Z3 * dZ3_wrt_A1
dE1_wrt_A1

#### 2.)

$\frac{\partial E_2}{\partial A_1} = \frac{\partial E_2}{\partial Z_4} \frac{\partial Z_4}{\partial A_1} $

$\frac{\partial E_1}{\partial Z_4} = \frac{\partial E_1}{\partial A_4} \frac{\partial A_4}{\partial Z_4}$

In [None]:
dE2_wrt_Z4 = dE_total_wrt_A4 * dA4_wrt_Z4
dE2_wrt_Z4

In [None]:
dZ4_wrt_A1 = w7
dZ4_wrt_A1

In [None]:
dE2_wrt_A1 = dE2_wrt_Z4 * dZ4_wrt_A1
dE2_wrt_A1

In total, this results in $\frac{\partial E_{total}}{\partial A_1} = \frac{\partial E_1}{\partial A_1} + \frac{\partial E_2}{\partial A_1} $:

In [None]:
dE_total_wrt_A1 = dE1_wrt_A1 + dE2_wrt_A1
dE_total_wrt_A1

Let's focus on the initial equation again: $\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial A_1} \frac{\partial A_1}{\partial Z_1} \frac{\partial Z_1}{\partial w_1} $

We still need $\frac{\partial A_1}{\partial Z_1}$ and $\frac{\partial Z_1}{\partial w_1}$.

$\frac{\partial A_1}{\partial Z_1} = A_1 (1 - A_1)$

In [None]:
dA1_wrt_Z1 = A1 * (1 - A1)
dA1_wrt_Z1

$\frac{\partial Z_1}{\partial w_1} = \frac{\partial Z_1}{\partial w_1} (w_1 X_1 + w_2 X_2 + b_1) = X_1$

In [None]:
dZ1_wrt_w1 = X1

Overall, this results in $\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial A_1} \frac{\partial A_1}{\partial Z_1} \frac{\partial Z_1}{\partial w_1} $:

In [None]:
dE_total_wrt_w1 = dE_total_wrt_A1 * dA1_wrt_Z1 * dZ1_wrt_w1
dE_total_wrt_w1

#### Adjust weight $w_1$

We can now adjust the weight $w_1$ with $w_{1 new} = w_1 - \alpha \cdot \frac{\partial E_{total}}{\partial w_1}$:

In [None]:
w1_new = w1 - 0.5 * dE_total_wrt_w1
w1_new

In [None]:
# Change in w1
delta_w1 = w1_new - w1
delta_w1

### Partial derivative of $E_{total}$ w.r.t. $w_2$ 

<img src="./images_en/backprop_w2_f.png" alt="drawing" width="80%"/>

$\frac{\partial E_{total}}{\partial w_2} = \frac{\partial E_{total}}{\partial A_1} \frac{\partial A_1}{\partial Z_1} \frac{\partial Z_1}{\partial w_2}   $

$\frac{\partial E_{total}}{\partial A_1} = \frac{\partial E_1}{\partial A_1} +  \frac{\partial E_2}{\partial A_1} $

$\frac{\partial Z_1}{\partial w_2} = \frac{\partial Z_1}{\partial w_2} (w_1 X_1 + w_2 X_2 + b_1) = X_2$

In [None]:
dZ1_wrt_w2 = X2
dZ1_wrt_w2

In [None]:
dE_total_wrt_w2 = dE_total_wrt_A1 * dA1_wrt_Z1 * dZ1_wrt_w2
dE_total_wrt_w2

#### Adjust weight $w_2$

We can now adjust the weight $w_2$ with $w_{2 new} = w_2 - \alpha \cdot \frac{\partial E_{total}}{\partial w_2}$:

In [None]:
w2_new = w2 - 0.5 * dE_total_wrt_w2
w2_new

In [None]:
# Change in w2
delta_w2 = w2_new - w2
delta_w2

### Partial derivative of $E_{total}$ w.r.t. $w_3$ 

<img src="./images_en/backprop_w3_f.png" alt="drawing" width="80%"/>

$\frac{\partial E_{total}}{\partial w_3} = \frac{\partial E_{total}}{\partial A_2} \frac{\partial A_2}{\partial Z_2} \frac{\partial Z_2}{\partial w_3}   $

$\frac{\partial E_{total}}{\partial A_2} = \frac{\partial E_1}{\partial A_2} +  \frac{\partial E_2}{\partial A_2} $

#### 1.)

$\frac{\partial E_1}{\partial A_2} = \frac{\partial E_1}{\partial Z_3} \frac{\partial Z_3}{\partial A_2}$

We have already calculated $\frac{\partial E_1}{\partial Z_3}$ since the following applies:

$\frac{\partial E_1}{\partial Z_3} = \frac{\partial E_1}{\partial A_3} \frac{\partial A_3}{\partial Z_3}$

In [None]:
dE1_wrt_Z3 = dE_total_wrt_A3 * dA3_wrt_Z3
dE1_wrt_Z3

$\frac{\partial Z_3}{\partial A_2}$ results in the following:

$\frac{\partial Z_3}{\partial A_2} = \frac{\partial }{\partial A_2} (w_5 A_1 + w_6 A_2 + b_2) = w_6$

In [None]:
dZ3_wrt_A2 = w6
dZ3_wrt_A2

In total, this results in $\frac{\partial E_1}{\partial A_2} = \frac{\partial E_1}{\partial Z_3} \frac{\partial Z_3}{\partial A_2}$:

In [None]:
dE1_wrt_A2 = dE1_wrt_Z3 * dZ3_wrt_A2
dE1_wrt_A2

#### 2.)

$\frac{\partial E_2}{\partial A_2} = \frac{\partial E_2}{\partial Z_4} \frac{\partial Z_4}{\partial A_2} $

$\frac{\partial E_2}{\partial Z_4} = \frac{\partial E_2}{\partial A_4} \frac{\partial A_4}{\partial Z_4}$

In [None]:
dE2_wrt_Z4 = dE_total_wrt_A4 * dA4_wrt_Z4
dE2_wrt_Z4

In [None]:
dZ4_wrt_A2 = w8
dZ4_wrt_A2

In [None]:
dE2_wrt_A2 = dE2_wrt_Z4 * dZ4_wrt_A2
dE2_wrt_A2

In total, this results in $\frac{\partial E_{total}}{\partial A_2} = \frac{\partial E_1}{\partial A_2} + \frac{\partial E_2}{\partial A_2} $:

In [None]:
dE_total_wrt_A2 = dE1_wrt_A2 + dE2_wrt_A2
dE_total_wrt_A2

Let's focus on the initial equation again: $\frac{\partial E_{total}}{\partial w_3} = \frac{\partial E_{total}}{\partial A_2} \frac{\partial A_2}{\partial Z_2} \frac{\partial Z_2}{\partial w_3} $

We still need $\frac{\partial A_2}{\partial Z_2}$ and $\frac{\partial Z_2}{\partial w_3}$.

$\frac{\partial A_2}{\partial Z_2} = A_2 (1 - A_2)$

In [None]:
dA2_wrt_Z2 = A2 * (1 - A2)
dA2_wrt_Z2

$\frac{\partial Z_2}{\partial w_3} = \frac{\partial }{\partial w_3} (w_3 X_1 + w_4 X_2 + b_1) = X_1$

In [None]:
dZ2_wrt_w3 = X1
dZ2_wrt_w3

Overall, the result for $\frac{\partial E_{total}}{\partial w_3} = \frac{\partial E_{total}}{\partial A_2} \frac{\partial A_2}{\partial Z_2} \frac{\partial Z_2}{\partial w_3} $:

In [None]:
dE_total_wrt_w3 = dE_total_wrt_A2 * dA2_wrt_Z2 * dZ2_wrt_w3
dE_total_wrt_w3

#### Adjust weight $w_3$

We can now adjust the weight $w_3$ with $w_{3 new} = w_3 - \alpha \cdot \frac{\partial E_{total}}{\partial w_3}$:

In [None]:
w3_new = w3 - 0.5 * dE_total_wrt_w3
w3_new

In [None]:
# Change in w3
delta_w3 = w3_new - w3
delta_w3

### Partial derivative of $E_{total}$ w.r.t. $w_4$

### Exercise:

Calculate the partial derivative of $E_{total}$ w.r.t. $w_4$.

<img src="./images_en/backprop_w4_f.png" alt="drawing" width="80%"/>

$\frac{\partial E_{total}}{\partial w_4} = \frac{\partial E_{total}}{\partial A_2} \frac{\partial A_2}{\partial Z_2} \frac{\partial Z_2}{\partial w_4}   $

$\frac{\partial E_{total}}{\partial A_2} = \frac{\partial E_1}{\partial A_2} +  \frac{\partial E_2}{\partial A_2} $

$\frac{\partial Z_2}{\partial w_4} = \frac{\partial Z_2}{\partial w_4} (w_3 X_1 + w_4 X_2 + b_1) = X_2$

In [None]:
### Your code here ...

In [None]:
### Solution

In [None]:
dZ2_wrt_w4 = X2
dZ2_wrt_w4

In [None]:
dE_total_wrt_w4 = dE_total_wrt_A2 * dA2_wrt_Z2 * dZ2_wrt_w4
dE_total_wrt_w4

We can now adjust the weight $w_4$ with $w_{4 new} = w_4 - \alpha \cdot \frac{\partial E_{total}}{\partial w_4}$:

#### Adjust weight $w_4$

In [None]:
w4_new = w4 - 0.5 * dE_total_wrt_w4
w4_new

In [None]:
# Change in w4
delta_w4 = w4_new - w4
delta_w4