In [None]:
%%capture
try:
    import aiim
    print('AIIM Library already installed.')
except:
    print('Installing and importing AIIM Library.')
    %pip install git+https://github.com/jmelsbach/AIIM@main
    import aiim

# Introduction to Neural Networks

In [None]:
%matplotlib inline

In [None]:
import ipywidgets as widgets
from ipywidgets import fixed
import matplotlib.pyplot as plt
import numpy as np
import torch


## An Artificial Neuron
In the following cell we will implement an Artificial Neuron which is the simplest form of a neural network. As we saw in the the lecture the formula of an Artificial Neuron is as follows:
$$\sum_i^n{w_ix_i + b = z}$$

In [None]:
def linear(x, n_inputs, n_outputs=1):
    w = torch.randn(n_inputs, n_outputs) # weights
    b = torch.randn(n_outputs) # biases
    return x @ w + b


We can implement the Artificial Neuron without a for loop by using matrix multiplication. We initialize the weights of our Neuron randomly. 
`torch.randn(n_inputs, n_outputs)` creates a random tensor of shape `n_inputs, n_outputs`. The number of biases matches the number of outputs.

Let's create an example input `x` and send put it in our Artificial Neuron.

In [None]:
x = torch.tensor([1., 0.5, 1.5])

In [None]:
linear(x, n_inputs = 3)

## Linear Layers
You surely noticed that our code allows us to define more than one outputs using the `n_outputs` parameter. We can combine any input and outputsize!
> 🖋️  Try it yourself. Send our input `x` through our function. Chose arguments so that we have 3 outputs.

In [None]:
# your code here

You might have wondered why we called our function `linear` and not `neuron`. Actually the function we created describes what is called a *Linear Layer* in *Deep Learning*. A *Neuron* is a special case of a *Linear Layer* where the number of outputs is `1`. A *Linear Layer* definitly deserves to have its own class so let's define it. 

In [None]:
class Linear:

    def __init__(self, in_features, out_features, bias=True):

        self.w = torch.randn(in_features, out_features, requires_grad=True)
        if bias:
            self.b = torch.randn(out_features, requires_grad=True)
        else:
            self.b = 0
    
    def __call__(self, x):
        return self.forward(x)
    

    def forward(self, x):
        return x @ self.w + self.b

Let's break down the code we just wrote:
`__init__` is the constructor of our class and allows us to set some parameters when we initially create our class.

Our `forward` function does the actual calculations of our *Linear Layer* which is the same as in our `linear` function we previously wrote.

The `__call__` function seems a little weird and unnecessary as it simply calls the `forward` function. Just as the `__init__` function it starts and ends with `__`. Those function are called *dunder functions* which stands for double underscore functions. There are several of those *dunder functions* in Python. The `__call__` function allows us to use our class object like a function which is very convinient.  

> 🖋️  Create a Linear Layer called `l1` with 3 inputs and 3 outputs and send `x` through it.

In [None]:
# Your Code

We can now use our object just like a function thanks to `__call__`!

In [None]:
# Your Code

## Stacking Linear Layers
We have created a reusable class for a *Linear Layer* the next logical step is to stack this layers together.
> 🖋️  You can reuse the `l1` object from above. Create another layer called `l2` that has a single output and send `x` through both of them sequentially. How do we have to choose `n_inputs` of `l2`? Can we choose any value for it? Why?

In [None]:
# create l2

# send x through l1 and then l2

### Exercise 1

* Create a class called `NeuralNetwork`. In the constructor define a `3x5` Linear Layer and a a second layer with one output.
* Define a `__call__` function that calls a `forward` function.
* Define a `forward` function that receives an input `x` and sends it through both *Linear Layers* and returns the output.

In [None]:
# define NeuralNetwork class
class NeuralNetwork:
    pass # replace with your code

In [None]:
# instantiate the model

In [None]:
# send x through the neural network 

## Combining Linear Functions
We successfully created our first neural network that consists of two layers. The truth is this was not a real *Neural Network*, yet. We combined two linear layers or linear functions, respectively. In the following we will look at a visual proof that stacking two (or more) linear functions into each other results in another linear function. So actually we gain nothing with an additional layer, we could have achieved the same result with a single linear function.

In [None]:
from functools import partial
from aiim.visualization import plot_function

Let's first define a simple linear function.

In [None]:
def linear_function(x, m, b):
    return m * x + b

We use `partial` from the `functools` to fixate the parameters m and b for two linear functions. You can ignore the `.__name__` part.

In [None]:
linear_function1 = partial(linear_function, m=2, b=0)
linear_function2 = partial(linear_function, m=5, b=3)

linear_function1.__name__ = 'linear_function1'
linear_function2.__name__ = 'linear_function2'


Let's plot `linear_function1` and `linear_function2` individually.

In [None]:
plot_function(linear_function1)

In [None]:
plot_function(linear_function2)

Nothing surprising here. Let us now write another function that chains the two linear functions and plot the new `combined_linear_function`.

In [None]:
def combined_linear_function(x):
    return linear_function2(linear_function1(x))

In [None]:
plot_function(combined_linear_function)

The resulting figure leaves no doubt that combining the two linear functions again results in just another linear function. We can even calculate `m` the slope and the intercept `b`.

$l2(l1(x)) = (x*2+0)*5 + 3  = 10x + 3$

So the combination of the two linear functions resulted in yet another linear function with $m=10$ and $b=3$.

## Activation Functions or Nonlinearities

In [None]:
from aiim.visualization import interactive_plot

### Rectified Linear Unit

In [None]:
def ReLU(x):
    return np.maximum(0, x)

In [None]:
interactive_plot(ReLU)

### Sigmoid

In [None]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

In [None]:
interactive_plot(sigmoid)

### tanh

Try to implement the tanh function yourself!

$tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}$

In [None]:
def tanh(x):
    pass # your code

In [None]:
# interactive_plot(tanh)

### Softmax Function

The softmax function, often used in the final layer of a neural network classifier, converts logits (numeric output scores from the model) into probabilities by taking the exponential of each output and then normalizing these values by dividing by the sum of all the exponentials. This ensures that the output values are in the range (0, 1) and sum up to 1, making them interpretable as probabilities. The softmax function is particularly useful for multi-class classification problems.

The formula for the softmax function for a vector $z$ of raw class scores from the final layer of a neural network is given by:

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
$$


where:
- $z_i$ is the score for class $i$.
- The denominator is the sum of the exponentials of all the raw class scores.

This formula ensures that the scores are normalized and interpretable as probabilities.

In [None]:
from aiim.visualization import softmax_with_sliders

In [None]:
softmax_with_sliders()

Implement the softmax with the help of the formula above.

In [None]:
def softmax(z):
    # your code

In [None]:
assert softmax(torch.randn(5)).sum() == 1

## A Neural Network with PyTorch

We have everything we need to create a forward pass with a Neural Network. In `Exercise 1` we implemented our own neural network but in practice it makes sense to use a framework like PyTorch.

In PyTorch everything resolves about the class `nn.Module`. Each Layer is an `nn.Module` and the Neural Network itself is also an `nn.Module`.

We need two things.
* An `__init__` function where you usually define the layers and that calls the constructor of the superclass `super().__init__` at the beginning.
* A forward function that passes the input in the desired order through our Network layers.

In the `nn` module you will find an implementation of the a Linear Layer called `nn.Linear` which basically works exactly as the your `Linear` class above.

In [None]:
import torch.nn as nn

In [None]:
nn.Linear??

### Exercise 2
Create a two layer Neural Network that has 10 inputs and 5 outputs at the end. Use activations functions after each Layer.

In [None]:
class TwoLayerNeuralNetwork(nn.Module):
    pass

## Loss Functions
In this section we want to have a look at two different loss functions, namely, `Cross Entropy Loss` and `Mean Squared Error` Loss. There are a lot more loss functions that exist and which one you use depends on the problem you want to solve. Sometimes you can use different loss functions for the same problem. For example you can use both the the `Cross Entropy Loss` and the `Mean Squared Error` Loss for a single-label classification with two classes. However, for each Loss functions expect the data in different formats and you as as the developer have to make sure it is in the right format.

### Mean Squared Error Loss
We discussed the MSE Loss in detail in the lecture. MSELoss is most of the time used when dealing with regression problems but can also been used for classification as stated above.

Imagine you want to predict if a move review is positive or negative. We would design our neural network so that it has one output that should be close to `1` if the review is positiv, and close to `0` if it is negative.

In [None]:
from torch.nn import MSELoss

Let's create an example output and label.

In [None]:
logit = torch.randn(1)
label = torch.randint(0,1,[1])
prediction = logit.sigmoid()

In [None]:
prediction, label

We now have dummy values for our prediction as well as a label. Let's put this into our Cross Entropy Loss Function.

In [None]:
MSELoss()(prediction, label)

### Exercise 3
What would be the shape of the `prediction` and `label` if we had a batch size of `4`? Create an example for the `prediction` and the `label` for a batch size of `4` and calculate the MSELoss for this example.

Yikes, if we run the code we get an `RuntimeError` telling us `Found dtype Long but expected Float`. What does that mean? Let's have a look at the datatype of our `label` and `prediction`.

## Cross Entropy Loss
The Cross Entropy Loss is used for single label classification and is widely used to train neural networks. For example, Language Models like GPT predict the next word based on all previous words to generate language. GPT outputs one logit for each word in it's vocabulary and we applies the `softmax` to get the most likely next word, by assigning a probability to each word. During training the loss is calculated with the `CrossEntropyLoss`.

In [None]:
from torch.nn import CrossEntropyLoss

Carefully read the documentation. It states that
> The input is expected to contain the unnormalized logits for each class

Take a pause and try to remember how we defined logits in the lecture.

Logits are the raw output of the neural network. This means that we do not have to apply the `softmax` function as this is already done for you in the loss function!!!

Let's create some example data and have a look on how the data is formatted.

In [None]:
loss = nn.CrossEntropyLoss()
logits = torch.randn(4, 5, requires_grad=True)
target = torch.empty(4, dtype=torch.long).random_(5)

In [None]:
logits, logits.shape

The logits are random positive and negative numbers and simulate the output of the last layer of a neural network. What does the shape `(4,5)` mean?

In [None]:
target, target.shape, target.dtype

The target is more interesting. It is a tensor of shape `[4]` with a dtype of `integer` with values between `0` and `4`. This integers encode the correct class of our single-label classification problem.

Imagine you classifiying images of animals and have the following classes:
    ```
    {
        0: 'cat',
        1: 'dog',
        2: 'horse',
        3: 'duck',
        4: 'monkey'
    }
    ```
So your job as a developer is to make sure that your dataset returns a training example formated as `(image_as_tensor, label_as_int)`.

We can now calculate the loss value for our example batch.

In [None]:
CrossEntropyLoss()(logits, target)
