# In this tutorial, we will learn about how to incorporate activation functions for adding nonlinearity to our Neural Network.

* To learn the role of activation function in NN, please see the following reference video: https://www.youtube.com/watch?v=3t9lZM7SS7k&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=12 or https://www.youtube.com/watch?v=s-V7gKrsels

* To visualize and play with different network structures and activations, go to -- A Neural Network Playground:
https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://playground.tensorflow.org/&ved=2ahUKEwint9HS1-yIAxWpj68BHYMgEpIQFnoECBcQAQ&usg=AOvVaw2vOMcWmUkU5HmbODfylijL

In [None]:
"""
Let us start with a similar problem from the previous section
"""

import torch
import torch.nn as nn
import numpy as np

# Creating dataset in PyTorch tensors
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)  # shape (4, 1)
Y = torch.tensor([[2, 1], [4, 4], [6, 9], [8, 16]], dtype=torch.float32)  # shape (4, 2)
"""
The difference is, the first position of the output remains as f(x) = 2 * x,
yet for the second position of the output, we are aiming at f(x) = x^2
"""

# Building NN same as the previous section
class NN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # first hidden
        self.fc2 = nn.Linear(hidden_size, hidden_size)  # second hidden
        self.fc_out = nn.Linear(hidden_size, output_size) # output layer

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc_out(x)
        return x

# Instanciating NN, but with hidden dimension = 10 now expecting the task may be more difficult
model = NN(input_size=X.shape[1], hidden_size=3, output_size=Y.shape[1])

# def MSE loss
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# Initial prediction
print(f'Prediction before training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

# Main training loop settings
learning_rate = 0.01
n_iterations = 10000
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = model.forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Zero gradients:: Putting zero_grad before backward is also fine!
    optimizer.zero_grad()

    # Backward pass to compute gradients
    l.backward()

    # Optimizer step
    optimizer.step()

    # Print training info
    if epoch % 1000 == 0:
        print(f'Epoch {epoch + 1}, loss = {l:.8f}')

# Prediction after training
print(f'Prediction after training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

Prediction before training: model([5]) = tensor([[ 0.4682, -0.0427]], grad_fn=<AddmmBackward0>)
Epoch 1, loss = 58.43006897
Epoch 1001, loss = 0.49999991
Epoch 2001, loss = 0.49999961
Epoch 3001, loss = 0.49999961
Epoch 4001, loss = 0.49999961
Epoch 5001, loss = 0.49999961
Epoch 6001, loss = 0.49999961
Epoch 7001, loss = 0.49999961
Epoch 8001, loss = 0.49999961
Epoch 9001, loss = 0.49999961
Prediction after training: model([5]) = tensor([[10.0000, 20.0000]], grad_fn=<AddmmBackward0>)


In [None]:
"""
How about the intermediate values?
"""
for i in range(5):
    print(f'Prediction after training: model([{i+1}]) = {model.forward(torch.tensor([[i+1]], dtype=torch.float32))}')

Prediction after training: model([1]) = tensor([[2.0000e+00, 6.2585e-07]], grad_fn=<AddmmBackward0>)
Prediction after training: model([2]) = tensor([[4.0000, 5.0000]], grad_fn=<AddmmBackward0>)
Prediction after training: model([3]) = tensor([[ 6.0000, 10.0000]], grad_fn=<AddmmBackward0>)
Prediction after training: model([4]) = tensor([[ 8.0000, 15.0000]], grad_fn=<AddmmBackward0>)
Prediction after training: model([5]) = tensor([[10.0000, 20.0000]], grad_fn=<AddmmBackward0>)


#Here,
we find that the first output can get close as expected. However, the second struggles, and no matter how much we adjust the `learning_rate`, `n_iterations`, or optimizers, the model fails to match the desired quadratic relationship.

This is not surprising because our neural network is built using only “linear layers.” Linear layers alone are not capable of modeling non-linear relationships like  $f(x) = x^2$.

To solve this, we need to introduce nonlinearity into the model. A common way to do this is by applying a non-linear activation function after each linear layer, as shown below:

$$a_i^{(l)} = f\left(\sum_{j=1}^{n_{l-1}} w_{ij}^{(l)} a_j^{(l-1)} + b_i^{(l)}\right)$$

where:
- $a_i^{(l)}$ represents the activation/output of the $i$-th neuron in layer $l$,
- $w_{ij}^{(l)}$ is the weight connecting the $j$-th neuron in layer $(l-1)$ to the $i$-th neuron in layer $l$,
- $b_i^{(l)}$ is the bias term for the $i$-th neuron in layer $l$,
- $a_j^{(l-1)}$ is the activation/output of the $j$-th neuron in the previous layer $(l-1)$
- and $f(\cdot)$ is a non-linear activation functions like ReLU, Sigmoid, or Tanh, which introduces nonlinearity between layers.

By wrapping the linear combinations of weights, inputs, and biases with a non-linear activation function $f$, the neural network gains the ability to learn more complex, non-linear patterns, like the quadratic relationship in our example.

There are many activation functions at work in neural networks. Some of the most commonly used activation functions—like Sigmoid, Tanh, ReLU, Leaky ReLU, and Softmax—are introduced in the reference video: *https://www.youtube.com/watch?v=3t9lZM7SS7k&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=12*.

Choosing the right activation function is an important part of neural network design and can vary from model to model. ReLU is likely the most widely used due to its simplicity and effectiveness in many deep learning tasks.

In [None]:
"""
Now, let us incorporate the activation function into our model to see if it improves the prediction.
We will use ReLU here. To do so, we can either use the module `nn.ReLU()` or the function `torch.relu()`.
Of course, you can also define the ReLU function by yourself: ReLU(x) = max(0, x)
"""

# Creating dataset in PyTorch tensors
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)  # shape (4, 1)
Y = torch.tensor([[2, 1], [4, 4], [6, 9], [8, 16]], dtype=torch.float32)  # shape (4, 2)

# NN with activation function (ReLU)
class NN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # first hidden
        self.fc2 = nn.Linear(hidden_size, hidden_size)  # second hidden
        self.fc_out = nn.Linear(hidden_size, output_size) # output layer

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc_out(x)
        return x

# Instanciating NN, with hidden layer dimension 10
model = NN(input_size=X.shape[1], hidden_size=10, output_size=Y.shape[1])

# def MSE loss
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# Initial prediction
print(f'Prediction before training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

# Main training loop settings
learning_rate = 0.01
n_iterations = 10000
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = model.forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Zero gradients:: Putting zero_grad before backward is also fine!
    optimizer.zero_grad()

    # Backward pass to compute gradients
    l.backward()

    # Optimizer step
    optimizer.step()

    # Print training info
    if epoch % 1000 == 0:
        print(f'Epoch {epoch + 1}, loss = {l:.8f}')

# Prediction after training
print(f'Prediction after training: model([5]) = {model.forward(torch.tensor([[5]], dtype=torch.float32))}')

## DO YOU EXPECT A [10, 25]?

Prediction before training: model([5]) = tensor([[0.2827, 1.0795]], grad_fn=<AddmmBackward0>)
Epoch 1, loss = 52.44388199
Epoch 1001, loss = 0.18804431
Epoch 2001, loss = 0.03756940
Epoch 3001, loss = 0.00843241
Epoch 4001, loss = 0.00213237
Epoch 5001, loss = 0.00063580
Epoch 6001, loss = 0.00018977
Epoch 7001, loss = 0.00005521
Epoch 8001, loss = 0.00001542
Epoch 9001, loss = 0.00000457
Prediction after training: model([5]) = tensor([[10.0462, 22.8948]], grad_fn=<AddmmBackward0>)


In [None]:
"""
You may find that the result is still not perfect.
However, if you check the predictions within the range of the training data, you’ll notice:
"""
for i in range(5):
    print(f'Prediction after training: model([{i+1}]) = {model.forward(torch.tensor([[i+1]], dtype=torch.float32))}')

Prediction after training: model([1]) = tensor([[2.0001, 1.0002]], grad_fn=<AddmmBackward0>)
Prediction after training: model([2]) = tensor([[4.0004, 4.0007]], grad_fn=<AddmmBackward0>)
Prediction after training: model([3]) = tensor([[6.0006, 9.0022]], grad_fn=<AddmmBackward0>)
Prediction after training: model([4]) = tensor([[ 8.0008, 16.0031]], grad_fn=<AddmmBackward0>)
Prediction after training: model([5]) = tensor([[10.0462, 22.8948]], grad_fn=<AddmmBackward0>)


In [None]:
"""
You’ll find that the model outputs very precise results **within** the range of
the data it was trained on. This leads us to two important conclusions:

1. **Compared to the version without an activation function, our model can now
learn non-linear relationships**. This is clear from how well the model predicts
both linear (2 * x) and non-linear (x^2) patterns with input [[1], [2], [3], [4]].

2. **However, DON’T expect a simple neural network to generalize well beyond the
training data range!** The model performs well within the range it was trained on,
but very often, a simple neural network can overfit to the training data, leading
to poor generalization outside of that range.

--- The information we don’t have, we wouldn’t have.
Yet, the interest lies in the information that hides in complex patterns,
which human modelers may not initially notice. ---
"""


"""
EXERCISE,
1. how would you apply activation functions to our "from scratch" model?
2. how about playing with some other activation functions from:
   https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity
   (search: activations)
"""

In [None]:
"""
Additional Information:

Deep neural network? Shallow neural network? Which network structure is the best?
https://www.youtube.com/watch?v=oJNHXPs0XDk
"""

Now, we are equipped with most of the fundamental knowledge for building a neural network using PyTorch. Advanced algorithms largely focus on the architecture of neural networks (e.g. see the Additional Information) and strategies for updating them in different tasks. In the next tutorial, we will apply what we have learned to a basic exercise: solving a simple image classification problem.