In this section, we will learn how to use the `.backward()` function for gradient descent and see how PyTorch's optimizers help update the model's parameters.

The process of updating a model's parameters involves three main steps:
- **Forward pass**: Constructing the computational graph from the input to the loss.
- **Backward pass**: Computing the gradients using the Chain Rule, typically through the "vector-Jacobian product" method.
- **Model optimization**: Updating model parameters using a PyTorch optimizer.

Each of these steps can be done with help from PyTorch.

*Reference video: https://www.youtube.com/watch?v=3Kb0QS6z7WA&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=4 (for backward pass)*

*https://www.youtube.com/watch?v=E-I2DNVzQLg&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=5 (for parameter updates)*

*https://www.youtube.com/watch?v=VVDHU_TWwUg&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4&index=7 (for PyTorch built-in optimizer)*

In [None]:
"""
Let us consider the following computational graph:

x --|
    | --> mul (*) --> ŷ --|
w --|                     |--> sub (-) --> s --> square (^2) --> loss
                      y --|

where `x` is the input, and the loss is calculated as:

loss = [(x * w) - y]^2, where (x * w) = ŷ and [(x * w) - y] = s.
The goal is simple:
Minimizing the loss <--> finding the value of `w` that makes `x * w` as close to the target `y` as possible.

Let us assume the initial conditions: x = 1, w = 1, and y = 2. Then the setup is:
"""

import torch

x = torch.tensor(1.0)
y = torch.tensor(2.0)

w = torch.tensor(1.0, requires_grad=True)  # requires_grad=True since we are aiming to update `w`

# forward pass
y_hat = w * x  # ŷ = w * x
loss = (y_hat - y)**2  # loss = (ŷ - y)^2

"""
Exercise: Do you think `y_hat` and `loss` now require gradients?
"""
print(loss)

# backward pass
loss.backward()
print(w.grad)  # output: tensor(-2.)

"""
Exercise: Try to check by hand if ∂(loss) / ∂(w) = -2.
"""

tensor(1., grad_fn=<PowBackward0>)
tensor(-2.)


'\nExercise: Try to check by hand if ∂(loss) / ∂(w) = -2.\n'

In [None]:
"""
Before we use PyTorch to update a model, let's manually explore how we would tackle a simple regression problem,
where we aim to find a function f(x) = 2 * x, but the weight "2" is unknown initially.

Steps of the regression problem include:
1. Starting from some known dataset.
2. Initializing the parameter for a linear function f(x) = w * x.
3. Using this f(x) = w * x to make predictions (forward pass).
4. Using the predictions to calculate the loss.
5. Using the loss to compute gradients with the Chain Rule (backward pass).
6. Iteratively updating w using the gradients (gradient descent).
"""

import numpy as np

# 1: Known dataset used for training the linear model
X = np.array([1, 2, 3, 4], dtype=np.float32)
Y = np.array([2, 4, 6, 8], dtype=np.float32)
# This suggests Y = 2 * X

# 2: Initialize the weight w (which we don't know initially)
w = 0.0  # Initial guess for w

# 3: Model prediction (define forward pass)
def forward(x):
    return w * x

# 4: define the loss (Mean Squared Error)
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

# 5: Gradient calculation
# MSE = 1/N * sum((w * x - y)^2)
# dJ/dw = 1/N * 2 * x * (w * x - y), where J is the loss

def gradient(x, y, y_predicted):
    return np.dot(2 * x, y_predicted - y).mean()

print(f'Prediction before training: f(5) = {forward(5):.3f}')

# 6: Main training loop
learning_rate = 0.01
n_iterations = 10

for epoch in range(n_iterations):
    # Prediction (forward pass)
    y_pred = forward(X)

    """
    Note: We are passing a vector of data (X) in each forward pass, which matches with Y.
    You can think of this as passing a `batch` of data. This is why we use `.mean()`
    in the loss and gradient functions to average over the batch.
    """

    # Calculating loss
    l = loss(Y, y_pred)

    # Calculating gradients
    dw = gradient(X, Y, y_pred)

    # Update weights (Gradient "Descent")
    w -= learning_rate * dw

    # Print training information
    if epoch % 1 == 0:
        print(f'Epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')

print(f'Prediction after training: f(5) = {forward(5):.3f}')

Prediction before training: f(5) = 0.000
Epoch 1: w = 1.200, loss = 30.00000000
Epoch 2: w = 1.680, loss = 4.79999924
Epoch 3: w = 1.872, loss = 0.76800019
Epoch 4: w = 1.949, loss = 0.12288000
Epoch 5: w = 1.980, loss = 0.01966083
Epoch 6: w = 1.992, loss = 0.00314574
Epoch 7: w = 1.997, loss = 0.00050331
Epoch 8: w = 1.999, loss = 0.00008053
Epoch 9: w = 1.999, loss = 0.00001288
Epoch 10: w = 2.000, loss = 0.00000206
Prediction after training: f(5) = 9.999


In [None]:
"""
Now, let's move on and see how the PyTorch version of this process will look.
We'll use PyTorch's built-in functionalities to handle gradient computation and parameter updates more efficiently.
"""

import torch

# Convert the dataset into PyTorch tensors for Autograd
X = torch.tensor(X, dtype=torch.float32)
Y = torch.tensor(Y, dtype=torch.float32)

# Create the weight parameter with requires_grad=True for grad tracking
w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

# Forward pass function (remains the same; but we'll use PyTorch tensors for input)
def forward(x):
    return w * x

# Loss function (remains the same; PyTorch tensors input)
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

print(f'Prediction before training: f(5) = {forward(5):.3f}')

# Main training loop
learning_rate = 0.01
n_iterations = 20

for epoch in range(n_iterations):
    # Forward pass
    y_pred = forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Backward pass to compute gradients
    l.backward()  # PyTorch handles the gradient computation AUTOMATICALLY!
                  # We don't need to calculate the gradients using Chain Rule by ourself!

    # Update weights
    with torch.no_grad():
        w -= learning_rate * w.grad
    """
    Note: We use `torch.no_grad()` to ensure that the weight update does not interfere with the computational graph.
    This prevents PyTorch from tracking gradients during the update step.
    """

    # Zero the gradients after updating
    w.grad.zero_()
    """
    IMPORTANT: Gradients accumulate with each call to `.backward()`, so we must reset (zero) them
    after each weight update and before the next backward pass.
    """

    # Print training info
    if epoch % 2 == 0:
        print(f'Epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')

print(f'Prediction after training: f(5) = {forward(5):.3f}')

"""
Note: The result of f(5) might not match exactly with the manual NumPy model.
The differences are likely due to floating-point precision in training,
but the gradients computed by backpropagation in PyTorch are exact, just like the manual calculation.
"""

Prediction before training: f(5) = 0.000
Epoch 1: w = 0.300, loss = 30.00000000
Epoch 3: w = 0.772, loss = 15.66018772
Epoch 5: w = 1.113, loss = 8.17471695
Epoch 7: w = 1.359, loss = 4.26725292
Epoch 9: w = 1.537, loss = 2.22753215
Epoch 11: w = 1.665, loss = 1.16278565
Epoch 13: w = 1.758, loss = 0.60698116
Epoch 15: w = 1.825, loss = 0.31684780
Epoch 17: w = 1.874, loss = 0.16539653
Epoch 19: w = 1.909, loss = 0.08633806
Prediction after training: f(5) = 9.612


'\nNote: The result of f(5) might not match exactly with the manual NumPy model.\nThe differences are likely due to floating-point precision in training,\nbut the gradients computed by backpropagation in PyTorch are exact, just like the manual calculation.\n'

Now we understand how PyTorch's autograd simplifies the training process. Up to this point, the key takeaway is that autograd frees us from manually tracking and calculating gradients.

Next, we will go further by learning how PyTorch's optimizers can help with parameter updates. In the current simple model, we used a static learning rate. In more complex scenarios, advanced optimization algorithms like `Adam`, which dynamically adjust the learning rate, are often preferred.

Putting aside the benefits of advanced algorithms, even basic methods like Stochastic Gradient Descent (SGD) are simplified with PyTorch, making our code more efficient and concise.

In [None]:
# Let's do the same example but with a PyTorch optimizer

# Creating dataset in PyTorch tensors
X = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32)

# Create the weight parameter with requires_grad=True for gradient tracking
w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

# The same forward pass function
def forward(x):
    return w * x

# The same mean squared error (MSE) loss function
def loss(y, y_predicted):
    return ((y_predicted - y) ** 2).mean()

print(f'Prediction before training: f(5) = {forward(5):.3f}')

# Main training loop settings
learning_rate = 0.01
n_iterations = 100

"""
Now, let's set up a PyTorch optimizer before starting the training loop
to update our model using stochastic gradient descent (SGD).

PyTorch's optimizers are part of the `torch.optim` module.
You can explore many different optimizers in the official documentation:
https://pytorch.org/docs/stable/optim.html

Let's start with a simple one: `torch.optim.SGD`.
The optimizer is instantiated by passing the parameters we want to update (in a list)
and the learning rate (lr).

Note: The learning rate is a fundamental hyperparameter that you should tune
first if the model's performance is not satisfactory.
"""
optimizer = torch.optim.SGD([w], lr=learning_rate)

for epoch in range(n_iterations):
    # Forward pass
    y_pred = forward(X)

    # Calculate loss
    l = loss(Y, y_pred)

    # Backward pass to compute gradients
    l.backward()

    # Update weights. The SGD optimizer updates the weights using the `.grad` attribute on the parameters
    # which was populated by the `.backward()` call.
    optimizer.step()

    # with torch optimizer, the way to zero gradients is simple:
    optimizer.zero_grad()

    # Print training info
    if epoch % 10 == 0:
        print(f'Epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')

print(f'Prediction after training: f(5) = {forward(5):.3f}')

## We see how clean and efficient our code becomes with the help of PyTorch optimizers.

Prediction before training: f(5) = 0.000
Epoch 1: w = 0.300, loss = 30.00000000
Epoch 11: w = 1.665, loss = 1.16278565
Epoch 21: w = 1.934, loss = 0.04506890
Epoch 31: w = 1.987, loss = 0.00174685
Epoch 41: w = 1.997, loss = 0.00006770
Epoch 51: w = 1.999, loss = 0.00000262
Epoch 61: w = 2.000, loss = 0.00000010
Epoch 71: w = 2.000, loss = 0.00000000
Epoch 81: w = 2.000, loss = 0.00000000
Epoch 91: w = 2.000, loss = 0.00000000
Prediction after training: f(5) = 10.000


### About the choice of optimizers:

Very nice YouTube reference: [ML 2021 (English version)] Lecture 6: What to do when optimization fails? (Learning rate):
https://www.youtube.com/watch?v=8yf-tU7zm7w
(describing how learning works in different optimization strategies)

When training machine learning models, choosing the right optimizer is important for performance. Different optimizers can lead to different convergence speeds and results. Here are a few commonly used optimizers in PyTorch, starting from the more basic to the more advanced:

1. **Stochastic Gradient Descent (SGD)**:
   - This is the most basic and foundational optimizer. It updates parameters based on the gradients, typically with a fixed learning rate.
   - While effective, SGD can struggle with more complex models due to its constant learning rate, but it can still perform well with careful tuning and the addition of momentum.
   - `torch.optim.SGD`

2. **Momentum-based SGD**:
   - A variant of SGD that introduces momentum, which helps the optimizer converge faster by building up speed in directions with consistent gradients.
   - Momentum helps overcome small local minima or flat regions in the loss function, which plain SGD may struggle with.
   - `torch.optim.SGD` with the `momentum` parameter.

3. **RMSprop**:
   - RMSprop divides the learning rate by an exponentially decaying average of squared gradients, which helps mitigate large gradient updates.
   - It is often used in recurrent neural networks (RNNs) and time-series models due to its ability to adapt to non-stationary data.
   - Example: `torch.optim.RMSprop`

4. **AdaGrad**:
   - AdaGrad adapts the learning rate for each parameter based on the magnitude of its gradient, making it particularly suitable for sparse data.
   - However, it can cause the learning rate to become too small over time, which may slow down learning.
   - `torch.optim.Adagrad`

5. **Adam (Adaptive Moment Estimation)**:
   - Adam is one of the most commonly used optimizers in modern deep learning. It combines the benefits of AdaGrad (adaptive learning rates) and RMSprop (exponentially decaying average of squared gradients) while also incorporating momentum-like behavior.
   - A common parameter used with Adam is `weight_decay`, which applies L2 regularization. This helps prevent overfitting by penalizing large weights and is widely used in practice.
   - These make Adam particularly effective for larger models, deep neural networks, and noisy data, which is why it’s often the default choice in modern machine learning.
   - `torch.optim.Adam`

### Which optimizer to choose?
- **SGD** is a good starting point for simple models and experimentation.
- **Adam** is the most widely used optimizer for modern deep learning models due to its adaptability and robust performance on complex problems.
- Regardless of the optimizer you choose, tuning hyperparameters like the learning rate is essential for achieving good results.

###P.S. If you're unsure what to use, **go with Adam**.

We have learned how to use PyTorch for updating a model. However, a famous structure — the neural network — has yet to appear. In the coming tutorial, we will explore more complex examples and start learning how to build a neural network — straight from scratch.