In our 2d example , the loss function can be thought as a parabola , that reaches its minimum on a certain pair of 
$w_1$ , $w_2$ , Visually we have : 

![Screenshot 2025-09-18 at 3.50.28 AM.png](attachment:b1fdda63-3044-4b7e-9f0a-0767d88326ea.png)

To find these weights, the core idea is to simply follow the slope of the curve. Although we don’t know the actual shape of the loss, we can calculate the slope in a point and then move towards the downhill direction.

But what is the slope ? 

### Slope : the derivative of the loss function 
In calculus, the slope is the derivative of the function at this point and is denoted as $\frac{\partial w}{\partial x}$. The ultimate goal would be to find the global min. The minimums, local or global, have a nearly zero derivative, which indicates that we are located at the minimum of the curve.

The same principle can be extended into many dimensions N. Despite the fact this is very difficult to visualize, maths is here to help us. Keep in mind that the minimum is not always the global minimum.

### Computing the gradient of a loss function. 

The question is how do we compute the derivative with respect to the weights. 
Since our loss function is $C= (f(x_i,\mathbf{W}) - y_i)^2$ , where the classifier $f$ is $f= w_1*x+w_2$ . 
we can easily say that  : 

$\frac{\partial C}{\partial w_1} = 2(w_1*x +w2 -y)x$

$\frac{\partial C}{\partial w_2} = 2(w_1*x +w2 -y)$

This is nothing more than the partial derivatives with respect to our 2 weights.
Now that we have our gradients, let’s adjust our weights to go downhill: 

$w1_{t} = w1_{t-1} - \lambda* \frac{\partial C}{\partial w_1}$

$w2_{t} = w2_{t-1} - \lambda* \frac{\partial C}{\partial w_2}$

where $\lambda$ is a small learning rate. the learning rate is usually between $10^{-3}$ and $10^{-6}$ and defines how quickly we move down towards the direction of gradient. 

Ok, we found the gradient! How do we change the parameter?

$w^{j}=w^{j-1} - \lambda \times \frac{\partial C}{\partial w^{j}}$

### Summing up the training scheme. 
To recap, the training algorithm, known as gradient descent, can be formulated like this for the N-dimensional case:

- Initialize the classifier $f(x_i, \mathbf{W})$ with random weights $W$.
- Feed a training example $x_i$ (vector) with corresponding target vector $t_i$ in the classifier, and compute the output $y_i = f(x_i, \mathbf{W})$ .
- Compute the loss between the prediction $y_i$ and target vector $t_i$. The mean squared error loss is commonly used $C = \sum{(y_i-t_i)}^2$ .
- Compute the gradients for the loss with respect to the weights/parameters.
- Adjust the weights $W$ based on the rule $w_{i}^{t} = w_{i}^{t-1} - l* \frac{\partial C}{\partial w_i}$. Note that $\frac{\partial C}{\partial w_i}$ is the gradient of the parameter and $\lambda$ the learning rate.


In [2]:
import torch
import torch.nn as nn


def train():

    # Define a simple linear model
    # Input size = 4, Output size = 2
    # So it learns weights (W: 2x4) and biases (b: 2)
    model = nn.Linear(4, 2)

    # Define Mean Squared Error loss function
    # Compares predicted output vs. true target
    criterion = torch.nn.MSELoss()

    # Define optimizer (Stochastic Gradient Descent) with learning rate = 0.1
    # This will update model weights based on computed gradients
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    # Run for 10 training iterations (epochs)
    for epoch in range(10):
        # Define input vector (4 features)
        inputs = torch.Tensor([0.8, 0.4, 0.4, 0.2])

        # Define expected output (target) vector (2 values)
        # Here, it's acting like a "one-hot" target [1,0]
        labels = torch.Tensor([1, 0])

        # Clear old gradients so they don’t accumulate
        optimizer.zero_grad()

        # Forward pass: compute predicted output
        outputs = model(inputs)

        # Compute the loss (how far prediction is from target)
        loss = criterion(outputs, labels)

        # Print loss (before backprop + update)
        print("Raw Loss Tensor:", loss)

        # Backward pass: compute gradients w.r.t. model parameters
        loss.backward()

        # Update model parameters using optimizer and gradients
        optimizer.step()

        # Print epoch and current loss value
        print(f"Epoch {epoch}, Loss = {loss.item()}")


if __name__ == "__main__":
    train()

Raw Loss Tensor: tensor(0.7837, grad_fn=<MseLossBackward0>)
Epoch 0, Loss = 0.7836969494819641
Raw Loss Tensor: tensor(0.5016, grad_fn=<MseLossBackward0>)
Epoch 1, Loss = 0.5015659928321838
Raw Loss Tensor: tensor(0.3210, grad_fn=<MseLossBackward0>)
Epoch 2, Loss = 0.3210022449493408
Raw Loss Tensor: tensor(0.2054, grad_fn=<MseLossBackward0>)
Epoch 3, Loss = 0.2054414004087448
Raw Loss Tensor: tensor(0.1315, grad_fn=<MseLossBackward0>)
Epoch 4, Loss = 0.13148249685764313
Raw Loss Tensor: tensor(0.0841, grad_fn=<MseLossBackward0>)
Epoch 5, Loss = 0.0841488167643547
Raw Loss Tensor: tensor(0.0539, grad_fn=<MseLossBackward0>)
Epoch 6, Loss = 0.05385523661971092
Raw Loss Tensor: tensor(0.0345, grad_fn=<MseLossBackward0>)
Epoch 7, Loss = 0.034467339515686035
Raw Loss Tensor: tensor(0.0221, grad_fn=<MseLossBackward0>)
Epoch 8, Loss = 0.022059103474020958
Raw Loss Tensor: tensor(0.0141, grad_fn=<MseLossBackward0>)
Epoch 9, Loss = 0.014117831364274025
