In [36]:
import torch

In [37]:
N = 10
D_in = 1
D_out = 1


In [38]:
torch.manual_seed(42)
X = torch.randn(N, D_in)
print(f"X[:3]: {X[:3]}")

W_true = torch.tensor([[2.0]])
b_true = torch.tensor(1.0)
print(f"X.shape: {X.shape}")
print(f"W_true: {W_true}")
print(f"b_true: {b_true}")

y_true = X @ W_true + b_true
print("---- y_true[:3]:")
print(y_true[:3])

X[:3]: tensor([[0.3367],
        [0.1288],
        [0.2345]])
X.shape: torch.Size([10, 1])
W_true: tensor([[2.]])
b_true: 1.0
---- y_true[:3]:
tensor([[1.6734],
        [1.2576],
        [1.4689]])


In [39]:
lr, epochs = 0.07, 40

torch.manual_seed(42)
W, b = (
    torch.randn(D_in, D_out, requires_grad=True),
    torch.randn(1, requires_grad=True),
)
print(f"W.shape: {W.shape}")
print(f"W[:3]: {W[:3]}")
print(f"b: {b}")

W.shape: torch.Size([1, 1])
W[:3]: tensor([[0.3367]], grad_fn=<SliceBackward0>)
b: tensor([0.1288], requires_grad=True)


In [40]:
for i in range(epochs):
    y_hat = X @ W + b
    loss = torch.mean((y_hat - y_true) ** 2)
    loss.backward()
    with torch.no_grad():
        W -= lr * W.grad
        b -= lr * b.grad

    W.grad.zero_()
    b.grad.zero_()
    # print(f"Epoch {i + 1}/{epochs}, Loss: {loss.item():.4f}")

print(f"Final Loss: {loss.item():.4f}")
# final W and b
# final y_hats
# true y_hats
print(f"Final W: {W}")
print(f"Final b: {b}")
print(f"Final y_hats: {y_hat}")
print(f"True y_hats: {y_true}")


Final Loss: 0.0007
Final W: tensor([[1.9722]], requires_grad=True)
Final b: tensor([1.0117], requires_grad=True)
Final y_hats: tensor([[ 1.6757],
        [ 1.2663],
        [ 1.4744],
        [ 1.4663],
        [-1.1988],
        [ 0.6457],
        [ 5.3616],
        [-0.2439],
        [ 1.9219],
        [ 1.5392]], grad_fn=<AddBackward0>)
True y_hats: tensor([[ 1.6734],
        [ 1.2576],
        [ 1.4689],
        [ 1.4607],
        [-1.2457],
        [ 0.6273],
        [ 5.4164],
        [-0.2760],
        [ 1.9233],
        [ 1.5347]])


## What this notebook is doing (linear regression in math)

We are fitting a linear model to data with one input feature and one output. In linear algebra notation:

- Let $$N = 10$$ be the number of samples.
- Let $$D_{\text{in}} = 1$$ be the input dimension and $$D_{\text{out}} = 1$$ the output dimension.
- Inputs: $$X \in \mathbb{R}^{N \times D_{\text{in}}}$$. In this notebook, $$X \in \mathbb{R}^{10 \times 1}$$.
- True parameters: $$W_{\text{true}} \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}} = \mathbb{R}^{1 \times 1}$$ and $$b_{\text{true}} \in \mathbb{R}^{D_{\text{out}}} = \mathbb{R}^{1}$$ (a scalar). The targets are
  $$
  y_{\text{true}} = X \, W_{\text{true}} + b_{\text{true}} \in \mathbb{R}^{N \times D_{\text{out}}} = \mathbb{R}^{10 \times 1}.
  $$

### Model

We learn parameters $$W \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}} = \mathbb{R}^{1 \times 1}$$ and $$b \in \mathbb{R}^{D_{\text{out}}} = \mathbb{R}^{1}$$. The model prediction for all $$N$$ samples is

$$
\hat{y} = X \, W + b, \quad \hat{y} \in \mathbb{R}^{N \times D_{\text{out}}} = \mathbb{R}^{10 \times 1}.
$$

Here $$b$$ (shape $$1$$) is broadcast to each row of $$XW$$ (shape $$10\times1$$).

### Loss (mean squared error)

We minimize the mean squared error (MSE):

$$
\mathcal{L}(W,b) = \frac{1}{N} \sum_{i=1}^{N} \big(\hat{y}_i - y_{\text{true},i}\big)^2
= \frac{1}{N} \left\| \hat{y} - y_{\text{true}} \right\|_2^2.
$$

With $$D_{\text{out}}=1$$, this is simply the average of squared residuals across the $$N$$ samples.

### Gradients

For linear regression with MSE, the analytical gradients are:

$$
\nabla_W \mathcal{L} = \frac{2}{N} \, X^{\top} \big( XW + b - y_{\text{true}} \big), \quad
\nabla_b \mathcal{L} = \frac{2}{N} \, \mathbf{1}^{\top} \big( XW + b - y_{\text{true}} \big),
$$

where $$\mathbf{1} \in \mathbb{R}^{N}$$ is a vector of ones (summing residuals over samples). Dimensions check:

- $$X^{\top} \in \mathbb{R}^{D_{\text{in}} \times N} = \mathbb{R}^{1 \times 10}$$
- Residuals $$R = XW + b - y_{\text{true}} \in \mathbb{R}^{N \times D_{\text{out}}} = \mathbb{R}^{10 \times 1}$$
- Thus $$X^{\top} R \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}} = \mathbb{R}^{1 \times 1}$$, matching $$W$$
- Summing residuals across samples yields a $$\mathbb{R}^{1}$$ quantity, matching $$b$$

In the notebook, PyTorch's autograd computes these gradients via `loss.backward()`.

### Parameter updates (gradient descent)

With learning rate $$\eta$$ (`lr` in code), we perform gradient descent:

$$
W \leftarrow W - \eta \, \nabla_W \mathcal{L}, \qquad
b \leftarrow b - \eta \, \nabla_b \mathcal{L}.
$$

This happens inside a `with torch.no_grad():` block to avoid tracking the updates as part of the computational graph. After each step, gradients are reset with `W.grad.zero_()` and `b.grad.zero_()`.

### Summary of shapes in this notebook

- $$X \in \mathbb{R}^{10 \times 1}$$
- $$W, W_{\text{true}} \in \mathbb{R}^{1 \times 1}$$
- $$b, b_{\text{true}} \in \mathbb{R}^{1}$$ (scalar)
- $$y_{\text{true}}, \hat{y} \in \mathbb{R}^{10 \times 1}$$

Over epochs, $$W$$ and $$b$$ move towards $$W_{\text{true}}$$ and $$b_{\text{true}}$$, driving the MSE loss close to zero, as observed by the final printed values.


## What this notebook is doing (linear regression in math)

Note: D_in = number of features, D_out = number of outputs per sample

We are fitting a linear model to data with one input feature and one output. In linear algebra notation:

- Let $N = 10$ be the number of samples.
- Let $D_{\text{in}} = 1$ be the input dimension and $D_{\text{out}} = 1$ the output dimension.
- Inputs: $X \in \mathbb{R}^{N \times D_{\text{in}}}$. In this notebook, $X \in \mathbb{R}^{10 \times 1}$.
- True parameters: $W_{\text{true}} \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}} = \mathbb{R}^{1 \times 1}$ and $b_{\text{true}} \in \mathbb{R}^{D_{\text{out}}} = \mathbb{R}^{1}$ (a scalar). The targets are:
  $$
   y_{\text{true}} = X \, W_{\text{true}} + b_{\text{true}} \in \mathbb{R}^{N \times D_{\text{out}}} = \mathbb{R}^{10 \times 1}.
  $$

### Model

We learn parameters $W \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}} = \mathbb{R}^{1 \times 1}$ and $b \in \mathbb{R}^{D_{\text{out}}} = \mathbb{R}^{1}$. The model prediction for all $N$ samples is:

$$
 \hat{y} = X \, W + b, \quad \hat{y} \in \mathbb{R}^{N \times D_{\text{out}}} = \mathbb{R}^{10 \times 1}.
$$

Here $b$ (shape $1$) is broadcast to each row of $XW$ (shape $10\times1$).

### Loss (mean squared error)

We minimize the mean squared error (MSE):

$$
 \mathcal{L}(W,b) = \frac{1}{N} \sum_{i=1}^{N} \big(\hat{y}_i - y_{\text{true},i}\big)^2
 = \frac{1}{N} \left\| \hat{y} - y_{\text{true}} \right\|_2^2.
$$

With $D_{\text{out}}=1$, this is simply the average of squared residuals across the $N$ samples.

### Gradients

For linear regression with MSE, the analytical gradients are:

$$
 \nabla_W \mathcal{L} = \frac{2}{N} \, X^{\top} \big( XW + b - y_{\text{true}} \big), \quad
 \nabla_b \mathcal{L} = \frac{2}{N} \, \mathbf{1}^{\top} \big( XW + b - y_{\text{true}} \big).
$$

Where $\mathbf{1} \in \mathbb{R}^{N}$ is a vector of ones (summing residuals over samples). Dimensions check (inline): $X^{\top} \in \mathbb{R}^{1\times10}$; residuals $R \in \mathbb{R}^{10\times1}$; hence $X^{\top}R \in \mathbb{R}^{1\times1}$ matches $W$; summing residuals gives a $\mathbb{R}^1$ quantity matching $b$.

### Parameter updates (gradient descent)

With learning rate $\eta$ (`lr` in code), we perform gradient descent:

$$
 W \leftarrow W - \eta \, \nabla_W \mathcal{L}, \qquad
 b \leftarrow b - \eta \, \nabla_b \mathcal{L}.
$$

This happens inside a `with torch.no_grad():` block to avoid tracking the updates as part of the computational graph. After each step, gradients are reset with `W.grad.zero_()` and `b.grad.zero_()`.

### Summary of shapes in this notebook (inline)

- $X \in \mathbb{R}^{10 \times 1}$
- $W, W_{\text{true}} \in \mathbb{R}^{1 \times 1}$
- $b, b_{\text{true}} \in \mathbb{R}^{1}$ (scalar)
- $y_{\text{true}}, \hat{y} \in \mathbb{R}^{10 \times 1}$

Over epochs, $W$ and $b$ move towards $W_{\text{true}}$ and $b_{\text{true}}$, driving the MSE loss close to zero, as observed by the final printed values.
