Let's build a real linear model

\begin{equation*}
y = wx + b
\end{equation*}

For trained parameters $\hat w$ and $\hat b$, the model is 

\begin{equation*}
\hat y(x)=\hat w x + \hat b
\end{equation*}

And the Mean Square Error (MSE) loss between $y$ and $\hat y$ is

\begin{equation*}
L = \sqrt{\lVert y - \hat y\rVert^2}
\end{equation*}

Assume a single $x$ is $3$-vector, i.e. $x\in \mathbb{R}^3$, $y$ is a $2$-vector, i.e. $y\in\mathbb{R}^2$.
There are $10$ samples, i.e. $10$ $3$-vectors as input and $10$ $2$-vectors as output.

In [36]:
import torch
x = torch.randn(10, 3)
y = torch.randn(10, 2)
print(x)
print(y)

tensor([[ 0.1766,  0.1254,  0.2699],
        [-1.5509,  1.8803,  0.3243],
        [-1.7099, -0.1290, -0.9486],
        [-0.8850,  0.3447, -0.8812],
        [ 1.2934, -1.2040,  0.2890],
        [ 0.1278, -1.0115, -0.1379],
        [-0.7354,  0.2185, -0.6483],
        [ 0.0492, -2.1378, -0.3556],
        [ 0.3415,  1.1722,  1.2577],
        [ 0.0516,  0.0733, -0.6183]])
tensor([[-0.4190,  1.5812],
        [-0.2929, -0.5735],
        [ 1.6416,  0.4896],
        [ 0.3663, -0.4154],
        [-0.9183, -1.6822],
        [-1.6864, -0.3989],
        [-0.6359, -0.1090],
        [-1.9653, -1.1053],
        [-0.0496, -0.3079],
        [ 0.3461, -0.5509]])


This is a linear model, so we build a fully connected (linear) layer.

In [37]:
import torch.nn as nn
linear = nn.Linear(3, 2)
print('w: ', linear.weight)
print('b: ', linear.bias)

w:  Parameter containing:
tensor([[-0.3412,  0.3215, -0.3616],
        [ 0.1967,  0.3356,  0.4529]], requires_grad=True)
b:  Parameter containing:
tensor([-0.4701, -0.2400], requires_grad=True)


Notice that the weight $w$ is a $2\times3$ matrix, while the bias $b$ is a $2$-vector. The implementation is actually $y = xw^T + b^T$

The loss function and optimizer are

In [38]:
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)

SGD means [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). We can simply understand it as a technique to improve parameter $w$ of $y=wx+b$ by backpropagate error of $y$ to $w$:

\begin{equation*}
w_{n+1} = w_n -\eta E_n(w) 
\end{equation*}

where $E(w)$ is the loss of $y$ relative to $w$, recorded as gradient for each tensor, $\eta$ is the learning rate (here is $0.01$).

Then we apply the linear model to input data $x$ (Forward pass)

In [39]:
pred = linear(x)
print(pred)

tensor([[-0.5876, -0.0410],
        [ 0.5464,  0.2327],
        [ 0.4149, -1.0494],
        [ 0.2613, -0.6976],
        [-1.4030, -0.2587],
        [-0.7891, -0.6167],
        [ 0.0855, -0.6050],
        [-1.0457, -1.1088],
        [-0.6645,  0.7902],
        [-0.2406, -0.4854]], grad_fn=<AddmmBackward>)


The loss is computed as

In [40]:
loss = criterion(pred, y)
print('loss: ', loss.item())

loss:  0.7318679094314575


The loss should be backpropagated to improve parameters (Backward pass)

In [41]:
loss.backward()
print('dL/dw: ', linear.weight.grad)
print('dL/db: ', linear.bias.grad)

dL/dw:  tensor([[-0.0379, -0.1217,  0.0014],
        [ 0.3901,  0.1111,  0.3637]])
dL/db:  tensor([ 0.0191, -0.0767])


With these gradients ready, we can improve the parameters now

In [42]:
optimizer.step()
print(linear.weight)
print(linear.bias)

Parameter containing:
tensor([[-0.3409,  0.3227, -0.3616],
        [ 0.1928,  0.3344,  0.4493]], requires_grad=True)
Parameter containing:
tensor([-0.4703, -0.2393], requires_grad=True)


Or equivalently, we can do it manulay

In [43]:
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)
print(linear.weight)
print(linear.bias)

Parameter containing:
tensor([[-0.3405,  0.3239, -0.3616],
        [ 0.1889,  0.3333,  0.4457]], requires_grad=True)
Parameter containing:
tensor([-0.4705, -0.2385], requires_grad=True)


The loss after optimization is:

In [44]:
pred = linear(x)
loss = criterion(pred, y)
print('loss is: ', loss.item())

loss is:  0.7255462408065796
