## Linear Regression with Pytorch

Phai Phongthiengtham

### 0. Math behind linear regression

Let the true data generating process is: $y = wx$, with the true value of $w$: $w_{true} = 2$.

For observation $i$, denote $\hat{y_i} = wx_i$ to be predicted value of $y_i$. Denote $N$ to be number of observations. Then the mean squared error loss function is:

$
\begin{equation}
L = \frac{1}{n}\sum_{i=1}^N(\hat{y_i} - y_i)^2 = \frac{1}{n}\sum_{i=1}^N(wx_i - y_i)^2 \nonumber
\end{equation}
$

$
\begin{equation}
\frac{\partial L}{\partial w} = \frac{1}{n}\sum_{i=1}^N\Bigl(2x_i(\hat{y_i} - y_i)\Bigl) \nonumber
\end{equation}
$

This is how to express as vectors:

$
\begin{equation}
\frac{\partial L}{\partial w} = mean\Bigl(2\vec{x} \cdot (\vec{\hat{y}} - \vec{y}) \Bigl)
\end{equation}
$

### 1. Using numpy

We will start with using numpy for the whole training process.

Given the current value of parameter(s), the training has the following steps: 
1. Make prediction: also known as forward propagation
2. Compute loss function
3. Compute gradient: also known as backward propagation
4. Adjust the value of parameter(s)

$
\begin{equation}
w_{updated} = w_{old} - \alpha\frac{\partial L}{\partial w_{old}} \nonumber
\end{equation}
$

where $\alpha$ refers to learning rate. The gradient $\frac{\partial L}{\partial w}$ is calculated from equation (1)  

In [1]:
import numpy as np

X = np.array([1, 2, 3, 4], dtype=np.float32)
Y = np.array([2, 4, 6, 8], dtype=np.float32) # y = w*x where true w = 2 

def forward(x): # 1. make prediction
    return w * x

def loss(y, y_predicted): # 2. calculate loss
    return ((y_predicted-y)**2).mean()

def gradient(x, y, y_predicted): # 3. calculate gradient
    return np.dot(2*x, y_predicted-y).mean()

w = 1.0 # initialize w
learning_rate = 0.02 # how fast the parameters update (hyperparameter)
n_iters = 20 # number of iterations

for epoch in range(n_iters):
    # forward pass
    y_pred = forward(X)

    # loss
    l = loss(Y, y_pred)

    # gradient
    dw = gradient(X, Y, y_pred)

    # update weights
    w -= learning_rate * dw # new w = old w - (learning_rate * dw)

    print(f'epoch {epoch+1}: w = {w:.3f}, loss = {l:.8f}')

epoch 1: w = 2.200, loss = 7.50000000
epoch 2: w = 1.960, loss = 0.30000019
epoch 3: w = 2.008, loss = 0.01200006
epoch 4: w = 1.998, loss = 0.00048001
epoch 5: w = 2.000, loss = 0.00001920
epoch 6: w = 2.000, loss = 0.00000077
epoch 7: w = 2.000, loss = 0.00000003
epoch 8: w = 2.000, loss = 0.00000000
epoch 9: w = 2.000, loss = 0.00000000
epoch 10: w = 2.000, loss = 0.00000000
epoch 11: w = 2.000, loss = 0.00000000
epoch 12: w = 2.000, loss = 0.00000000
epoch 13: w = 2.000, loss = 0.00000000
epoch 14: w = 2.000, loss = 0.00000000
epoch 15: w = 2.000, loss = 0.00000000
epoch 16: w = 2.000, loss = 0.00000000
epoch 17: w = 2.000, loss = 0.00000000
epoch 18: w = 2.000, loss = 0.00000000
epoch 19: w = 2.000, loss = 0.00000000
epoch 20: w = 2.000, loss = 0.00000000


### 2. Using torch to compute gradient

Next, we will use torch to automatically compute gradient. This is needed because we cannot compute the formula for gredient with the majority of deep learning models.

Reminders:
- We need to tell torch to prepare to compute gradient by setting: ```requires_grad=True``` when initializing tensors.
- If we set ```requires_grad=True```, we can wrap with ```torch.no_grad():``` when we do **NOT** want to compute gradient.
- Once finish, we have to empty the gradient by setting ```.grad.zero_()```.

In [2]:
import torch

X = torch.tensor([1, 2, 3, 4], dtype=torch.float32) # [UPDATED] instead of numpy array, now we use torch tensor
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32) # [UPDATED] instead of numpy array, now we use torch tensor

def forward(x): # 1. make prediction
    return w * x

def loss(y, y_predicted): # # 2. calculate loss
    return ((y_predicted-y)**2).mean()

# [UPDATED] no longer need this, we will use built-in torch function to compute gradient  
#def gradient(x, y, y_predicted): # 3. calculate gradient
    #return np.dot(2*x, y_predicted-y).mean()

# [UPDATED] instead of numpy array, now we use torch tensor - REMEMBER "requires_grad=True" is required!
w = torch.tensor(1.0, dtype=torch.float32, requires_grad=True) # initialize w

learning_rate = 0.05 # how fast the parameters update (hyperparameter)
n_iters = 20 # number of iterations

for epoch in range(n_iters):
    # forward pass
    y_pred = forward(X)

    # loss
    l = loss(Y, y_pred)

    # gradient
    #dw = gradient(X, Y, y_pred)
    l.backward() # [UPDATED] use torch to compute gradient with respect to w 
    
    # update weights
    with torch.no_grad(): 
        # [UPDATED] REMEMBER "torch.no_grad()" is required! We do not want to compute gradient here 
        w -= learning_rate * w.grad 

    # empty gradient vector
    w.grad.zero_() # [UPDATED] REMEMBER ".grad.zero_()" is required! We are done in this epoch.

    print(f'epoch {epoch+1}: w = {w:.3f}, loss = {l:.8f}')

epoch 1: w = 1.750, loss = 7.50000000
epoch 2: w = 1.938, loss = 0.46875000
epoch 3: w = 1.984, loss = 0.02929688
epoch 4: w = 1.996, loss = 0.00183105
epoch 5: w = 1.999, loss = 0.00011444
epoch 6: w = 2.000, loss = 0.00000715
epoch 7: w = 2.000, loss = 0.00000045
epoch 8: w = 2.000, loss = 0.00000003
epoch 9: w = 2.000, loss = 0.00000000
epoch 10: w = 2.000, loss = 0.00000000
epoch 11: w = 2.000, loss = 0.00000000
epoch 12: w = 2.000, loss = 0.00000000
epoch 13: w = 2.000, loss = 0.00000000
epoch 14: w = 2.000, loss = 0.00000000
epoch 15: w = 2.000, loss = 0.00000000
epoch 16: w = 2.000, loss = 0.00000000
epoch 17: w = 2.000, loss = 0.00000000
epoch 18: w = 2.000, loss = 0.00000000
epoch 19: w = 2.000, loss = 0.00000000
epoch 20: w = 2.000, loss = 0.00000000


### 2. Using torch to compute gradient, loss and update weights

In [3]:
import torch

X = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32)

def forward(x):
    return w * x

# [UPDATED] no longer need this, we will use built-in torch function to compute loss 
#def loss(y, y_predicted):
    #return ((y_predicted-y)**2).mean()
loss = torch.nn.MSELoss()

w = torch.tensor(1.0, dtype=torch.float32, requires_grad=True) # initialize w

# [UPDATED] we will use torch optimizer module for updating weights. lr is learning rate 
optimizer = torch.optim.SGD([w], lr=0.05)

n_iters = 20 # number of iterations

for epoch in range(n_iters):
    # forward pass
    y_pred = forward(X)

    # loss
    l = loss(Y, y_pred)

    # gradient
    l.backward()

    # [UPDATED] no longer need this, we will use built-in torch function to update weights
    # update weights
    #with torch.no_grad(): 
        #w -= learning_rate * w.grad
    optimizer.step() 

    # empty gradient vector
    optimizer.zero_grad() # [UPDATED] still need this, but we do this on the optimizer instead of w.

    print(f'epoch {epoch+1}: w = {w:.3f}, loss = {l:.8f}')

epoch 1: w = 1.750, loss = 7.50000000
epoch 2: w = 1.938, loss = 0.46875000
epoch 3: w = 1.984, loss = 0.02929688
epoch 4: w = 1.996, loss = 0.00183105
epoch 5: w = 1.999, loss = 0.00011444
epoch 6: w = 2.000, loss = 0.00000715
epoch 7: w = 2.000, loss = 0.00000045
epoch 8: w = 2.000, loss = 0.00000003
epoch 9: w = 2.000, loss = 0.00000000
epoch 10: w = 2.000, loss = 0.00000000
epoch 11: w = 2.000, loss = 0.00000000
epoch 12: w = 2.000, loss = 0.00000000
epoch 13: w = 2.000, loss = 0.00000000
epoch 14: w = 2.000, loss = 0.00000000
epoch 15: w = 2.000, loss = 0.00000000
epoch 16: w = 2.000, loss = 0.00000000
epoch 17: w = 2.000, loss = 0.00000000
epoch 18: w = 2.000, loss = 0.00000000
epoch 19: w = 2.000, loss = 0.00000000
epoch 20: w = 2.000, loss = 0.00000000


### 3. Using torch to do everything (compute gradient and loss, update weights, and make prediction)

In [4]:
import torch

X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32) # [UPDATED] torch expects a different data shape
Y = torch.tensor([[2], [4], [6], [8]], dtype=torch.float32) # [UPDATED] torch expects a different data shape

# [UPDATED] no longer need this, we will use built-in torch function.
#def forward(x):
    #return w * x

# [UPDATED] We will use built-in Linear model from torch
n_samples = 4
n_features = 1

input_size = n_features
output_size = n_features

model = torch.nn.Linear(input_size, output_size)
loss = torch.nn.MSELoss()

# [UPDATED] no longer need this, this is already in torch Linear model
#w = torch.tensor(1.0, dtype=torch.float32, requires_grad=True) # initialize w

optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

n_iters = 20 # number of iterations

for epoch in range(n_iters):
    # forward pass
    y_pred = model(X) # [UPDATED] Now we call the model directly!

    # loss
    l = loss(Y, y_pred)

    # gradient
    l.backward()

    # update weights
    optimizer.step() 

    # empty gradient vector
    optimizer.zero_grad()

    [w, b] = model.parameters() # [UPDATED] Now we have to unpack parameters
    print(f'epoch {epoch+1}: w = {w[0][0].item():.3f}, loss = {l:.8f}')

epoch 1: w = 1.825, loss = 18.50027084
epoch 2: w = 2.018, loss = 0.50810522
epoch 3: w = 2.049, loss = 0.01831267
epoch 4: w = 2.054, loss = 0.00484970
epoch 5: w = 2.054, loss = 0.00435401
epoch 6: w = 2.053, loss = 0.00421506
epoch 7: w = 2.052, loss = 0.00408956
epoch 8: w = 2.052, loss = 0.00396803
epoch 9: w = 2.051, loss = 0.00385013
epoch 10: w = 2.050, loss = 0.00373571
epoch 11: w = 2.049, loss = 0.00362471
epoch 12: w = 2.049, loss = 0.00351699
epoch 13: w = 2.048, loss = 0.00341248
epoch 14: w = 2.047, loss = 0.00331110
epoch 15: w = 2.046, loss = 0.00321271
epoch 16: w = 2.046, loss = 0.00311724
epoch 17: w = 2.045, loss = 0.00302460
epoch 18: w = 2.044, loss = 0.00293474
epoch 19: w = 2.044, loss = 0.00284752
epoch 20: w = 2.043, loss = 0.00276291
