# Delta Linear Regression

* See this [Notion Page](https://www.notion.so/delta-linear-regression-1ae807742b3549f4af1d16e1f47c4203) for detail
* Related to: [Sketch-RNN](https://github.com/jiahaoliu1891/Sketch-Composer)

## 一个关于 Linear Regression 的简单分析

Linear Regression on data $X, Y$:

$$
\hat{y} = w x + b
$$

损失 $L$ 是 $X, \hat{y}, w$ 三者的函数:

$$
L = \frac12(y-\hat{y})^2
$$

 梯度  $\frac{\partial L}{\partial w}$  正比与 $x$ 的取值 

$$
\frac{\partial L}{\partial w} = (y-\hat{y})x
$$

当我们用 SGD 训练的时候,  $w$  受到学习率 $\alpha$  和梯度的共同作用。 

$$
w := w - \alpha  \frac{\partial L}{\partial w} =w - \alpha(y-\hat{y})x 
$$

假设世界的真实规律是:

$$
y = 2x + 1
$$

我们根据这个规律采集数据点：$(1, 3), (2, 5), (3, 7), \dots,  (100, 201)$

假设此时 $w = 1.9, b = 1.0$。考虑两个数据点$(1, 3)$ 和 $(100, 201)$

对于第一个数据点 $(1, 3)$，预测误差 $(y-\hat{y}) = -0.1$, 而 $x = 1$。因此我们调整参数 $w$ 的方法为：

$$
w := w - \alpha\frac{\partial L}{\partial w} = w + \alpha \times 0.1
$$

考虑到真实世界的 $w_{gt} = 2$, 此时**合理**的学习率 $\alpha=1.0$ .

但如果第二个数据点 $(100, 201)$，预测误差 $(\hat{y}-y) = -10$, 而 $x = 100$。因此我们调整参数 $w$ 的方法为：

$$
w := w - \alpha\frac{\partial L}{\partial w} = w + \alpha \times 1000
$$

考虑到真实世界的 $w_{gt} = 2$, 此时合理的学习率 $\alpha= 10^{-4}$.

从上边的例子我们可以发现，$x$ 变大 $10^2$倍，梯度 $\frac{\partial L}{\partial w}$变大 $10^4$ 倍，合理的学习率  $\alpha$ 的值就得小 $10^4$ 。（这也就是为什么做ML需要经常调学习率的参数，以及我们 preprocess 的时候为什么要 normalize data）

我们可以思考，到底什么是 $\frac{\partial L}{\partial w}$? 为什么他会随 $x$ 的变化而变化如此剧烈？这需要我们思考参数空间，即不同的 $w$ 对预测结果的影响。在数据 $x$ 处给 $w$ 一个变化量$dw$，预测结果发生的变化为：

$$
dy=(w+dw)x - wx=x \times dw 
$$

我们可以发现，对于同样的变化 $dw$， 在不同数据点 $x$ 处，$dy$ 是不一样的。


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch.autograd import Variable


class linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize):
        super(linearRegression, self).__init__()
        self.linear = torch.nn.Linear(inputSize, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out

    
def trainLR(model, X, y):
    lr = 0.01
    epochs = 10

    loss_func = torch.nn.MSELoss() 
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    REGULARIZE = False
    l1_lambda = 0.1

    batch_size = 10
    N = X.shape[0]

    for epoch in range(epochs):
        # Converting inputs and labels to Variable
        s, e = 0, batch_size
        while s < e and e <= N:
            inputs = Variable(torch.from_numpy(X[s:e]))
            labels = Variable(torch.from_numpy(y[s:e]))

            # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
            optimizer.zero_grad()

            # get output from the model, given the inputs
            outputs = model(inputs)

            # get loss for the predicted output
            loss = loss_func(outputs, labels)

            if REGULARIZE:
                for W in model.parameters():
                    loss +=  l1_lambda * W.norm(1).sum()

            # get gradients w.r.t to parameters
            loss.backward()
            # update parameters
            optimizer.step()

            s += batch_size
            e = min(e + batch_size, N)

        weight = model.linear.weight.detach().numpy()[0][0]
        bias = model.linear.bias.detach().numpy()[0]
        grad = model.linear.weight.grad.detach().numpy()[0][0]
        print(f'epoch {epoch}, loss {loss.item():.2f}, weight:{weight:.2f}, bias:{bias:.2f}, grad:{grad:.2f}')
        

In [4]:
x_values = [i for i in range(10)]
X = np.array(x_values, dtype=np.float32)
# np.random.shuffle(X)
X = X.reshape(-1, 1)
y = X * 2 + 1
n_feat = X.shape[-1]        # takes variable 'x' 
n_out = y.shape[-1]        # takes variable 'y'


# create model
model = linearRegression(n_feat, n_out)

trainLR(model, X, y)

print('=== Look at the parameter, the model has already capture the law: y = 2 * X + b ===')


epoch 0, loss 64.36, weight:1.63, bias:-0.60, grad:-85.07
epoch 1, loss 11.66, weight:1.99, bias:-0.53, grad:-35.28
epoch 2, loss 2.54, weight:2.13, bias:-0.50, grad:-14.59
epoch 3, loss 0.96, weight:2.19, bias:-0.48, grad:-5.99
epoch 4, loss 0.69, weight:2.22, bias:-0.47, grad:-2.41
epoch 5, loss 0.63, weight:2.22, bias:-0.46, grad:-0.93
epoch 6, loss 0.62, weight:2.23, bias:-0.45, grad:-0.31
epoch 7, loss 0.61, weight:2.23, bias:-0.44, grad:-0.05
epoch 8, loss 0.60, weight:2.23, bias:-0.43, grad:0.05
epoch 9, loss 0.59, weight:2.23, bias:-0.42, grad:0.10
=== Look at the parameter, the model has already capture the law: y = 2 * X + b ===


In [5]:
x_values = [i for i in range(10)]
# NOTE: We shift x_values by 1000, nothing else changes
print('=== Now we shift the data distribution, while not change the data generation function y = 2 * X + b ===')

X = np.array(x_values, dtype=np.float32) + 1000
# np.random.shuffle(X)
X = X.reshape(-1, 1)
y = X * 2 + 1
trainLR(model, X, y)

=== Now we shift the data distribution, while not change the data generation function y = 2 * X + b ===
epoch 0, loss 51306.35, weight:-4548.36, bias:-4.95, grad:455058.53
epoch 1, loss 20892759359488.00, weight:91824400.00, bias:91411.87, grad:-9182895104.00
epoch 2, loss 8507862646056288256000.00, weight:-1852978626560.00, bias:-1844662528.00, grad:185307051851776.00
epoch 3, loss 3464536289305600348642728738816.00, weight:37392350351196160.00, bias:37224531886080.00, grad:-3739420354368503808.00
epoch 4, loss inf, weight:-754562284598486630400.00, bias:-751175761297145856.00, grad:75459968980814170947584.00
epoch 5, loss inf, weight:15226758826865125551505408.00, bias:15158418287786847109120.00, grad:-1522751274529541806087995392.00
epoch 6, loss inf, weight:-307269728814764986676455407616.00, bias:-305890753972303131140161536.00, grad:30728495825777758196748376866816.00
epoch 7, loss inf, weight:6200577975552102010971159724556288.00, bias:6172749243047174851522478473216.00, grad:-6

## A simple solution

一个简单的解决方案就是预测相对值，而不是绝对值。


In [20]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.autograd import Variable


class DelatlinearRegression(nn.Module):
    def __init__(self, inputSize, outputSize):
        super(DelatlinearRegression, self).__init__()
        self.linear = nn.Linear(inputSize, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out


def train(model, X, y):
    print('----- TRAIN ----')
    lr = 0.00001
    epochs = 5

    ##### For GPU #######
    if torch.cuda.is_available():
        model.cuda()

    loss_func = torch.nn.MSELoss() 
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    REGULARIZE = False
    l1_lambda = 0.1

    batch_size = 10
    N = X.shape[0]

    for epoch in range(epochs):
        # Converting inputs and labels to Variable
        s, e = 0, batch_size - 1
        while s < e and e <= N:
            delta_X = Variable(torch.from_numpy(X[s:e] - X[s+1: e+1]))
            delta_y = Variable(torch.from_numpy(y[s:e] - y[s+1: e+1]))
            
            # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
            optimizer.zero_grad()

            # get output from the model, given the inputs
            outputs = model(delta_X)

            # get loss for the predicted output
            loss = loss_func(outputs, delta_y)

            if REGULARIZE:
                for W in model.parameters():
                    loss +=  l1_lambda * W.norm(1).sum()

            # get gradients w.r.t to parameters
            loss.backward()
            # update parameters
            optimizer.step()

            s += batch_size
            e = min(e + batch_size, N - 1)

        weight = model.linear.weight.detach().numpy()[0][0]
        bias = model.linear.bias.detach().numpy()[0]
        grad = model.linear.weight.grad.detach().numpy()[0][0]
        print(f'epoch {epoch}, loss {loss.item():.2f}, weight:{weight:.2f}, bias:{bias:.2f}, grad:{grad:.2f}')
    

    # with torch.no_grad(): # we don't need gradients in the testing phase
    #     if torch.cuda.is_available():
    #         pred = model(Variable(torch.from_numpy(X).cuda())).cpu().data.numpy()
    #     else:
    #         pred = model(Variable(torch.from_numpy(X))).data.numpy()

    # plt.plot(X, y, 'go', label='True data', alpha=0.5)
    # plt.plot(X, pred, '--', label='Predictions', alpha=0.5)
    # plt.legend(loc='best')
    # plt.show()

def main():
    # create dummy data for training
    x_values = [i for i in range(500)]
    X = np.array(x_values, dtype=np.float32)
    np.random.shuffle(X)
    X = X.reshape(-1, 1)
    y = X * 10 + 2
    n_feat = X.shape[-1]        # takes variable 'x' 
    n_out = y.shape[-1]       # takes variable 'y'

    # create model
    model = DelatlinearRegression(n_feat, n_out)

    train(model, X, y)
    '''NOTE: Already Capture the laws !!!'''

    
    x_values = [i for i in range(500)]
    # NOTE: We shift x_values by 1000, nothing else changes
    X = np.array(x_values, dtype=np.float32) + 1000
    np.random.shuffle(X)
    X = X.reshape(-1, 1)
    y = X * 10 + 2
    train(model, X, y)
    '''NOTE: NOT EXLPODE !!!'''



main()

----- TRAIN ----
epoch 0, loss 0.28, weight:10.00, bias:0.54, grad:5.53
epoch 1, loss 0.28, weight:10.00, bias:0.54, grad:5.50
epoch 2, loss 0.28, weight:10.00, bias:0.54, grad:5.52
epoch 3, loss 0.28, weight:10.00, bias:0.54, grad:5.49
epoch 4, loss 0.28, weight:10.00, bias:0.54, grad:5.53
----- TRAIN ----
epoch 0, loss 0.29, weight:10.00, bias:0.54, grad:-19.56
epoch 1, loss 0.29, weight:10.00, bias:0.54, grad:-19.50
epoch 2, loss 0.29, weight:10.00, bias:0.54, grad:-19.51
epoch 3, loss 0.29, weight:10.00, bias:0.53, grad:-19.49
epoch 4, loss 0.29, weight:10.00, bias:0.53, grad:-19.44
