# 线性回归

使用最小二乘法估计

$$ \hat{Y_i} = a + bx_i $$

时，尝试最小化 $L_2$ 距离

$$ L(a, b) = \sum_{i=1}^{n}(Y_i - \bar{Y_i})^2 = \sum_{i=1}^{n}(Y_i - a - bx_i)^2 $$

令$L$对$a$、$b$的偏微分分别为0，则

$$ \frac{\partial{L}}{\partial{a}} = -2 \sum_{i=1}^{n}(Y_i - a - bx_i) = 0 \Leftrightarrow \sum_{i=1}^{n}(Y_i - a - bx_i) = 0 $$

$$ \frac{\partial{L}}{\partial{b}} = -2 \sum_{i=1}^{n}x_i(Y_i - a - bx_i) = 0 \Leftrightarrow \sum_{i=1}^{n}x_i(Y_i - a - bx_i) = 0 $$

解得

$$
\begin{array}
    \hat{a} & = & \bar{Y} - \frac{S_{XY}}{S_{XX}}\cdot\bar{X} \\
    \hat{b} & = & \frac{S_{XY}}{S_{XX}}
\end{array}
$$

其中

$$
\begin{array}
    S_{XY} & = & \sum_{i=1}^{n}(Y_i - \bar{Y})(x_i - \bar{x}) \\
    S_{XX} & = & \sum_{i=1}^{n}(x_i - \bar{x})^2
\end{array}
$$


## Correlation and Regression Lines #1

> For a particular scatter plot, the line of regression of $y$ on $x$ is:
$$ 3x + 4y + 8 = 0$$
And the line of regression of $x$ on $y$ is:
$$ 4x + 3y + 7 = 0$$
Find the Pearson Product moment coefficient, , correct to a scale of  decimal places.

从回归方程解出

$$ \frac{S_{XY}}{S_{XX}} = \frac{S_{YX}}{S_{YY}} = -\frac{3}{4} $$

Pearson 相关系数

$$ r_{XY} = \frac{S_{XY}}{\sqrt{S_{XX}S_{YY}}} =  -\frac{3}{4}$$

## Correlation and Regression Lines #2

> There are $2$ series of data inolving index numbers: $P$ for price index and $S$ for the commodity stock. 

> The mean and standard deviation of $P$ are $100$ and $8$, respectively.

> The mean and standard deviation of $S$ are $103$ and $4$, respectively. 

> The $R^2$ correlation coefficient between the two series is $0.4$.

> With this data, obtain the slope of the regression line of $P$ on $S$, correct to a scale of $2$ decimal places.

由 Pearson 相关系数

$$ R^2 = \frac{S_{SP}^2}{S_{SS}S_{PP}} = 0.4 $$

由最小二乘法的估计过程：

$$
\begin{array}
    \hat{b} & = & \frac{S_{SP}}{S_{SS}}     \\
            & = & \sqrt{\frac{S_{SP}^2}{S_{SS}S_{PP}}\frac{S_{PP}}{S_{SS}}} \\
            & = & \sqrt{R^2\frac{n\sigma^2_P}{n\sigma^2_S}} \\
            & = & R\frac{\sigma_P}{\sigma_S} \\
            & = & 2\sqrt{0.4}
\end{array}
$$

In [1]:
from math import sqrt
print(2 * sqrt(0.4))

1.2649110640673518


## Correlation and Regression Lines #3

> The two regression lines of a bivariate distribution are: $4x – 5y + 33 = 0$ and $20x – 9y – 107 = 0$.

> Calculate the arithmetic means of $x$ and $y$ respectively. (Both of these are integer values.)

由最小二乘法估计的回归线必过样本均值点。

In [2]:
def rl3():
    from sympy.abc import x, y
    from sympy import solve
    print(solve([4 * x - 5 * y + 33, 20 * x - 9 * y - 107]))

rl3()

{x: 13, y: 17}
