# Relationships between elementary modeling building blocks

### 1. Derivation of simple statistics

Sample varience

$$
s_{y}^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}
$$

Sample standard deviation

$$
s_{y}={\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}}
$$

Sample covarience

$$
s_{x y}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)
$$

Pearson correlation coeficient

$$
r_{x y}=\frac{s_{x y}}{s_{x} s_{y}}
$$

$$
r_{x y} = \frac{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}}
$$

https://rpsychologist.com/d3/correlation/

Z - score

$$
z_{i}=\frac{x_{i}-\overline{x}}{s_{x}}
$$

Pearson correlation coeficient  as the average of the sum of the cross-products of z-scores

$$
r_{x y}=\frac{1}{n-1} \sum_{i=1}^{n} z_{x, i} \cdot z_{y, i}
$$

### 2. Derivation of normal equations for multiple linear regression

$$
y_{i} = \beta_{1}+\beta_{2} x_{2 i}+\beta_{3} x_{3 i}+\cdots+\beta_{k} x_{k i}+\varepsilon_{i} \quad(i=1, \cdots, n)
$$

Multiple Regression - matrix notation

$$
y=X \beta+\varepsilon
$$

if b is estimate of beta


$$
y=X b+e
$$

residuals

$$
e=y-X b
$$

sum of squares of the residuals

$$
\begin{aligned} S(b) &=\sum e_{i}^{2}=e^{\prime} e=(y-X b)^{\prime}(y-X b) \\ &=y^{\prime} y-y^{\prime} X b-b^{\prime} X^{\prime} y+b^{\prime} X^{\prime} X b \end{aligned}
$$

The least squares estimator is obtained by minimizing S(b)

$$
\frac{\partial S}{\partial b}=-2 X^{\prime} y+2 X^{\prime} X b
$$

we set these derivatives equal to zero, which gives the normal equation

$$
X^{\prime} X b=X^{\prime} y
$$

Solving this for b we get normal equation

$$
b=\left(X^{\prime} X\right)^{-1} X^{\prime} y
$$

### 3. Derivation of relationship between regression and correlation coeficient for univariate case

b for simple regression (univariate x) could be written as:

$$
b=\frac{\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum\left(x_{i}-\overline{x}\right)^{2}}
$$

$$
b_{x y}= \frac{s_{x y}}{s_{x}^{2}}
$$

$$
b=r_{x y} \frac{s_{y}}{s_{x}}
$$

$$
\frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum\left(x_{i}-\bar{x}\right)^{2}}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}} * \frac{\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}}{\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}}
$$

if

$$
\sum\left(x_{i}-\overline{x}\right)^{2} = \sum\left(y_{i}-\overline{y}\right)^{2}
$$

then

$$
b=r_{x y}
$$

#### 3.1 Check empirically that regression coef. == correlation coef. if x stdr. dev. == y std. dev

In [1]:
import pandas as pd
import pandas.util.testing as tm
tm.N, tm.K = 100, 2
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
np.random.seed(5)

  import pandas.util.testing as tm


In [2]:
df = tm.makeTimeDataFrame(freq='M')

In [3]:
df.corr()

Unnamed: 0,A,B,C,D
A,1.0,0.042426,0.04021,0.093468
B,0.042426,1.0,0.016829,0.255671
C,0.04021,0.016829,1.0,-0.363042
D,0.093468,0.255671,-0.363042,1.0


In [4]:
y_scaled = StandardScaler().fit_transform(df[['B']])
x_scaled = StandardScaler().fit_transform(df[['A']])

In [5]:
lin_reg = LinearRegression().fit(x_scaled, y_scaled)

In [6]:
lin_reg.coef_

array([[0.04242638]])

### 4. Relationship between regression coeficient and coeficient of determination

For simple regression (univariate x) could be written as:

$$
b=\frac{\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum\left(x_{i}-\overline{x}\right)^{2}}
$$

$$
y_{i}=a+b x_{i}+e_{i}
$$

the difference from the mean (yi - y) can be decomposed as a sum of two components, a component corresponding to the difference 

from the mean of the explanatory variable (xi - x) and an unexplained component described by the residual

$$
y_{i}-\overline{y}=b\left(x_{i}-\overline{x}\right)+e_{i}
$$

IMPORTANT NOTE: SST = total sum of squares, SSE = explained sum of squares, SSR the sum of squared residuals, 

but sometimes you can encouter that the meaning is switched and SSE is sum of squared errors and SSR as explained varience

$$
\begin{array}{c}{\sum\left(y_{i}-\overline{y}\right)^{2}=b^{2} \sum\left(x_{i}-\overline{x}\right)^{2}+\sum e_{i}^{2}} \\ {S S T=S S E+S S R}\end{array}
$$

$$
R^{2}=\frac{S S E}{S S T}=\frac{b^{2} \sum\left(x_{i}-\overline{x}\right)^{2}}{\sum\left(y_{i}-\overline{y}\right)^{2}}
$$

Coeficient of determination is really just squated of correlation between x and y (holds true just for simple regression)

$$
R^{2}=\frac{\left(\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)\right)^{2}}{\sum\left(x_{i}-\overline{x}\right)^{2} \sum\left(y_{i}-\overline{y}\right)^{2}}
$$

$$
r_{x y} = \frac{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}}
$$

$$
r_{x y} = \sqrt{ R^{2}}
$$

$$
R^{2}=1-\frac{\sum e_{i}^{2}}{\sum\left(y_{i}-\overline{y}\right)^{2}}
$$

$$
R^{2}=1-\frac{S S R}{S S T}
$$

#### 4.1 Check empirically that squared correlation coef. == coef. of determination

In [7]:
import pandas as pd
import pandas.util.testing as tm
import numpy as np
tm.N, tm.K = 100, 2
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import pandas_bokeh
pandas_bokeh.output_notebook()
np.random.seed(5)

In [8]:
df = tm.makeTimeDataFrame(freq='M')

In [9]:
df.corr() ** 2

Unnamed: 0,A,B,C,D
A,1.0,0.0018,0.001617,0.008736
B,0.0018,1.0,0.000283,0.065368
C,0.001617,0.000283,1.0,0.131799
D,0.008736,0.065368,0.131799,1.0


In [16]:
x = df[['A']]
y = df['B']

In [17]:
lin_reg = LinearRegression().fit(x, y)

In [18]:
r2_score(y,lin_reg.predict(x))

0.0017999979073189953

In [19]:
r = (pd.DataFrame(data = np.linspace(-1, 1, 10), columns = ['r'])
                .assign(r2 = lambda x: x['r']**2)
                .plot_bokeh(x = 'r', y = 'r2', title = 'Relationship between r and r2'))

### 5. Relationship between coeficient of determination and mean squared error

Coeficient of determination is just mean squared error devided by it's standard deviation

$$
R^{2}(y, \hat{y})=1-\frac{\sum_{i=0}^{n_{\text { sanples }}-1}\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum_{i=0}^{n_{\text { samples }}-1}\left(y_{i}-\overline{y}\right)^{2}}
$$

$$
\operatorname{MSE}(y, \hat{y})=\frac{1}{n_{\text { samples }}} \sum_{i=0}^{n_{\text { samples }}-1}\left(y_{i}-\hat{y}_{i}\right)^{2}
$$

#### 5.1 Check empirically that mean squared error == coef. of determination scaled by varience

In [28]:
lin_reg = LinearRegression().fit(x, y)

In [29]:
r2_score(y,lin_reg.predict(x))

0.07930073140334426

In [30]:
1-(mean_squared_error(y,lin_reg.predict(x)))/np.var(y)

0.07930073140334415