# Linear Regression 

Throughout the notebook we will use the diabetes dataset to understand the maths

In [106]:
from sklearn import datasets
import pandas as pd
import statsmodels.api as sm
import numpy as np

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

X = pd.DataFrame(diabetes_X)
X.columns = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

y = pd.DataFrame(diabetes_y)
y.columns = ['disease']

### OLS Linear Regression Equation

Classical Linear Regression Equation and Residual Sum of Squares(RSS) is given by:

$$
\hat{y} = \hat{\beta_0} + \hat{\beta_1}x \\
RSS = (y_1 - \hat{\beta_0}+\hat{\beta_1}x_1)^2+(y_2 - \hat{\beta_0} + \hat{\beta_1}x_2)^2 + ... + (y_n - \hat{\beta_0} + \hat{\beta_1}x_n)^2 \\
RSS = \sum_{i=0}^n(y_i - \hat{\beta_0}+\hat{\beta_1}x_i)^2 \\
$$

### Computing $\beta_0$ and $\beta_1$

To solve for $\beta_0$ and $\beta_1$, the normal Equations for the above are given by:

$$
\frac{\delta{RSS}}{\delta{\beta_0}} = -2\sum (y_i - \hat{\beta_0}+\hat{\beta_1}x_i) = 0 \\
\frac{\delta{RSS}}{\delta{\beta_1}} = -2x_i\sum (y_i - \hat{\beta_0}+\hat{\beta_1}x_i) = 0 \\
$$

Solving the first normal equation for $\beta_0$:

$$
\frac{\delta{RSS}}{\delta{\beta_0}} = -2\sum (y_i - \hat{\beta_0}+\hat{\beta_1}x_i) = 0 \\
\sum{y_i} - n\hat{\beta_0}-\hat{\beta_1}\sum{x_i} = 0 \\
\underline{\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}} \;\;\;\;\;
where \;
\bar{x} = \frac{\sum{x_i}}{n} \;\;\;
\bar{y} = \frac{\sum{y_i}}{n}
$$

Solving the second normal equation for $\beta_1$:

$$
\frac{\delta{RSS}}{\delta{\beta_1}} = \sum{x_i}(y_i - \hat{\beta_0}+\hat{\beta_1}x_i) = 0 \\
\sum{x_i}(y_i - \bar{y} + \hat{\beta_1}\bar{x} - \hat{\beta_1}x_i) = 0 \;\;\;\;
using \; \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \\
\sum{x_i}(y_i-\bar{y}) = \hat{\beta_1}\sum{(x_i-\bar{x})x_i} \\
\hat{\beta_1} = \frac{\sum{(y_i-\bar{y})x_i}}{\sum{(x_i-\bar{x})x_i}} \\
\underline{\hat{\beta_1}= \frac{\sum{(y_i-\bar{y})(x_i-\bar{x}})}{\sum{(x_i-\bar{x})^2}}} \;\;\;\;\;
where \;
\sum{(y_i-\bar{y})}\bar{x} = 0 \;\;
\sum{(x_i-\bar{x})}\bar{x} = 0
$$


In [117]:
# For this example we will chose the Age feature solely 

X_age = X.age

X_age = sm.add_constant(X_age, prepend=False)
model = sm.OLS(y, X_age)
results = model.fit()
rss = results.ssr
print(f"RSS is {rss}")
print(f"Coeff for age is {results.params[0]}")
print(f"Intercept is {results.params[1]}")

RSS is 2528481.7816048963
Coeff for age is 304.1830745282948
Intercept is 152.13348416289608


In [159]:
# Lets verify if based on our formulas of coeff and intercept we reach the same values

y_mean = [np.mean(y.disease)] * 442

y_diff = y.disease - y_mean
x_mean = 0 # Data is 0 mean
coeff = np.sum((y.disease - y_mean) * (X_age.age - x_mean)) / np.sum(X_age.age**2)
print(f"Calculated coeff is {coeff}")

intercept = y_mean[0] - coeff * x_mean
print(f"Calculated intercept is {intercept}")

print()
print("It can be seen that the values calculated using the formulas are same as the values calculated by the statsmodel. Our maths holds up!")

Calculated coeff is 304.1830745282948
Calculated intercept is 152.13348416289594

It can be seen that the values calculated using the formulas are same as the values calculated by the statsmodel. Our maths holds up!


### Standard Error of Coefficients

Since we don't have the true mean of the $y$ but only sample mean based on $n$ observations, we don't know the true $\beta_0$ and $\beta_1$, we only have $\hat{\beta_0}$ and $\hat{\beta_1}$. Let's see how far are we from true $\beta_0$ and $\beta_1$ by using standard error, which is given by:
$$
SE(\hat{\beta_0})^2 = \sigma^2[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}] \\
SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2} \;\;\;
$$
where $\sigma^2$ is variance of $\epsilon$ which is the uncontrolled error in $y = \beta_0 + \beta_1x + \epsilon$, which again is something we don't know.

The approximation of $\sigma$ is given by Residual Standard Error
$$
\sigma = RSE = \sqrt{\frac{RSS}{(n-2)}}
$$

In [158]:
print(f"Standard error for coeff is {results.bse[0]}")
print(f"Standard error for intercept is {results.bse[1]}")

# Lets try to verify this as well based on the above formula

sigma = np.sqrt(rss/(442-2))

se_coeff = np.sqrt((sigma**2)/np.sum((X_age.age - x_mean)**2))
print(f"Computed standard error for coeff is {se_coeff}")

se_intercept = np.sqrt(sigma**2*(1/442)) # The second term with x_mean can be avoided since x_mean is zero
print(f"Computed standard error for intercept is {se_intercept}")

print()
print("HOORAY! Values are same.") 

Standard error for coeff is 75.8059991270286
Standard error for intercept is 3.605723675064678
Computed standard error for coeff is 75.80599912702861
Computed standard error for intercept is 3.6057236750646777

HOORAY! Values are same.


### Testing Coefficients
In order to know whether a particular feature is important or not, we use the null hypothesis and see if it can be discarded. 

The null hypothesis $H_0$ says that there is no relationship between a feature $X$ and output $Y$, hence $\beta = 0$
If there is relationship between the feature and ouput, then $\beta$ would be sufficiently far from 0. This means that the value of $\beta$ would be atleast some number of standard deviations away from zero which is given by $t-statistic$, resulting in $p-value$ being < 0.05

Based on the above, $t-statistic$ is given by:
$$
t = \frac{\hat{\beta_1} - 0}{SE({\hat{\beta_1}})}
$$

Using this, we can compute the $p-value$ to check if the value of the coefficient is small enough to indicate that there is relationship between feature and output is not due to chance, and null hypothesis can be discarded.

In [164]:
# T-statistics from the model
print(f"t-statistic of coeff from the model is {results.tvalues[0]}")
print(f"t-statistic of intercept from the model is {results.tvalues[1]}")

t_coeff = coeff / se_coeff
t_intercept = intercept / se_intercept

print(f"Calculated t-statistic of coeff is {t_coeff}")
print(f"Calculated t-statistic of intercept is {t_intercept}")

print()
print("Holds up again!") 

t-statistic of coeff from the model is 4.012651743018033
t-statistic of intercept from the model is 42.192219335877745
Calculated t-statistic of coeff is 4.012651743018033
Calculated t-statistic of intercept is 42.19221933587771

Holds up again!


### Model Accuracy

RSE and RSS defined above provide the measure of fit of the model. Higher the values, worse the model fit. However the absolute value of RSE and RSS might not be very intuitive if the scale of Y is not looked at along with it.

$$
RSE = \sqrt{\frac{RSS}{(n-p-1)}} \;\; p = Number \; of \; features
$$



To address it, we use $R^2$, which indicates the amount of variance in the data that is explained by the model. It is given by:

$$
R^2 = \frac{TSS - RSS}{TSS} \\
\underline{R^2 = 1 - \frac{RSS}{TSS}} \;\;\;\; where \; TSS = \sum(y_i - \bar{y})^2
$$

TSS simply indicates the amount of variance in target variable already present. 

Another commonly used approach is to calculate the Adjusted R-sqaure. This takes into number of predictors used in the model. This is to help filter out the predictors that might not have significant impact on the model. Their addition might reduce the R-sqaure value a bit but not enough to distinguish it from noise predictors. Its formula is given by:

$$
Adj. R^2 = 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)}
$$

### Testing Model

Similar to how we test the coefficients to see if they reject the null hypothesis and have any relationship with the target, we use the $f-statistic$ to see if the predictors and response have any relationship. Using this, we can discard the null hypothesis which says there is no relationship between the predictors and response. If that is the case then the $f-statistic$ value would be close to 1 and the p-value of $f-statistic$ would be greater than 0.05 and hence not signifcant to reject the null hypotheses. $f-statistic$ is calculated as following:

$$
F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)}
$$

In [168]:
X_OLS = sm.add_constant(X, prepend=False)
model = sm.OLS(y, X_OLS)
results = model.fit()
print(f"RSS of the model is {results.ssr}")
print(f"TSS of the model is {results.centered_tss}")
print()
print(f"R-sqaure of the model is {results.rsquared}")
print(f"Adjusted R-sqaure of the model is {results.rsquared_adj}")
print(f"F-score of the model is {results.fvalue}")

RSS of the model is 1263983.156255485
TSS of the model is 2621009.124434389

R-sqaure of the model is 0.5177494254132934
Adjusted R-sqaure of the model is 0.506560316954205
F-score of the model is 46.27262550062717


Lets verify!

$$
R^2 = 1 - \frac{RSS}{TSS} \;\; = 1 - \frac{1263983.15}{2621009.12} \;\; = 0.51 \\
Adj. R^2 = 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)} \;\; = 1 - \frac{1263983.15/(442-10-1)}{2621009.12/(432-1)} \;\;
= 1 - \frac{3002.33}{6081.22} \;\; = 1 - 0.493\;\;  = 0.507
$$

It seems like R-sqaure and Adjusted R-sqaure values are correct according to our formula. Lets see F-statistic now

$$
F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)} \;\; = \frac{(2621009.12 - 1263983.15)/10}{1263983.15/(442-10-1)} \;\;
= \frac{135702.59}{2932.67} \;\; = 46.27
$$

WOHOOO!