# Linear Models

## Linear Regression

$$\hat{\beta} = \frac{Cov(x, y)}{Var(x)} = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = (X^T X)^{-1} X^T y$$

$\textbf{Derivations}$:
$$\mathcal{L} = (Y - X \beta)^T (Y - X \beta) = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2$$
$$\frac{\partial \mathcal{L}}{\partial \beta} = 2 * X^T (Y - X \beta) = 0$$

$$\beta = (X^T X)^{-1} X^T y$$

$\textbf{Assumptions}$:

- Linearity between DV (y) and IV (X)
  - scatterplot
- Normality of residuals
  - Shapir-Wilk test
  - Kolmogorov–Smirnov test
  - QQ plot: compare quantles of data and normality
  - $\underline{Alternative}$: Use MAE when assume Laplace Distribution
  $$pdf: f(x) = \frac{1}{2 b} exp \{ -\frac{|x - \mu|}{2 b} \}$$
  - Why if normality is not met: Maximum Likelihood (MLE) is not equivalent to Least Square (OLS)
- Homoscedasticity (equal variance) of residuals
  - Breusch Pagan Test
    - obtain squared residual $\hat{u}^2$ from OLS
    - regress $\hat{u}^2$ on all independent variables ($x_1, x_2, ..., x_k$)
    - get $R_{\hat{u}^2}^2$
    - compute F statistic and p-value
  - Scatterplot of residual vs predictor
  - How to deal with heteroscedasticity:
    - obtain residual $\hat{u}$ from OLS
    - regress $ln(\hat{u}^2)$ on $x_1, x_2, ..., x_k$
    - exponentiate the fitted values $\hat{h} = exp(\hat{g})$
    - run WLS with weights $1 / \hat{h}$
- No multicollinearity of independent variables
  - How to deal with multicollinearity:
    - Regularization
    - PCA
    - VIF
- How to compute VIF
  - regress k-th variables on other independent variables
  $$VIF = \frac{1}{1 - R_k^2}$$

$\textbf{Goodness of Fit}$:

$$SST = \sum_i (y_i - \bar{y})^2, SSE = \sum_i (y_i - \hat{y_i})^2, SSR = \sum_i (\hat{y_i} - \bar{y})^2 $$
$$SST = SSE + SSR$$ 
$$R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST} = \frac{[Cov(x, y)]^2}{Var(x) Var(y)}$$
$$Adjusted R^2 = 1 - \frac{(1 - R^2) * (N - 1)}{N - k - 1} < R^2$$

$\textbf{Linear Regression Questions}$:
- Regress y on x, regress x on y, what's relationship between $\hat{\beta}_{y|x}$ and $\hat{\beta}_{x|y}$
  - $\hat{\beta}_{y|x} = \frac{Cov(x, y)}{Var(x)}, \hat{\beta}_{x|y} = \frac{Cov(x, y)}{Var(y)}, \hat{\beta}_{y|x} * \hat{\beta}_{x|y} = \frac{[Cov(x, y)]^2}{Var(x) Var(y)} = R^2$
- Duplicate data, how will coefficient, $R^2$, standard error/variance ($Var(\hat{\beta})$) change
  - $\textbf{coefficient}$: same
    - $MSE = \frac{1}{2n}\sum_{i=1}^{2n} (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 = \frac{2}{2n}\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2$
    - same loss function, get same coefficient
  - $R^2$: same
    - $R^2 = \frac{[Cov(x, y)]^2}{Var(x) Var(y)}$
    - numerator and denominator cancel
  - $\textbf{standard error/variance}$: smaller
- Univariate regression significant, multivariate regression not significant
  - Exist multicollinearity
- How does multicollinearity affect standard error/variance, t-statistic, p-value
  - $\textbf{standard error/variance}$: larger
    - multicollinearity: X is not full ranked, $(X^T X)^{-1}$ unstable
  - $\textbf{t-statistic}$: smaller
    - $t_{\hat{\beta}} = \frac{\hat{\beta} - \beta_0}{s.e.(\hat{\beta})}$
  - $\textbf{p-value}$: larger

In [1]:
import numpy as np
from statsmodels.regression.linear_model import OLS

n = 300
x1 = np.random.normal(size=n)
y1 = 2 * x1 + np.random.normal(size=n) 

In [2]:
model1 = OLS(endog=y1, exog=x1).fit()
print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.815
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     1320.
Date:                Thu, 20 Aug 2020   Prob (F-statistic):          1.06e-111
Time:                        23:37:50   Log-Likelihood:                -403.88
No. Observations:                 300   AIC:                             809.8
Df Residuals:                     299   BIC:                             813.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.9794      0.054     36.337      0.0

In [3]:
# Duplicate data
x1_dup = np.concatenate((x1, x1), axis=0)
y1_dup = np.concatenate((y1, y1), axis=0)

model1_dup = OLS(endog=y1_dup, exog=x1_dup).fit()
print(model1_dup.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.815
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     2645.
Date:                Thu, 20 Aug 2020   Prob (F-statistic):          6.66e-222
Time:                        23:37:51   Log-Likelihood:                -807.75
No. Observations:                 600   AIC:                             1618.
Df Residuals:                     599   BIC:                             1622.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.9794      0.038     51.431      0.0

In [4]:
# Multicollinearity
x2 = np.random.normal(loc=x1, scale=0.01, size=n)
y2 = x1 - x2 + np.random.normal(size=n)
x1x2 = np.resize(np.array((x1, x2)), (n, 2))

model2 = OLS(endog=y2, exog=x1x2).fit()
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     1.410
Date:                Thu, 20 Aug 2020   Prob (F-statistic):              0.246
Time:                        23:37:51   Log-Likelihood:                -436.07
No. Observations:                 300   AIC:                             876.1
Df Residuals:                     298   BIC:                             883.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0962      0.059     -1.631      0.1

## Lasso Regression (L1)

$$\mathcal{L} = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j|$$
$$ \hat{\beta}_j^{lasso} = sgn(\beta_j^{OLS}) max(0, |\beta_j^{OLS}| - \gamma), \gamma = \frac{n \lambda}{||x||^2}$$
https://stats.stackexchange.com/questions/17781/derivation-of-closed-form-lasso-solution

Usage:
- variable selection
- parameter shrinkage
- penalize $\beta_j$ to exactly zero

## Ridge Regression (L2)

$$\mathcal{L} = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2$$
$$ \hat{\beta}_j^{ridge} = (X^T X + \lambda I_p)^{-1} X^T y$$

Usage:
- parameter shrinkage
- include all independent variables
- penalize $\beta_j$ close to zero

## Elastic Net (L1 + L2)

$$\mathcal{L} = \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \alpha \lambda \sum_{j=1}^p \beta_j^2 + (1 - \alpha) \lambda \sum_{j=1}^p |\beta_j|$$

$\textbf{Regularization Questions}$:
- Need standardization before regularization
  - Why: Transform features to same scale
- Why does regularization reduce overfitting
  - Large $\lambda$ penalizes parameters/weights (close) to zero
  - Reduce Model complexity
- Why do not regularize bias term ($\beta_0$)
  - Var(X + $\beta_0$) = Var(X)
  - Bias term does not increase variance

## Logistic Regression

Each observation: 
$$f(y_i) = p^{y_i} (1-p)^{1-{y_i}}$$

Likelihood:
$$L(p) = \prod_i f(y_i) = p^{\sum_i y_i} (1-p)^{\sum_i (1 - y_i)}$$

Log-likelihood:
$$\mathcal{l} = log(L(p)) = \sum_i y_i log(p) + \sum_i (1 - y_i) log(1-p)$$

Loss function (Cross Entropy):
$$\mathcal{L} = -\sum_i [y_i log(p_i) + (1 - y_i) log(1 - p_i)]$$

$\textbf{Assumptions}$:
- linearity of indepdent variables and log odds
$$p = \frac{1}{1 + exp(-y)}, y = ln(\frac{p}{1-p}) = X \beta$$
- every observation is independent Bernoulli(p)
- no multicollinearity

$\textbf{Pseudo R-Squared}$: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/#:~:text=A%20pseudo%20R%2Dsquared%20only,model%20better%20predicts%20the%20outcome.
$$R^2 = 1 - \frac{\sum_i (y_i - p_i)^2}{\sum_i (y_i - \bar{y})^2}$$