# Introduction

### Mathematical Optimization

\begin{equation*}
\begin{aligned}
\text{minimize} & & f_0(x) \\
\text{subject to} & & f_i(x) \leq b_i, & i = 1, 2, \dots, m
\end{aligned}
\end{equation*}

### Notations
+ $f_0: \mathbf{R}^n \rightarrow \mathbf{R}$: objective function
+ $f_i: \mathbf{R}^n \rightarrow \mathbf{R}$: constraints
+ $x = (x_1, \dots, x_n)$: optimization variable

### Convex Optimization
A convex optimization is the optimization problems in which objective function and constraints are convex, which means they satisify $f_i(\alpha x + \beta y) \leq \alpha f_i(x) + \beta f_i(y)$ for all $x, y \in \mathbf{R}^n$ and all $\alpha, \beta \in \mathbf{R}$ with $\alpha + \beta = 1$, $\alpha \geq 0$, $\beta \geq 0$

### Ordinary Least Squares



\begin{equation*}
\begin{aligned}
\text{minimize} & & f_0(x) = \lVert Ax - b \rVert_2^2 = \lVert Xw \rVert_2^2
\end{aligned}
\end{equation*}

where $$X = \begin{bmatrix}
    1       & X_{11} & \ldots & X_{n1} \\
    \vdots  & \vdots &        & \vdots \\
    1       & X_{1k} & \ldots & X_{nk} 
    \end{bmatrix}$$
    
+ $n$: number of explantory variables; $k$: number of observations. ($k \geq n$)

+ If $X$ has full column rank, $w$  has an explicit solution in matrix form as: $w^* = (X^T X)^{-1} X^T y$, where $y = X w$.

+ Computational time proportional to $n^2 k$; less if structured

+ lower bias, higher variance

+ Details: [最小二乘线性回归从入门到放弃 :-)](https://www.bilibili.com/video/av13759873)


In [1]:
import numpy as np

def f(x1, x2):
    return 2 +  3 * x1 - x2

x1 = [1, 3, 0, 1, -1]
x2 = [2, 4, 1, 0, 1]
y = [f(x11, x22) for x11, x22 in zip(x1, x2)]

X = np.zeros((5, 3))
X[:, 0] = 1
X[:, 1] = x1
X[:, 2] = x2
print('X', X, sep='\n')

w_opt = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T ), y)
print('w_opt', w_opt, sep='\n')


X
[[ 1.  1.  2.]
 [ 1.  3.  4.]
 [ 1.  0.  1.]
 [ 1.  1.  0.]
 [ 1. -1.  1.]]
w_opt
[ 2.  3. -1.]


### Linear Optimization

\begin{equation*}
\begin{aligned}
\text{minimize} & & c^T x \\
\text{subject to} & & a_i^T x \leq b_i, & i = 1, 2, \dots, m
\end{aligned}
\end{equation*}

vectors $c, a_1, \dots, a_m \in \mathbf{R}^n$ and scalars $b_1, \dots, b_m \in \mathbf{R}$

## Ridge Regression
Fix the issue when $X^T X$ is singular or nearly singular: Add a small constant **positive** value $\lambda$ to the diagonl entries of the matrix $X^T X$ before taking its inverse. [Regularization Part 1: Ridge Regression](https://www.youtube.com/watch?v=Q81RR3yKn30)


+ high bias, low variance
+ make $y$ less sensitive to `x` by decreasing the slope
+ smaller sample size (Note that OLS can only be applied to dataset whose size is not less than its variables, while Ridge Regression has no such constraint) => improve the generalization ability of the model

The ridge estimator is $$\beta^{ridge} = (X^T X + \lambda I_p)^{-1} X^T y$$

Note that OLS is $$\beta^{ols} = (X^T X)^{-1} X^T y$$

The ridge estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lVert \beta \rVert^2$$ 

### Lasso Regression
Similar to Ridge Regression, except that its penality is $\lambda \lvert \beta \rvert$. [Regularization Part 2: Lasso Regression
](https://www.youtube.com/watch?v=NGf0voTMlcs)

+ Ridge Regression does not exclude useless variables, while Lasso Regression does by setting their coefficients to zero


The lasso estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lvert \beta \rvert$$ 


### Elastic Net Regression
Ridge + Lasso. [Regularization Part 3: Elastic Net Regression](https://www.youtube.com/watch?v=1dKRdX9bfIo)