# Ridge and LASSO Regression  


## 1. Introduction  
<p> Linear regression probably will be the first type of regression that you might learn in statistics. Even people who have little knowledge about statistics because its mathematical notation is similar to what they've learned in high school math/calculus classes. For example, equation that explains about relation between two variables has a form $Y = a + bX$. when people look at this equation, they will easily draw a graph of linear function with $a$ as an y-intercept and $b$ as a slope.

<p> It is that simple to interpret and understand about linear regression(one unit increase in X will increas Y by $b$). You might be curious why I mentioned about linear regression. That's because ridge regression and LASSO are popular machine learning techniques where they both rooted in linear regression model.

### 1.1 Linear Regression  
    
<p> Let's take a look at notation of linear regression first.
$$\text{Y} =  \text{X} \beta + \epsilon$$ 
where $$ \text{Y} = 
\begin{pmatrix} 
y_1\\ y_2\\ \vdots\\ y_n\\ \end{pmatrix}
\text{, } \space\space 
\text{X} = 
\begin{pmatrix} 1 & x_{11} & \cdots & x_{1p}\\ 
                1 & x_{21} & \cdots & x_{2p}\\ 
                \vdots & \vdots & \ddots & \vdots \\ 
                1 & x_{n1} & \cdots & x_{np}\\ \end{pmatrix}
\text{, } \space\space
\beta = 
\begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \\ \end{pmatrix}
\text{, } \space\space
\epsilon = 
\begin{pmatrix} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_n\\
\end{pmatrix}$$

<p> $\epsilon$ is called an error term. We want to find $\beta$ where summation of $\epsilon^2$, which is called RSS(Residual Sum of Squares), becomes the smallest. We first need to note that linear model is trying to express relationship between Y and X in $\textbf{'linear'}$ form. We can never know the $\textbf{'true'}$ relationship g(X), often called as universal function, and don't even know if any other variables that have not been chosen in explanatory variables play important role in explaining the relationship. So, we are just assuming linear relationship between Y and X. Hence, if the summation of squared error terms is at its lowest, we consider the relationship is best explained within given dataset. As X and Y are observed data, they are fixed values and $\beta$ is the only variables that we can change. We need to find the estimate of $\beta$($=\hat \beta$, ^ stands for the estimate) where $$\hat \beta = \underset{\beta}{\operatorname{argmin}} \sum_{i=1}^{n}( y_i - \sum_{j=1}^{p}X_{ij}\beta_j )^2 $$. Such $\hat \beta$ is called $\textbf{OLS(ordinary least squares) estimator}$, and it is formulated as $$\hat \beta = (\text{X}^\top \text{X})^{-1}\text{X}^\top \text{Y}$$. 
    
### 1.2 Regularization

<p> As mentioned above, $\beta$ which has the smallest RSS is considered to be the optimal choice. But, is a model having smallest RSS really the best option for prediction? There is something called bias-variance tradeoff. When we decrease bias, variance will be increased, and vice versa. To simply explain what bias and variance are, bias is predicting off target, and variance is prediction with high volatility.
<p><img src="https://miro.medium.com/max/1062/1*v63L_h5WXGOb4o6oh_daAA.jpeg" alt="bias variance">
<p> Source: Scott Fortmann-Roe., Understanding Bias-Variance Trade-off
    
<p> In linear regression, increasing number of variables increases variance(as more variables try to explain the response variable), hence decrease in bias. Both variance and bias influence error of prediction, and their increase/decrease are not equal to the other's decrease/increase. There are some point where the total error hits the lowest, and that point is our optimum model. Well described in the figure blow;
<p><img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1543418451/tradeoff_sevifm.png" alt="trade off">  
    
<p> Source: researchgate.net

<p> OLS estimator is unbiased estimator, hence it has to have high variance when the model is fitted to the new data. Then, to decrease variance, don't we just have to remove some variables in terms of reducing model complexity through stepwise regression(leaving only significant variables based on p-values)? The answer is NO. Judging significance of variables using p-value can be dangerous choice. For instance, when few variables are actually playing significant role to change values of response and assuming they are correlated to each other, sometimes p-values of both tell people that they are not significant in the model, and sometimes only one of them is.
    
<p> Therefore, people developed an idea called regularization. The idea is pretty simple. We should keep all variables in the model, but penalizing the model with bigger coefficients, $\beta$. In this way, size of coefficients are constrained and variance will decrease hugely with slight increase in bias. Overall, the total error is decreased until penalty reaches certain level.

### 1.3 Ridge and LASSO

<p> Ridge regression and LASSO use regularization to decrease total error. Let's take a look at their objective function which to be minimized.  
    
+ Ridge
$$\hat \beta_\text{ridge} = \underset{\hat \beta}{\operatorname{argmin}} \sum_{i=1}^{n} (y_i - x_i^\text{'} \hat \beta)^2 + \lambda \sum_{j=1}^{p} {\hat \beta_j}^2$$  
+ LASSO
$$\hat \beta_\text{lasso} = \underset{\hat \beta}{\operatorname{argmin}} \sum_{i=1}^{n} (y_i - x_i^\text{'} \hat \beta)^2 + \lambda \sum_{j=1}^{p} {\hat \beta_j}$$ 

<p> They look very similar to each other. The only difference is the last term. The last term from ridge is known as L2 norm and last term from lasso is L1 norm. $\lambda$ in the notations is the penalty described above. It is easy to understand the impact of $\lambda$ with the plot below;

<p><img src="https://miro.medium.com/max/1400/1*Jd03Hyt2bpEv1r7UijLlpg.png" alt="ridgelasso">
<p>Source: towardsdatascience.com

<p> The $\hat \beta$ is the OLS estimator and contours around it represents the coefficients of $\beta$ which have same RSS. The shaded diamond and circle region near origin is constraint region. When the contour first touches this region, the coordinate(coefficients) is the LASSO/Ridge regression's estimates. As can be seen from the notations, $|\beta_1| + |\beta_2| \leq t$ and $\beta^2_1 + \beta^2_2 \leq c$, $t$ and $c$ are the upper limit of coefficients' size. They inverse proportional to the penalty($\lambda \propto \frac{1}{t} \text{or} \frac{1}{c}$). 

+ If $\lambda \rightarrow 0$:  
    penalty term vanishes from the notation, hence $\hat \beta = \beta_\text{OLS}$
+ If $\lambda \rightarrow \infty$:  
    even small coefficients makes the objective function infinity, hence $\hat \beta = 0$.
    
<p> LASSO stands for Least Absolute Shrinkage and Selection Operator. As it is stated from its last two abbreviation, the greatest benefit using LASSO over Ridge is that it can select which variables to be remain in the regression model. In other words it can remove non-significant variables with only slight increase in total error in comparison to removing variables from linear regression model. LASSO also can rank the importance of variables from most important to the least through something called 'solution path'. The solution path starts with all coeficients at 0, then variables starts to have non-zero coefficient values one by one as $\lambda$ decreases(or constraint increases). The first variable which have non-zero coefficient is the most significant variable and the last is the least. At last, all variables have OLS estimates.

<p> <img src="https://online.stat.psu.edu/onlinecourses/sites/stat508/files/lesson04/ordinary_lasso.png">
<p> Source: online.stat.psu.edu  

<p> We have now briefly looked over about linear, Ridge and LASSO regression. In the next chapter, I will describe why I decided to make project about Ridge and LASSO, and some codes that I wrote to solve the question I had.
