In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

# Linear Regression: Loss function

Fitting an estimator/predictor/model involves solving for the $\Theta$ that minimizes the Loss function.

For a Regression task: our goal is to make the discrepancy (error) between $\y$ and $\hat{\y}$ "small".
- The discrepancy between $\y^\ip$ and $\hat{\y}^\ip$ is refered to as the *residual*, usually denoted by $\epsilon$

$$
\mathbf{\epsilon}^\ip =   \y^\ip - \hat{\y}^\ip 
$$

So 
$$
\begin{array}[lll]\\
\y & = & \hat{\y} + \epsilon \\
& = & \X \Theta + \epsilon
\end{array}
$$

We define the per-example loss to be the residual *squared*

$$\loss^\ip_\Theta  =   ( \y^\ip - \hat{\y}^\ip   )^2 $$

so that the average loss
$$
\begin{array}[lll]\\
\loss_\Theta  & = & { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta \\
& = & { 1\over{m} } \sum_{i=1}^m ( \y^\ip  - \hat{\y}^\ip  )^2  \\
\end{array}
$$

This expression on the right is called the *Mean Squared Error (MSE)*.

$$
\text{MSE}(\y, \hat{\y}) = { 1\over{m} } \sum_{i=1}^m (  \y^\ip  - \hat{\y}^\ip )^2
$$

- You will sometimes see *Root Mean Squared Error (RMSE)* which is the square root of the MSE

Notice that the Performance Metric and Loss Functions are identical in this case.

This will not always be true.

# $R^2$ versus RMSE: Absolute versus relative error

One often sees the term $R^2$ in the context of Linear Regression.

Whereas RMSE is absolute error (in same units as $\y$), $R^2$ is a relative error (in units of percent).

The relationship is:
$$ 
\begin{array}{ll}
R^2 & = & 1 - \left( \frac{\sum_{i=1}^m { (\y_i - \hat{\y}_i)^2} }{ \sum_{i=1}^m { (\y_i -  \bar{\y}_i)^2} }   \right) \\
& = & 1 - \left( \frac{m \cdot \text{MSE}(\y, \hat{\y})}{\sum_{i=1}^m { (\y_i -  \bar{\y}_i)^2}} \right) \\
& = & 1 - \left(  \frac{m \cdot \textrm{RMSE}(\hat{\y}, \y)^2} { \sum_{i=1}^m { (\y_i -  \bar{\y}_i)^2}} \right)
\end{array}
$$

In addition to changing the units of error, the $R^2$ metric has an interesting interpretation.

Consider a naive "baseline" model for prediction 
- predict $\bar{\y}$ for every value of $\x$
    - where $\bar{\y}$ is the average (over the training examples) of the target

The loss for the naive model is 
$$
\loss_\text{naive} = \text{MSE}(\y, \bar{\y})
$$

Then
$$
\begin{array}\\
R^2 & = & 1 - \left( \frac{m \cdot \text{MSE}(\y, \hat{\y})}{m \cdot \text{MSE}(\y, \bar{\y})}  \right) \\
& = & 1 - \frac{\loss}{\loss_\text{naive}}
\end{array}
$$

Thus, $R^2$ is the *percent reduction in loss* achieved by our model compared to the naive model that always predicts $\bar{\y}$.


We now know our Loss function for the Linear Regression model.

Fitting the Linear Regression model solves for the
$\Theta^*$ that minimizes average loss

$$
\Theta^* = \argmin{\Theta} \loss_\Theta
$$

which are the parameter values that minimizes MSE.

In [3]:
print("Done")

Done
