Sources: 
<BR>
Data100 Textbook: https://www.textbook.ds100.org/ch/10/modeling_intro.html
<BR>
Foundations of Statistical Inference slides by Sandrine Dudoit:       http://www.ds100.org/sp19/assets/lectures/lec12/inference.pdf

In [1]:
import numpy as np
import pandas as pd

## Models:
- **model**: an idealized representation of a system; allows us to make predictions based on data used to create the mode; predictions will almost alwyas make incorrect predictions; *any model we create is an approximation of a real-world process*
- **population parameter**: a single value hat represents some attirbute of the large number of values within the population; representative of the population

## Loss: 
- **loss function**: a function that takes in as input a value of $\Theta$ and points in our dataset; outputs a single number the is used to select the best value of $\Theta$ we can
- measures how well a given value of $\Theta$ fits the data
    - $\Theta$ denotes the population parameter
    - by convention, loss function outputs lower values (for loss) for preferable values of $\Theta$ and larger values (for loss) for values of $\Theta$ that minimize loss

## Loss Functions:

## Mean Sqaures Error (MSE):
1. Select a value of $\Theta$
2. For each value in the dataset, take the squared difference between the value and $\Theta$
3. compute the final loss by taking the avergae of each sqaured difference

In python that looks like: 

In [2]:
def mse_loss(theta, values):
    return np.mean((y_vals - theta) ** 2)

MSE penalizes values that are far form the center of the data. 

#### Minimizing the MSE function: 
Since our loss function is a function of theta, we can take the partial derivative of it in terms of theta and then find the minimizing value. 

**Observation**: Minimizing value of the MSE is the average of the data values for the constant model. There is a single value of $\Theta$ that gives us the least MSE no matter what the data is.

## Mean Absolute Error (MAE):
1. Select a value of $\Theta$
2. For each value in the dataset, take the absolue difference between the value and $\Theta$
3. compute the final loss by taking the avergae of each absolute difference

In python it looks like:

In [3]:
def mae_loss(theta, values):
    return np.mean(np.abs(theta - values))

#### Minimizing the MAE function:
Since our loss function is a function of theta, we can take the partial derivative of it in terms of theta and then find the minimizing value. When doing this, we observe that to satisfy the derivative for the aboslute value function, we need to pick a value of $\Theta$ for that has the same number of smaller and larger points. 

**Observation**: Minimizing value of the MAE is the median of the data points, when we have an odd number of points. When we have an even number of points, the mloss is minimized when $\Theta$ is any value in between the two cenral points. 

#### MAE  vs MSE: 
- Since the MSE has sqaured terms it will be more sensitive to outliers, and this makes sense because the mean is more sensitive to outliers than the median is. 
- MAE is less sensitive to outlier just like how the median is compared to MSE and the mean.
- MAE can have multiple $\Theta$ values that minimize the loss, but the MSE will always have one $\Theta$ value. 

## Huber Loss: MAE + MSE
**Huber Loss**: combines both the MSE and the MAE that is differentiable and robust to outliers; behaves likes the MSE for $\Theta$ that are close to the minimizing value and switching to the MSE for $\Theta$ that are far from the minimizing value.

It has an additional parameter **$\alpha$** that is set to the point where the Huber loss transitions from MSE to MAE. 

In python this (roughly) looks like: 

In [6]:
def huber_loss(theta, alpha, values):
    total_sum = 0
    for point in values:
        if np.abs(point - theta) <= alpha:
            total_sum += ((point - theta) ** 2)/2
        else:
            total_sum += alpha*(np.abs(point - theta) - alpha/2)
    return total_sum / len(values)

Because it is hard to differentiate Huber Loss and is tedious, **gradient descent** is used to find the minimizing value of $\Theta$.