# Models

- Models explain and predict (through quantifying relationships)
- Models are approximations (i.e., they are not perfect representations)

### Predictions Don't Have To Be Accurate to be Useful
In general our models are not so precise if the relationship between our quantities are not perfect, but we can still make a reasonable guess using our models.

* They have to generalize well to be useful
* Real life data comes installed with lots of unexpected variation
* Nothing in life is 100% certain, not even relationships 🙄

# Linear Regression

* Linear regression is a method to determine the coefficients of linear relationships

## Simple linear regression

Simple linear regression is an approach for predicting a **continuous response** using a **single feature**. It takes the following form:

$y = \beta_0 + \beta_1x$

- $y$ is the response
- $x$ is the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x

$\beta_0$ and $\beta_1$ are called the **model coefficients**:

- We must "learn" the values of these coefficients to create our model.
- And once we've learned these coefficients, we can use the model to predict **something**.

#### Estimating ("learning") model coefficients
- Coefficients are estimated during the model fitting process using the least squares criterion.
- We find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").

_Residuals: The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual._

## Bias Variance

We want to **minimize the predictive error of our models**. (I.e. we need an objective function.) How do we quantify the error in our model?

#### Sum of Squared Errors (SSE)

$$ SSE = \sum_{i=1}^{n}(y_i - f(x_i))^2 = \sum_{i=1}^{n}(y_i - \hat y)^2 $$

$x_i$ -- a given x value

$y_i$ -- actual y value

$f(x_i)$-- the model's predicted y value

$\hat y $ -- predicted y value

#### SSE can be decomposed<sup>1</sup> into error due to Bias and Variance

$$SSE \sim E(y_i - \hat{f}(x_i))^2 = Var(\hat{f}(x_i))\ + [Bias(\hat{f}(x_i))]^2 + Var(\epsilon)$$
<sup>1</sup>See the derivation of this result [here](https://theclevermachine.wordpress.com/tag/bias-variance-decomposition/)

#### Bias and Variance

**Bias**
* Your model makes assumptions about the shape of the data and consistently gets it wrong as it is run on new sample data.

**Variance**
* Imagine building your model many times, on different slices of data. Variance is related to how much your predictions for a given $x_i$ differ each time you make a prediction

#### The tradeoff:

![](https://camo.githubusercontent.com/be96d619bff8883343cf541ed1405a8f7f5991cc/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f6d6174682f632f622f632f63626336353331306430396136656661363330643863316633336364666138382e706e67)
![](https://camo.githubusercontent.com/34d8f46b4220c71b359f55db15ed9124474b397d/687474703a2f2f73636f74742e666f72746d616e6e2d726f652e636f6d2f646f63732f646f63732f4269617356617269616e63652f6269617376617269616e63652e706e67)

#### Example variance and bias code 

In [None]:
#from w3d2-bias-variance-lecture
domain = np.array([x[0] for x in data]) # The x values we're "observing"
Y = np.array([x[1] for x in data]) # The values we are trying to predict


for i in range(1, 10):
    X = monomials(domain, i) 
    # Create linear regression object and fit it to X and Y
    regr = regr.fit(X, Y)

    yhat = regr.predict(X)
    sse = np.mean((np.mean(yhat) - Y) ** 2)
    var = np.var(yhat)
    bias = sse - var - 0.01
    
    # The coefficients
    print('Coefficients: %.4f' % regr.coef_)
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.2f' % regr.score(X, Y))

    # The mean square error
    print("Residual sum of squares: %.2f" % sse)

    print("Bias: {bias}".format(bias=bias))
    print("Variance: {var}".format(var=var))
        
    # Plot outputs
    plt.scatter(domain, Y,  color='black')
    plt.plot(domain, regr.predict(X), color='blue', linewidth=3)

    plt.title("n = " + str(i))

    plt.show()

## Regularization (theory)

#### Regularization

* Regularization helps avoid overfitting by limiting model complexity
* Mathematically this works by penalizing models with greater complexity
* Regularized models will often fit alternate datasets better than a model that's been overfit on the training data
  * [But what if minimizing the error gives us a model that is great on our training data, but not so great on out-of-sample data?
  * When might this happen? When could we have **too complex a model**?
  * Few samples compared to the number of features (predictors)
  * So if we have a large number of features, we can end up with a model that fits our training set well, but fails to generalize to our test data]
* To remedy this, we can make a **tradeoff**: We can increase the bias of our model in order to reduce the variance. * * It can make sense to do this when the increase in bias is small compared to the decrease in variance.
* We have two choices:
  * Remove selected features manually - problem is this may remove some valuable info
  * **Regularization** - keep all features, but reduce their "pull" in the model

#### Model Complexity to Prediction Error Chart

<img src="http://i.imgur.com/C9EmsUV.png" width=500>

#### Penalty

* This "pull" reduction is accomplished by constraining their coefficients, or by **regularization**
* That is we impose a cost or a **penalty** to having a high $Beta$
* This means that we have moved away from the minimum error (increased bias) in order to the reduce variance of our model
#### Cost Function
* Minimize
* In linear model, cost function is sum of squares
* Cost = Bias (sum of squares) - ????
* Penalty = Variance (coefficients, regularization) - ????

#### Cost Penalty Chart

<img src="http://i.imgur.com/CkRe7Ru.png" width=400>

#### Penalty Formula

Previously, we minimized the sum of squared error, but now we are minimizing that plus the size of the coefficients times some weight (either called lambda or alpha). Theta here is the size of our all our coefficients.

$$ min \sum ( y_{i} - \hat{y} )^2 + \lambda \theta $$
$$ \lambda $$ <center>may also be called</center> $$ \alpha $$

## Lasso (L1) and Ridge (L2) - most common penalties

#### L1 vs L2

<img src="http://i.imgur.com/c8BZW3i.png" width=700>

## Loss Functions 

In [None]:
# Regression Metrics Loss Functions

# Cross Validation

## Gradient Descent

# Logistic regression