# Feature Engineering & Bias/Variance Tradeoff

<img src=https://i.redd.it/ngdmak09ha131.jpg width=400>


## Lesson objectives:

By the end of this lesson students will be able to:
- Understand how to create new variables based using other variables in our dataset
- Understand and apply feature scaling techniques
- Understand what Bias and Variance are in terms of data modeling 
- Understand what model underfitting and overfitting are.

## Lets take a look at the Multiple Linear Regression below.

In [None]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import seaborn as sns
from functions import * 
%matplotlib inline

baseball = pd.read_csv('data/baseball_height_weight.csv')
baseball.head()

In [None]:
# Lets run our OLS model for weight_ib on height_in and age

lr = ols(formula='weight_lb~age+height_in', data=baseball).fit()
lr.summary()

### Turn and Talk!

With your group:

1. Interpret the coefficient values from our model
2. Given this interpretation which feature is more important to predicting the target variable?

# What is Feature Scaling?
Scaling data is the process of **increasing or decreasing the magnitude according to a fixed ratio.** You change the size but not the shape of the data.

In [None]:
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(baseball['height_in'].values.reshape(-1, 1))

ax1 = plt.subplot(1, 2, 1)
baseball['height_in'].hist(ax = ax1)
ax1.set_title('Original Data')

ax2 = plt.subplot(1, 2, 2)
ax2.hist(scaled_data)
ax2.set_title('Scaled Data')
plt.show()

^^ Notice the change in the X-axis' scale. But NOT the shape of the distribution

## Why do we need to use feature scaling?

- In order to compare the magnitude of coefficients thus increasing the interpretability of coefficients.
- It helps handling disparities in units.
- Some models use euclidean distance in their computations.
- Some models require features to be on equivalent scales.
- In the machine learning space, it helps improve the performance of the model and reducing the values/models from varying widely.
- Some algorithms are sensitive to the scale of the data.

## When do we use feature scaling?
- in the preprocessing phase.

## How do we perform feature scaling?

The common types of feature scaling are:

- Min/Max Scaling
- Standardization

# [MinMax Scalar](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?highlight=minmax#sklearn.preprocessing.MinMaxScaler)

\begin{align}
X_{norm} & = \frac{X - X_{min}}{X_{max}-X_{min}} \\
\end{align}

In [None]:
# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler object
minmax = MinMaxScaler()


# Set X to standardized features
X=baseball[['height_in', 'age']]

#scale our X variables
X_scaled = minmax.fit_transform(X)

baseball['scaled_height'] = X_scaled[:, 0]
baseball['scaled_age'] = X_scaled[:, 1]


# Using Ordinary Least Squares (OLS - from StatsModel), fit a model to our X and y.
lr_scaled = ols(formula='weight_lb~scaled_age+scaled_height', data=baseball).fit()


# Output the summary of the model. Label the coefficients as below. 
lr_scaled.summary()

Lets interpret this:

- 1 unit increase in height estimates +79.56 lbs of weight
- 1 unit increase in age estimates +26.6 lbs of weight

Makes it easier to compare different quantity measurements with each other.

# Now try applying Standardization to the same dataset

The most common method of scaling is standardization.  In this method we center the data, then we divide by the standard devation to enforce that the standard deviation of the variable is one:

$X_{std} = \cfrac{X-\bar{X}}{s_X}$


With standardization we now interpret the coefficients in units of standard deviation of each variable!

### Your Turn

With your group:

1. Apply standardization to age and height variables using [`StandardScaler` in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standard%20scaler#sklearn.preprocessing.StandardScaler).
2. Re-interpret the coefficient values
3. Given your interpretation of the coefficients, which feature is more important at predicting the target?

In [None]:
# your code here


Your Interpretation here:

# How do we implement scaling with sklearn and a test-train split?

In [None]:
# Import appropriate models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# Read in CSV and set X and y. 
baseball = pd.read_csv('data/baseball_height_weight.csv')
X = baseball[['height_in','age']]
y = baseball['weight_lb']

# Run the train-test split function.
X_train, X_test, y_train, y_test = train_test_split(X, y,               # Pass in our X and y
                                                    random_state=42,    # Abritary select a random_state 
                                                    test_size=.2        # Split test size to be 20% of full data.
                                                   )
# Instantiate the Standar Scaler Object.
ss = StandardScaler()

# Fit and Transform the training data.
X_train_transformed = ss.fit_transform(X_train)

# ONLY TRANSFORM the test data.
X_test_transformed = ss.transform(X_test)

# Instantiate the Linear Regression Object.
model = LinearRegression()

# Fit the model to the transformed X_train, and the y_train.
model.fit(X_train_transformed, y_train)

# Score the model based on the transformed X_test and the y_test.
model.score(X_test_transformed, y_test)

# Check out these visualizations to better wrap your head around the affects of scaling. 

In [None]:
from functions import * 
make_plot(0) # Unscaled Data
make_plot(1) # Standard Scaled
make_plot(2) # MinMax Scaled

### Your Turn!

In your group:

1. Create a linear regression model to predict gross revenue.  Use the variables `cast_total_facebook_likes` and `budget` as your features.
2.  Using a test-train split, run your model on your raw training data.
3.  Run your model again using standard scaling of your training features.
4.  Run your model again using min-max scaling of your training features.
5.  What about the model summary changed between these three models?  Was r-squared impacted?  Coefficients?  P-values for features?

In [None]:
# your code here

## Creating new variables


There are some times in which we might want to create a new feature in our dataset.  For example we might want to create a ratio from our columns.  There are also times in which we might want to bin, or categorize, a continuous variable.

Let's do each of these!

First let's create a ratio of years played/age of each player to see the proportion of their life they have played baseball.

In [None]:
baseball['proportion_age_played'] = baseball.years_played/baseball.age

In [None]:
baseball.head()

We might also want to test a hypothesis that older players weight more than younger players.  In order to do this we would need to bin our age variable.  

In [None]:
baseball.age.hist()

### Turn and talk

1. How should we bin this variable?  What would be consider to be young vs old?
2. Create your bins!
3. Run another regression model using your binned age variable and your new variable about the proportion of time played as features.  Remember to take out your original age predictor!
4.  Interpret the coefficients of your features.  Examine the r-squared value.  Is this model a better fit than our previous model?

In [None]:
# your code here

## Bias Variance Tradeoff

<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQpAMlAGFzVmJ0es77DsYx9qE0dhObJbsNL3A&usqp=CAU' width =300>


![which model is better](img/which_model_is_better.png)

https://towardsdatascience.com/cultural-overfitting-and-underfitting-or-why-the-netflix-culture-wont-work-in-your-company-af2a62e41288


### What makes a model good?

- We don’t ultimately care about how well your model fits your data.

- What we really care about is how well your model describes the process that generated your data.

- Why? Because the data set you have is but one sample from a universe of possible data sets, and you want a model that would work for any data set from that universe

## Defining Error: prediction error and irreducible error



### Regression fit statistics are often called “error”
 - Sum of Squared Errors (SSE)
 $ {\displaystyle \operatorname {SSE} =\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 - Mean Squared Error (MSE) 
 
 $ {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} $
 
 - Root Mean Squared Error (RMSE)  
 $ {\displaystyle \operatorname 
  {RMSE} =\sqrt{MSE}} $

 All are calculated using residuals    

![residuals](img/residuals.png)


## This error can be broken up into parts:

![defining error](img/defining_error.png)

There will always be some random, irreducible error inherent in the data.  Real data always has noise.

The goal of modeling is to reduce the prediction error, which is the difference between our model and the realworld processes from which our data is generated.

### Prediction error is a combination of bias and variance


$ Total\ Error\ = Prediction\ Error+ Irreducible\ Error$

Our prediction error can be further broken down into error due to bias and error due to variance.

$ Total\ Error = Model\ Bias^2 + Model\ Variance + Irreducible\ Error$



**Model Bias** is the expected prediction error of the expected trained model

> In other words, if you were to train multiple models on different samples, what would be the average difference between the prediction and the real value.

**Model Variance** is the expected variation in predictions, relative to your expected trained model

> In other words, what would be the average difference between any one model's prediction and the average of all the predictions .



### Thought Experiment

1. Imagine you've collected 23 different training sets for the same problem.
2. Now imagine training one model on each of your 23 training sets.
3. Bias vs. variance refers to the accuracy vs. consistency of the models trained by your algorithm.

![target_bias_variance](img/target.png)

http://scott.fortmann-roe.com/docs/BiasVariance.html



### Explore Bias Variance Tradeoff

**High bias** algorithms tend to be less complex, with simple or rigid underlying structure.

+ They train models that are consistent, but inaccurate on average.
+ These include linear or parametric algorithms such as regression and naive Bayes.
+ For linear, perhaps some assumptions about our feature set could lead to high bias. 
      - We did not include the correct predictors
      - We did not take interactions into account
      - In linear, we missed a non-linear relationship (polynomial). 
      
High bias models are **underfit**

On the other hand, **high variance** algorithms tend to be more complex, with flexible underlying structure.

+ They train models that are accurate on average, but inconsistent.
+ These include non-linear or non-parametric algorithms such as decision trees and nearest neighbors.
+ For linear, perhaps we included an unreasonably large amount of predictors using polynomial regression. 
+ High variance models are modeling the noise in our data

High variance models are **overfit**



While we build our models, we have to keep this relationship in mind.  If we build complex models, we risk overfitting our models.  Their predictions will vary greatly when introduced to new data.  If our models are too simple, the predictions as a whole will be inaccurate.   

The goal is to build a model with enough complexity to be accurate, but not too much complexity to be erratic.

![optimal](img/optimal_bias_variance.png)
http://scott.fortmann-roe.com/docs/BiasVariance.html



![which_model](img/which_model_is_better_2.png)

### Use of Test-Train Split and Cross Validation

It is hard to know if your model s too simple or too complex by just using it on the training data. This is why we withhold some of our data as a test set!

The using our model with our test data allows us to evaluate if our model has the right balance of variance vs bias.

**How do we know if our model is overfitting or underfitting?**


If our model is not performing well on the training  data, we are probably underfitting it.  


To know if our  model is overfitting the data, we need  to test our model on unseen data. 
We then measure our performance on the unseen data. 

If the model performs way worse on the  unseen data, it is probably  overfitting the data.

<img src='https://developers.google.com/machine-learning/crash-course/images/WorkflowWithTestSet.svg' width=500/>

### Should you ever fit on your test set?  


![no](https://media.giphy.com/media/d10dMmzqCYqQ0/giphy.gif)


**Never fit on test data.** If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. 

