# Lesson 5 Review and Lesson 6 Prework

## Lesson 5 Review and Coding Tips

1. Coding: Pandas' Merge, Concat and Append

## Lesson 6 Topics
1. Regularization (Lasso, Ridge, Elastic Net)
3. Bias-Variance Trade Off
4. Error Metrics: MSE, RMSE, R-Squared

In [2]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

%matplotlib inline

## Lesson 5 Review Topics and Coding Tips

### Merging

Pandas' merge function is very similar to how relationship databases (SQL) does joins. Look at the parameters and you'll see why.

pd.merge(left_df, right_df, how=['inner', 'outer', 'left', 'right'], on=Col)

There are a couple more parameters included, but the basic usage of merge requires those 4 paremeters. Here's a quick example.

In [3]:
df1 = pd.DataFrame({'Names': ['Reid', 'Edward', 'Matt', 'MartyMcFly'],
                    'is_GA': ['Yes', 'Yes', 'Yes', 'No'],
                    'Movies': [0, 0, 0, 1]})

df2 = pd.DataFrame({'Names': ['Reid', 'Edward', 'Matt', 'MartyMcFly'],
                    'Num_of_Glasses': [2, 2, 0, 1],
                    'Steps_to_get_to_class': [3000, 8000, 5000, 2000]})

In [21]:
df1

Unnamed: 0,Movies,Names,is_GA
0,0,Reid,Yes
1,0,Edward,Yes
2,0,Matt,Yes
3,1,MartyMcFly,No


In [4]:
df2

Unnamed: 0,Names,Num_of_Glasses,Steps_to_get_to_class
0,Reid,2,3000
1,Edward,2,8000
2,Matt,0,5000
3,MartyMcFly,1,2000


In [7]:
big_df = pd.merge(df1, df2, how='inner', on='Names')
big_df

Unnamed: 0,Movies,Names,is_GA,Num_of_Glasses,Steps_to_get_to_class
0,0,Reid,Yes,2,3000
1,0,Edward,Yes,2,8000
2,0,Matt,Yes,0,5000
3,1,MartyMcFly,No,1,2000


### Concat
Another way to join tables together, this time more similar to SQL's Union. However, it is a bit stronger that you can choose which axis to concat on (axis=0 is rows and axis=1 is columns).

In [8]:
df1 = pd.DataFrame({'Animals': ['Bear', 'Cat', 'Dog', 'Bird'],
                    'Pet': ['No', 'Yes', 'Yes', 'No'],
                    'Size': ['Big', 'Medium', 'Medium', 'Small']})

df2 = pd.DataFrame({'Animals': ['Turtle', 'Cheetah', 'Eagle', 'Beaver'],
                    'Pet': ['Yes', 'No', 'No', 'No'],
                    'Size': ['Small', 'Big', 'Medium', 'Medium']})

df3 = pd.DataFrame({'Animals': ['Rat', 'Hampster', 'Porcupine', 'Pig'],
                    'Pet': ['Yes', 'Yes', 'No', 'Yes'],
                    'Size': ['Small', 'Small', 'Medium', 'Medium']})

In [16]:
results = pd.concat([df1, df2, df3], axis=0)
results

Unnamed: 0,Animals,Pet,Size
0,Bear,No,Big
1,Cat,Yes,Medium
2,Dog,Yes,Medium
3,Bird,No,Small
0,Turtle,Yes,Small
1,Cheetah,No,Big
2,Eagle,No,Medium
3,Beaver,No,Medium
0,Rat,Yes,Small
1,Hampster,Yes,Small


In [9]:
results = pd.concat([df1, df2, df3], axis=1)
results

Unnamed: 0,Animals,Pet,Size,Animals.1,Pet.1,Size.1,Animals.2,Pet.2,Size.2
0,Bear,No,Big,Turtle,Yes,Small,Rat,Yes,Small
1,Cat,Yes,Medium,Cheetah,No,Big,Hampster,Yes,Small
2,Dog,Yes,Medium,Eagle,No,Medium,Porcupine,No,Medium
3,Bird,No,Small,Beaver,No,Medium,Pig,Yes,Medium


### Append
A shortcut way to concat on axis=0. Will use same df as in concat.

In [10]:
# Python is very handy where you can chain commands together. Make sure that they are in order of operations.
results = df1.append(df2).append(df3)
results

Unnamed: 0,Animals,Pet,Size
0,Bear,No,Big
1,Cat,Yes,Medium
2,Dog,Yes,Medium
3,Bird,No,Small
0,Turtle,Yes,Small
1,Cheetah,No,Big
2,Eagle,No,Medium
3,Beaver,No,Medium
0,Rat,Yes,Small
1,Hampster,Yes,Small


## Lesson 6

### Bias-Variance Trade Off

Every Data Scientist will tell you that the Bias-Variance trade off is important to understand. First off, what is Bias and what is variance?

1. Bias - I've accepted that my model will not perfectly model the dataset.
2. Variance - I will try to account for all the data points in my current "sample/training" set.

What does it mean?

Bias means that you are **underfitting**. You are obtaining only the most important feature(s) that you leave out other features that may improve your explanation of your model. This is a problem because you want to get the best explanation.

Variance means that you are **overfitting**. You are accommodating every single data point that you include features you don't want that may confuse or hurt your explanation of your models.

The trade off is to find the most optimized point between bias and variance, but how?

Too much variance:
    - Regularization regression models, such as Lasso and Ridge (explained later), will increase bias.
    - Dimensionality Reduction (unsupervised learning) and Feature Selection will increase bias.
    - Ensemble Method Boosting combines 'strong' learners in a way to reduce variance
    
Too much bias:
    - Increase the number of features
    - Ensemble Method Bagging combining many weak, high bias models to get a lower bias

It is also good to mention that every model's algorithm is different in bias and variance itself. k-Nearest Neighbors is a high bias model. Decision Trees can be high variance depending on how many leaf nodes you determine in your parameter. This is where research and analyst comes into place. As a scientist, you'll need to go back and forth to achieve the highest optimized trade off point. There is no "right" answer.

Some of these methods we haven't learned yet, but don't fret, we will eventually get there. Just know for now that bias-variance is important and there are ways to balance it out.

### Error Metrics: Loss Function, MSE, RMSE, R-Squared

1. Loss Function - The loss function is what is being optimized by the process of regression. Think of the term "loss function" sort of like the greater the value, the more information about your target variable that is "lost" by your model. A common loss function is the least squares loss function, where the increase in the sum of squared errors (loss) is creating a bad fit between your predictor and target variance
2. Mean Squared Error (MSE) and Residual Mean Squared Error (RMSE) - https://www.vernier.com/til/1014/
3. R-Squared - http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

### Regularization with Lasso, Ridge, Elastic Net

The concept of regularization is adding an additional "penalty" on the size of coefficients to the minimization of sum of squared errors in standard regression.

In other words, there are additional components to the loss function, so the minimization becomes a balance between these components. 

The most common types of regularization are the **"Lasso"**, **"Ridge"**, and the **Elastic Net**.
    
##### First off, let's hurt some of your brains.

Least Squares Loss function formula:
### $$ \text{minimize}\; RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 $$

This is the Lasso formula:
### $$ \text{minimize}\; RSS + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_1\sum_{j=1}^p |\beta_j|$$

This is the Ridge formula:
### $$ \text{minimize}\; RSS+Ridge = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_2\sum_{j=1}^p \beta_j^2$$

And finally, the ElasticNet formula:
### $$ \text{minimize}\; RSS + Ridge + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_1\sum_{j=1}^p |\beta_j| + \lambda_2\sum_{j=1}^p \beta_j^2$$

Whoa, that looks confusing! But:
### $$ \text{minimize}\; RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 $$

That is just adding all your (predicted_y - true_y)^2, which is what you do to calculate the MSE and RMSE!

What we are interested in is the lambda ($\lambda$)
### $$ Lasso = \lambda_1\sum_{j=1}^p |\beta_j|$$
### $$ Ridge = \lambda_2\sum_{j=1}^p \beta_j^2$$
### $$ Ridge + Lasso =  \lambda_1\sum_{j=1}^p |\beta_j| + \lambda_2\sum_{j=1}^p \beta_j^2$$

The $\lambda$ is the penalty you are setting to regularize (reduce loss, increase bias, decrease variance). When $\lambda = 0$, then it will end up being just the least squares loss function. For each coefficient, we put a penalty constraint of $\lambda$, and depending which regularization we use, either taking the absolute value or squaring it. Now let's talk about what each one is actually doing.

- Lasso is high in bias. Lasso completely removes all weak predictors completely out of your model and will only work with the strongest predictors to give you the best R-squared.
- Ridge handles multicollinearity really well. Ridge doesn't remove any predictors, but rather, balances out all the predictors so that every predictor will have an optimized explanation towards the model.
- Elastic Net is a combination of both, and personally, I prefer using this, unless I have definite reasoning behind what I am looking for. Elastic Net will help you optimize both Lasso and Ridge. The model is also smart enough that if only Lasso is needed, Elastic Net will only use Lasso and set $\lambda_2 = 0$, and vice versa.

##### Standardization
Before doing any regularization, all predictors must be on the same scale or else your results will be highly skewed and biased. Standardization takes all your values in a series/dataframe, set the mean to 0 and each element is scaled based on the standard deviation. Below is how to import standardization and apply it to your dataframe.

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
scaled_df = ss.fit_transform(df)

Advanced Statistics/Machine Learning Free Course:
https://onlinecourses.science.psu.edu/stat857/node/141