# Linear Regression Prework

## Coding Tips

1. Coding: Pandas' Merge, Concat and Append

## Prework Topics
1. Linear Regression
2. Regularization (Lasso, Ridge, Elastic Net)
3. Bias-Variance Trade Off
4. Error Metrics: MSE, RMSE, R-Squared

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline



### Merging

Pandas' merge function is very similar to how relationship databases (SQL) does joins. Look at the parameters and you'll see why.

pd.merge(left_df, right_df, how=['inner', 'outer', 'left', 'right'], on=Col)

There are a couple more parameters included, but the basic usage of merge requires those 4 paremeters. Here's a quick example.

In [2]:
df1 = pd.DataFrame({'Names': ['Sri', 'Edward', 'Matt', 'Ray'],
                    'is_GA': ['Yes', 'Yes', 'Yes', 'No'],
                    'Movies': [0, 0, 0, 1]})

df2 = pd.DataFrame({'Names': ['Sri', 'Edward', 'Matt', 'Ray'],
                    'Num_of_Glasses': [2, 2, 0, 1],
                    'Steps_to_get_to_class': [3000, 8000, 5000, 2000]})

In [3]:
df1

Unnamed: 0,Movies,Names,is_GA
0,0,Sri,Yes
1,0,Edward,Yes
2,0,Matt,Yes
3,1,Ray,No


In [4]:
df2

Unnamed: 0,Names,Num_of_Glasses,Steps_to_get_to_class
0,Sri,2,3000
1,Edward,2,8000
2,Matt,0,5000
3,Ray,1,2000


In [5]:
big_df = pd.merge(df1, df2, how='inner', on='Names')
big_df

Unnamed: 0,Movies,Names,is_GA,Num_of_Glasses,Steps_to_get_to_class
0,0,Sri,Yes,2,3000
1,0,Edward,Yes,2,8000
2,0,Matt,Yes,0,5000
3,1,Ray,No,1,2000


### Concat
Another way to join tables together, this time more similar to SQL's Union. However, it is a bit stronger that you can choose which axis to concat on (axis=0 is rows and axis=1 is columns).

In [6]:
df1 = pd.DataFrame({'Animals': ['Bear', 'Cat', 'Dog', 'Bird'],
                    'Pet': ['No', 'Yes', 'Yes', 'No'],
                    'Size': ['Big', 'Medium', 'Medium', 'Small']})

df2 = pd.DataFrame({'Animals': ['Turtle', 'Cheetah', 'Eagle', 'Beaver'],
                    'Pet': ['Yes', 'No', 'No', 'No'],
                    'Size': ['Small', 'Big', 'Medium', 'Medium']})

df3 = pd.DataFrame({'Animals': ['Rat', 'Hampster', 'Porcupine', 'Pig'],
                    'Pet': ['Yes', 'Yes', 'No', 'Yes'],
                    'Size': ['Small', 'Small', 'Medium', 'Medium']})

In [7]:
results = pd.concat([df1, df2, df3], axis=0)
results

Unnamed: 0,Animals,Pet,Size
0,Bear,No,Big
1,Cat,Yes,Medium
2,Dog,Yes,Medium
3,Bird,No,Small
0,Turtle,Yes,Small
1,Cheetah,No,Big
2,Eagle,No,Medium
3,Beaver,No,Medium
0,Rat,Yes,Small
1,Hampster,Yes,Small


In [8]:
results = pd.concat([df1, df2, df3], axis=1)
results

Unnamed: 0,Animals,Pet,Size,Animals.1,Pet.1,Size.1,Animals.2,Pet.2,Size.2
0,Bear,No,Big,Turtle,Yes,Small,Rat,Yes,Small
1,Cat,Yes,Medium,Cheetah,No,Big,Hampster,Yes,Small
2,Dog,Yes,Medium,Eagle,No,Medium,Porcupine,No,Medium
3,Bird,No,Small,Beaver,No,Medium,Pig,Yes,Medium


### Append
A shortcut way to concat on axis=0. Will use same df as in concat.

In [9]:
# Python is very handy where you can chain commands together. Make sure that they are in order of operations.
results = df1.append(df2).append(df3)
results

Unnamed: 0,Animals,Pet,Size
0,Bear,No,Big
1,Cat,Yes,Medium
2,Dog,Yes,Medium
3,Bird,No,Small
0,Turtle,Yes,Small
1,Cheetah,No,Big
2,Eagle,No,Medium
3,Beaver,No,Medium
0,Rat,Yes,Small
1,Hampster,Yes,Small


## Prework Topics

### Linear Regression

This is a great, free online course for Penn States' Statistic Course.
https://onlinecourses.science.psu.edu/stat501/node/250

### What are the differences between a Correlation and a Linear Regression? Aren't they doing the same thing?

First things first, they are not the same. They are similar, but achieve different goals. I'll break them down into similarities and differences.

#### Similarities
- The standardised regression coefficient is the same as Pearson's correlation coefficient
- The square of Pearson's correlation coefficient is the same as the R^2 in simple linear regression
- Neither simple linear regression nor correlation answer questions of causality directly. This point is important, because I've met people that think that simple regression can magically allow an inference that X causes Y.

#### Differences
- Correlation quantifies the relationship between two variables. Correlation does not fit a line through the data.
- Correlation typically refers to Pearson Correlation, but there are other correlations, such as Spearman's Correlation.
- It does not matter whether you call which feature X or Y, you will get the same correlation if you swap them. With linear regression, you have to decide your independent/predictor variable (X) with your dependent/target variable (Y). You are looking for a best fit line that predicts Y from X, and it will definitely be different if you swap them around.
- Linear Regression fits a best fit line to predict Y from X. If you were to bring in new data/samples, predictor X can help you predict the outcomes of Y, based off of previous data.
- Although both correlation and linear regression does not answer causality directly, with linear regression you do have to think about causality to determine whether X is the best predictor for Y.

In short, if you are interested in seeing the magnitude of the relationship between X and Y, use correlation. If you are interested in predicting and explaining your results in particular values, use linear regression.

### What are the differences between Simple Linear Regression/Ordinary Least Squares Regression (OLS)/Linear Least Squares? What about Multiple Linear Regression?

Ordinary Least Squares Regression (OLS) and Linear Least Squares are the same thing. The goal is to reduce the error rate or the squares. Looks something like this: https://training-course-material.com/images/a/ac/Images-mod2-linear12.gif How is it different from Simple Linear Regression? Not much, they are practically the same thing. In order to reduce the error rate/squares in OLS, you have to use Simple Linear Regression or Multiple Linear Regression in a dataset. The only difference is SLR has 1 target and 1 predictor, where as MLR has 1 target and 2 or more predictors.

### What is statistic interaction?

When we multiplied 2 features together, that is called Statistical Interaction. The reason why we do this is to see if there is an interaction between the 2 features. We want to see if the effect of 1 independent variable will vary depending on the level of another independent variable.

This website provides some further insight on statistic interaction (the bottom of the page has graphs): 

https://courses.washington.edu/smartpsy/interactions.htm

This blog provides a simple and clear example on statistic interaction:

http://www.theanalysisfactor.com/interpreting-interactions-in-regression/

### Below is a basic Linear Regression Example

In [10]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

In [11]:
titanic = pd.read_csv('review_datasets/titanic.csv')
titanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
titanic.isnull().sum()

survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

In [13]:
titanic.dropna(subset=['age'], inplace=True)

In [14]:
X = titanic[['fare', 'survived', 'pclass']]
y = titanic.age.values

#### Using statsmodels

In [15]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     535.1
Date:                Mon, 25 Sep 2017   Prob (F-statistic):          7.93e-182
Time:                        19:54:49   Log-Likelihood:                -3089.2
No. Observations:                 714   AIC:                             6184.
Df Residuals:                     711   BIC:                             6198.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
fare           0.1322      0.013     10.303      0.0

#### Using sklearn

In [16]:
lr = LinearRegression()
model = lr.fit(X, y)

In [17]:
model.predict([50, 0, 1])



array([ 43.18520993])

<span style='color:red'>My model predicts that if you paid a $50 fare, did not survive and was in luxurious pclass 1, you would be 43.19 years old.</span>

### Bias-Variance Trade Off

Every Data Scientist will tell you that the Bias-Variance trade off is important to understand. First off, what is Bias and what is variance?

1. Bias - I've accepted that my model will not perfectly model the dataset.
2. Variance - I will try to account for all the data points in my current "sample/training" set.

What does it mean?

Bias means that you are **underfitting**. You are obtaining only the most important feature(s) that you leave out other features that may improve your explanation of your model. This is a problem because you want to get the best explanation.

Variance means that you are **overfitting**. You are accommodating every single data point that you include features you don't want that may confuse or hurt your explanation of your models.

The trade off is to find the most optimized point between bias and variance, but how?

Too much variance:
    - Regularization regression models, such as Lasso and Ridge (explained later), will increase bias.
    - Dimensionality Reduction (unsupervised learning) and Feature Selection will increase bias.
    - Ensemble Method Boosting combines 'strong' learners in a way to reduce variance
    
Too much bias:
    - Increase the number of features
    - Ensemble Method Bagging combining many weak, high bias models to get a lower bias

It is also good to mention that every model's algorithm is different in bias and variance itself. k-Nearest Neighbors is a high bias model. Decision Trees can be high variance depending on how many leaf nodes you determine in your parameter. This is where research and analyst comes into place. As a scientist, you'll need to go back and forth to achieve the highest optimized trade off point. There is no "right" answer.

Some of these methods we haven't learned yet, but don't fret, we will eventually get there. Just know for now that bias-variance is important and there are ways to balance it out.

### Error Metrics: Loss Function, MSE, RMSE, R-Squared

1. Loss Function - The loss function is what is being optimized by the process of regression. Think of the term "loss function" sort of like the greater the value, the more information about your target variable that is "lost" by your model. A common loss function is the least squares loss function, where the increase in the sum of squared errors (loss) is creating a bad fit between your predictor and target variance
2. Mean Squared Error (MSE) and Residual Mean Squared Error (RMSE) - https://www.vernier.com/til/1014/
3. R-Squared - http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

### Regularization with Lasso, Ridge, Elastic Net

The concept of regularization is adding an additional "penalty" on the size of coefficients to the minimization of sum of squared errors in standard regression.

In other words, there are additional components to the loss function, so the minimization becomes a balance between these components. 

The most common types of regularization are the **"Lasso"**, **"Ridge"**, and the **Elastic Net**.
    
##### First off, let's hurt some of your brains.

Least Squares Loss function formula:
### $$ \text{minimize}\; RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 $$

This is the Lasso formula:
### $$ \text{minimize}\; RSS + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_1\sum_{j=1}^p |\beta_j|$$

This is the Ridge formula:
### $$ \text{minimize}\; RSS+Ridge = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_2\sum_{j=1}^p \beta_j^2$$

And finally, the ElasticNet formula:
### $$ \text{minimize}\; RSS + Ridge + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \lambda_1\sum_{j=1}^p |\beta_j| + \lambda_2\sum_{j=1}^p \beta_j^2$$

Whoa, that looks confusing! But:
### $$ \text{minimize}\; RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 $$

That is just adding all your (predicted_y - true_y)^2, which is what you do to calculate the MSE and RMSE!

What we are interested in is the lambda ($\lambda$)
### $$ Lasso = \lambda_1\sum_{j=1}^p |\beta_j|$$
### $$ Ridge = \lambda_2\sum_{j=1}^p \beta_j^2$$
### $$ Ridge + Lasso =  \lambda_1\sum_{j=1}^p |\beta_j| + \lambda_2\sum_{j=1}^p \beta_j^2$$

The $\lambda$ is the penalty you are setting to regularize (reduce loss, increase bias, decrease variance). When $\lambda = 0$, then it will end up being just the least squares loss function. For each coefficient, we put a penalty constraint of $\lambda$, and depending which regularization we use, either taking the absolute value or squaring it. Now let's talk about what each one is actually doing.

- Lasso is high in bias. Lasso completely removes all weak predictors completely out of your model and will only work with the strongest predictors to give you the best R-squared.
- Ridge handles multicollinearity really well. Ridge doesn't remove any predictors, but rather, balances out all the predictors so that every predictor will have an optimized explanation towards the model.
- Elastic Net is a combination of both, and personally, I prefer using this, unless I have definite reasoning behind what I am looking for. Elastic Net will help you optimize both Lasso and Ridge. The model is also smart enough that if only Lasso is needed, Elastic Net will only use Lasso and set $\lambda_2 = 0$, and vice versa.

##### Standardization
Before doing any regularization, all predictors must be on the same scale or else your results will be highly skewed and biased. Standardization takes all your values in a series/dataframe, set the mean to 0 and each element is scaled based on the standard deviation. Below is how to import standardization and apply it to your dataframe.

In [21]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
scaled_df = ss.fit_transform(X)
scaled_df

array([[-0.51897787, -0.82702011,  0.91123237],
       [ 0.69189675,  1.20916044, -1.47636364],
       [-0.50621356,  1.20916044,  0.91123237],
       ..., 
       [-0.08877362,  1.20916044, -1.47636364],
       [-0.08877362,  1.20916044, -1.47636364],
       [-0.50952283, -0.82702011,  0.91123237]])

Advanced Statistics/Machine Learning Free Course:
https://onlinecourses.science.psu.edu/stat857/node/141