# Bias and Variance ，[參考資料1](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/), [參考資料二](https://medium.com/analytics-vidhya/bias-variance-trade-off-in-datascience-and-calculating-with-python-766158812c46)
* 一般來模型的 Bias 和 Variance 會呈現互斥的現像
* 通常來說 Bias 越低，代表可能產生 Overfitting ，而 Overfitting 通常代表 Vairance 越高，其意義代表對於 Input 資料的敏感到越高。
## How to Caculate the Model's Bias
* Bias 的定義: Bias 是模型的預測(產出)的均值與實際值的差異(Bias is the difference between the mean of these estimates and the actual value.)
* Model Bias Formula : $$ {1 \over n}\sum\limits_{k=1}^n \{{\hat{y}-y_i}\}^2 $$
* n 為測試集中的資料筆數,y_hat 為模型的均值,y_i 為 groud truth value
## How to Caculate the Model's Variance
* Variance 定義：這與實際(y label)的 value 沒關係，而是這個組模型的穩定度，在不同 sub training set 所產出相對應的 model 對於 同一組 testing data 所預測(產出)的值的 variance .
* Variance Defintion: Variance is the amount that the estimate of the target function will change if different training data was used.
*  Model Variance Formula:$$ {1 \over N}\sum\limits_{n=1}^N {1 \over L}\sum\limits_{l=1}^l \{{ y^l(x_n) -  \bar{y}(x_n) }\}^2 $$
* N 為 testing data 的數(筆)量，L 將 training data 折成多少個 sub trainin set (N is the number of rows that is in testing set . L is the number of subset that is splited from all training set. )

In [2]:
from mlxtend.evaluate import bias_variance_decomp
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsRegressor
import warnings
warnings.filterwarnings('ignore')
# function to load the boston_housing_data dataset into NumPy arrays.
# So I will be able to directly apply NumPy functions to this
from mlxtend.data import boston_housing_data

In [3]:

X, y = boston_housing_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123, shuffle=True)

Dimensions: 506 x 13


## Demonstration
* Show the relation between variance, bias, and tree depth.
* You could see if tree depth is deeper than the bias could be lower and variance could be increased.

In [12]:
for d in range(1,5):
    decision_tree = DecisionTreeRegressor(random_state=123,max_depth=d)
    # first calculate all the statistical parameters before pruning
    mse_decision_tree, bias_decision_tree, var_decision_tree = bias_variance_decomp(decision_tree, X_train, y_train, X_test, y_test, 'mse', random_seed=123 )

    # random_seed : Used to initialize a pseudo-random
    # number generator for the bias-variance decomposition
    print('Original Bias from un-pruned data at {}'.format(d), np.round(bias_decision_tree, 2))
    print('Original Variance from un-pruned data at {}'.format(d), np.round(var_decision_tree, 2))

Original Bias from un-pruned data at 1 41.41
Original Variance from un-pruned data at 1 11.49
Original Bias from un-pruned data at 2 29.04
Original Variance from un-pruned data at 2 7.41
Original Bias from un-pruned data at 3 20.7
Original Variance from un-pruned data at 3 12.62
Original Bias from un-pruned data at 4 18.6
Original Variance from un-pruned data at 4 12.43


## The simple model 
* the linea regression is a very simple model , so it's bias is high and variance is low. 

In [4]:
lr = LinearRegression()

# first calculate all the statistical parameters before pruning
mse_decision_tree, bias_decision_tree, var_decision_tree = bias_variance_decomp(lr, X_train, y_train, X_test, y_test, 'mse', random_seed=123 )

# random_seed : Used to initialize a pseudo-random
# number generator for the bias-variance decomposition
print('Original Bias from un-pruned data ', np.round(bias_decision_tree, 2))
print('Original Variance from un-pruned data ', np.round(var_decision_tree, 2))

Original Bias from un-pruned data  30.07
Original Variance from un-pruned data  1.26


# Conclusion
* this is trade off between bias and variance that show as below chart
<img src="https://i.imgur.com/JNyTZwN.png" />
