# Bias-variance decomposition

#  the bias-variance decomposition provides a framework for understanding the trade-offs involved in model selection and helps in making informed decisions to improve model performance.

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [None]:
RANDOM_STATE = 42

In [None]:
data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

In [None]:
pd.concat([X,y], axis=1)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

In [None]:
scaler = StandardScaler()
scaler.fit(Xtrain)

Xtrain = pd.DataFrame(data=scaler.transform(Xtrain), columns=X.columns)
Xtest = pd.DataFrame(data=scaler.transform(Xtest), columns=X.columns)

Bias-variance decomposition can be done with [`mlxtend`](https://rasbt.github.io/mlxtend/) library (`bias_variance_decomp` function).

In [None]:
%%capture
!pip install mlxtend --upgrade

## Arguments
* `estimator` - A classifier or regressor object or class implementing both a fit and predict method similar to the scikit-learn API.
* `X_train, y_train` - train data
* `X_test, y_test` - test data
* `loss` - Loss function for performing the bias-variance decomposition. Currently allowed values are `'0-1_loss'` (classification) and `'mse'` (regression).
* `num_rounds=200` - Number of bootstrap rounds (sampling from the training data) for performing the bias-variance decomposition. Each bootstrap sample has the same size as the original training set.

## Returns

* `avg_expected_loss` - error on test data
* `avg_expected_bias` - bias
* `avg_expected_variance` - variance

In [None]:
X_train = Xtrain.values
y_train = ytrain.values
X_test = Xtest.values
y_test = ytest.values

In [None]:
from mlxtend.evaluate import bias_variance_decomp

avg_mse, avg_bias, avg_var = bias_variance_decomp(LinearRegression(), X_train, y_train,
                                                  X_test, y_test, loss='mse',
                                                  random_seed=np.random.seed(RANDOM_STATE))

In [None]:
print('Loss:', avg_mse)
print('Bias:', avg_bias)
print('Variance:', avg_var)

Loss: 0.5304055638008108
Bias: 0.5283978620960189
Variance: 0.0020077017047919693


In [None]:
avg_mse, avg_bias, avg_var = bias_variance_decomp(DecisionTreeRegressor(), X_train, y_train,
                                                  X_test, y_test, loss = 'mse',
                                                  random_seed=np.random.seed(RANDOM_STATE))

In [None]:
print('Loss:', avg_mse)
print('Bias:', avg_bias)
print('Variance:', avg_var)

Loss: 0.5666498104545756
Bias: 0.25233310871426057
Variance: 0.31431670174031506


As we can see, the decision tree predicts better the target variable than linear regression (*bias* is lower), but it is more overfitted (*variance* is higher). The total error is slightly higher for DecisionTreeRegressor than for LinearRegression.

Hyperparameters tuning can help to reduce the overfitting and total error.