# Bias and Variance in Machine Learning

### Bias and Variance in Machine Learning

In machine learning, **bias** and **variance** are two sources of error that affect the performance of a model.

- **Bias** refers to the error due to overly simplistic assumptions in the model. A high-bias model tends to miss important patterns in the data, leading to **underfitting**. Underfitting means the model performs poorly on both the training and test data.
  
- **Variance** refers to the model's sensitivity to small fluctuations in the training data. A high-variance model captures noise as well as the true patterns, leading to **overfitting**. Overfitting means the model performs well on the training data but poorly on unseen test data.

The goal in machine learning is to find a balance between bias and variance:
- **High bias** causes underfitting, where the model is too simple.
- **High variance** causes overfitting, where the model is too complex and fails to generalize.

There is always a tradeoff between bias and variance—reducing one typically increases the other. A well-performing model needs to find the right balance to generalize well to new data.

To summarize:
- **Underfitting** → High bias, regardless of variance.
- **Overfitting** → Low bias + High variance.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

The diagram on the left uses a bullseye (target) to illustrate how different combinations of bias and variance affect the accuracy of a model’s predictions. Here's how the figure explains bias and variance:
1. Low Bias, Low Variance (Top-Left):
- Meaning The model’s predictions are accurate and consistent, hitting close to the target center with minimal spread.
- Interpretation This is the ideal case where the model generalizes well, with low error in both training and test datasets.

2. Low Bias, High Variance (Top-Right):
- Meaning The predictions are centered around the target (low bias) but have a wide spread (high variance), indicating inconsistency.
- Interpretation The model fits the training data well (low bias) but overfits, performing poorly on new data (high variance).

3. High Bias, Low Variance (Bottom-Left):
- Meaning The predictions are consistently far from the target center (high bias) but don’t vary much (low variance).
- Interpretation The model is too simplistic and underfits both the training and test data, resulting in high errors for both (even though it’s consistent).

4. High Bias, High Variance (Bottom-Right):
- Meaning The predictions are far from the target center and are also scattered widely (high bias, high variance).
- Interpretation This is the worst scenario, where the model both underfits (high bias) and overfits at the same time (high variance), leading to very poor performance.

The graph on the right illustrates the **bias-variance tradeoff** in relation to **model complexity** and how it impacts error. It demonstrates how both **bias** and **variance** contribute to the overall error as you increase the complexity of a model, and highlights the concept of **underfitting** and **overfitting**. Here’s a detailed breakdown:

### Axes:
- **x-axis (Model Complexity)**: Represents the complexity of the model, ranging from simple models (on the left) to more complex models (on the right). As complexity increases, the model becomes more flexible and capable of fitting data in greater detail.
  
- **y-axis (Error)**: Represents the error (which can be training error, test error, or generalization error). This can be thought of as how far off the model's predictions are from the actual values.

### Lines on the Graph:
1. **Bias (train)** (orange line): 
   - Bias decreases as model complexity increases. This is because simpler models (low complexity) cannot capture the underlying patterns in the data and tend to underfit, leading to **high bias**.
   - As the model becomes more complex, it fits the training data better, reducing bias, but increasing the risk of overfitting (fitting the noise in the data too closely).

2. **Variance** (blue line):
   - Variance increases as model complexity increases. This is because more complex models are highly sensitive to fluctuations in the training data, and their performance can vary a lot across different datasets. This results in **high variance**.
   - While simpler models have low variance, they do not capture the true patterns in the data as effectively, which leads to underfitting.

3. **Generalization Error (Test)** (pink line):
   - This line represents the **total error** or **generalization error**, which is the error a model makes on unseen test data. It is the sum of both bias and variance.
   - At the beginning, when the model is too simple, the generalization error is high due to high bias (underfitting).
   - As the model becomes more complex, it starts to fit the training data better, reducing bias and improving generalization error.
   - Beyond a certain point, however, as the model becomes too complex, the generalization error starts to increase again due to high variance (overfitting).

### Key Concepts:
- **Underfitting**: On the left-hand side of the graph, where the model is too simple (low complexity), both the bias and generalization error are high. The model is unable to capture the underlying patterns of the data, leading to poor performance on both training and test sets.
  
- **Overfitting**: On the right-hand side of the graph, where the model is very complex (high complexity), the model fits the training data too closely, including noise, which causes poor generalization to test data. Here, the **variance** is high, leading to an increase in the generalization error, even though bias is low.

- **Optimal Complexity**: In the middle of the graph, there is a **sweet spot** (shaded region) where the **bias and variance are balanced**, and the model has **minimal generalization error**. This is the ideal level of complexity, where the model is complex enough to capture the underlying patterns but not so complex that it overfits the data.

### Summary:
- **Bias** decreases as model complexity increases (simpler models underfit).
- **Variance** increases as model complexity increases (complex models overfit).
- The **generalization error** is the combination of bias and variance and has a U-shaped curve. The optimal model complexity is at the lowest point of this curve, where the model achieves the best balance between bias and variance and performs best on unseen data.

<hr>

### Techniques to Balance Bias and Variance
There are different techniques to balance bias and variance to achieve an optimal prediction error.

1. Reducing High Bias
- Choosing a more complex model − As we have seen in the above diagram, choosing a more complex model may reduce the bias error of the model prediction.
- Adding more features − Adding mode features can increase the complexity of the model that can capture even better hidden patterns that will decrease the bias error of the model.
- Reducing regularization − Regularization prevents overfitting, but while decreasing the variance, it can increase bias. So, reducing the regularization parameters or removing regularization overall can reduce bias errors.
2. Reducing High Variance
- Applying regularization techniques − Regularization techniques add penalty to complex model that will eventually result in reduced complexity of the model. A less complex model will show less variance.
- Simplifying model complexity − A less complex model will have low variance. You can reduce the variance by using a simpler algorithm.
- Adding more data − Adding more data to the dataset can help the model to perform better showing less variance.
- Cross-validation − Cross-validation can be useful to identify overfitting by comparing the performance on training and validation sets of the datasets.
### Bias and Variance Examples Using Python
Let's implement some practical examples using Python programming language. We have provided here four examples. The first three examples show some level of high/ low bias or variance. The fourth example shows the optimal value of both bias and variance.

### Example of High Bias
Below is an implementation example in Python that illustrates how bias and variance can be analyzed using the Boston Housing dataset

The **Boston Housing Dataset** is widely used in machine learning and statistical analysis for predicting housing prices in the Boston area based on various features of the homes and neighborhoods. Here’s a breakdown of the columns visible in the table:

- **crim**: Per capita crime rate by town.
- **zn**: Proportion of residential land zoned for lots over 25,000 sq. ft.
- **indus**: Proportion of non-retail business acres per town.
- **chas**: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- **nox**: Nitric oxides concentration (parts per 10 million).
- **rm**: Average number of rooms per dwelling.
- **age**: Proportion of owner-occupied units built before 1940.
- **dis**: Weighted distances to five Boston employment centers.
- **rad**: Index of accessibility to radial highways.
- **tax**: Full-value property tax rate per $10,000.
- **ptratio**: Pupil-teacher ratio by town.
- **b**: 1000(Bk - 0.63)^2 where Bk is the proportion of African Americans by town.
- **lstat**: Percentage of lower status of the population.
- **medv**: Median value of owner-occupied homes in $1000s.

This dataset is part of the UCI Machine Learning Repository and is often used for regression tasks.

In [5]:
import numpy as np
import pandas as pd


df = pd.read_csv("..\datasets\BostonHousing.csv")
df

  df = pd.read_csv("..\datasets\BostonHousing.csv")


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [12]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)

train_preds = lr.predict(X_train)
train_mse = mean_squared_error(y_train, train_preds)
print("Training MSE:", train_mse)

test_preds = lr.predict(X_test)
test_mse = mean_squared_error(y_test, test_preds)
print("Testing MSE:", test_mse)

Training MSE: 21.641412753226312
Testing MSE: 24.291119474973613


The output shows the training and testing mean squared errors (MSE) of the linear regression model. The training MSE is 21.64 and the testing MSE is 24.29, indicating that the model has a high level of bias and moderate variance.

- Training MSE: 21.641412753226312
- Testing MSE: 24.291119474973456

### Example of Low Bias and High Variance
Let's try a polynomial regression model

In [7]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

pr = LinearRegression()
pr.fit(X_train_poly, y_train)

train_preds = pr.predict(X_train_poly)
train_mse = mean_squared_error(y_train, train_preds)
print("Training MSE:", train_mse)

test_preds = pr.predict(X_test_poly)
test_mse = mean_squared_error(y_test, test_preds)
print("Testing MSE:", test_mse)

Training MSE: 5.31446956670908
Testing MSE: 14.183558207567042


The output shows the training and testing MSE of the polynomial regression model with degree=2. 

The training MSE is 5.31 and the testing MSE is 14.18, indicating that the model has a lower bias but higher variance compared to the linear regression model.

- Training MSE: 5.31446956670908
- Testing MSE: 14.183558207567042

### Example of Low Variance
To reduce variance, we can use regularization techniques such as ridge regression or lasso regression. In the following example, we will be using ridge regression −

In [11]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1)
ridge.fit(X_train_poly, y_train)

train_preds = ridge.predict(X_train_poly)
train_mse = mean_squared_error(y_train, train_preds)
print("Training MSE:", train_mse)

test_preds = ridge.predict(X_test_poly)
test_mse = mean_squared_error(y_test, test_preds)

print("Testing MSE:", test_mse)

Training MSE: 5.66620937860839
Testing MSE: 14.142093755326755


The output shows the training and testing MSE of the ridge regression model with alpha=1. The training MSE is 9.03 and the testing MSE is 13.88 compared to the polynomial regression model, indicating that the model has a lower variance but slightly higher bias.

- Training MSE: 9.03220937860839
- Testing MSE: 13.882093755326755

### Example of Optimal Bias and Variance
We can further tune the hyperparameter alpha to find the optimal balance between bias and variance. Let's see an example −

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': np.logspace(-3, 3, 7)}
ridge_cv = GridSearchCV(Ridge(), param_grid, cv=5)
ridge_cv.fit(X_train_poly, y_train)

train_preds = ridge_cv.predict(X_train_poly)
train_mse = mean_squared_error(y_train, train_preds)
print("Training MSE:", train_mse)

test_preds = ridge_cv.predict(X_test_poly)
test_mse = mean_squared_error(y_test, test_preds)
print("Testing MSE:", test_mse)

Training MSE: 8.326082686584716
Testing MSE: 12.873907256619141


  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


The output shows the training and testing MSE of the ridge regression model with the optimal alpha value.

- Training MSE: 8.326082686584716
- Testing MSE: 12.873907256619141
- The training MSE is 8.32 and the testing MSE is 12.87, indicating that the model has a good balance between bias and variance.

*References*
https://www.tutorialspoint.com/machine_learning/machine_learning_bias_and_variance.htm
