# 🎯 Bias-Variance Trade-Off in Machine Learning

- Understanding the `**bias-variance trade-off**` is crucial for building models that generalize well to unseen data.

---

## 🔍 What is Bias?
- **Bias** refers to errors due to overly simplistic assumptions in the learning algorithm.
- High bias can cause the model to **miss relevant patterns** in the data (📉 underfitting).
- Common in models like **linear regression** or shallow decision trees when applied to complex problems.

### 🔴 Signs of High Bias:
- Poor performance on training data.
- Poor performance on test data.
- Model is **too simple** to capture the complexity of the data.

---

## 🔄 What is Variance?
- **Variance** refers to the model’s sensitivity to fluctuations in the training data.
- High variance can cause the model to **learn noise** in the data (📈 overfitting).
- Common in models like **deep decision trees**, **polynomial regression**, or complex neural networks without regularization.

### 🔴 Signs of High Variance:
- Excellent performance on training data.
- Poor performance on test/validation data.
- Model is **too complex** and over-adapts to training data.

---

## ⚖️ Why a Balanced Model is Essential
- A balanced model achieves the **right trade-off**:
  - Low enough bias to capture data patterns
  - Low enough variance to generalize to new data
- 🏆 Goal: **Minimize total error (bias² + variance + irreducible error)**

---

## ✅ How to Avoid Overfitting (High Variance)

1. 🔽 **Simplify the Model**: Avoid unnecessarily deep or complex models.
2. 🛠️ **Regularization**: Use L1 (Lasso) or L2 (Ridge) regularization to penalize overly complex models.
3. 🔁 **Cross-Validation**: Use techniques like K-Fold CV to ensure your model performs well on unseen data.
4. ⏹️ **Early Stopping**: Stop training once performance on validation data begins to degrade.
5. 📊 **Use More Training Data**: More examples help the model generalize better.
6. 📦 **Data Augmentation** (for images/text): Increase training set variety artificially.
7. 🧹 **Feature Selection**: Remove irrelevant or noisy features to prevent overfitting.

---

## 🧠 How to Avoid Underfitting (High Bias)

1. ➕ **Use a more complex model**: Try moving from linear models to tree-based or neural networks.
2. 🧮 **Add more features**: Include relevant predictors that can help the model learn better.
3. 🛠️ **Reduce regularization** if it's too strong.
4. 🔄 **Allow more training time** or train for more epochs (if early stopping is applied too early).

---

## 📌 Summary Table
- ✨ **Pro Tip**: Always visualize training vs. validation loss curves to detect overfitting or underfitting patterns early!

    | Issue         | Cause                       | Effect                           | Fixes                                  |
    |---------------|-----------------------------|-----------------------------------|----------------------------------------|
    | High Bias     | Model too simple             | Underfitting                      | Use more complex model, add features   |
    | High Variance | Model too complex            | Overfitting                       | Regularization, early stopping, CV     |
    | Balanced      | Optimal bias and variance    | Good generalization               | Careful tuning and model validation    |

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


In [None]:
# Dataset

iris = load_iris()
X = iris.data
Y = iris.target

In [None]:
# Standardization

scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)

In [None]:
# Principal Component Analysis

pca = PCA(n_components=2)
x_pca = pca.fit_transform(x_scaled)

In [None]:
# DataFrame

df = pd.DataFrame(data=x_pca, columns=['PC 1','PC 2'])
df['Target'] = Y
print(df.head())

In [None]:
# Variance

variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance)

print(f"Explained Variance Ratio: {variance}\nCumulative Variance: {cumulative_variance}")

In [None]:
# Scree Plot
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(variance) + 1), variance, 'o-', color='blue', label='Individual Explained Variance')
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 's--', color='orange', label='Cumulative Explained Variance')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(variance) + 1))
plt.legend(loc='center right')
plt.grid()
plt.tight_layout()
plt.show()

In [None]:
# Load data

df = pd.read_csv('customer_data.csv')

df.fillna({'fea_2':df['fea_2'].mean()}, inplace=True)
df.drop_duplicates(inplace=True)
df = df.drop(['id'], axis=1)

X = df.drop(['label'], axis=1)
Y = df['label']


In [None]:
# Standardization

scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)

In [None]:
# Principal Component Analysis

pca = PCA(n_components=3)
x_pca = pca.fit_transform(x_scaled)

In [None]:
# DataFrame

df = pd.DataFrame(data=x_pca, columns=['PC 1','PC 2', 'PC 3'])
df['Target'] = Y
print(df.head())

In [None]:
# Variance

variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance)

print(f"Explained Variance Ratio: {variance}\nCumulative Variance: {cumulative_variance}")

In [None]:
# Scree Plot
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(variance) + 1), variance, 'o-', color='blue', label='Individual Explained Variance')
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 's--', color='orange', label='Cumulative Explained Variance')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(variance) + 1))
plt.legend(loc='center right')
plt.grid()
plt.tight_layout()
plt.show()