
# Introduction to Statistical Learning: Bias, Variance, and Bias-Variance Tradeoff

## Q1: What is Unsupervised Learning?
Unsupervised learning is a type of machine learning where the algorithm is trained on data that does not have labeled outputs. The goal is to uncover hidden patterns, relationships, or structures in the data.

Key Characteristics of Unsupervised Learning:
- No labels or target variable.
- Goal: To find underlying structure or distribution in the data.
- Applications: Grouping similar items, reducing the dimensionality of the data, detecting anomalies.

### Types of Unsupervised Learning
1. Clustering
   - Grouping similar data points into clusters based on some notion of similarity.
   - **Algorithms**: K-means clustering, Hierarchical clustering.

2. Dimensionality Reduction
   - Reducing the number of input variables while preserving the most important information.
   - **Algorithms**: Principal Component Analysis (PCA), t-SNE.

### Example: Clustering in Python using K-Means

```python
from sklearn.cluster import KMeans
import numpy as np

# Sample dataset
X = np.array([[1, 2], [2, 3], [3, 4], [5, 6], [6, 7], [8, 9]])

# K-means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print("Cluster Labels:", labels)
print("Cluster Centroids:", centroids)
```

## Q2: How to Select Training and Testing Data
To select training and testing data, the dataset is usually split into training and testing sets.

### Common Techniques
1. **Holdout Method**: A simple random split of the data into training (70-80%) and testing (20-30%).

```python
from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset
data = {'Feature1': [1, 2, 3, 4, 5, 6],
        'Feature2': [5, 6, 7, 8, 9, 10],
        'Target': [0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# Split data
X = df[['Feature1', 'Feature2']]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train, y_train, X_test, y_test)
```

2. **K-Fold Cross-Validation**: Splits data into K subsets and repeatedly trains and tests across folds.

```python
from sklearn.model_selection import KFold
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
y = np.array([0, 1, 0, 1, 0, 1])

kf = KFold(n_splits=3)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("TRAIN:", train_index, "TEST:", test_index)
```

3. **Stratified Split**: Ensures class proportions are preserved in both training and testing sets.

```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
```

## Q3: Bias, Variance, and Bias-Variance Tradeoff
Bias refers to the error introduced by approximating a complex real-world problem with a simplified model. Variance refers to the model's sensitivity to small changes in the training data.

### Bias-Variance Tradeoff
The bias-variance tradeoff is the balance between underfitting (high bias) and overfitting (high variance).

- High bias: Simple models (underfitting).
- High variance: Complex models (overfitting).

The goal is to minimize total error, which is a combination of bias, variance, and irreducible error.

### Example of Bias-Variance Tradeoff

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

np.random.seed(42)
X = np.random.rand(100, 1) * 6 - 3
y = 0.5 * X**3 - X + 2 + np.random.randn(100, 1) * 3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def plot_model(degree):
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly_train = poly_features.fit_transform(X_train)
    X_poly_test = poly_features.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_poly_train, y_train)
    
    y_train_predict = model.predict(X_poly_train)
    y_test_predict = model.predict(X_poly_test)
    
    train_mse = mean_squared_error(y_train, y_train_predict)
    test_mse = mean_squared_error(y_test, y_test_predict)
    
    X_plot = np.linspace(-3, 3, 100).reshape(100, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    plt.scatter(X_train, y_train)
    plt.plot(X_plot, y_plot, color='r')
    plt.title(f"Degree {degree} Polynomial
Train MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}")
    plt.show()

plot_model(1)  # High bias (underfitting)
plot_model(3)  # Balanced model
plot_model(15)  # High variance (overfitting)
```

