# Multi-Layer Perceptron (MLP) in sklearn

## Overview

**Multi-Layer Perceptron (MLP)** is a feedforward artificial neural network with one or more hidden layers. It's sklearn's implementation of neural networks for classification and regression.

### Core Concept

*"Learn complex non-linear patterns through layers of interconnected neurons"*

### Key Ideas

1. **Layers**: Input → Hidden(s) → Output
2. **Non-linearity**: Activation functions enable complex patterns
3. **Backpropagation**: Gradient descent to learn weights
4. **Universal Approximation**: Can approximate any continuous function

## Mathematical Foundation

### Single Neuron (Perceptron)

\[
y = \sigma\left(\sum_{i=1}^{n} w_i x_i + b\right) = \sigma(w^T x + b)
\]

where:
- \(x\) = input features
- \(w\) = weights
- \(b\) = bias
- \(\sigma\) = activation function

### Multi-Layer Network

**Forward Pass** (2 hidden layers):

\[
h^{(1)} = \sigma_1(W^{(1)} x + b^{(1)})
\]
\[
h^{(2)} = \sigma_2(W^{(2)} h^{(1)} + b^{(2)})
\]
\[
\hat{y} = \sigma_3(W^{(3)} h^{(2)} + b^{(3)})
\]

### Activation Functions

**ReLU (Rectified Linear Unit)**:
\[
\text{ReLU}(x) = \max(0, x)
\]

**Logistic (Sigmoid)**:
\[
\sigma(x) = \frac{1}{1 + e^{-x}}
\]

**Tanh (Hyperbolic Tangent)**:
\[
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
\]

**Identity (Linear)**:
\[
f(x) = x
\]

### Loss Functions

**Classification (Cross-Entropy)**:
\[
L = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})
\]

**Regression (Mean Squared Error)**:
\[
L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
\]

## Topics Covered

1. MLP architecture and components
2. MLPClassifier for classification
3. MLPRegressor for regression
4. Activation functions comparison
5. Hidden layer sizes and network depth
6. Regularization (alpha parameter)
7. Learning rate and optimization
8. Early stopping and validation
9. Feature scaling importance
10. Hyperparameter tuning
11. Learning curves and convergence

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import warnings
warnings.filterwarnings('ignore')

# MLP models
from sklearn.neural_network import MLPClassifier, MLPRegressor

# Other models for comparison
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC

# Utilities
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV,
    learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, mean_absolute_error
)
from sklearn.datasets import (
    make_classification, make_regression, make_moons, make_circles,
    load_breast_cancer, load_diabetes, load_digits
)

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. MLP Architecture and Components

### 1.1 Understanding Network Structure

In [None]:
print("Multi-Layer Perceptron Architecture")
print("="*70)
print("\nStructure: Input Layer → Hidden Layer(s) → Output Layer\n")

# Example architecture
print("Example: Binary Classification with 4 features")
print("-" * 70)

n_features = 4
hidden_layer_sizes = (10, 5)
n_classes = 2

print(f"\nInput Layer:    {n_features} neurons (features)")
print(f"Hidden Layer 1: {hidden_layer_sizes[0]} neurons")
print(f"Hidden Layer 2: {hidden_layer_sizes[1]} neurons")
print(f"Output Layer:   {n_classes} neurons (classes)")

# Calculate parameters
weights_1 = n_features * hidden_layer_sizes[0]
bias_1 = hidden_layer_sizes[0]
weights_2 = hidden_layer_sizes[0] * hidden_layer_sizes[1]
bias_2 = hidden_layer_sizes[1]
weights_3 = hidden_layer_sizes[1] * n_classes
bias_3 = n_classes

total_params = weights_1 + bias_1 + weights_2 + bias_2 + weights_3 + bias_3

print(f"\nParameter Count:")
print(f"  Layer 1: W{n_features}x{hidden_layer_sizes[0]} + b{hidden_layer_sizes[0]} = {weights_1 + bias_1}")
print(f"  Layer 2: W{hidden_layer_sizes[0]}x{hidden_layer_sizes[1]} + b{hidden_layer_sizes[1]} = {weights_2 + bias_2}")
print(f"  Layer 3: W{hidden_layer_sizes[1]}x{n_classes} + b{n_classes} = {weights_3 + bias_3}")
print(f"  Total Parameters: {total_params}")

print("\n" + "="*70)
print("\n💡 Key Concepts:")
print("   - More layers = deeper network (learns hierarchical features)")
print("   - More neurons per layer = wider network (more capacity)")
print("   - More parameters = more capacity but risk of overfitting")
print("   - Each connection has a learnable weight")

### 1.2 Activation Functions Visualization

In [None]:
print("Activation Functions")
print("="*70)

# Define activation functions
x = np.linspace(-5, 5, 100)

def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def identity(x):
    return x

# Plot
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

activations = [
    (relu, 'ReLU', 'max(0, x)', 'Most popular, prevents vanishing gradient'),
    (sigmoid, 'Logistic', '1 / (1 + e^-x)', 'Output layer for binary classification'),
    (tanh, 'Tanh', '(e^x - e^-x) / (e^x + e^-x)', 'Zero-centered, range [-1, 1]'),
    (identity, 'Identity', 'x', 'Used in regression output layer')
]

for idx, (func, name, formula, desc) in enumerate(activations):
    y = func(x)
    axes[idx].plot(x, y, linewidth=3)
    axes[idx].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    axes[idx].axvline(x=0, color='k', linestyle='--', alpha=0.3)
    axes[idx].grid(alpha=0.3)
    axes[idx].set_xlabel('Input')
    axes[idx].set_ylabel('Output')
    axes[idx].set_title(f'{name}\n{formula}\n{desc}', fontsize=10)
    
    # Add range annotation
    if name == 'ReLU':
        axes[idx].text(0.05, 0.95, 'Range: [0, ∞)', transform=axes[idx].transAxes,
                      verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    elif name == 'Logistic':
        axes[idx].text(0.05, 0.95, 'Range: (0, 1)', transform=axes[idx].transAxes,
                      verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    elif name == 'Tanh':
        axes[idx].text(0.05, 0.95, 'Range: (-1, 1)', transform=axes[idx].transAxes,
                      verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    else:
        axes[idx].text(0.05, 0.95, 'Range: (-∞, ∞)', transform=axes[idx].transAxes,
                      verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\n💡 Choosing Activation Function:")
print("   Hidden Layers: ReLU (default, works well)")
print("   Binary Classification Output: logistic")
print("   Multi-class Output: softmax (automatic)")
print("   Regression Output: identity")

## 2. MLPClassifier - Binary Classification

### 2.1 Simple Example

In [None]:
# Generate non-linear dataset (moons)
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

print("MLPClassifier - Binary Classification")
print("="*70)
print(f"Dataset: Half-moons (non-linear)")
print(f"Samples: {X_moons.shape[0]}")
print(f"Features: {X_moons.shape[1]}\n")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_moons, y_moons, test_size=0.3, random_state=42
)

# CRITICAL: Scale features for MLP!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train MLP
mlp = MLPClassifier(
    hidden_layer_sizes=(10, 5),  # 2 hidden layers: 10 and 5 neurons
    activation='relu',           # ReLU activation
    solver='adam',               # Adam optimizer
    max_iter=1000,               # Maximum iterations
    random_state=42
)

print("Training MLP with architecture (2 → 10 → 5 → 2)...")
start = time()
mlp.fit(X_train_scaled, y_train)
train_time = time() - start

# Predictions
y_pred = mlp.predict(X_test_scaled)
y_pred_proba = mlp.predict_proba(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)

print(f"\nTraining time: {train_time:.3f}s")
print(f"Iterations: {mlp.n_iter_}")
print(f"Accuracy: {accuracy:.4f}")
print(f"\nNetwork Structure:")
print(f"  Input: {X_train.shape[1]} features")
print(f"  Hidden layers: {mlp.hidden_layer_sizes}")
print(f"  Output: {mlp.n_outputs_} classes")
print(f"  Total layers: {mlp.n_layers_}")

# Show loss curve
plt.figure(figsize=(10, 6))
plt.plot(mlp.loss_curve_, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training Loss Curve')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Loss curve shows convergence - loss decreases over iterations")

In [None]:
# Plot decision boundary
def plot_decision_boundary(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 7))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k', s=50)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.tight_layout()
    plt.show()

plot_decision_boundary(mlp, X_test_scaled, y_test, 
                      'MLP Decision Boundary (Half-Moons Dataset)')

print("\n💡 MLP learned complex non-linear decision boundary!")

### 2.2 Real-World Classification: Breast Cancer

In [None]:
# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print("MLPClassifier - Breast Cancer Detection")
print("="*70)
print(f"Samples: {X_cancer.shape[0]}")
print(f"Features: {X_cancer.shape[1]}")
print(f"Classes: {cancer.target_names}\n")

# Split and scale
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)

# Train MLP
mlp_cancer = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    alpha=0.0001,  # L2 regularization
    max_iter=500,
    random_state=42
)

print("Training MLP...")
mlp_cancer.fit(X_train_c_scaled, y_train_c)

# Evaluate
y_pred_c = mlp_cancer.predict(X_test_c_scaled)
accuracy_c = accuracy_score(y_test_c, y_pred_c)

print(f"\nAccuracy: {accuracy_c:.4f}")
print(f"Iterations: {mlp_cancer.n_iter_}")
print(f"\nClassification Report:")
print(classification_report(y_test_c, y_pred_c, target_names=cancer.target_names))

# Confusion matrix
cm = confusion_matrix(y_test_c, y_pred_c)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=cancer.target_names,
           yticklabels=cancer.target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('MLP Confusion Matrix')
plt.tight_layout()
plt.show()

## 3. MLPRegressor - Regression

### 3.1 Diabetes Progression Prediction

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X_diab = diabetes.data
y_diab = diabetes.target

print("MLPRegressor - Diabetes Progression")
print("="*70)
print(f"Samples: {X_diab.shape[0]}")
print(f"Features: {X_diab.shape[1]}\n")

# Split and scale
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_diab, y_diab, test_size=0.2, random_state=42
)

scaler_d = StandardScaler()
X_train_d_scaled = scaler_d.fit_transform(X_train_d)
X_test_d_scaled = scaler_d.transform(X_test_d)

# Train MLPRegressor
mlp_reg = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    alpha=0.001,
    max_iter=1000,
    random_state=42
)

print("Training MLPRegressor...")
mlp_reg.fit(X_train_d_scaled, y_train_d)

# Predictions
y_pred_d = mlp_reg.predict(X_test_d_scaled)

# Evaluate
mse = mean_squared_error(y_test_d, y_pred_d)
mae = mean_absolute_error(y_test_d, y_pred_d)
r2 = r2_score(y_test_d, y_pred_d)

print(f"\nResults:")
print(f"  R² Score: {r2:.4f}")
print(f"  RMSE: {np.sqrt(mse):.2f}")
print(f"  MAE: {mae:.2f}")
print(f"  Iterations: {mlp_reg.n_iter_}")

# Plot predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs Actual
axes[0].scatter(y_test_d, y_pred_d, alpha=0.6)
axes[0].plot([y_test_d.min(), y_test_d.max()], 
            [y_test_d.min(), y_test_d.max()], 'r--', linewidth=2)
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Predicted vs Actual (R²={r2:.3f})')
axes[0].grid(alpha=0.3)

# Loss curve
axes[1].plot(mlp_reg.loss_curve_, linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss')
axes[1].set_title('Training Loss Curve')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 MLPRegressor can capture non-linear relationships in regression tasks")

## 4. Effect of Hidden Layer Sizes

### 4.1 Comparing Different Architectures

In [None]:
print("Effect of Hidden Layer Sizes")
print("="*70)

# Test different architectures
architectures = [
    (10,),           # Single layer, 10 neurons
    (50,),           # Single layer, 50 neurons
    (100,),          # Single layer, 100 neurons
    (50, 25),        # Two layers
    (100, 50),       # Two layers (wider)
    (100, 50, 25),   # Three layers
]

results = []

for arch in architectures:
    mlp_test = MLPClassifier(
        hidden_layer_sizes=arch,
        activation='relu',
        solver='adam',
        max_iter=500,
        random_state=42
    )
    
    # Train
    start = time()
    mlp_test.fit(X_train_c_scaled, y_train_c)
    train_time = time() - start
    
    # Evaluate
    train_acc = mlp_test.score(X_train_c_scaled, y_train_c)
    test_acc = mlp_test.score(X_test_c_scaled, y_test_c)
    
    # Count parameters
    n_params = sum(w.size for w in mlp_test.coefs_) + sum(b.size for b in mlp_test.intercepts_)
    
    results.append({
        'Architecture': str(arch),
        'Layers': len(arch),
        'Parameters': n_params,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Train Time (s)': train_time
    })
    
    print(f"{str(arch):20} - Test: {test_acc:.4f}, Train: {train_acc:.4f}, "
          f"Params: {n_params:6d}, Time: {train_time:.2f}s")

results_df = pd.DataFrame(results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
x = np.arange(len(results_df))
width = 0.35
axes[0].bar(x - width/2, results_df['Train Acc'], width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, results_df['Test Acc'], width, label='Test', alpha=0.8)
axes[0].set_ylabel('Accuracy')
axes[0].set_xlabel('Architecture')
axes[0].set_title('Accuracy by Architecture')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results_df['Architecture'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')

# Parameters vs Test Accuracy
axes[1].scatter(results_df['Parameters'], results_df['Test Acc'], s=100, alpha=0.7)
for idx, row in results_df.iterrows():
    axes[1].annotate(row['Architecture'], 
                    (row['Parameters'], row['Test Acc']),
                    fontsize=8, ha='right')
axes[1].set_xlabel('Number of Parameters')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Model Complexity vs Accuracy')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   - Deeper/wider networks can learn more complex patterns")
print("   - Too large = overfitting (high train, low test accuracy)")
print("   - Too small = underfitting (low train and test accuracy)")
print("   - Start with (100,) or (100, 50) as reasonable defaults")

## 5. Activation Function Comparison

### 5.1 Testing Different Activations

In [None]:
print("Activation Function Comparison")
print("="*70)

activations = ['relu', 'logistic', 'tanh', 'identity']
activation_results = []

for activation in activations:
    mlp_act = MLPClassifier(
        hidden_layer_sizes=(100, 50),
        activation=activation,
        solver='adam',
        max_iter=500,
        random_state=42
    )
    
    # Train
    start = time()
    mlp_act.fit(X_train_c_scaled, y_train_c)
    train_time = time() - start
    
    # Evaluate
    test_acc = mlp_act.score(X_test_c_scaled, y_test_c)
    n_iter = mlp_act.n_iter_
    
    activation_results.append({
        'Activation': activation,
        'Test Accuracy': test_acc,
        'Iterations': n_iter,
        'Train Time (s)': train_time
    })
    
    print(f"{activation:10} - Test Acc: {test_acc:.4f}, Iterations: {n_iter:3d}, Time: {train_time:.2f}s")

act_df = pd.DataFrame(activation_results)
act_df = act_df.sort_values('Test Accuracy', ascending=False)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(act_df['Activation'], act_df['Test Accuracy'], alpha=0.7)
ax.set_xlabel('Test Accuracy')
ax.set_title('Activation Function Performance')
ax.set_xlim([0.9, 1.0])
ax.grid(alpha=0.3, axis='x')

# Color the best one
bars[0].set_color('green')

plt.tight_layout()
plt.show()

print("\n" + act_df.to_string(index=False))

print("\n💡 ReLU is usually the best choice for hidden layers!")
print("   - Avoids vanishing gradient problem")
print("   - Computationally efficient")
print("   - Works well in practice")

## 6. Regularization (Alpha Parameter)

### 6.1 Effect of L2 Regularization

In [None]:
print("Regularization (Alpha Parameter)")
print("="*70)
print("Alpha = L2 regularization penalty (prevents overfitting)")
print("  - Small alpha: Less regularization (may overfit)")
print("  - Large alpha: More regularization (may underfit)\n")

# Test different alpha values
alpha_values = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0]
alpha_results = []

for alpha in alpha_values:
    mlp_alpha = MLPClassifier(
        hidden_layer_sizes=(100, 50),
        activation='relu',
        solver='adam',
        alpha=alpha,
        max_iter=500,
        random_state=42
    )
    
    mlp_alpha.fit(X_train_c_scaled, y_train_c)
    
    train_acc = mlp_alpha.score(X_train_c_scaled, y_train_c)
    test_acc = mlp_alpha.score(X_test_c_scaled, y_test_c)
    
    alpha_results.append({
        'Alpha': alpha,
        'Train Acc': train_acc,
        'Test Acc': test_acc
    })
    
    print(f"Alpha={alpha:7.5f} - Train: {train_acc:.4f}, Test: {test_acc:.4f}")

alpha_df = pd.DataFrame(alpha_results)

# Plot
plt.figure(figsize=(10, 6))
plt.semilogx(alpha_df['Alpha'], alpha_df['Train Acc'], 'o-', label='Train Accuracy', linewidth=2)
plt.semilogx(alpha_df['Alpha'], alpha_df['Test Acc'], 's-', label='Test Accuracy', linewidth=2)
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Accuracy')
plt.title('Effect of Regularization')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

best_alpha = alpha_df.loc[alpha_df['Test Acc'].idxmax(), 'Alpha']
print(f"\nBest alpha: {best_alpha}")

print("\n💡 Regularization helps prevent overfitting")
print("   Default alpha=0.0001 works well for most cases")
print("   Increase if overfitting (high train, low test accuracy)")

## 7. Learning Rate and Optimization

### 7.1 Comparing Solvers

In [None]:
print("Solver Comparison")
print("="*70)
print("Solvers = Optimization algorithms for weight updates\n")

# Test different solvers
solvers = ['sgd', 'adam', 'lbfgs']
solver_results = []

for solver in solvers:
    # Note: lbfgs doesn't support partial_fit, needs more memory
    if solver == 'sgd':
        mlp_solver = MLPClassifier(
            hidden_layer_sizes=(100, 50),
            activation='relu',
            solver=solver,
            learning_rate_init=0.001,  # Initial learning rate
            max_iter=500,
            random_state=42
        )
    else:
        mlp_solver = MLPClassifier(
            hidden_layer_sizes=(100, 50),
            activation='relu',
            solver=solver,
            max_iter=500,
            random_state=42
        )
    
    # Train
    start = time()
    mlp_solver.fit(X_train_c_scaled, y_train_c)
    train_time = time() - start
    
    # Evaluate
    test_acc = mlp_solver.score(X_test_c_scaled, y_test_c)
    n_iter = mlp_solver.n_iter_
    
    solver_results.append({
        'Solver': solver,
        'Test Accuracy': test_acc,
        'Iterations': n_iter,
        'Train Time (s)': train_time
    })
    
    print(f"{solver:8} - Test Acc: {test_acc:.4f}, Iterations: {n_iter:3d}, Time: {train_time:.2f}s")

solver_df = pd.DataFrame(solver_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
axes[0].barh(solver_df['Solver'], solver_df['Test Accuracy'], alpha=0.7)
axes[0].set_xlabel('Test Accuracy')
axes[0].set_title('Solver Performance')
axes[0].set_xlim([0.9, 1.0])
axes[0].grid(alpha=0.3, axis='x')

# Training time
axes[1].barh(solver_df['Solver'], solver_df['Train Time (s)'], alpha=0.7, color='orange')
axes[1].set_xlabel('Training Time (seconds)')
axes[1].set_title('Training Time Comparison')
axes[1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 Solver Selection:")
print("   adam: Best default choice (adaptive learning rate)")
print("   sgd: Good for large datasets, needs tuning")
print("   lbfgs: Good for small datasets, memory intensive")

## 8. Early Stopping

### 8.1 Preventing Overfitting with Validation

In [None]:
print("Early Stopping")
print("="*70)
print("Stop training when validation score stops improving")
print("Prevents overfitting and saves training time\n")

# Train with early stopping
mlp_early = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=1000,
    early_stopping=True,        # Enable early stopping
    validation_fraction=0.1,    # Use 10% for validation
    n_iter_no_change=10,        # Stop if no improvement for 10 iterations
    random_state=42
)

print("Training with early stopping enabled...")
mlp_early.fit(X_train_c_scaled, y_train_c)

print(f"\nStopped at iteration: {mlp_early.n_iter_}")
print(f"Best validation score: {mlp_early.best_validation_score_:.4f}")
print(f"Test accuracy: {mlp_early.score(X_test_c_scaled, y_test_c):.4f}")

# Plot validation curve
plt.figure(figsize=(10, 6))
plt.plot(mlp_early.validation_scores_, linewidth=2, label='Validation Score')
plt.axvline(x=mlp_early.n_iter_, color='red', linestyle='--', 
           label=f'Stopped at iteration {mlp_early.n_iter_}')
plt.xlabel('Iteration')
plt.ylabel('Validation Accuracy')
plt.title('Early Stopping - Validation Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Early stopping:")
print("   - Automatically finds optimal number of iterations")
print("   - Prevents overfitting")
print("   - Saves computation time")
print("   - Recommended for most applications")

## 9. Feature Scaling - CRITICAL!

### 9.1 Impact of Scaling

In [None]:
print("Feature Scaling Impact on MLP")
print("="*70)
print("MLP is VERY sensitive to feature scales!\n")

# Train without scaling
print("Training WITHOUT scaling...")
mlp_unscaled = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42
)

start = time()
mlp_unscaled.fit(X_train_c, y_train_c)  # No scaling!
time_unscaled = time() - start
acc_unscaled = mlp_unscaled.score(X_test_c, y_test_c)
iter_unscaled = mlp_unscaled.n_iter_

# Train with scaling
print("Training WITH scaling...")
mlp_scaled = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42
)

start = time()
mlp_scaled.fit(X_train_c_scaled, y_train_c)  # Scaled!
time_scaled = time() - start
acc_scaled = mlp_scaled.score(X_test_c_scaled, y_test_c)
iter_scaled = mlp_scaled.n_iter_

# Compare
print("\n" + "="*70)
print("\nComparison:")
print(f"{'Metric':20} {'Without Scaling':>20} {'With Scaling':>20}")
print("-" * 70)
print(f"{'Test Accuracy':20} {acc_unscaled:>20.4f} {acc_scaled:>20.4f}")
print(f"{'Iterations':20} {iter_unscaled:>20d} {iter_scaled:>20d}")
print(f"{'Training Time (s)':20} {time_unscaled:>20.2f} {time_scaled:>20.2f}")

# Visualize loss curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(mlp_unscaled.loss_curve_, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title(f'Without Scaling\nFinal Accuracy: {acc_unscaled:.4f}')
axes[0].grid(alpha=0.3)

axes[1].plot(mlp_scaled.loss_curve_, linewidth=2, color='green')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss')
axes[1].set_title(f'With Scaling\nFinal Accuracy: {acc_scaled:.4f}')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 ALWAYS SCALE FEATURES FOR MLP!")
print("   Without scaling:")
print("   - Poor convergence")
print("   - Longer training time")
print("   - Worse performance")
print("   \n   Use StandardScaler before training!")

## 10. Hyperparameter Tuning

### 10.1 Grid Search

In [None]:
print("Hyperparameter Tuning for MLP")
print("="*70)

# Define parameter grid
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'adaptive']
}

print("Parameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

print(f"\nTotal combinations: {np.prod([len(v) for v in param_grid.values()])}")
print("\nPerforming Grid Search with 3-fold CV...\n")

# Grid search
grid_search = GridSearchCV(
    MLPClassifier(max_iter=500, random_state=42),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

start = time()
grid_search.fit(X_train_c_scaled, y_train_c)
grid_time = time() - start

print(f"\nGrid Search completed in {grid_time:.2f}s")
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest CV Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_mlp = grid_search.best_estimator_
test_acc_best = best_mlp.score(X_test_c_scaled, y_test_c)

print(f"Test Set Accuracy: {test_acc_best:.4f}")

# Show top 5 configurations
results_df = pd.DataFrame(grid_search.cv_results_)
top5 = results_df.nlargest(5, 'mean_test_score')[[
    'param_hidden_layer_sizes', 'param_activation', 'param_alpha', 
    'param_learning_rate', 'mean_test_score', 'std_test_score'
]]

print("\nTop 5 Configurations:")
print(top5.to_string(index=False))

## Summary and Quick Reference

### Quick Reference Code

```python
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import StandardScaler

# CRITICAL: Always scale features!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ===== CLASSIFICATION =====

mlp_clf = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Tuple: (layer1, layer2, ...)
    activation='relu',             # 'relu', 'tanh', 'logistic'
    solver='adam',                 # 'adam', 'sgd', 'lbfgs'
    alpha=0.0001,                  # L2 regularization
    batch_size='auto',             # Mini-batch size
    learning_rate='constant',      # 'constant', 'adaptive'
    learning_rate_init=0.001,      # Initial learning rate
    max_iter=500,                  # Maximum epochs
    early_stopping=True,           # Enable validation-based stopping
    validation_fraction=0.1,       # Validation set size
    n_iter_no_change=10,           # Patience for early stopping
    random_state=42
)

# ===== REGRESSION =====

mlp_reg = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    alpha=0.001,
    max_iter=1000,
    early_stopping=True,
    random_state=42
)

# Train
mlp.fit(X_train_scaled, y_train)

# Predict
predictions = mlp.predict(X_test_scaled)
probabilities = mlp.predict_proba(X_test_scaled)  # Classification only

# Monitoring
print(f"Iterations: {mlp.n_iter_}")
print(f"Loss curve: {mlp.loss_curve_}")
```

### Key Hyperparameters

**hidden_layer_sizes**:
- Tuple defining network architecture
- (100,) = 1 hidden layer with 100 neurons
- (100, 50) = 2 layers: 100 then 50 neurons
- (100, 50, 25) = 3 layers
- Start with (100,) or (100, 50)

**activation**:
- 'relu': Default, works best for most tasks
- 'tanh': Can work better than ReLU sometimes
- 'logistic': Sigmoid, rarely used in hidden layers
- 'identity': Linear, only for special cases

**solver**:
- 'adam': Best default (adaptive learning)
- 'sgd': Stochastic gradient descent, needs tuning
- 'lbfgs': Good for small datasets

**alpha**:
- L2 regularization strength
- Default: 0.0001
- Increase if overfitting: try 0.001, 0.01
- Decrease if underfitting: try 0.00001

**max_iter**:
- Maximum training epochs
- 200-1000 usually sufficient
- Use early_stopping to avoid setting manually

### Default Configuration (Good Starting Point)

```python
MLPClassifier(
    hidden_layer_sizes=(100,),
    activation='relu',
    solver='adam',
    alpha=0.0001,
    max_iter=500,
    early_stopping=True,
    random_state=42
)
```

### Best Practices

1. **Always scale features**: Use StandardScaler (CRITICAL!)
2. **Start simple**: Begin with (100,) hidden layer
3. **Use early stopping**: Prevents overfitting automatically
4. **Monitor loss curve**: Check convergence
5. **Use adam solver**: Best default optimizer
6. **Set random_state**: For reproducibility
7. **Cross-validate**: Don't rely on single train/test split
8. **Increase max_iter**: If not converging

### Common Issues and Solutions

| Issue | Solution |
|-------|----------|
| Poor performance | Scale features with StandardScaler |
| Convergence warning | Increase max_iter or enable early_stopping |
| Overfitting | Increase alpha, reduce hidden_layer_sizes |
| Underfitting | Increase hidden_layer_sizes, decrease alpha |
| Slow training | Reduce hidden_layer_sizes, use smaller dataset |
| Unstable training | Decrease learning_rate_init |

### When to Use MLP

✓ **Good for:**
- Complex non-linear patterns
- Image data (after flattening or CNN features)
- Large datasets
- High-dimensional data
- When interpretability not required

✗ **Avoid when:**
- Small datasets (<1000 samples)
- Need interpretability
- Simple linear patterns
- Limited computational resources
- Quick prototyping needed

### MLP vs Other Models

| Aspect | MLP | Logistic Reg | Random Forest | SVM |
|--------|-----|--------------|---------------|-----|
| Non-linear | Excellent | Poor | Excellent | Good |
| Training Speed | Slow | Fast | Medium | Slow |
| Prediction Speed | Fast | Very Fast | Medium | Medium |
| Interpretability | Poor | Excellent | Good | Poor |
| Hyperparameters | Many | Few | Few | Few |
| Feature Scaling | Required | Optional | Not Required | Required |

### Computational Complexity

**Training**: O(n × d × h × t)
- n = samples
- d = features  
- h = hidden neurons
- t = iterations

**Prediction**: O(d × h) per sample

### Further Reading

- **Book**: "Deep Learning" - Goodfellow, Bengio, Courville
- **sklearn Docs**: https://scikit-learn.org/stable/modules/neural_networks_supervised.html
- **Paper**: "Understanding the difficulty of training deep feedforward neural networks" - Glorot & Bengio
- **For production**: Consider TensorFlow/PyTorch for larger networks