<div align="center"><img src="https://github.com/hyeonsangjeon/Hyperparameters-Optimization/blob/master/pic/hyperparameteroptimization.png?raw=true" alt="Hyperparameter Optimization" width="400"/></div>

# Hyperparameter Optimization Tutorial

## üìã Overview

Hyperparameter tuning is **essential but time-consuming**. Simple models may take hours, while complex neural networks can require days or weeks.

This tutorial provides hands-on practice with **5 major optimization techniques**, comparing their strengths, weaknesses, and practical applications.

---

## üéØ Optimization Techniques Covered

| Technique | Characteristics | Recommended For |
|-----------|----------------|-----------------|
| **1. Grid Search** | Exhaustive search of all combinations | Small parameter spaces |
| **2. Random Search** | Random sampling | Quick prototyping |
| **3. Optuna** | TPE + Early stopping | **Balanced choice (Recommended)** |
| **4. Bayesian Optimization** | Probabilistic model-based | Maximum performance needs |
| **5. TPE (Hyperopt)** | Bayesian optimization variant | High-dimensional problems |

---

## üìö Learning Outcomes

After completing this tutorial, you will:

‚úÖ **Understand** how each technique works  
‚úÖ **Choose** the right optimization method for different scenarios  
‚úÖ **Reduce** modeling time effectively

---

## ‚ö†Ô∏è Important Note

- **HyperBand**: Deprecated due to incompatibility with modern scikit-learn ‚Üí **Replaced with Optuna**
- **Optuna**: Implements HyperBand's core concepts (early stopping, resource allocation) via TPE + Pruning
- Provides more modern and powerful features with active maintenance

---

## üî¨ Experimental Setup

| Component | Details |
|-----------|---------|
| **Dataset** | Sklearn Diabetes (442 samples) |
| **Model** | LightGBM Regressor |
| | üå≥ High-performance Gradient Boosting algorithm |
| | Sequentially trains decision trees to correct previous errors |
| | Fast training speed with high accuracy |
| | Ideal for optimization learning with many tunable hyperparameters |
| **Metric** | MSE (Mean Squared Error) |
| **Validation** | 2-Fold Cross-Validation |
| **Iterations** | 50 trials (same for all algorithms) |

> **Note**: This is an educational demo. For production, use larger datasets and 5-fold or more cross-validation.

In [None]:
#!pip install pip install git+https://github.com/darenr/scikit-optimize

## Preparation Step
- Import standard libraries


In [None]:
#!pip install lightgbm
import numpy as np
import pandas as pd

from lightgbm.sklearn import LGBMRegressor
from sklearn.metrics import mean_squared_error

%matplotlib inline

import warnings                                  # `do not disturb` mode
warnings.filterwarnings('ignore')

# Suppress LightGBM logs
import os
os.environ['LIGHTGBM_VERBOSE'] = '-1'

## 2Ô∏è‚É£ Dataset Preparation

### üìä Sklearn Diabetes Dataset

We use the **Sklearn Diabetes regression dataset** to compare hyperparameter optimization techniques.

| Item | Details |
|------|---------|
| **Samples** | 442 patients |
| **Features** | 10 (age, sex, BMI, blood pressure, etc.) |
| **Target** | Disease progression after 1 year (continuous) |
| **Problem Type** | Regression |

### üí° Why This Dataset?

‚úÖ **Fast experimentation**: Small size ideal for algorithm comparison  
‚úÖ **Clear impact**: Regression problem shows hyperparameter effects intuitively  
‚úÖ **Focus on optimization**: Concentrate on techniques rather than data complexity

> **Note**: This tutorial focuses on **comparing optimization algorithms**. Concentrate on how each technique works rather than the dataset itself!

In [None]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
n = diabetes.data.shape[0]

data = diabetes.data
targets = diabetes.target

## 3Ô∏è‚É£ Data Splitting & Experiment Setup

### üìä Data Split Strategy

| Item | Setting | Description |
|------|---------|-------------|
| **Train/Test** | 80% / 20% | 353 training, 89 test samples |
| **Cross-Validation** | 2-Fold KFold | Simplified for fast experimentation |
| **Evaluation Metric** | MSE | Lower is better (prediction error) |
| **Random Seed** | 42 | Ensures reproducible results |
| **Iterations** | 50 trials | Same for all algorithms |

### ‚ö†Ô∏è Experimental Constraints

This tutorial is an **educational demo**. Differences from production:

| Aspect | Tutorial | Production Recommendation |
|--------|----------|---------------------------|
| Data Size | 442 (small) | Thousands to tens of thousands |
| Cross-Validation | 2-Fold | 5-Fold or more |
| Iterations | 50 trials | 100-500 trials |
| Stability Check | Single seed | Multiple seed testing |

### üí° Learning Focus

‚úÖ **Algorithm comparison** - Understand how each technique works  
‚úÖ **Relative performance** - Which technique is more efficient  
‚ùå **Absolute performance** - Not critical due to small dataset

> **Next Step**: Now we'll split the data and prepare to run each optimization technique.

In [None]:
from sklearn.model_selection import KFold, cross_val_score, train_test_split

# Experimental parameter settings
random_state = 42  # Random seed for reproducible results
n_iter = 50        # Number of iterations to apply to all optimization algorithms

# Train/test data split (80% training, 20% testing)
train_data, test_data, train_targets, test_targets = train_test_split(
    data, 
    targets, 
    test_size=0.20, 
    shuffle=True,
    random_state=random_state
)

# Cross-validation setup (2-fold KFold)
num_folds = 2
kf = KFold(
    n_splits=num_folds, 
    shuffle=True, 
    random_state=random_state
)

In [None]:
# Check data split results
print('=' * 50)
print('Data Split Results')
print('=' * 50)
print(f'Training Data   (train_data)    : {train_data.shape}')
print(f'Test Data       (test_data)     : {test_data.shape}')
print(f'Training Target (train_targets) : {train_targets.shape}')
print(f'Test Target     (test_targets)  : {test_targets.shape}')
print('=' * 50)

## 4Ô∏è‚É£ Creating Baseline Model

### üéØ Model Selection: LGBMRegressor

We'll solve this problem using `LGBMRegressor`. Gradient Boosting models have many tunable hyperparameters, making them ideal for demonstration.

| Category | Description |
|------|------|
| **Algorithm** | LightGBM (Light Gradient Boosting Machine) |
| **Model Type** | Regressor (regression prediction) |
| **Feature** | Sequential learning of multiple trees |

### üîÑ How It Works

```
Tree 1 ‚Üí Find errors ‚Üí Tree 2 compensates ‚Üí Tree 3 refines ‚Üí ... ‚Üí Final prediction
```

**Analogy**: Like solving exam problems
- Student 1 solves and makes mistakes ‚Üí Student 2 corrects them
- Student 2 makes mistakes ‚Üí Student 3 corrects them
- Combine all students' answers ‚Üí Get final correct answer

In [None]:
# Create baseline model (using default parameters)
model = LGBMRegressor(random_state=random_state, verbose=-1)

### üìè Measuring Baseline Performance

#### üí° Quick Summary
**Before hyperparameter optimization**, we measure results from training **just once** with default settings.  
‚Üí This score becomes our **comparison baseline**.

#### üéØ Why We Need a Baseline

```
Baseline (default settings) vs Optimized results ‚Üí Compare improvement
```

| Category | Role |
|------|------|
| üìä **Comparison Standard** | MSE 3500 ‚Üí 2800 = Verify 20% improvement |
| üí∞ **ROI Assessment** | Is the performance gain worth the optimization time invested? |
| ‚è±Ô∏è **Practicality Evaluation** | 5 seconds vs 30 minutes‚Äîis the improvement worth it? |

#### üî¨ Measurement Method

| Item | Setting |
|------|------|
| **Evaluation Method** | 2-Fold Cross Validation |
| **Evaluation Metric** | MSE (lower is better) |
| **Parameters** | LightGBM default values |


In [None]:
%%time

# Baseline model performance evaluation
baseline_scores = cross_val_score(
    model, 
    train_data, 
    train_targets, 
    cv=kf, 
    scoring="neg_mean_squared_error", 
    n_jobs=-1
)

# Calculate MSE (convert negative to positive)
baseline_mse = -baseline_scores.mean()

# Print results
print('=' * 50)
print('Baseline Model Performance')
print('=' * 50)
print(f'MSE (Mean): {baseline_mse:.2f}')
print(f'MSE (Std Dev): {baseline_scores.std():.2f}')
print(f'Individual Fold Results: {[-score for score in baseline_scores]}')
print('=' * 50)

## 5Ô∏è‚É£ Setting Hyperparameter Search Space

### üéØ Optimization Target: 3 Key Parameters

| Parameter | Search Range | Role |
|---------|---------|------|
| **n_estimators** | 100 ~ 2000 | Number of trees (more accurate but slower) |
| **max_depth** | 2 ~ 20 | Tree depth (deeper = learns complex patterns) |
| **learning_rate** | 0.00001 ~ 1.0 | Learning rate (controls each tree's contribution) |

### Why Only 3 Parameters?

LightGBM has dozens of parameters, but in this tutorial:

| Reason | Explanation |
|------|------|
| üéØ **Clear Comparison** | Focus on algorithm characteristics with core parameters |
| ‚ö° **Fast Experiments** | Reduce execution time for better learning efficiency |
| üìä **Easy Understanding** | Visualizations are easier to interpret |

> **Important**: These 3 parameters alone have a major impact on model performance

# 1. Grid Search

## Concept

Grid Search is the most traditional optimization method that **exhaustively searches all hyperparameter combinations**.

- **How It Works**: Try all combinations of parameter values specified by the user, and select the combination with the best cross-validation results
- **Implementation**: Uses `sklearn.model_selection.GridSearchCV`

## Pros and Cons

### ‚úÖ Advantages
- **Simple and Clear**: Easiest method to understand
- **Complete Search**: Guaranteed to find optimal value within specified range
- **Reproducible**: Always same results on the same grid

### ‚ùå Disadvantages

| Problem | Description | Example |
|-----|------|-----|
| **Slow Speed** | Time increases exponentially as all combinations are tried | Adding 1 parameter increases computation time 10x |
| **Discrete Value Limitation** | May miss optimal values in between continuous values | If optimal is 550 but searching 100, 200, 300..., won't find it |
| **Prior Knowledge Required** | Need to know appropriate search range in advance for efficiency | Wrong range wastes time |

## Practical Example: Time Calculation

**Full Grid (Unrealistic)**:
- n_estimators: 20 values (100~2000)
- max_depth: 19 values (2~20)
- learning_rate: 5 values (0.0001~0.1)
- **Total Combinations**: 20 √ó 19 √ó 5 = **1,900**
- **Expected Time**: 15~30 minutes (excessive for 442 sample dataset)

**Reduced Grid (This Tutorial)**:
- n_estimators: 5 values (800~1200)
- max_depth: 8 values (5~12)
- learning_rate: 3 values (0.001~0.1)
- **Total Combinations**: 5 √ó 8 √ó 3 = **120** ‚úÖ

### üí° Practical Tips
1. **Start with narrow range**: Gradual exploration from wide grid ‚Üí narrow grid
2. **Important parameters first**: Allocate more values to parameters with greater impact
3. **Consider time constraints**: Consider Random Search or Bayesian methods first

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

# Define Grid Search parameter grid (120 combinations)
param_grid = {
    'max_depth': np.linspace(5, 12, 8, dtype=int),        # 8 values
    'n_estimators': np.linspace(800, 1200, 5, dtype=int), # 5 values
    'learning_rate': np.logspace(-3, -1, 3),              # 3 values
    'random_state': [random_state]
}

# Create GridSearchCV object
gs = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    cv=kf,
    verbose=False
)

# Execute Grid Search
gs.fit(train_data, train_targets)

# Test set evaluation
gs_test_score = mean_squared_error(test_targets, gs.predict(test_data))

# Print results (same format as baseline)
print('=' * 50)
print('Grid Search Optimization Results')
print('=' * 50)
print(f'Optimal MSE (CV): {-gs.best_score_:.2f}')
print(f'Test MSE: {gs_test_score:.2f}')
print(f'Optimal Parameters:')
for param, value in gs.best_params_.items():
    if param != 'random_state':
        print(f'  - {param}: {value}')
print('=' * 50)

## üìä Grid Search Results Analysis

### üîç Visualizing Parameter Search Process

The graph below shows how each parameter changed while trying 120 combinations.

### üí° Key Findings

| Parameter | Impact | Characteristics |
|---------|--------|------|
| üìà **learning_rate** | üî• Very High | Lower values improve performance (most important) |
| üìä **n_estimators** | ‚ö° Medium | Improves up to a certain level as tree count increases |
| ‚ö†Ô∏è **max_depth** | üí§ Low | Relatively less impact ‚Üí 8 values is excessive |

### ‚ùå Inefficiency of Grid Search

```
Problems:
1. Searches all parameter combinations equally
2. Uses 8 values even for low-impact max_depth ‚Üí 8x time increase
3. Fixed grid approach ‚Üí Cannot consider parameter interactions
```

> **Conclusion**: Grid Search is thorough but **inefficient for the time invested**

In [None]:
# Convert Grid Search results to DataFrame
gs_results_df = pd.DataFrame({
    'score': -gs.cv_results_['mean_test_score'],
    'learning_rate': gs.cv_results_['param_learning_rate'].data,
    'max_depth': gs.cv_results_['param_max_depth'].data,
    'n_estimators': gs.cv_results_['param_n_estimators'].data
})

# Visualize parameter exploration process
gs_results_df.plot(subplots=True, figsize=(10, 10), title='Grid Search: Parameter Evolution')

# 2. Random Search

## Concept

Random Search is a method that finds optimal values by **randomly sampling in the hyperparameter space**.

- **How It Works**: Randomly extract parameter values from specified distributions and evaluate
- **Implementation**: Uses `sklearn.model_selection.RandomizedSearchCV`
- **Key Paper**: [Random Search for Hyper-Parameter Optimization (Bergstra & Bengio, 2012)](https://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf)

<img src="https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/expanding_ML_toolkit/expanding_toolkit.jpg" style="height:500px;width:50%;"/>

## Pros and Cons

### ‚úÖ Advantages

| Advantage | Description | Compared to Grid Search |
|-----|------|------------------|
| **Efficient Exploration** | Tries more values for important parameters | Equal time on all parameters |
| **Fast Speed** | Reaches good results with fewer iterations | Slow by trying all combinations |
| **Continuous Value Support** | Freely explores real-valued parameters | Only searches discrete values |
| **Scalability** | Time increases linearly even when adding parameters | Exponential increase when adding parameters |

### ‚ùå Disadvantages

1. **No Guarantee**: May not find optimal value within specified range
2. **Independent Search**: Randomly selects each time without utilizing previous results
3. **Depends on Luck**: Same settings may have different results depending on random_state

## Practical Example

In this tutorial:
- **Wide Search Range**: learning_rate (10‚Åª‚Åµ ~ 1.0), n_estimators (100~2000), max_depth (2~20)
- **Few Iterations**: Only 50 random samplings performed
- **vs Grid Search**: 120 exhaustive search combinations vs 50 random samplings

### üí° Practical Tips
1. **Start with wide range**: First identify approximate area with Random Search
2. **Sufficient iterations**: Minimum 50~100 recommended
3. **Next step**: Based on Random Search results, narrow range for Grid Search or Bayesian methods

## Executing Random Search

Uses `RandomizedSearchCV` to perform 50 random samplings in a wide parameter space.


In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define Random Search parameter distributions
param_grid_rand = {
    'learning_rate': np.logspace(-5, 0, 100),  # 10^-5 ~ 1.0 (continuous distribution)
    'max_depth': randint(2, 20),               # 2 ~ 19 (uniform distribution)
    'n_estimators': randint(100, 2000),        # 100 ~ 1999 (uniform distribution)
    'random_state': [random_state]
}

# Create RandomizedSearchCV object
rs = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_grid_rand,
    n_iter=n_iter,                             # 50 random samplings
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    cv=kf,
    verbose=False,
    random_state=random_state
)

# Execute Random Search
rs.fit(train_data, train_targets)

# Test set evaluation
rs_test_score = mean_squared_error(test_targets, rs.predict(test_data))

# Print results (same format as baseline)
print('=' * 50)
print('Random Search Optimization Results')
print('=' * 50)
print(f'Optimal MSE (CV): {-rs.best_score_:.2f}')
print(f'Test MSE: {rs_test_score:.2f}')
print(f'Optimal Parameters:')
for param, value in rs.best_params_.items():
    if param != 'random_state':
        print(f'  - {param}: {value}')
print('=' * 50)

## Random Search Results Analysis

### üìä Comparison with Grid Search

**Performance:**
- ‚úÖ Better results than Grid Search (with less time)
- ‚úÖ 50 samplings vs 120 exhaustive search ‚Üí Superior efficiency per time

### üé≤ Characteristics of Random Search

**Advantages:**
- Explores by **changing all parameters simultaneously** each time
- No time wasted on less important parameters
- Freely explores continuous value ranges

**Limitations:**
- Each sampling is **completely independent** ‚Üí Doesn't utilize previous results
- Results may vary depending on luck

### üí° Parameter Change Visualization

The graph below shows Random Search **irregularly exploring the entire space**, unlike Grid Search.

In [None]:
# Convert Random Search results to DataFrame
rs_results_df = pd.DataFrame({
    'score': -rs.cv_results_['mean_test_score'],
    'learning_rate': rs.cv_results_['param_learning_rate'].data,
    'max_depth': rs.cv_results_['param_max_depth'].data,
    'n_estimators': rs.cv_results_['param_n_estimators'].data
})

# Visualize parameter exploration process
rs_results_df.plot(subplots=True, figsize=(10, 10), title='Random Search: Parameter Evolution')

# 3. HyperBand ‚ö†Ô∏è (Replaced with Optuna Due to Compatibility)

### Research Paper [HyperBand](https://arxiv.org/pdf/1603.06560.pdf)

## Concept

HyperBand is an algorithm that maximizes optimization speed through **Early Stopping** and **Adaptive Resource Allocation**.

**How It Works:**
1. Randomly generate many hyperparameter combinations
2. Start training each combination quickly with **few resources**
3. **Terminate early** combinations with low performance
4. Invest more resources only in promising combinations
5. Finally, only a few best combinations train to completion

**Analogy:** Similar to a hiring process that quickly screens dozens of candidates with short interviews, then only conducts 2nd and 3rd round interviews with promising few

<img src="https://github.com/hyeonsangjeon/Hyperparameters-Optimization/blob/master/pic/Hyperband.png?raw=true" />

## ‚ö†Ô∏è Compatibility Issue

**The scikit-hyperband library is no longer usable:**
- Incompatible with scikit-learn 0.24+ (`iid` parameter removed)
- Library maintenance discontinued
- Cannot run in latest environments

## Solution: Replace with Optuna

HyperBand's core concepts can be implemented more modernly through **Optuna**.

| Feature | HyperBand | Optuna |
|-----|----------|--------|
| **Core Technique** | Successive Halving | TPE (Bayesian) + Pruning (early stopping) |
| **Strength** | Fast resource allocation optimization | Combines intelligent search + early stopping |
| **Compatibility** | ‚ùå Not compatible with scikit-learn 0.24+ | ‚úÖ Fully compatible with latest libraries |
| **Maintenance** | ‚ùå Discontinued | ‚úÖ Active development |

### üí° Practical Selection Guide

**When HyperBand concept is needed:**
- ‚úÖ **Use Optuna** - Pruning feature implements HyperBand's early stopping concept
- ‚úÖ Better performance: Synergy of Bayesian optimization + early stopping

**Code below preserved for educational purposes only** (not executable)

In [None]:
# HyperBand installation code (commented due to compatibility issues)
# !git clone https://github.com/thuijskens/scikit-hyperband.git 2>/dev/null 1>/dev/null

In [None]:
# HyperBand file copy (commented due to compatibility issues)
# !cp -r scikit-hyperband/* .

In [None]:
#!python setup.py install 2>/dev/null 1>/dev/null

In [None]:
# HyperBand execution code (commented due to compatibility issues)
# Cannot run in scikit-learn 0.24+ due to removed iid parameter
# Use Optuna section below instead

"""
%%time
from hyperband import HyperbandSearchCV

from scipy.stats import randint as sp_randint
from sklearn.preprocessing import LabelBinarizer


param_hyper_band={'learning_rate': np.logspace(-5, 0, 100),
                 'max_depth':  randint(2,20),
                 'n_estimators': randint(100,2000),                  
                 #'num_leaves' : randint(2,20),
                 'random_state': [random_state]
                 }


hb = HyperbandSearchCV(model, param_hyper_band, max_iter = n_iter, scoring='neg_mean_squared_error', resource_param='n_estimators', random_state=random_state)


#%time search.fit(new_training_data, y)
hb.fit(train_data, train_targets)



hb_test_score=mean_squared_error(test_targets, hb.predict(test_data))

print('===========================')
print("Best MSE = {:.3f} , when params {}".format(-hb.best_score_, hb.best_params_))
print('===========================')
"""

# 3-1 Optuna (Modern Alternative to HyperBand)

## üîç Core Concept

A modern hyperparameter optimization framework combining **Bayesian Optimization (TPE) + Early Stopping (Pruning)**

```
TPE: Learns from previous trial results to predict next parameters to try
Pruning: Automatically terminates low-performing trials during training ‚Üí Saves time
```

**Analogy**: A system that analyzes past interview results to intelligently select next candidates, and immediately eliminates candidates certain to fail during the interview process

## ‚öñÔ∏è HyperBand vs Optuna

| Category | HyperBand | Optuna |
|------|-----------|--------|
| **Search Method** | Random + Early Stopping | Bayesian (intelligent) + Early Stopping |
| **Learning Capability** | ‚ùå Doesn't use previous results | ‚úÖ Learns from previous results |
| **Compatibility** | ‚ùå Only supports old versions | ‚úÖ Fully compatible with latest libraries |
| **Maintenance** | ‚ùå Discontinued | ‚úÖ Active development (2024) |
| **Visualization** | None | ‚úÖ Rich built-in tools |

## üìä Pros & Cons

### ‚úÖ Advantages

- üß† **Intelligent Search**: Finds optimal values more efficiently than Random Search
- ‚è±Ô∏è **Time Savings**: Pruning terminates unnecessary computations early
- üîß **Flexibility**: Compatible with various ML frameworks (sklearn, PyTorch, etc.)

### ‚ùå Disadvantages

- ‚è∞ **Initial Overhead**: TPE model construction requires some time
- üéõÔ∏è **Configuration Needed**: Additional setup for pruning strategy, sampler selection, etc.

## üí° This Tutorial Setup

```
Number of trials: 50 (same as Random Search)
Method: TPESampler (Bayesian optimization)
Search space: 
  - learning_rate: 10‚Åª‚Åµ ~ 1.0
  - n_estimators: 100 ~ 2000
  - max_depth: 2 ~ 20
```

**Installation:**
```bash
pip install optuna
```

In [None]:
# Import and configure Optuna library
# Installation: pip install optuna

import optuna
from optuna.samplers import TPESampler

# Set log output level (show WARNING and above only)
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Check version
print(f"Optuna version: {optuna.__version__}")

In [None]:
%%time

# 1. Define Optuna objective function
def optuna_objective(trial):
    """
    Objective function to evaluate in Optuna trial
    Returns: MSE (minimization target)
    """
    # Suggest hyperparameters (TPE selects intelligently)
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1.0, log=True),  # Log scale
        'max_depth': trial.suggest_int('max_depth', 2, 20),                          # Integer range
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),                # Integer range
        'random_state': random_state,
        'verbose': -1
    }
    
    # Create model and evaluate with cross-validation
    model_optuna = LGBMRegressor(**params)
    score = -cross_val_score(
        model_optuna, 
        train_data, 
        train_targets, 
        cv=kf, 
        scoring="neg_mean_squared_error", 
        n_jobs=-1
    ).mean()
    
    return score

# 2. Create Optuna Study
study = optuna.create_study(
    direction='minimize',                      # Minimize MSE
    sampler=TPESampler(seed=random_state)      # Use TPE algorithm
)

# 3. Execute optimization (50 trials)
study.optimize(optuna_objective, n_trials=n_iter, show_progress_bar=True)

# 4. Train final model with optimal parameters and test
best_params_optuna = study.best_params.copy()
best_params_optuna['random_state'] = random_state
best_params_optuna['verbose'] = -1

optuna_model = LGBMRegressor(**best_params_optuna)
optuna_model.fit(train_data, train_targets)
optuna_test_score = mean_squared_error(test_targets, optuna_model.predict(test_data))

# 5. Print results (same format as baseline)
print('=' * 50)
print('Optuna Optimization Results')
print('=' * 50)
print(f'Optimal MSE (CV): {study.best_value:.2f}')
print(f'Test MSE: {optuna_test_score:.2f}')
print(f'Optimal Parameters:')
for param, value in study.best_params.items():
    print(f'  - {param}: {value}')
print('=' * 50)

In [None]:
# Convert Optuna results to DataFrame
optuna_results_df = pd.DataFrame({
    'score': [trial.value for trial in study.trials],
    'learning_rate': [trial.params['learning_rate'] for trial in study.trials],
    'max_depth': [trial.params['max_depth'] for trial in study.trials],
    'n_estimators': [trial.params['n_estimators'] for trial in study.trials]
})

# Visualize parameter exploration process (same format as other algorithms)
optuna_results_df.plot(subplots=True, figsize=(10, 10), title='Optuna: Parameter Evolution')

# 4. Bayesian Optimization

## Concept

Bayesian Optimization is an intelligent optimization method that **learns from previous results using probabilistic models**.

**Core Idea:**
- Unlike Random/Grid Search, **tracks and utilizes past evaluation results**
- Estimates relationship between hyperparameters and performance scores as a **probabilistic model**: $P(\text{Score} | \text{Hyperparameters})$
- **Strategically explores** the most promising areas

![](https://github.com/hyeonsangjeon/Hyperparameters-Optimization/blob/master/pic/BayesianOpt.gif?raw=true)

<img src="https://github.com/hyeonsangjeon/Hyperparameters-Optimization/blob/master/pic/bayesopt2.png?raw=true" style="height:320px;"  />

## Core Components

### 1. Surrogate Model

**Definition:** A model that probabilistically estimates the objective function form based on points investigated so far $(x_1, f(x_1)), ..., (x_t, f(x_t))$

- Denoted as $p(y | x)$ in papers
- **Much cheaper computational cost** than actual objective function
- Predicts performance in entire parameter space

### 2. Acquisition Function

**Definition:** A function that recommends the next input candidate $x_{t+1}$ most useful for finding the optimal input value $x^*$

- Typically uses **Expected Improvement (EI)**
- Balances Exploration and Exploitation

## Algorithm Process

1. Generate random initial point $x^*$
2. Calculate $f(x^*)$
3. Build conditional probability model $P(f | x)$ from past results (Surrogate Model)
4. Select $x_i$ with highest probability of improving $f(x_i)$ according to $P(f | x)$ (Acquisition Function)
5. Calculate actual value of $f(x_i)$
6. Repeat steps 3~5 until maximum iterations reached

## Pros and Cons

### ‚úÖ Advantages

| Advantage | Description |
|-----|------|
| **Intelligent Search** | Learns from previous results to focus on promising areas |
| **Efficiency** | Can reach optimal with few iterations |
| **Continuous Space Optimization** | Effective in continuous parameter spaces |

### ‚ùå Disadvantages

| Disadvantage | Description |
|-----|------|
| **Initial Cost** | Time required to build Surrogate model |
| **High-Dimension Limitation** | Efficiency decreases as parameters increase (>20) |
| **Implementation Complexity** | More complex to understand and implement than Random Search |

## Practical Example

In this tutorial:
- Uses **scikit-optimize (skopt)** library's `BayesSearchCV`
- **50 iterations** (same as Random Search)
- **Wide Search Range**: learning_rate (10‚Åª‚Åµ ~ 1.0), n_estimators (100~2000), max_depth (2~20)

### üí° Practical Tips

1. **Initial Exploration**: First 10~20 iterations randomly explore to initialize model
2. **Appropriate Iterations**: 50~100 iterations sufficient in most cases
3. **High-Dimension Caution**: Consider TPE or Random Search if parameters exceed 10
4. **Early Termination**: Can stop with callback function when target performance reached

### üìö References

- [Bayesian Optimization for Robots (Kaggle)](https://www.kaggle.com/artgor/bayesian-optimization-for-robots)
- [scikit-optimize Documentation](https://scikit-optimize.github.io/)

In [None]:
#! pip install scikit-optimize
#https://towardsdatascience.com/hyperparameter-optimization-with-scikit-learn-scikit-opt-and-keras-f13367f3e796
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer


In [None]:
%%time

# 1. Define search space
search_space = {
    'learning_rate': np.logspace(-5, 0, 100),  # 10^-5 ~ 1.0 (log scale)
    "max_depth": Integer(2, 20),                # 2 ~ 20 (integer range)
    'n_estimators': Integer(100, 2000),         # 100 ~ 2000 (integer range)
    'random_state': [random_state]
}

# 2. Define callback function (early termination)
def on_step(optim_result):
    """
    Callback function called at each Bayesian Optimization iteration
    Terminates optimization early when target performance reached
    """
    score = optim_result.fun  # Minimum value so far (negative MSE)
    print(f"Best score: {score:.3f}")
    
    # Terminate early if MSE better than 2800
    if score < -2800:
        print('Target performance reached! Stopping optimization.')
        return True

# 3. Create and execute BayesSearchCV object
bayes_search = BayesSearchCV(
    model, 
    search_space, 
    n_iter=n_iter,                          # 50 iterations
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    cv=kf,                                  # 2-fold CV (same as other algorithms)
    random_state=random_state
)

# 4. Execute optimization (early termination possible with callback)
bayes_search.fit(train_data, train_targets, callback=on_step)

# 5. Test set evaluation
bayes_test_score = mean_squared_error(test_targets, bayes_search.predict(test_data))

# 6. Print results (same format as baseline)
print('=' * 50)
print('Bayesian Optimization Results')
print('=' * 50)
print(f'Optimal MSE (CV): {-bayes_search.best_score_:.2f}')
print(f'Test MSE: {bayes_test_score:.2f}')
print(f'Optimal Parameters:')
for param, value in bayes_search.best_params_.items():
    if param != 'random_state':
        print(f'  - {param}: {value}')
print('=' * 50)

In [None]:
# Convert Bayesian Optimization results to DataFrame
bayes_results_df = pd.DataFrame({
    'score': -bayes_search.cv_results_['mean_test_score'],
    'learning_rate': bayes_search.cv_results_['param_learning_rate'].data,
    'max_depth': bayes_search.cv_results_['param_max_depth'].data,
    'n_estimators': bayes_search.cv_results_['param_n_estimators'].data
})

# Visualize parameter exploration process (same format as other algorithms)
bayes_results_df.plot(subplots=True, figsize=(10, 10), title='Bayesian Optimization: Parameter Evolution')

# 5. Hyperopt (TPE Implementation)

## Overview

**Hyperopt** is a Python library that implements the TPE (Tree-structured Parzen Estimator) algorithm.

- **GitHub**: [hyperopt/hyperopt](https://github.com/hyperopt/hyperopt)
- **Feature**: Automatic hyperparameter search based on Bayesian optimization
- **Advantage**: Works independently of sklearn, allows flexible search space definition

## Main Components

| Function/Class | Role | Description |
|-----------|-----|------|
| **fmin** | Main optimization function | Searches optimal parameters that minimize objective function |
| **tpe.suggest** | TPE algorithm | Bayesian optimization strategy (recommended) |
| **hp** | Search space definition | Specifies parameter distributions (uniform, loguniform, etc.) |
| **Trials** | Execution log | Stores results of all trials |

**Installation:**
```bash
pip install hyperopt
```

In [None]:
#!pip install hyperopt

## Importing Hyperopt Library

Import necessary functions:
- **fmin**: Execute hyperparameter optimization
- **tpe**: TPE algorithm (Bayesian optimization)
- **hp**: Define parameter search space (quniform, loguniform, etc.)
- **Trials**: Log and track execution results

In [None]:
from hyperopt import fmin, tpe, hp, anneal, Trials

## Implementing Objective Function

Unlike sklearn, Hyperopt requires **direct definition of objective function**. The function below takes parameters, trains model, and returns MSE.

In [None]:
def gb_mse_cv(params, random_state=random_state, cv=kf, X=train_data, y=train_targets):
    """
    Hyperopt objective function: Trains LGBMRegressor with given parameters and returns MSE
    
    Args:
        params (dict): Hyperparameters to optimize (n_estimators, max_depth, learning_rate)
        
    Returns:
        float: Cross-validation MSE (minimization target)
    """
    # 1. Convert parameters to integers and organize
    model_params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']), 
        'learning_rate': params['learning_rate'],
        'random_state': random_state,
        'verbose': -1
    }
    
    # 2. Create model
    model = LGBMRegressor(**model_params)
    
    # 3. Perform cross-validation and calculate MSE
    score = -cross_val_score(
        model, X, y, 
        cv=cv, 
        scoring="neg_mean_squared_error", 
        n_jobs=-1
    ).mean()

    return score

## Tree-structured Parzen Estimator (TPE)

### Research Paper
[Algorithms for Hyper-Parameter Optimization (Bergstra et al., NIPS 2011)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)

<img src="https://github.com/hyeonsangjeon/Hyperparameters-Optimization/blob/master/pic/TPE.gif?raw=true" />

## Concept

TPE is **Hyperopt's core algorithm**, a method that implements Bayesian optimization.

**Core Idea:**
- Learns from previous trial results to **intelligently predict parameters to try next**
- Probabilistically distinguishes between parameter regions with good and bad performance
- Focuses on promising regions while also exploring new regions (Exploration & Exploitation)

## Algorithm Process

1. **Initialize**: Generate random initial point $x^*$
2. **Evaluate**: Calculate $F(x^*)$ (perform cross-validation)
3. **Build Model**: Create conditional probability model $P(F | x)$ from past trial history
4. **Select Next Candidate**: Based on $P(F | x)$, select $x_i$ with highest probability of improving $F(x_i)$
5. **Actual Evaluation**: Calculate actual value of $F(x_i)$
6. **Repeat**: Repeat steps 3~5 until maximum iterations reached

## Comparison with Bayesian Optimization

| Feature | Bayesian Optimization (skopt) | TPE (Hyperopt) |
|-----|------------------------------|----------------|
| **Surrogate Model** | Gaussian Process (GP) | Tree-structured Parzen Estimator |
| **Computational Complexity** | O(n¬≥) - slow in high dimensions | O(n) - fast even in high dimensions |
| **Parameter Independence** | Considers correlations | Assumes conditional independence |
| **Application Scenario** | Parameters < 10 | Parameters > 10 |

### üí° Practical Tips

1. **High-Dimension Problems**: TPE more efficient than Bayesian Optimization when parameters exceed 10
2. **Fast Feedback**: TPE builds models quickly, advantageous when preferring fast iteration
3. **Flexible Search Space**: Can define conditional parameters (e.g., different parameters depending on number of layers)

### üìö References

- [Detailed TPE Algorithm Explanation (Towards Data Science)](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)
- [Hyperopt Official Documentation](https://github.com/hyperopt/hyperopt)

## Executing TPE

Performs TPE optimization using Hyperopt's `fmin` function. Compares under same conditions as Random Search with 50 iterations.

In [None]:
%%time

# 1. Define search space
space = {
    'n_estimators': hp.quniform('n_estimators', 100, 2000, 1),  # 100 ~ 2000 (integer)
    'max_depth': hp.quniform('max_depth', 2, 20, 1),            # 2 ~ 20 (integer)
    'learning_rate': hp.loguniform('learning_rate', -5, 0)      # 10^-5 ~ 1.0 (log scale)
}

# 2. Create Trials object (for logging)
trials = Trials()

# 3. Set random seed (reproducibility)
np.random.seed(random_state)

# 4. Execute TPE optimization
best = fmin(
    fn=gb_mse_cv,           # Objective function to minimize
    space=space,            # Search space
    algo=tpe.suggest,       # TPE algorithm
    max_evals=n_iter,       # Maximum iterations (50)
    trials=trials           # Result logging
)

# 5. Train final model with optimal parameters
best_params_tpe = {
    'n_estimators': int(best['n_estimators']),
    'max_depth': int(best['max_depth']),
    'learning_rate': best['learning_rate'],
    'random_state': random_state,
    'verbose': -1
}

tpe_model = LGBMRegressor(**best_params_tpe)
tpe_model.fit(train_data, train_targets)

# 6. Test set evaluation
tpe_test_score = mean_squared_error(test_targets, tpe_model.predict(test_data))

# 7. Print results (same format as baseline)
print('=' * 50)
print('TPE (Hyperopt) Optimization Results')
print('=' * 50)
print(f'Optimal MSE (CV): {gb_mse_cv(best):.2f}')
print(f'Test MSE: {tpe_test_score:.2f}')
print(f'Optimal Parameters:')
for param, value in best.items():
    print(f'  - {param}: {value:.6f}' if isinstance(value, float) else f'  - {param}: {int(value)}')
print('=' * 50)

### TPE Results Analysis

TPE tends to find **better hyperparameter combinations with fewer iterations** than Random Search.

- **Bayesian Approach**: Learns from results of previous trials to smartly select parameters to try next.
- **Efficient Exploration**: Even with same 50 iterations, more likely to converge to higher quality solutions than Random Search.
- **Balanced Performance**: Computation time increases slightly, but performance improvement is significant enough to be frequently used in practice.

The graph below shows TPE's convergence process where **score gradually decreases (improves)** as iterations progress.

In [None]:
# Convert TPE (Hyperopt) search results to DataFrame
tpe_results = np.array([
    [x['result']['loss'],
     x['misc']['vals']['learning_rate'][0],
     x['misc']['vals']['max_depth'][0],
     x['misc']['vals']['n_estimators'][0]]
    for x in trials.trials
])

tpe_results_df = pd.DataFrame(
    tpe_results,
    columns=['score', 'learning_rate', 'max_depth', 'n_estimators']
)

# Visualize parameter exploration process (same format as other algorithms)
tpe_results_df.plot(subplots=True, figsize=(10, 10), title='TPE (Hyperopt): Parameter Evolution')

## Results

Compares **optimal performance (best cumulative score) changes over iterations** for all approaches in one graph.

- The y-axis represents the **trend of best (minimum) MSE value** found by each algorithm so far.
- The faster and lower the curve descends, **the better hyperparameters were found with fewer attempts**.
- The **Baseline** curve represents the fixed MSE without any tuning, and how much other curves fall below this line is **the gain (performance improvement) from tuning**.
- Through this graph, you can intuitively compare **how much each method improved vs Baseline**, and **convergence speed and final performance** among Grid/Random/Optuna/Bayesian/TPE.

In [None]:
# Compare best cumulative score changes over iterations for all algorithms
scores_df = pd.DataFrame(index=range(n_iter))
scores_df['Baseline'] = baseline_mse
scores_df['Grid Search'] = gs_results_df['score'].cummin()
scores_df['Random Search'] = rs_results_df['score'].cummin()
# scores_df['Hyperband'] = hb_results_df['score'].cummin()  # Activate if needed
scores_df['Optuna'] = optuna_results_df['score'].cummin()
scores_df['Bayesian Optimization'] = bayes_results_df['score'].cummin()
scores_df['TPE'] = tpe_results_df['score'].cummin()

# Visualize convergence curves
ax = scores_df.plot(figsize=(10, 6))
ax.set_xlabel('Number of iterations')
ax.set_ylabel('Best cumulative score (MSE)')
ax.set_title('Hyperparameter Optimization: Best Score over Iterations')
ax.legend(loc='best')

### Results Analysis

#### üìä Algorithm Performance Comparison

**Improvement vs Baseline:**
- All optimization algorithms achieve **significantly lower MSE** than Baseline ‚Üí Hyperparameter tuning effect is clear

**Algorithm Characteristics:**

| Algorithm | Convergence Speed | Final Performance | Computational Cost | Recommended Situation |
|---------|----------|----------|----------|----------|
| **Grid Search** | Slow | Good | High | When parameter space is small and precise search needed |
| **Random Search** | Fast | Good | Low | Fast prototyping, initial experiments |
| **Optuna** | Very Fast | Very Good | Medium | **Balanced choice** - recommended in most cases |
| **Bayesian Opt** | Medium | Very Good | High | Parameters < 10, highest performance needed |
| **TPE** | Fast | Very Good | Medium | Parameters > 10, high-dimensional problems |

#### üí° Practical Application Guide

**Situation-Specific Recommendations:**
1. **‚ö° Fast Prototyping** (Time Constraints)
   - Priority 1: **Random Search** - Simple and fast
   - Priority 2: **Optuna** - Better results with slightly more time

2. **üéØ Highest Performance Needed** (Time Available)
   - Parameters < 10: **Bayesian Optimization**
   - Parameters ‚â• 10: **TPE (Hyperopt)** or **Optuna**

3. **‚öñÔ∏è Balanced Choice** (Practical Recommendation)
   - **Optuna (TPE + Pruning)** - Optimal balance of speed and performance

4. **üöÄ Limited Time/Resources**
   - **Utilize Optuna's Pruning feature** - Early termination of unnecessary trials

#### ‚ö†Ô∏è Important Notes

**Cautions When Interpreting Graphs:**
- This experiment was conducted with **small dataset (442 samples), 2-fold CV, 50 iterations**
- Results may vary greatly depending on `random_state` value
- More folds (5-fold or more) and more iterations recommended in actual environments

**Toy Dataset Characteristics:**
- **Overfitting easily occurs** on small datasets (442 samples)
- When CV MSE is high but test MSE is low ‚Üí May be due to test set being coincidentally easy or data being too small
- **Validation on larger datasets needed** in actual projects

**Practical Evaluation Criteria:**
- This tutorial: Comparison based on iteration count (educational purpose)
- Actual environment: Evaluation based on **actual time spent** is more accurate

## Test Data Performance Comparison

Outputs **test set MSE** for all algorithms to verify practical performance.

### üìã Comparison Items
- **Cross-Validation MSE**: Optimal performance obtained during training for each algorithm (refer to graph above)
- **Test MSE**: Actual prediction performance on unseen test data (output below)

### üí° Interpretation Guide
- **CV MSE < Test MSE**: Normal case - better performance on training data
- **CV MSE ‚âà Test MSE**: Ideal - excellent generalization performance
- **CV MSE > Test MSE**: Toy dataset characteristic - possibility that test set was coincidentally easy

In [None]:
# Test set performance comparison (including baseline)
print('=' * 50)
print('Test Set MSE Comparison')
print('=' * 50)
print(f"Baseline (no tuning)    : {baseline_mse:.3f}")
print('-' * 50)
print(f"Grid Search             : {gs_test_score:.3f}")
print(f"Random Search           : {rs_test_score:.3f}")
print(f"Optuna                  : {optuna_test_score:.3f}")
print(f"Bayesian Optimization   : {bayes_test_score:.3f}")
print(f"TPE (Hyperopt)          : {tpe_test_score:.3f}")
print('=' * 50)

# Calculate improvement rate vs baseline
print('\nImprovement Rate vs Baseline:')
print('=' * 50)
for name, score in [
    ('Grid Search', gs_test_score),
    ('Random Search', rs_test_score),
    ('Optuna', optuna_test_score),
    ('Bayesian Optimization', bayes_test_score),
    ('TPE (Hyperopt)', tpe_test_score)
]:
    improvement = ((baseline_mse - score) / baseline_mse) * 100
    print(f"{name:25s}: {improvement:+.2f}%")
print('=' * 50)

## Final Conclusion

### üèÜ Algorithm Characteristics Summary

| Algorithm | Recommended Situation | Advantages | Disadvantages |
|---------|----------|-----|------|
| **Optuna** | **Most Cases** | Optimal balance of speed and performance | Initial setup required |
| **Random Search** | Quick experiments | Simple and fast | No guarantee of best performance |
| **TPE (Hyperopt)** | High-dimensional parameters | Effective for many parameters | Learning curve exists |
| **Bayesian Opt** | Highest performance needed | High final performance | Long computation time |
| **Grid Search** | Small search space | Guaranteed complete search | High time consumption |

### üí° Practical Selection Guide

**When Time Constrained:**
- Start with Random Search ‚Üí Find good range ‚Üí Precise search with Optuna

**When Highest Performance Needed:**
- Parameters < 10: Bayesian Optimization
- Parameters ‚â• 10: TPE or Optuna

**In Most Cases:**
- Use **Optuna** - Modern, actively maintained, and powerful

### ‚ö†Ô∏è Important Notice

This experiment was conducted with **toy dataset for educational purposes** (442 samples, 2-fold CV).

**In Actual Projects:**
- More data (minimum thousands of samples)
- More folds (5-fold or more)
- Validation with multiple random_state
- Evaluation based on actual time spent

Recommended.

## Tutorial Summary

### üéØ 5 Optimization Algorithms Compared

| No. | Algorithm | Core Feature | Practical Rating |
|-----|---------|----------|-----------|
| 1 | **Grid Search** | Exhaustive search of all combinations | ‚≠ê‚≠ê (Small space only) |
| 2 | **Random Search** | Random sampling | ‚≠ê‚≠ê‚≠ê‚≠ê (Quick start) |
| 3 | **Optuna** | TPE + Pruning combined | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê (Best balance) |
| 4 | **Bayesian Opt** | Gaussian Process based | ‚≠ê‚≠ê‚≠ê‚≠ê (High performance) |
| 5 | **TPE (Hyperopt)** | Bayesian optimization variant | ‚≠ê‚≠ê‚≠ê‚≠ê (High dimensions) |

### üí° Situation-Based Selection Guide

**Project Start Phase:**
1. Identify good parameter range with Random Search (1 hour)
2. Narrow range for precise search with Optuna (2-3 hours)
3. Final fine-tuning if needed (additional 1 hour)

**Time-Based Recommendations:**
- ‚ö° **Within 30 minutes**: Random Search (50 iterations)
- üéØ **1-2 hours**: Optuna (100 iterations + Pruning)
- üèÜ **3+ hours**: Bayesian Opt or TPE (200+ iterations)

**By Parameter Count:**
- Parameters < 5: Grid Search also possible
- Parameters 5-10: Bayesian Optimization or Optuna
- Parameters > 10: **TPE** or **Optuna** (essential)

### üöÄ Key Lessons

**There is no "perfect" algorithm.**
- Random Search can be better than Bayesian depending on situation
- What matters is **performance improvement per time spent**
- In most cases **Optuna is most practical**

**Practical Tips:**
1. Measure baseline first (confirm improvement effect)
2. Start with few iterations ‚Üí Gradually expand
3. Monitor exploration process with visualization
4. Evaluate based on actual time spent

---

Hope this tutorial helps **drastically reduce hyperparameter optimization time** in your ML/DL projects! üéâ