
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Demo: Hyperparameter Tuning with SparkML

In this demo, you will learn how to use **Optuna**, a powerful hyperparameter optimization (HPO) framework, to tune machine learning models in Databricks utilizing **Spark MLlib**.

We will demonstrate how to implement **Optuna**  using a **Random Forest Regressor** from SparkML, covering:
- **Defining search spaces** for HPO.
- **Creating objective functions** tailored to the framework.
- **Optimizing hyperparameters** using two execution strategies:
  - **Single-node multithreading** for local tuning.
  - **Distributed Spark execution** for large-scale training.

Additionally, we will track and log the results using **MLflow**, enabling efficient management and monitoring of the tuning process.

### **Distributed Machine Learning in Databricks**
Distributing the workload for Hyperparameter tuning with Spark can be broken down into two key components:

1. **Model Training Level:**  
   - Utilize **PySpark DataFrames** for distributed data processing.
   - Leverage **Spark ML algorithms**, which are inherently scalable.

2. **Optimization Level:**  
   - Use **driver-based orchestration frameworks** (e.g., Optuna) to manage hyperparameter tuning while leveraging **Spark MLlib** for distributed model training.
   - While model training runs in parallel across the Spark cluster, **hyperparameter tuning is orchestrated on a single machine** (the driver node). Optuna can execute multiple trials **in parallel using threads**, but this still occurs within the driver and is not distributed across the cluster.


### **ðŸš¨A Warning Concerning HyperOpt on Databricks**
The open-source version of Hyperopt is no longer being maintained and will be removed in the DBR ML versions 17.0+. *This notebook is currently running on a version that supports Hyperopt.* Databricks recommends using Optuna for single-node optimization or RayTune for a similar experience.

---

## **Learning Objectives**

By the end of this demo, you will be able to:

- **Understand the Hyperparameter Tuning Approach**
   - **Optuna** for single-node orchestration of hyperparameter tuning with parallel execution across threads.
   - **Spark MLlib** for distributed model training, combined with driver-based hyperparameter orchestration (e.g., via Optuna).

- **Perform Hyperparameter Tuning using Optuna**
   - Define an **objective function** tailored to your model.
   - Configure **a search space** for hyperparameter optimization.
   - Optimize hyperparameters using **single-node execution**.

- **Understand Hyperoptâ€™s Usage with Spark MLlib (Optionally)**
   - Learn how Hyperopt can be used for **sequential Bayesian optimization** with Spark MLlib.
   - Identify the **limitations and trade-offs** of using Hyperopt in distributed environments.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need a classic cluster running one of the following Databricks runtime(s): **16.4.x-cpu-ml-scala2.12**. **Do NOT use serverless compute to run this notebook**.

In [0]:
%pip install -U optuna optuna-integration mlflow
%pip install --upgrade ray[tune]
dbutils.library.restartPython()

Before starting the demo, run the provided classroom setup script.

In [0]:
%run ../Includes/Classroom-Setup-02.1a

**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets.wine_quality}")

## Load Data and Perform Train-Test Split

In this step, we will load the dataset from the **Delta table** `wine_quality_features`, which is stored in **Unity Catalog** under:  
`{DA.catalog_name}.{DA.schema_name}.wine_quality_features`

### **Instructions:**
1. **Load the dataset** from the Delta table using `spark.read.table()`.
2. **Split the dataset** into **training (80%) and testing (20%)** sets to evaluate the model's performance.
   - Since we are using **PySpark DataFrames**, we will use `.randomSplit()` for the split.

In [0]:
df = spark.read.format("delta").table(f"{DA.catalog_name}.{DA.schema_name}.wine_quality_features")
# Split the dataset into training and test sets
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

In [0]:
train_df.explain()

## Part 1: HPO with Optuna and Distributed Training of a Spark Model
In this part, we will use **Optuna for hyperparameter optimization** while training a Spark ML model in **parallel**.

### How This Works:
- **Optuna runs on a single machine** to manage hyperparameter tuning, where it suggests configurations for each trial and records their performance.
- **Model training can be distributed**, ensuring the ability to scale out and speed up for large datasets and complex models.
- **Each Optuna trial runs a new model training job** on the Spark cluster, allowing it to evaluate different hyperparameter configurations efficiently.

This approach allows us to leverage **distributed computing for training** while keeping **hyperparameter optimization lightweight and efficient** on a single node across multiple threads.

### Define the Objective Function for Optuna
The first step will be to define the **objective function** for Optuna. This is the function that Optuna will minimize by optimizing hyperparameters like the number of trees (`numTrees`) and the depth of the tree (`maxDepth`). In our case, the objective function is the `Root Mean Squared Error` (RMSE) since our model is a random forest regressor. However, we will use a distributed training approach within this function by running the training on Spark workers.

**Instructions:**
- **Initialize TPESampler Configuration**. In this example we will use Bayesian optimization along with a Gaussian prior to help stabilize the Parzen estimator (known as the Tree-structured Parzen Estimator algorithm).  
- **Initialize hyperparameters** using Optuna's `trial.suggest_int()` function. This function samples integers between `low` and `high` for the hyperparameter `<hyperparameter_name>` when calling `trial.suggest_int('<hyperparameter_name>', low, high)`. 
- **Train the model** using **Spark's distributed cluster** by running the `RandomForestRegressor` model on Spark workers.
- **Evaluate the model** using the **RMSE** metric (`rmse`), and return it as the value to *minimize* during the optimization. Note, we will tell Optuna to minimize the returned **RMSE** value when we create an Optuna study later. This happens outside the definition of the objective function. 

Refer to the documentation for:
* [optuna.samplers](https://optuna.readthedocs.io/en/stable/reference/samplers/index.html) for the choice of samplers
* [optuna.trial.Trial](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html) for a full list of functions supported to define a hyperparameter search space.


In [0]:
import optuna

optuna_sampler = optuna.samplers.TPESampler(
  consider_prior=True, #Enhance the stability of Parzen estimator by imposing a Gaussian prior when True
  n_startup_trials=3, #The random sampling is used instead of the TPE algorithm until the given number of trials finish in the same study.
  seed=123 # Seed for random number generator.
)

In [0]:
from pyspark.ml.regression import RandomForestRegressor, LinearRegression, GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator

class ObjectiveOptuna:
    """
    Objective function class for Optuna hyperparameter tuning with SparkML models.
    
    Instead of loading the dataset in each trial execution, this class receives 
    the training and test datasets during initialization, improving efficiency.
    """

    def __init__(self, train_df, test_df, label_column="label"):
        """
        Initializes the objective function with training and test datasets.

        Args:
            train_df (DataFrame): Spark DataFrame containing features and label for training.
            test_df (DataFrame): Spark DataFrame containing features and label for evaluation.
            label_column (str): Name of the label column in the dataset. Default is "label".
        """
        self.train_df = train_df
        self.test_df = test_df
        self.label_column = label_column
    
    def objective_sparkmodel_distributed_Optuna(self, trial):
        """
        Optuna objective function for tuning regression models using SparkML. Possible models are: Linear Regression, Random Forest, and Gradient-Boosted Trees.

        Args:
            trial (optuna.trial.Trial): An Optuna trial object to suggest hyperparameters.

        Returns:
            float: Root Mean Squared Error (RMSE) to minimize.
        """

        # Select model type
        model_name = trial.suggest_categorical("model", ["LinearRegression", "RandomForest", "GBTRegressor"])

        if model_name == "LinearRegression":
            # Hyperparameter tuning for Linear Regression
            model = LinearRegression(
                featuresCol="features",
                labelCol=self.label_column,
                regParam=trial.suggest_float("reg_param", 0.0, 1.0),
                elasticNetParam=trial.suggest_float("elastic_net_param", 0.0, 1.0)
            )

        elif model_name == "RandomForest":
            # Hyperparameter tuning for Random Forest
            model = RandomForestRegressor(
                featuresCol="features",
                labelCol=self.label_column,
                numTrees=trial.suggest_int("num_trees", 2, 5, log=True),
                maxDepth=trial.suggest_int("max_depth", 3, 10),
                minInstancesPerNode=trial.suggest_int("min_instances_per_node", 1, 10)
            )

        elif model_name == "GBTRegressor":
            # Hyperparameter tuning for Gradient-Boosted Trees
            model = GBTRegressor(
                featuresCol="features",
                labelCol=self.label_column,
                maxDepth=trial.suggest_int("max_depth", 3, 10),
                maxIter=trial.suggest_int("n_estimators", 2, 5, log=True),
                stepSize=trial.suggest_float("learning_rate", 0.01, 0.5)
            )

        # Train the model
        trained_model = model.fit(self.train_df)

        # Generate predictions
        predictions = trained_model.transform(self.test_df)

        # Evaluate performance using RMSE
        rmse = RegressionEvaluator(
            labelCol=self.label_column,
            predictionCol="prediction",
            metricName="rmse"
        ).evaluate(predictions)

        return rmse

### Optimize The Spark ML model on Single-Machine Optuna and Log Results with MLflow
In this step, we will utilize `MLflow` to track the optimization process by adding out-of-the box logging provided by Optuna trials using `MLflowCallBack()`. Once we have our logging parameters configured, there are two additional steps to take care of before moving onto the run. 

1. Initialize Optuna's `optuna.create_study()`. A *study* is corresponds to the optimization task, which is a set of trials and a trial is a process of evaluating an *objective function*.
1. Tell Optuna how we want to optimize with `optimize()`. 

Each trial will be logged to MLflow, including the hyperparameters tested and the corresponding `RMSE` values. Optuna will handle the optimization, while training continues to be distributed across Spark workers.

**Instructions:**
- **Set up MLflow** to track the experiments using `MLflowCallBack()`.
- **Define the storage location** with the variable `storage_url`. In this demonstration, we will be using the driver node to persist our study information, allowing for distributed optimization. 
- **Setup an Optuna study** with `optuna.study()`. 
- **Optimize hyperparameters** using Optuna's `study.optimize()` method.
- **Log results to MLflow**, including the best hyperparameters and RMSE.
- **End the MLflow run** to ensure that all information is saved.

*Note on parallelization: The value of `n_jobs` within the `optimization()` function is the number of parallel jobs. If this argument is set to -1 (as we have done below), then the number of parallel jobs is set to the number of CPU cores (the default value for this demonstration is 4 cores).*

In [0]:
import os
import mlflow
import optuna
from optuna.integration.mlflow import MLflowCallback

# Set up MLflow experiment tracking
experiment_name_spark = os.path.join(
    os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()),
    "02a - Model Tuning with Optuna_spark"
)
mlflow.set_experiment(experiment_name_spark)
experiment_id_spark = mlflow.get_experiment_by_name(experiment_name_spark).experiment_id

def optuna_hpo_fn(n_trials: int, experiment_id: str, optuna_sampler) -> optuna.study.Study:
    """
    Runs hyperparameter optimization using Optuna with MLflow logging.

    Args:
        n_trials (int): Number of trials for optimization.
        experiment_id (str): MLflow experiment ID for logging.
        optuna_sampler (optuna.samplers.BaseSampler): Optuna sampler for search strategy.

    Returns:
        optuna.study.Study: The Optuna study object with optimization results.
    """

    # MLflow callback to log results
    mlflow_callback_spark = MLflowCallback(
        tracking_uri=mlflow.get_tracking_uri(),
        metric_name="RMSE",
        create_experiment=False,
        mlflow_kwargs={"experiment_id": experiment_id}
    )

    # Define the objective function
    objective_function = ObjectiveOptuna(train_df, test_df, label_column="quality").objective_sparkmodel_distributed_Optuna

    # Create or load an Optuna study
    study = optuna.create_study(
        study_name="sparkmodel_optuna_distributed_hpo",
        sampler=optuna_sampler,
        load_if_exists=True,
        direction="minimize"
    )

    # Run optimization
    study.optimize(
        objective_function,
        n_trials=n_trials,
        n_jobs=-1,  # Parallel execution
        callbacks=[mlflow_callback_spark]
    )

    # Extract best trial results
    best_trial = study.best_trial
    best_rmse = best_trial.value  # RMSE metric

    # Display results
    print(f"Best Trial Number: {best_trial.number}")
    print(f"Best Hyperparameters: {best_trial.params}")
    print(f"Best RMSE: {best_rmse:.4f}")

    # Log the best results manually in MLflow
    with mlflow.start_run(run_name="best_trial_results"):
        mlflow.log_params(best_trial.params)
        mlflow.log_metric("Best RMSE", best_rmse)

    return study  # Return study for further analysis

#### Execute the Single Node Study

In [0]:
# Disable MLflow autologging to prevent unwanted logging of model artifacts
mlflow.autolog(log_models=False, disable=True)

# Invoke Optuna training function on the driver node
single_node_study = optuna_hpo_fn(
    n_trials=10,
    experiment_id=experiment_id_spark,
    optuna_sampler=optuna_sampler
)

### Explanation: Distributing Hyperparameter Tuning and Model Training

The previous cells implemented **distributed hyperparameter tuning and training** using **Optuna**, **MLflow**, and **Spark MLlib**. 

#### **Key Characteristics of This Setup**
| Aspect | Current Implementation |
|--------|------------------------|
| **Hyperparameter Tuning** | Runs on a **single machine** (Optuna executes locally, even if multiple trials run in parallel). |
| **Parallel Execution** | Trials are parallelized **within a single machine** and training happens in a **distributed fashion** across multiple threads. |
| **Database Storage** | Uses **default in memory storage** for Optuna trials, limiting multi-machine and multi-process execution. |
| **Experiment Logging** | MLflow logs hyperparameters and RMSE for each trial. |

#### **How to Fully Distribute Hyperparameter Tuning**
While this implementation already distributes model training, Optuna's default execution with `n_jobs` utilizes multithreading on a single node, which, due to Python's Global Interpreter Lock, allows for concurrency but not true parallelism in CPU-bound tasks. To achieve true parallelization, Optuna can be configured to use multiprocessing, either on a single node or across multiple nodes, by setting up an appropriate backend such as a relational database. To fully distribute the hyperparameter search across multiple machines:

**Use a centralized database**:
   - Within create_study include storage such as `storage="sqlite:////local_disk0/optuna_distributed_model.db"` for Multi-processing parallelization with single node or client/server Relational Databases like PostgreSQL or MySQL **ex:** for Multi-processing parallelization with multiple nodes
     ```
     storage="mysql://root@localhost/example"
     ```
   - This allows multiple workers to share and execute trials.
   - **Requirement:** Launch a MySQL instance (can be on AWS RDS, Azure Database for MySQL, GCP Cloud SQL, or an on-prem server).  [See Optuna Documentation](https://optuna.readthedocs.io/en/latest/faq.html#how-can-i-parallelize-optimization)

## Part 2: Approaches for HPO with SparkML on Databricks

Hyperparameter tuning in a **Databricks environment** can be challenging due to **SparkContext limitations** and **process forking issues** in managed clusters. Below are **three recommended approaches** to effectively perform hyperparameter tuning while avoiding common pitfalls.

---

### **Challenges with Hyperparameter Tuning in Spark and Python**
1. **Serialization Issues**:  
   - Passing **Spark objects** (e.g., Spark DataFrame, SparkSession, SparkContext) into a **distributed function** (Hyperopt or a Spark UDF) can cause failures due to **pickling restrictions**.

2. **Single SparkContext Per Notebook**:  
   - Databricks runs a **single Spark driver** (the notebook environment) with one **Spark session**.  
   - Workers **cannot** create new Spark sessions (`SparkSession.getOrCreate()`) without **proper master settings**.

3. **Rayâ€™s Process Forking Issue**:  
   - Even in **local mode**, Ray spawns separate processes **per trial**.  
   - These processes **do not inherit** the Spark master URL or Spark session.  

---

## **Recommended Approaches**

### **Option 1: Use Sparkâ€™s Built-in Hyperparameter Tuning Tools**
#### Best for: **Native Spark ML hyperparameter tuning**
- **How it Works**:  
  - Leverage **Spark MLâ€™s** `CrossValidator` or `TrainValidationSplit` to perform distributed hyperparameter tuning.  
  - Spark handles **parallelism natively**.

- **Pros**:
  - Fully **compatible** with Databricks.  
  - Runs in **distributed mode**, leveraging Spark Executors.  
  - Avoids **SparkContext serialization issues**.  

- **Cons**:
  - Limited to **grid search or random search** (without custom logic).  
  - No advanced Bayesian Optimization (unless implemented manually).  

---

### Option 2: Using Hyperopt with Spark MLlib (Optional)

Hyperopt can be used for hyperparameter tuning with **Spark MLlib models**, but it is subject to important limitations. This approach may be helpful in certain cases but is **not recommended** for scalable or parallel execution.

#### What You Can Do
- Use Hyperoptâ€™s `fmin()` function with the default `Trials` object (not `SparkTrials`).
- Each hyperparameter trial runs **sequentially** on the **driver node**, which can then initiate **distributed training** using Spark MLlib.
- This setup allows for **Bayesian Optimization**, offering a more efficient search strategy compared to random or grid search.

#### What You Cannot Do
- You **cannot** use `SparkTrials` with Spark MLlib. `SparkTrials` is intended for distributing trials across a Spark cluster for **single-node machine learning libraries** like scikit-learn.
- Using `SparkTrials` with Spark MLlib will lead to SparkContext-related errors due to the incompatible execution model. For example:
```
PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER]
Could not serialize object: SparkContext can only be used on the driver, not in code that runs on workers.
```
**References**
- [Sample Notebook: Hyperopt with Spark MLlib](https://assets.docs.databricks.com/_extras/notebooks/source/hyperopt-spark-ml.html)
- [Databricks Documentation: Hyperopt for Distributed ML](https://learn.microsoft.com/en-us/azure/databricks/machine-learning/automl-hyperparam-tuning/hyperopt-distributed-ml?utm_source=chatgpt.com)
- [Hyperopt Documentation: SparkTrials with scikit-learn](https://hyperopt.github.io/hyperopt/scaleout/spark/)

#### Option 1: Use Spark MLâ€™s Built-in Hyperparameter Tuning Tools

In [0]:
import os
import time
import mlflow
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

label_column = "quality"

# MLflow Experiment Setup
experiment_name_spark_cv = os.path.join(
    os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()),
    "02c - Model Tuning with spark cv"
)
mlflow.set_experiment(experiment_name_spark_cv)
experiment_id_spark_cv = mlflow.get_experiment_by_name(experiment_name_spark_cv).experiment_id

# Ensure feature vectorization
if "features" not in train_df.columns:
    feature_cols = [col for col in train_df.columns if col != label_column]
    assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
    train_df_transformed = assembler.transform(train_df).select("features", label_column).na.drop()
else:
    feature_cols = [col for col in train_df.columns if col != label_column]
    train_df_transformed = train_df

# Define RandomForestRegressor and hyperparameter grid
rf = RandomForestRegressor(featuresCol="features", labelCol=label_column, seed=42)

param_grid = (
    ParamGridBuilder()
    .addGrid(rf.numTrees, [5, 10, 20])    # Number of trees
    .addGrid(rf.maxDepth, [2, 5, 10])    # Max tree depth
    .build()
)

# Set up CrossValidator
evaluator = RegressionEvaluator(labelCol=label_column, predictionCol="prediction", metricName="rmse")

cv = CrossValidator(
    estimator=rf,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,  # 3-fold cross-validation
    parallelism=4  # Parallel execution
)

with mlflow.start_run(run_name="spark_cv_rf", experiment_id=experiment_id_spark_cv):
    try:
        # Start training
        start_time = time.time()
        cv_model = cv.fit(train_df_transformed)
        training_duration = time.time() - start_time
        mlflow.log_metric("training_duration_s", training_duration)

        # Retrieve best model and evaluate
        best_model = cv_model.bestModel
        train_predictions = best_model.transform(train_df_transformed)
        train_rmse = evaluator.evaluate(train_predictions)
        mlflow.log_metric("train_rmse", train_rmse)
        print(f"Best Model RMSE on training folds: {train_rmse:.4f}")

        # Evaluate on test set
        test_predictions = best_model.transform(test_df)
        test_rmse = evaluator.evaluate(test_predictions)
        mlflow.log_metric("test_rmse", test_rmse)
        print(f"Test RMSE: {test_rmse:.4f}")

        # Log best hyperparameters
        best_num_trees = best_model.getNumTrees
        best_max_depth = best_model.getOrDefault("maxDepth")
        mlflow.log_param("best_numTrees", best_num_trees)
        mlflow.log_param("best_maxDepth", best_max_depth)

        print(f"Best hyperparameters â†’ numTrees={best_num_trees}, maxDepth={best_max_depth}")

        # Log feature importances
        if hasattr(best_model, "featureImportances"):
            importances = best_model.featureImportances
            feat_imp_map = {col: val for col, val in zip(feature_cols, importances.toArray())}
            mlflow.log_text(str(feat_imp_map), "feature_importances.txt")
            print("Feature Importances:", feat_imp_map)

        # Log all hyperparameter results
        avg_metrics = cv_model.avgMetrics
        print("\nHyperparameter Combinations and Avg RMSE:")
        print("-------------------------------------------------------")
        print(f"{'numTrees':<12}{'maxDepth':<12}{'avg_rmse':<10}")
        print("-------------------------------------------------------")

        for i, param_map in enumerate(param_grid):
            avg_rmse = avg_metrics[i] if i < len(avg_metrics) else "N/A"  # Handle index errors safely
            num_trees_val = param_map.get(rf.numTrees, "N/A")
            max_depth_val = param_map.get(rf.maxDepth, "N/A")
            print(f"{num_trees_val:<12}{max_depth_val:<12}{avg_rmse:<10.4f}")

    except Exception as e:
        print(f"Error during cross-validation: {e}")

# End MLflow Run
mlflow.end_run()
print("Cross-validation complete. Check MLflow UI for details.")

# Conclusion

In this demo, we explored how to optimize machine learning models in **Databricks** using **Optuna** and **HyperOpt** with **Spark ML**. We demonstrated how these frameworks handle hyperparameter tuning at both the **model training level** and **optimization level**, leveraging distributed computing for scalability.

We compared multiple strategies for hyperparameter tuning:
- **Single-machine tuning** using Optuna for efficient local execution.
- **Optional use of Hyperopt** with `Trials` (not `SparkTrials`) for tuning **Spark MLlib models** sequentially on the driver.
- **End-to-end SparkML tuning** using `CrossValidator` for native Spark-based optimization.

### **Key Takeaways**
- **Parallelization strategies** significantly impact model training efficiency and resource utilization.
- **Databricks provides multiple options** for hyperparameter tuning, allowing flexibility in balancing **scalability vs. compute cost**.
- **MLflow enables seamless experiment tracking**, making it easier to compare results across different tuning frameworks.

By leveraging these frameworks effectively, you can enhance model performance, streamline experimentation, and scale machine learning workflows efficiently within Databricks.

### Next Steps
In the next demonstration, we will see how to use Ray Tune for hyperparameter optimization leveraging a single node and our understanding of Optuna from this demonstration.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>