
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Demo - Model Deployment with Spark

In this demo, you will learn how to deploy machine learning models using Apache Spark. We will explore different deployment strategies, comparing single-node and distributed Spark ML models, and performing inference using Spark UDFs to scale predictions efficiently across large datasets.

**Learning Objectives:**

_By the end of this demo, you will be able to:_

1. **Understand deployment methods** for machine learning models using Spark, including single-node and distributed model deployments.
2. **Compare single-node vs. distributed model deployments** using Spark DataFrames and Spark MLlib.
3. **Perform parallelized inference** using `spark_udf` on distributed data to efficiently handle large datasets.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need a classic cluster running one of the following Databricks runtime(s): **16.4.x-cpu-ml-scala2.12**. **Do NOT use serverless compute to run this notebook**.

## Classroom Setup

Install required libraries.

In [0]:
%pip install -U optuna mlflow==2.9.2 delta-spark joblibspark pyspark==3.5.3 databricks-feature-engineering==0.12.1

dbutils.library.restartPython()

Before starting the demo, run the provided classroom setup script.

In [0]:
%run "../Includes/Classroom-Setup-Demo"

**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets.wine_quality}")

## Pre-Steps: Data Preparation

Before diving into model deployment, we first need to prepare our dataset for both training and inference. In this demo, we will use the **Wine Quality** dataset. The data will be loaded from a Delta table, and we will perform necessary manipulations such as assembling features and splitting the dataset into training and test sets.

**Instructions:**

1. **Load the Wine Quality dataset** from Delta Lake.
2. **Assemble features** into a vector using `VectorAssembler` for Spark ML training.
3. **Split the dataset** into training (80%) and testing (20%) sets for both Spark ML and single-node Scikit-learn training.
4. **Convert the Spark DataFrame** into Pandas DataFrame for single-node model training.

In [0]:
from sklearn.model_selection import train_test_split
from pyspark.ml.feature import VectorAssembler
# Load the large Wine Quality dataset from the new Delta table
data_path = f"{DA.paths.working_dir}/v01/large_wine_quality_delta"
df = spark.read.format("delta").load(data_path)
# Define feature columns
feature_columns = [
    "fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", 
    "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", 
    "pH", "sulphates", "alcohol"
]
# Assemble features into a vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_with_features = assembler.transform(df)

# Split the data into training and test sets (for SparkML Model Training)
train_df, test_df = df_with_features.randomSplit([0.8, 0.2], seed=42)
# Convert Spark DataFrame to Pandas DataFrame for single-node model training
train_pandas_df = train_df.select(feature_columns + ["quality"]).toPandas()
test_pandas_df = test_df.select(feature_columns + ["quality"]).toPandas()

# Split features and target variable
X_train = train_pandas_df.drop(columns=["quality"])
y_train = train_pandas_df["quality"]
X_test = test_pandas_df.drop(columns=["quality"])
y_test = test_pandas_df["quality"]

# Display the first few rows of the DataFrame to ensure correctness
display(df)

## Part 1: Methods to Deploy Machine Learning Models in Spark

Spark provides multiple approaches for deploying machine learning models, making it a powerful and flexible framework for both data processing and model serving. Depending on the size of your dataset and the computational requirements, you can choose from two main approaches for model deployment:

- **Single-node model deployment:** The model is trained and executed on a single machine. Even though the model is single-node, Spark can still distribute the data for processing. This approach works well for smaller datasets and simpler applications.

- **Distributed model deployment:** This approach fully leverages Spark's distributed architecture to handle large datasets and complex computations. Models are deployed across multiple nodes in a cluster, making it suitable for large-scale machine learning tasks.

For more information on Spark MLlib and its capabilities for distributed machine learning, refer to the [Apache Spark MLlib Documentation](https://spark.apache.org/docs/latest/ml-guide.html).

### Why Use Spark for Model Deployment?
Apache Spark provides a robust platform for deploying machine learning models at scale. Spark’s ability to process distributed data and its seamless integration with Spark MLlib, Scikit-learn, and other ML libraries makes it a versatile tool for both small and large datasets. By the end of this demo, you’ll have a solid understanding of how to choose between single-node and distributed model deployment and how Delta Lake optimizations can enhance your model's performance.

**Key Benefits:**
- **Scalability:** Spark is designed to handle massive datasets, scaling seamlessly across clusters. This makes it ideal for training and deploying models on large data.
- **Versatility:** Spark integrates with various machine learning libraries, including Scikit-learn, Spark MLlib, XGBoost, and more. This allows you to choose the best tools for training your models while benefiting from Spark’s distributed architecture.
- **Efficiency:** With its in-memory processing and parallelized operations, Spark ensures efficient data processing and model serving. This is useful for both batch and real-time inference scenarios.

By deploying machine learning models on Spark, you can achieve high-performance, scalable model serving that meets both real-time and batch processing needs.

**For further exploration:**  
- [Spark MLlib Documentation](https://spark.apache.org/docs/latest/ml-guide.html): Learn more about Spark’s native machine learning library.

## Part 2: Comparing Single-node and Distributed Model Deployments in Spark
In this section, you will explore and compare two different methods for deploying a machine learning model in Spark using the same model configurations in **Part 1**. We will specifically focus on comparing the training and inference time for deploying the **Decision Tree model** in both single-node and distributed configurations.

- **Single-node model deployment**: We will use the **DecisionTreeRegressor** model from `Scikit-learn`, which runs on a single machine to perform training and inference.
- **Distributed model deployment**: We will use the **DecisionTreeRegressor** model from Spark MLlib, which performs distributed training and inference across multiple nodes in a Spark cluster.

Both methods will be applied to the same dataset. However, the deployment approaches will differ significantly in scalability and computational performance. The single-node model executes training and inference on a single machine, while the distributed model leverages Spark’s parallel processing capabilities across a cluster.

The objective of this comparison is to evaluate the execution efficiency of both configurations in terms of training and inference time.

### Single-node Model Deployment with Spark DataFrames
In this section, we will perform single-node model deployment using the **DecisionTreeRegressor** from Scikit-learn. While the model training and inference are executed on a single machine, Spark’s distributed data capabilities will still be leveraged to handle and distribute the data efficiently.

**Instructions:**

1. **Train a DecisionTreeRegressor model** using Scikit-learn. Although the data is distributed across the cluster, the training will occur on a single machine.
2. **Perform inference on the test data** using the trained model.
3. **Log the model with MLflow** for future reference and tracking.
4. **Evaluate the model’s performance** by comparing the time taken for training and inference.

In [0]:
from sklearn.tree import DecisionTreeRegressor
import mlflow
import time

# Set the active experiment
mlflow.set_experiment(f"/Users/{DA.username}/Model_Deployment_with_Spark")

# Train a Decision Tree model using Scikit-learn
model = DecisionTreeRegressor(max_depth=5, random_state=42)

# Measure training time
start_time = time.time()
model.fit(X_train, y_train)
training_time_single_node = time.time() - start_time
print(f"Training time (single-node): {training_time_single_node} seconds")

# Log model with MLflow
with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, "decision_tree_single_node")
    run_id_single_node = run.info.run_id  # Capture the run ID
    print(f"Single-node model logged in run: {run_id_single_node}")
    
    # Measure inference time
    start_time = time.time()
    y_pred = model.predict(X_test)
    inference_time_single_node = time.time() - start_time
    print(f"Inference time (single-node): {inference_time_single_node} seconds")

### Distributed Model Deployment with Spark MLlib

In this section, we will demonstrate the deployment of a **DecisionTreeRegressor** model using Spark MLlib. Spark MLlib fully leverages the distributed nature of Spark, making it scalable and ideal for training and inference on large datasets.

We will train a **DecisionTreeRegressor** model, perform distributed inference, and log the trained model using MLflow for future tracking and deployments.

**Steps:**
1. **Train a DecisionTreeRegressor model** using Spark MLlib on a distributed Spark DataFrame.
2. **Log the trained model** using MLflow for version control and future inference.
3. **Set an alias** for the registered model version to easily identify and retrieve the best-performing model.
4. **Perform distributed inference** on the test data.

**Walkthrough:**
- **Initialize the DecisionTreeRegressor model**: We will use Spark's `DecisionTreeRegressor` to initialize and train the model on distributed data, allowing for scalable training.
- **Create a pipeline**: By using Spark's `Pipeline`, we can chain multiple steps, including feature vector assembly and model training, into a single unified workflow.
- **Train the model**: The training will be executed across a distributed Spark DataFrame, which allows for efficient and scalable processing of large datasets.
- **Log the model with MLflow**: The trained model will be logged into MLflow for tracking, reproducibility, and potential future deployments.
- **Set model alias**: We will assign a "champion" alias to the latest model version in MLflow, making it easier to reference the best-performing version for future operations.

In [0]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml import Pipeline
import mlflow
import time

# Remove or rename the existing 'features' column if it exists
if 'features' in train_df.columns:
    train_df = train_df.drop('features')
if 'features' in test_df.columns:
    test_df = test_df.drop('features')

# Define and train the DecisionTreeRegressor model in distributed mode
decision_tree = DecisionTreeRegressor(featuresCol="features", labelCol="quality", maxDepth=5)

# Create a pipeline for training
pipeline = Pipeline(stages=[assembler, decision_tree])

# Measure training time for the distributed model
start_time = time.time()
dt_model = pipeline.fit(train_df)
training_time_distributed = time.time() - start_time
print(f"Training time (distributed): {training_time_distributed} seconds")

# Log the model using MLflow
with mlflow.start_run() as run:
    mlflow.spark.log_model(dt_model, artifact_path="model-artifacts")
    run_id_distributed = run.info.run_id  # Capture the run ID
    print(f"Distributed model logged in run: {run_id_distributed}")

    # Measure inference time
    start_time = time.time()
    predictions = dt_model.transform(test_df)
    inference_time_distributed = time.time() - start_time
    print(f"Inference time (distributed): {inference_time_distributed} seconds")

### Comparison: Single-Node vs. Distributed Model

In this section, we will compare the performance of the **Decision Tree Regressor** model deployed in both single-node and distributed configurations. Rather than focusing solely on error metrics, we will assess the performance based on the following key factors:

1. **Training Time**: Measure the time taken to train the model in both single-node and distributed configurations. This will highlight the efficiency of each approach in handling data processing and training.
2. **Inference Time**: Measure the time taken to perform inference on the test data using each model. This will help compare the scalability of the deployment methods when applying the trained model to make predictions.
3. **Resource Utilization**: Optionally, monitor and compare resource utilization (like memory and CPU usage) for each approach to understand the resource efficiency of each deployment.

By focusing on training and inference times, this comparison emphasizes the practical differences in scalability and efficiency between single-node and distributed deployments using Spark. This comparison will help you choose the best approach based on your dataset size and computational needs.

In [0]:
# Create a comparison table for training and inference times
comparison_data = {
    "Model": ["Single-Node DecisionTreeRegressor", "Distributed DecisionTreeRegressor"],
    "Training Time (seconds)": [training_time_single_node, training_time_distributed],
    "Inference Time (seconds)": [inference_time_single_node, inference_time_distributed]
}

import pandas as pd
comparison_df = pd.DataFrame(comparison_data)

# Display the comparison table
print("Time Comparison between Single-Node and Distributed Gradient Boosting Models:")
display(comparison_df)

## Part 3: Performing Parallelized Inference Using `spark_udf`

**Why Use UDFs for Parallelized Inference?**

User Defined Functions (UDFs) in Spark allow you to apply custom functions to data in parallel. By defining a UDF, you can distribute the inference workload across a Spark cluster. This method ensures that even if your model is single-node, the inference can scale across your distributed data.

**Instructions:**

1. **Define a Spark UDF** that applies your trained Scikit-learn model to distributed data.
2. **Use the UDF** to perform inference on the distributed Spark DataFrame.
3. **Log the results** and view predictions using Spark.


**For further learning:**  
You can explore more about UDFs and parallelized operations in the [Spark UDF Documentation](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html).

In [0]:
from pyspark.sql.functions import pandas_udf, PandasUDFType, struct
import mlflow.pyfunc
from pyspark.ml.feature import VectorAssembler

#Define a Spark UDF that uses your trained model

# Load the logged model from MLflow using the run_id captured earlier
model_uri = f"runs:/{run_id_single_node}/decision_tree_single_node"  # Replace 'run_id' with your logged model's run ID

predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri,
    result_type="double",
    env_manager="local"   # <-- Add this
)

# Prepare the Spark DataFrame by creating feature vectors
# (Ensure that 'features' column doesn't already exist from earlier)
if "features" not in df_with_features.columns:
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    df_with_features = assembler.transform(df_with_features)

#Apply the UDF to distributed data for parallelized inference
# The UDF is applied on the 'features' column to generate predictions
predictions = df_with_features.withColumn(
    "predicted_quality",
    predict_udf(struct(*feature_columns))
)

In [0]:
#Display predictions
display(predictions.select(*feature_columns, "quality", "predicted_quality"))

In [0]:
# Saving predictions to a Delta table:
predictions.write.format("delta").mode("overwrite").saveAsTable(f"{DA.catalog_name}.{DA.schema_name}.single_node_udf_predictions")

### Distributed Inference Using Spark ML

**Why Use Distributed Inference with Spark ML?**

For larger datasets and machine learning models, Spark MLlib provides an efficient way to perform inference in a distributed manner. Spark MLlib handles both data distribution and computation, making it highly scalable and ideal for large-scale workloads. 

Distributed inference allows us to apply the trained model to data across multiple nodes, improving performance and enabling the model to handle large datasets in real-time.

**Steps:**
1. **Prepare Features**: Use Spark’s `VectorAssembler` to combine multiple feature columns into a single feature vector.
2. **Train a Distributed Model**: Train a distributed **Decision Tree Regressor** model with Spark ML on your dataset.
3. **Perform Distributed Inference**: Apply the trained model to new data using Spark ML's `transform()` method.
4. **Save the Predictions**: Store the inference results (predictions) in a Delta table for further analysis or reporting.

**For further exploration** [Spark MLlib's Guide](https://spark.apache.org/docs/latest/ml-guide.html).

In [0]:
if "features" in test_df.columns:
    test_df = test_df.drop("features")

# Perform inference on the test data using the distributed GBT model
predictions_dt = dt_model.transform(test_df)

# Show predictions
display(predictions_dt.select("features", "quality", "prediction"))

# Save predictions to a Delta table in Unity Catalog
predictions_dt.write.format("delta").mode("overwrite").saveAsTable(f"{DA.catalog_name}.{DA.schema_name}.distributed_predictions_table")

## Conclusion

In this demo, we explored two key deployment strategies using Spark: single-node deployment with Scikit-learn and distributed deployment with Spark MLlib. We highlighted the benefits and challenges of each approach in terms of scalability and processing power. Additionally, we demonstrated parallelized inference using Spark UDFs, showcasing how to leverage Spark’s distributed architecture for efficient large-scale predictions. This comprehensive approach to model deployment ensures that machine learning models can efficiently scale and serve predictions across large datasets.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>