
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Lab - Model Development with Spark
In this lab, you will engage in an end-to-end workflow to develop a machine learning model using **Apache Spark** and **Delta Lake**. You'll start by loading data from a Delta table, performing essential data manipulations, and then build a regression model using **Spark ML**. Additionally, you will log and track your model using **MLflow** to demonstrate how to manage machine learning models in production environments.

**Lab Outline:**

_By the end of this lab, you will:_

- **Task 1: Prepare Data for Model Training Using Spark DataFrames**
  - **Task 1.1:** Read a Delta Table into a Spark DataFrame.
  - **Task 1.2:** Perform Basic Data Manipulations.
  - **Task 1.3:** Write Data to a Delta Table.

- **Task 2: Build and Evaluate a Machine Learning Model Using Spark ML**
  - **Task 2.1:** Perform a Train-Test Split Using Spark ML.
  - **Task 2.2:** Assemble a Feature Vector Using Spark ML.
  - **Task 2.3:** Additional Feature Engineering (Independent Exploration) 
    - You will have the freedom to experiment with different transformations and see how they impact the model’s performance.
  - **Task 2.4:** Fit a Linear Regression Model Using Spark ML.
  - **Task 2.5:** Create and Fit a Pipeline.

- **Task 3: Evaluate and Log the Model with MLflow**

  - **Task 3.1:** Compute Predictions
    - Use the linear regression model (LRM) you built previously to make predictions.
  - **Task 3.2:** Evaluate Model Performance
    - Use metrics such as RMSE and R² to evaluate your LRM.

---

**Instructions**

Follow the lab steps closely to ensure you build a complete and well-performing model. Each task will guide you through important concepts, and you will have opportunities to explore Spark ML and Delta Lake features through independent exploration.

If you feel confident after completing the main tasks, try some **bonus challenges** at the end of each section for a deeper understanding of the material!

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need a classic cluster running one of the following Databricks runtime(s): **16.4.x-cpu-ml-scala2.12**. **Do NOT use serverless compute to run this notebook**.

## Classroom Setup

Before starting the Lab, run the provided classroom setup script.

In [0]:
%run "../Includes/Classroom-Setup-lab"

**Other Conventions:**

Throughout this Lab, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets.california_housing}")

##Task 1: Data Preparation
In this task, you will load a dataset from a Delta table, apply basic transformations, and save the cleaned data back to a Delta table. This task is fundamental for ensuring the data is properly prepared for building a machine learning model.

###Task 1.1: Read a Delta Table into a Spark DataFrame

In this step, you will read data from a Delta table into a **Spark DataFrame**. Delta tables provide ACID transactions, scalable metadata handling, and unification of streaming and batch data processing.

**Instructions:**
1. Use the `spark.read.format("delta")` method to read the Delta table.
2. Display the schema of the DataFrame to understand the structure of the dataset.
3. Preview the data using `display()` to ensure it’s loaded correctly.

In [0]:
# Load data into a Spark DataFrame
housing_df = spark.read.format("delta").load(f"{DA.paths.working_dir}/v01/large_california_housing_delta")

# Display schema
housing_df.printSchema()

# Display data
display(housing_df)

### Task 1.2: Perform Basic Data Manipulations

Now that the data is loaded, the next step is to clean and filter the data for model training. You will select relevant columns and filter the rows based on certain conditions.

**Instructions:**
1. **Select** the columns that are most relevant for your machine learning task (e.g., features like `MedInc`, `HouseAge`, etc.).
2. **Filter** the rows to remove any invalid or irrelevant data.
3. Optionally, explore additional transformations using **PySpark** functions like `withColumn()` to modify or add new columns.

In [0]:
from pyspark.sql.functions import col

## Select and filter relevant columns
df_filtered = housing_df.select(<FILL_IN>)


## Example: Apply basic filtering (modify as you explore)
df_filtered = <FILL_IN>

## Display filtered data
<FILL_IN>

In [0]:
%skip
from pyspark.sql.functions import col

## Select and filter relevant columns
df_filtered = housing_df.select("MedInc", "HouseAge", "AveRooms", "AveBedrms", 
                        "Population", "AveOccup", "Latitude", "Longitude", "label")


## Example: Apply basic filtering (modify as you explore)
df_filtered = df_filtered.filter(col("label") > 0)

## Display filtered data
display(df_filtered)

**Exploration Prompt:**
- **Try This**: Experiment with additional transformations using `withColumn()` to create new columns or modify existing ones. For example, you could add a new column that normalizes the `MedInc` column.
- **Documentation**: [Use this PySpark DataFrames Guide](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) to explore various DataFrame operations.
- **Bonus Task**: Try adding filters based on different thresholds or applying statistical transformations.

###Task 1.3: Write Data to a Delta Table
After filtering and manipulating the data, you will now save the transformed DataFrame back to a Delta table. This ensures that the data is versioned and can be used in later steps of the machine learning pipeline.

**Instructions:**
1. Define the path to the output Delta table.
2. Save the DataFrame using the **Delta Lake write modes**, such as `overwrite` or `append`.
3. Verify the saved data by reading it back from the Delta table and displaying it.

In [0]:
## Define the output Delta table path without using dbfs
output_delta_table = f"{DA.catalog_name}.{DA.schema_name}.lab_delta_table"

## Save the filtered DataFrame to a Delta table (overwrite mode)
df_filtered.write.<FILL_IN>

## Verify by reading the data back from the Delta table
df_output =  <FILL_IN>
## Display the data read from the Delta table to verify
<FILL_IN>

In [0]:
%skip
##  Define the output Delta table path without using dbfs
output_delta_table = f"{DA.catalog_name}.{DA.schema_name}.lab_delta_table"

## Save the filtered DataFrame to a Delta table (overwrite mode)
df_filtered.write.format("delta").mode("overwrite").saveAsTable(output_delta_table)

## Verify by reading the data back from the Delta table
df_output = spark.read.format("delta").table(output_delta_table)

## Display the data read from the Delta table to verify
display(df_output)

**Exploration Prompt:**
- **Explore Write Modes**: Experiment with different write modes such as `append`, `ignore`, and `error`. Learn more from the [Delta Lake Quickstart Guide](https://docs.delta.io/latest/quick-start.html).
- **Bonus Task**: Try saving the data using a different write mode (e.g., `append`) and observe the behavior when you attempt to overwrite or append the same data.

## Task 2: Model Development

In this task, you will split the dataset, create a feature vector, and train a machine learning model using **Spark ML**. You will also have the opportunity to experiment with feature engineering techniques to enhance the model's performance.


### Task 2.1: Perform a Train-Test Split Using Spark ML

Splitting the dataset into training and test sets is essential for evaluating model performance. In this task, you will perform a reproducible train-test split.

**Instructions**:
1. **Split the data** into training and test sets using an 80/20 split.
2. Use the `randomSplit()` method with a random seed to ensure reproducibility.


In [0]:
## Perform a reproducible train-test split
train_df, test_df = <FILL_IN>

## Display the number of records in each set
<FILL_IN>

In [0]:
%skip
##  Perform a reproducible train-test split
train_df, test_df = df_filtered.randomSplit([0.8, 0.2], seed=42)

## Display the number of records in each set
print(f"Training Data Count: {train_df.count()}")
print(f"Test Data Count: {test_df.count()}")

**Exploration Prompt:**
- Try experimenting with different train-test split ratios (e.g., 70/30, 90/10) and observe how the size of each set changes.
- Learn more about `randomSplit()` by visiting the [Spark Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html).

###Task 2.2: Assemble a Feature Vector Using Spark ML

Now, you will use **VectorAssembler** to combine multiple features into a single vector. You will also experiment with **StandardScaler** to scale the feature vectors, which is important for algorithms like linear regression.

**Instructions:**
1. **Define feature columns** and combine them into a single vector using `VectorAssembler`.
2. **Scale the features** using `StandardScaler` to normalize them.

In [0]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

## Define feature columns (input columns)
feature_columns = <FILL_IN>

## Assemble the feature vector
assembler = VectorAssembler(inputCols=<FILL_IN>, outputCol=<FILL_IN>)
train_df = assembler.transform(<FILL_IN>)
test_df = assembler.transform(<FILL_IN>)

## Initialize StandardScaler to scale the feature vectors
scaler = StandardScaler(inputCol=<FILL_IN>, outputCol=<FILL_IN>)

## Scale the feature vectors
scaler_model = <FILL_IN>
train_df = <FILL_IN>
test_df = <FILL_IN>

In [0]:
%skip
from pyspark.ml.feature import VectorAssembler, StandardScaler

## Define feature columns (input columns)
feature_columns = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", 
                   "Population", "AveOccup", "Latitude", "Longitude"]

## Assemble the feature vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="assembled_features")
train_df = assembler.transform(train_df)
test_df = assembler.transform(test_df)

## Initialize StandardScaler to scale the feature vectors
scaler = StandardScaler(inputCol="assembled_features", outputCol="scaled_features")

## Scale the feature vectors
scaler_model = scaler.fit(train_df)
train_df = scaler_model.transform(train_df)
test_df = scaler_model.transform(test_df)

**Exploration Prompt:**
- **Experiment with Scaling**: Try running the model without scaling the features and compare the results. How does scaling affect the performance?
- **Documentation**: Refer to the [VectorAssembler Documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) for more configuration options.

### Task 2.3: Additional Feature Engineering (Independent Exploration)
In this section, you can experiment with additional feature transformations to enhance your model. Refer to the [Spark ML Feature Transformers documentation](https://spark.apache.org/docs/latest/ml-features.html) for more options.

**Explore the following transformations:**

- **Add another transformer:** For instance, you can use [**`PolynomialExpansion`**](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.PolynomialExpansion.html) to generate polynomial features from your existing data.
- **Feature extraction:** Try using [**`VarianceThresholdSelector`**](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VarianceThresholdSelector.html) to select the most relevant features based on statistical tests.

Feel free to experiment with these transformations or explore other options from the documentation.

In [0]:
## Perform Polynomial Expansion and Feature Selection
## Task: Add polynomial features to the dataset using PolynomialExpansion (degree=2),
## and then perform feature selection using VarianceThresholdSelector to filter features with low variance.
## Implement the code below to transform both train_df and test_df.

In [0]:
%skip
from pyspark.ml.feature import PolynomialExpansion, VarianceThresholdSelector

## Step 1: Polynomial Expansion
## Add polynomial features of degree 2
poly_expander = PolynomialExpansion(degree=2, inputCol="scaled_features", outputCol="poly_features")
train_df = poly_expander.transform(train_df)
test_df = poly_expander.transform(test_df)

## Step 2: Feature Selection using VarianceThresholdSelector
## Fit the selector on the training data
selector = VarianceThresholdSelector(varianceThreshold=0.5, featuresCol="poly_features", outputCol="selected_features")
selector_model = selector.fit(train_df)
train_df = selector_model.transform(train_df)
test_df = selector_model.transform(test_df)

**Exploration Prompt:**
- **Try Different Transformers**: Explore other feature transformers from the [Spark ML Feature Transformers documentation](https://spark.apache.org/docs/latest/ml-features.html). For example, experiment with `PCA` for dimensionality reduction.
- **Bonus Task**: Investigate the effect of polynomial expansion with higher degrees (e.g., 3 or 4).

###Task 2.4: Fit a Linear Regression Model Using Spark ML

In this task, you will train a **Linear Regression** model using the features you’ve engineered. After training, you will generate predictions on the test set.

**Instructions:**
1. **Initialize the `LinearRegression` model** using the feature and label columns.
2. **Train the model** on the training dataset and generate predictions on the test set.

In [0]:
from pyspark.ml.regression import LinearRegression

## Initialize the Linear Regression model
lr = <FILL_IN>

## Train the model on the training dataset
lr_model = <FILL_IN>

## Make predictions on the test data
predictions = <FILL_IN>

In [0]:
%skip
from pyspark.ml.regression import LinearRegression

## Initialize the Linear Regression model
lr = LinearRegression(featuresCol="selected_features", labelCol="label")

## Train the model on the training dataset
lr_model = lr.fit(train_df)

## Make predictions on the test data
predictions = lr_model.transform(test_df)

**Exploration Prompt:**
- **Experiment with Other Models**: Try replacing the `LinearRegression` model with another regression model such as `DecisionTreeRegressor` or `GBTRegressor`. How do the results differ?
- **Documentation**: Learn more about different regression models in the [Spark ML Documentation](https://spark.apache.org/docs/latest/ml-classification-regression.html).

###Task 2.5: Create and Fit a Pipeline

Now, you will streamline the feature engineering and model training steps by creating a **Pipeline**. Pipelines allow you to combine multiple stages of transformation and modeling into a single, reusable workflow.

**Instructions:**
1. **Define the pipeline stages** by combining the feature transformers and the regression model.
2. **Fit the pipeline** on the training data and log the model using MLflow.
- Combine feature engineering and model training into a **Pipeline** for a streamlined process.

In [0]:
from pyspark.ml import Pipeline
import mlflow
import mlflow.spark

## Define the stages for the pipeline
stages = <FILL_IN>
pipeline = <FILL_IN>

## Remove any existing columns in the training data to avoid conflicts.
## You can drop columns that were created earlier to ensure the pipeline generates these columns again without errors.
columns_to_drop = <FILL_IN>
train_df = <FILL_IN>
## Train the pipeline model on the training dataset.
## This will apply all transformations and train the regression model in a single step.
## After training the model, log it to MLflow for tracking and version control.
with mlflow.start_run() as run:
    ## Fit the pipeline on the training data
    pipeline_model = <FILL_IN>
    
    ## Log the trained pipeline model to MLflow for tracking and versioning
    mlflow.spark.log_model(<FILL_IN>)
    
    ## Log the run details such as run ID and experiment ID
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    ## Print the MLflow run information for easy reference
    <FILL_IN>

In [0]:
%skip
from pyspark.ml import Pipeline
import mlflow
import mlflow.spark

## Define the stages for the pipeline
stages = [assembler, scaler, poly_expander, selector, lr]
pipeline = Pipeline(stages=stages)

## Remove any existing columns in the training data to avoid conflicts.
## You can drop columns that were created earlier to ensure the pipeline generates these columns again without errors.
columns_to_drop = ["assembled_features", "scaled_features", "poly_features", "selected_features"]
train_df = train_df.drop(*[col for col in columns_to_drop if col in train_df.columns])

## Train the pipeline model on the training dataset.
## This will apply all transformations and train the regression model in a single step.
## After training the model, log it to MLflow for tracking and version control.
with mlflow.start_run() as run:
    ## Fit the pipeline on the training data
    pipeline_model = pipeline.fit(train_df)
    
    ## Log the trained pipeline model to MLflow for tracking and versioning
    mlflow.spark.log_model(pipeline_model, "pipeline_model")
    
    ## Log the run details such as run ID and experiment ID
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    ## Print the MLflow run information for easy reference
    print(f"MLflow Run ID: {run_id}")
    print(f"MLflow Run: {mlflow_run}")

**Exploration Prompt:**
- **Experiment with Different Stages**: Try adding or removing stages from the pipeline (e.g., without polynomial expansion) and observe the impact on the model's performance.
- **Bonus Task**: Add a cross-validation stage using `CrossValidator` to tune model hyperparameters within the pipeline.

## Task 3: Model Evaluation

In this task, you will evaluate the performance of your trained model using common regression metrics such as **RMSE** (Root Mean Squared Error) and **R²** (Coefficient of Determination). Additionally, you will log the evaluation metrics and predictions to **MLflow**.

### Task 3.1: Compute Predictions

In this step, you'll use the trained pipeline model to compute predictions on the test data, which includes all engineered features and transformations.

**Instructions:**
1. **Remove conflicting columns** from the test dataset to avoid errors.
2. **Apply the pipeline model** to the test dataset to generate predictions.
3. **Log the predictions** to MLflow for tracking and version control.


In [0]:
## Remove existing columns in the test data to avoid conflicts
test_df = <FILL_IN>

## Apply the trained pipeline model to the test dataset to generate predictions
pipeline_predictions = pipeline_model.transform(test_df)

## Start a new MLflow run to log the predictions
with mlflow.start_run(run_name="Compute Predictions") as run:
    
    # Log the predictions as an artifact in MLflow (optional: log predictions file)
    # You can store predictions as a DataFrame or save them to a file for further analysis
    # Optionally, the predictions can be logged as artifacts if saved to a file
    
    # Log the MLflow run ID and link to the run for easy reference
    run_id = <FILL_IN>
    experiment_id = <FILL_IN>
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    # Print the MLflow run details for reference
    print(f"MLflow Run ID: <FILL_IN>")
    print(f"MLflow Run: <FILL_IN>")

In [0]:
%skip
## Remove existing columns in the test data to avoid conflicts
test_df = test_df.drop(*[col for col in columns_to_drop if col in test_df.columns])

## Apply the trained pipeline model to the test dataset to generate predictions
pipeline_predictions = pipeline_model.transform(test_df)

## Start a new MLflow run to log the predictions
with mlflow.start_run(run_name="Compute Predictions") as run:
    
    ## Log the predictions as an artifact in MLflow (optional: log predictions file)
    ## You can store predictions as a DataFrame or save them to a file for further analysis
    ## Optionally, the predictions can be logged as artifacts if saved to a file
    
    ## Log the MLflow run ID and link to the run for easy reference
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    ## Print the MLflow run details for reference
    print(f"MLflow Run ID: {run_id}")
    print(f"MLflow Run: {mlflow_run}")

### Task 3.2: Evaluate Model Performance

Now, you will evaluate your model using two common regression metrics:
- **Root Mean Squared Error (RMSE)**: Measures the average magnitude of prediction errors (lower is better).
- **R² (Coefficient of Determination)**: Measures how much variance in the target variable is explained by the model (closer to 1 is better).

**Instructions:**
1. **Initialize the evaluators** for both RMSE and R².
2. **Evaluate the model** on the test data using the `RegressionEvaluator`.
3. **Log the metrics** to MLflow for tracking.


In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
import mlflow

## Initialize the evaluators for RMSE and R²
evaluator_rmse = RegressionEvaluator(predictionCol=<FILL_IN>, labelCol=<FILL_IN>", metricName="rmse")
evaluator_r2 = RegressionEvaluator(predictionCol=<FILL_IN>, labelCol=<FILL_IN>, metricName="r2")

## Evaluate RMSE and R² on the pipeline predictions
## Use the trained pipeline model's predictions to evaluate its performance using both RMSE and R²
rmse = evaluator_rmse.evaluate(<FILL_IN>)
r2 = evaluator_r2.evaluate(<FILL_IN>)

## Start a new MLflow run to log the evaluation metrics
## This run will log the evaluation metrics to track the performance of the model in MLflow
with mlflow.start_run(run_name="Evaluate Housing Model") as run:
    
    ## Log RMSE and R² metrics to MLflow for tracking and future reference
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    
    ## Retrieve the run ID and experiment ID, and generate a link to the MLflow run for easy access
    run_id = <FILL_IN>
    experiment_id = <FILL_IN>
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    ## Print the MLflow run ID and the link to the logged run for easy reference
    print(f"MLflow Run ID: {<FILL_IN>}")
    print(f"MLflow Run: {<FILL_IN>}")

## Print the evaluation metrics (RMSE and R²) to see how well the model performed
print(f"Root Mean Squared Error (RMSE): <FILL_IN>")
print(f"R² (Coefficient of Determination): <FILL_IN>")

In [0]:
%skip
from pyspark.ml.evaluation import RegressionEvaluator
import mlflow

## Initialize the evaluators for RMSE and R²
evaluator_rmse = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName="rmse")
evaluator_r2 = RegressionEvaluator(predictionCol="prediction", labelCol="label", metricName="r2")

## Evaluate RMSE and R² on the pipeline predictions
## Use the trained pipeline model's predictions to evaluate its performance using both RMSE and R²
rmse = evaluator_rmse.evaluate(pipeline_predictions)
r2 = evaluator_r2.evaluate(pipeline_predictions)

## Start a new MLflow run to log the evaluation metrics
## This run will log the evaluation metrics to track the performance of the model in MLflow
with mlflow.start_run(run_name="Evaluate Housing Model") as run:
    
    ## Log RMSE and R² metrics to MLflow for tracking and future reference
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    
    ## Retrieve the run ID and experiment ID, and generate a link to the MLflow run for easy access
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    mlflow_run = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}/#mlflow/experiments/{experiment_id}/runs/{run_id}"
    
    ## Print the MLflow run ID and the link to the logged run for easy reference
    print(f"MLflow Run ID: {run_id}")
    print(f"MLflow Run: {mlflow_run}")

## Print the evaluation metrics (RMSE and R²) to see how well the model performed
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R² (Coefficient of Determination): {r2}")

## Conclusion

In this lab, you successfully explored the end-to-end machine learning workflow using Apache Spark, Delta Lake, and MLflow. You started by preparing the dataset, followed by performing feature engineering to enhance model performance. After that, you trained a machine learning model using Spark ML and evaluated it with metrics like RMSE and R². Lastly, you logged the model and its evaluation metrics to MLflow for tracking and future reference, showcasing how models can be managed effectively in a production environment.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>