
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


In [0]:
https://docs.databricks.com/aws/en/dlt/expectations#manage-data-quality-with-pipeline-expectations

# 2.6 - Performing Integration Tests

Integration tests for data engineering ensures that different components of the data pipeline, such as data ingestion, transformation, storage, and retrieval, work together seamlessly in a real-world environment. These tests validate the flow of data across systems, checking for issues like data consistency, format mismatches, and processing errors when components interact as expected.

There are multiple ways to implement integration tests within Databricks:

1. **Delta Live Tables (DLT)**: With DLT, you can use expectations to check pipeline’s results.
    - [Manage data quality with pipeline expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html#manage-data-quality-with-pipeline-expectations)

2. **Workflow Tasks**: You can also perform integration tests as a Databricks Workflow with tasks - similarly what is typically done for non-DLT code.

In this demonstration, we will quickly introduce to you how to perform simple integration tests with Delta Live Tables and discuss how to implement them with Workflows. Prior knowledge of DLT and Workflows is assumed.

## Objectives

- Learn how to perform integration testing in DLT pipelines using expectations.
- Understand how to perform integration tests on data from DLT pipelines using Workflow tasks.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course. 

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

##### The notebook "2.1 - Modularizing PySpark Code - Required" sets up the catalogs for this course. If you have not run this notebook, the catalogs will not be available.

In [0]:
%run ../Includes/Classroom-Setup-2.6

## B. Option 1 - Delta Live Tables (DLT) Pipeline with Integration Tests

In this section, we will create a DLT pipeline using the modularized functions from the `src.helpers` file, which we unit tested in the previous notebook. In the DLT pipeline, we will use these functions to create tables and then implement some simple integration tests for the output tables in our ETL pipeline for this project.

- With DLT, you can use expectations to check pipeline’s results.
  - [Manage data quality with pipeline expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html#manage-data-quality-with-pipeline-expectations)

  - [Expectation recommendations and advanced patterns](https://docs.databricks.com/en/delta-live-tables/expectation-patterns.html#expectation-recommendations-and-advanced-patterns)

  - [Applying software development & DevOps best practices to Delta Live Table pipelines](https://www.databricks.com/blog/applying-software-development-devops-best-practices-delta-live-table-pipelines)


1. We will create the DLT pipeline for this project using the Databricks Academy **`DAPipelineConfig`** class, which was specifically designed for this course with the Databricks SDK. This avoids manually creating the DLT pipeline for this demo. Typically during development you would manually build the DLT pipeline with the UI during development.

    **NOTE:** The Databricks SDK is outside the scope of this course. However, if you're interested in seeing the code that uses the SDK to automate building DLT pipelines in Databricks Academy, check out the **[../Includes/Classroom-Setup-Common]($../Includes/Classroom-Setup-Common)** notebook in **Cell 6**.

    [Databricks SDK for Python](https://docs.databricks.com/en/dev-tools/sdk-python.html)

    [Databricks SDK Documentation](https://databricks-sdk-py.readthedocs.io/en/latest/)


![Full DLT Pipeline](../Includes/images/04_dlt_pipeline.png)

In [0]:
# Create the DLT pipeline for this project using the custom Databricks Academy class DAPipelineConfig that was created using the Databricks SDK. 

pipeline = DAPipelineConfig(pipeline_name=f"sdk_health_etl_{DA.catalog_dev}", 
                            catalog=f"{DA.catalog_dev}",
                            schema="default", 
                            pipeline_notebooks=[
                                "/src/dlt_pipelines/ingest-bronze-silver_dlt", 
                                "/src/dlt_pipelines/gold_tables_dlt",
                                "/tests/integration_test/integration_tests_dlt"
                              ],
                            config_variables={
                                'target':'development', 
                                'raw_data_path': f'/Volumes/{DA.catalog_dev}/default/health'
                              }
                          )

pipeline.create_dlt_pipeline()

pipeline.start_dlt_pipeline()

2. While the DLT pipeline is running, examine it through the UI by completing the following steps:

   a. In the far left navigation pane, right-click on **Pipelines** and select *Open in a New Tab*.

   b. Find your DLT pipeline named **sdk_health_etl_your_catalog_1_dev** and select it.

   c. Click **Settings** at the top right.

    - c1. In the **General** section notice that this DLT pipeline is using **Serverless** compute.

    - c2. Scroll down to the **Advanced** section. You'll notice that the pipeline contains two **Configuration** variables:

      - **target** = *'development'*
        - This `target` variable will be modified dynamically for each deployment to **development**, **stage**, and **production**.

      - **raw_data_path** = *'/Volumes/your_catalog_1_dev/default/health'*
        - This `raw_data_path` variable will be modified dynamically for each deployment to **development data**, **stage data**, and **production data**.

    - c3. Click **Cancel** at the bottom right.

   d. At the top of the Pipelines select the kebab menu (three ellipses) and select **View settings YAML**. Notice that the UI provides the necessary YAML files for future deployment. We will talk more about this later. 

   e. In the **Pipeline details** section on the far right, you should see three notebooks being used for the **Source code**. Right-click each notebook and select *Open Link in New Tab* to examine them:

    - **Notebook 1: [..../src/dlt_pipelines/ingest-bronze-silver_dlt]($../../src/dlt_pipelines/ingest-bronze-silver_dlt)** - Obtains the DLT configuration variables that setup the target and raw data, and creates the bronze and silver tables based on those variable values.
  
    - **Notebook 2: [..../src/dlt_pipelines/gold_tables_dlt]($../../src/dlt_pipelines/gold_tables_dlt)** - Creates the gold table.
  
    - **Notebook 3: [..../tests/integration_test/integration_tests_dlt]($../../tests/integration_test/integration_tests_dlt)** - Performs simple integration tests on the bronze, silver and gold tables based on the target environment.

   h. Here is a diagram of the entire DLT pipeline for **development, stage and production**. Depending on the values of the **target** and **raw_data_path** configuration variables that are set, the ingest data source and integration tests will vary (dev catalog, stage catalog, prod catalog), but the ETL pipeline will remain the same.

  ![Explain DLT Pipeline]( ../Includes/images/04_dlt_explain_integrations.png)

## C. Option 2 - Integration Testing with Notebooks and Databricks Workflows
You can also perform integration testing using notebooks and add them as tasks in jobs for your pipeline. 

**NOTE:** We will simply review how to implement integration tests with Workflows if that is the method you prefer. The final deployment for this course uses the DLT integration tests with expectations.

#### Steps to take:
1. Create a setup notebook to handle any dynamic setup required using job parameters for your target environment and data locations.

2. Create additional notebooks or files to store the integration tests you want to run as tasks.

3. Organize the new notebooks or files within your **tests** folder.

4. Create a Workflow. Within the Workflow:

   - a. Create the necessary tables or views using DLT or code.

   - b. Add tasks to set up your integration tests (e.g., setting up any dynamic job parameters that need to be set).

   - c. Perform validation by using your notebooks as tasks and set the tasks to all should succeed.

**NOTES:** One major drawback of this approach is that you will need to write more code for setup and validation tasks, as well as manage the job parameters to dynamically modify the code based on the target environment.

## Summary
Integration testing can be performed in a variety of ways within Databricks. In this demonstration, we focused on how to perform simple integration tests using DLT expectations. We also discussed how to implement them with Workflow tasks.

Depending on your specific situation, you can choose the approach that best fits your needs.


&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>