# Cookbook 3: Validate data with GX Core and GX Cloud

**Note:** ***The GX Cloud UI screenshots contained in this cookbook are current as of*** `2025-01-06`. ***As GX Cloud continues to evolve, it is possible that you will see a difference between the latest UI and the screenshots displayed here.***

This cookbook showcases a data validation workflow characteristic of vetting existing data in an organization's data stores. It could be representative of two groups within an organization enforcing a publisher-subscriber data contract, or representative of users ensuring that data meets the quality requirements for its intended use, such as analytics or modeling.

[Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and [Cookbook 2](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) explored GX Core workflows that were run within a data pipeline, orchestrated by Airflow. This cookbook introduces [GX Cloud](https://greatexpectations.io/gx-cloud) as an additional tool to store and visualize data validation results and features a hybrid workflow using GX Core, GX Cloud, and Airflow.

This cookbook builds on [Cookbook 1: Validate data during ingestion (happy path)](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and [Cookbook 2: Validate data during ingestion (take action on failures)](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) and focuses on how data validation failures can be programmatically handled in the pipeline based on GX Validation Results and how failures can be shared using GX Cloud. This cookbook assumes basic familiarity with GX Core workflows; for a step-by-step explanation of the GX data validation workflow, refer to [Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and [Cookbook 2](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb).

## Imports

The GX Core content of this cookbook uses the `great_expectations` library.

The `tutorial_code` module contains helper functions used within this notebook and the associated Airflow pipeline.

The `airflow_dags` submodule is included so that you can inspect the code used in the related Airflow DAG directly from this notebook.

In [None]:
import inspect
import os

import great_expectations as gx
import great_expectations.expectations as gxe
import IPython
import pandas as pd

import tutorial_code as tutorial
import airflow_dags.cookbook3_validate_postgres_table_data as dag

## The GX data quality platform

The Great Expectations data quality platform is comprised by:
* [GX Cloud](https://greatexpectations.io/gx-cloud), a fully managed SaaS solution, with web portal, and
* [GX Core](https://github.com/great-expectations/great_expectations), the open source Python framework.

GX Cloud and GX Core can be used separately for a cloud-only or programmatic-only approach ([Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and [Cookbook 2](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb) are an example of a Core-only workflow). However, using GX Core and GX Cloud *together* provides a solution in which GX Cloud serves as a single source of truth for data quality definition and application, and GX Core enables flexible integration of data validation into existing data stacks. Together, GX Cloud and GX Core enable you to achieve data quality definition, monitoring, and management using UI-based workflows, programmatic workflows, or hybrid workflows.

The diagram below depicts different ways you might opt to use the platform (but is not exhaustive):

In [None]:
IPython.display.Image(
    "img/diagrams/gx_cloud_core_architecture.png",
    alt="Example modes of working together with GX Cloud and GX Core",
    width=900,
)

## Cookbook workflow

In this cookbook, you will use GX Core, GX Cloud, and Airflow to define data quality for sample data, run data validation, and explore the results of data validation. The key steps are:
1. Define your Data Asset, Expectations, and Checkpoint programmatically with GX Core
2. Store the GX workflow configuration in your GX Cloud organization
3. Trigger data validation from an Airflow pipeline
4. Explore data validation results in GX Cloud

The diagram below depicts, in more detail, the underlying interactions of GX Core, GX Cloud, Airflow, and the sample data Postgres database. As you work through this cookbook, you'll implement each of these interactions.

In [None]:
IPython.display.Image(
    "img/diagrams/cookbook3_workflow.png",
    alt="GX Cloud and GX Core interactions in Cookbook3",
    width=900,
)

## Verify GX Cloud connectivity

Before working through the tutorial, check that your GX Cloud organization credentials are available in this notebook environment, and log in to GX Cloud.

### Check that GX Cloud credentials are defined

Valid GX Cloud organization credentials need to be provided for GX Core to persist workflow configuration and validation results to GX Cloud. Run the code below to check that your credentials are availabe in this notebook environment.

In [None]:
if tutorial.cloud.gx_cloud_credentials_exist():
    print(
        "Found stored credentials in the GX_CLOUD_ORGANIZATION_ID and GX_CLOUD_ACCESS_TOKEN environment variables."
    )

```{warning} GX Cloud credential error
If `tutorial.cloud.check_for_gx_cloud_credentials_exist()` rasies a `ValueError` indicating that `GX_CLOUD_ORGANIZATION_ID` or `GX_CLOUD_ACCESS_TOKEN` is undefined, ensure that you have provided your GX Cloud organization id and access token when starting Docker compose.
```

### Log into GX Cloud

In a separate browser window or tab, log in to [GX Cloud](https://hubs.ly/Q02TyCZS0).

## Connect to source data

In this tutorial, you will validate customer profile information that is hosted in a publicly available Postgres database, provided by GX. The customer profile data extends the sample customer data used in [Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb). Data for each customer includes their age (in years) and annual income (in USD).

In [None]:
pd.read_sql_query(
    "select count(*) from customer_profile",
    con=tutorial.db.get_cloud_postgres_engine(),
).iloc[0]["count"]

In [None]:
pd.read_sql_query(
    "select * from customer_profile limit 5",
    con=tutorial.db.get_cloud_postgres_engine(),
)

## Profile source data and determine data quality checks

The scenario explored in this cookbook assumes that the data has been vetted for schema adherence and completeness. Notably, all rows contain required fields and data is non-null and in a valid format.

The Expectations that you create will assess the distribution of the customer profile dataset - representative of data testing performed before using data for analysis or machine learning purposes.

In [None]:
tutorial.cookbook3.visualize_customer_age_distribution()

In [None]:
tutorial.cookbook3.visualize_customer_income_distribution()

You will use the following Expectations in this cookbook to validate distribution of the sample customer profile data:
* The minimum customer age is between 20 and 25 years
* The maximum customer age is 85 years or younger
* The median customer annual income between 45k and 50k, with a standard deviation of 10k.

## GX validation workflow

The GX data validation workflow was introduced in [Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and [Cookbook 2](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb); refer to these cookbooks for provided walkthroughs of the following GX components:
* Data definition: Data Source, Data Asset, Batch Definition, Batch
* Data quality definition: Expectation, Expectation Suite
* Data Validation: Validation Definition, Checkpoint, Validation Result

This cookbook will provide additional detail on the Data Context and discuss the choice of Data Context when introducing GX Cloud into your data validation workflow.

### Ephemeral and Cloud Data Contexts

All GX Core workflows start with the creation of a Data Contex. A Data Context is the Python object that serves as an entrypoint for the GX Core Python library, and it also manages the settings and metadata for your GX workflow.

* An **Ephemeral Data Context** stores the configuration of your GX workflow in memory. Workflow configurations do not persist beyond the current notebook or Python session.

  ```
  context = gx.get_context(mode="ephemeral")
  ```

* A **Cloud Data Context** stores the configuration of your GX workflow in GX Cloud. Configurations stored in GX Cloud are accessible by others in your organization and can be used across sessions and mediums - in Python notebooks, Python scripts, and orchestrators that support Python. When creating a Cloud Data Context, you need to provide credentials for the specific GX Cloud organization that you want to use.

  ```
  context = gx.get_context(
      mode="cloud",
      cloud_organization_id="<my-gx-cloud-org-id>",
      cloud_access_token="<my-gx-cloud-access-token>"
  )
  ```

For additional detail on Data Contexts, see [Create a Data Context](https://docs.greatexpectations.io/docs/core/set_up_a_gx_environment/create_a_data_context) in the GX Core documentation.

The `gx.get_context()` method, when called with no arguments, will auto-discover your GX Cloud organization id and access token credentials if they are available as the `GX_CLOUD_ORGANIZATION_ID` and `GX_CLOUD_ACCESS_TOKEN` environment variables, respectively.

In [None]:
context = gx.get_context()

if (os.getenv("GX_CLOUD_ORGANIZATION_ID", None) is not None) and (
    os.getenv("GX_CLOUD_ACCESS_TOKEN", None) is not None
):
    assert isinstance(context, gx.data_context.CloudDataContext)
    print("GX Cloud credentials found, created CloudDataContext.")

### Define validation workflow and persist configuration in GX Cloud

```{admonition} Reminder: Adding GX components to the Data Context
GX components are unique on name. Once a component is created with the Data Context, adding another component with the same name will cause an error. To enable repeated execution of cookbook cells that add GX workflow components, you will see the following pattern:

    try:
        Add a new component(s) to the context
    except:
        Get component(s) from the context by name, or delete and recreate the component(s)
```

#### Create the GX Data Asset

Create the Cloud Data Context and the initial components that define a Data Asset for the sample customer profile data.

In [None]:
# Create the Cloud Data Context.
context = gx.get_context()

# Create the Data Source, Data Asset, and Batch Definition.
try:
    data_source = context.data_sources.add_postgres(
        "GX tutorial", connection_string=tutorial.db.get_gx_postgres_connection_string()
    )
    data_asset = data_source.add_table_asset(
        name="customer profiles", table_name="customer_profile"
    )
    batch_definition = data_asset.add_batch_definition_whole_table(
        "customer profiles batch definition"
    )

except:
    data_source = context.data_sources.get("GX tutorial")
    data_asset = data_source.get_asset(name="customer profiles")
    batch_definition = data_asset.get_batch_definition(
        "customer profiles batch definition"
    )

#### Examine the Data Asset in GX Cloud

Since the Cloud Data Context was used to create the Data Source and Data Asset, you will now see these components in your GX Cloud organization. View the Data Asset in the [GX Cloud UI](https://hubs.ly/Q02TyCZS0).

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_new_data_asset.png",
    alt="Data Asset created in GX Cloud using a Cloud Data Context",
    width=900,
)

You will see that the newly created Data Asset does not contain any Expectations or Validation Results yet.

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_new_asset_no_expectations_validations.png",
    alt="A Data Asset newly created with GX Core does not yet have Expectations or Validation Results in GX Cloud",
    width=700,
)

#### Add Expectations and a Checkpoint to the workflow

Continue to build your GX validation workflow, adding the Expectation Suite, Expectations, Validation Definition, and Checkpoint.

In [None]:
EXPECTATION_SUITE_NAME = "Customer profile expectations"
VALIDATION_DEFINTION_NAME = "Customer profile validation definition"
CHECKPOINT_NAME = "Customer profile checkpoint"


def create_gx_validation_workflow_components(
    expectation_suite_name: str, validation_definition_name: str, checkpoint_name: str
) -> gx.Checkpoint:
    """Create the Expectation Suite, Validation Definition, and Checkpoint for the Cookbook 3 workflow.

    Returns:
        GX Checkpoint object
    """

    # Create the Expectation Suite.
    expectation_suite = context.suites.add(
        gx.ExpectationSuite(name=EXPECTATION_SUITE_NAME)
    )

    # Add Expectations to Expectation Suite.
    expectations = [
        gxe.ExpectColumnMinToBeBetween(column="age", min_value=20, max_value=25),
        gxe.ExpectColumnMaxToBeBetween(column="age", max_value=90),
        gxe.ExpectColumnMedianToBeBetween(
            column="annual_income_usd", min_value=45_000, max_value=50_000
        ),
        gxe.ExpectColumnStdevToBeBetween(
            column="annual_income_usd", min_value=10_000, max_value=10_000
        ),
    ]

    for expectation in expectations:
        expectation_suite.add_expectation(expectation)

    expectation_suite.save()

    # Create the Validation Definition.
    validation_definition = context.validation_definitions.add(
        gx.ValidationDefinition(
            name=VALIDATION_DEFINTION_NAME,
            data=batch_definition,
            suite=expectation_suite,
        )
    )

    # Create the Checkpoint.
    checkpoint = context.checkpoints.add(
        gx.Checkpoint(
            name=CHECKPOINT_NAME,
            validation_definitions=[validation_definition],
        )
    )

    return checkpoint


# Create (or recreate: delete & create) the cookbook workflow components.
try:
    checkpoint = create_gx_validation_workflow_components(
        expectation_suite_name=EXPECTATION_SUITE_NAME,
        validation_definition_name=VALIDATION_DEFINTION_NAME,
        checkpoint_name=CHECKPOINT_NAME,
    )

except:
    context.checkpoints.delete(name=CHECKPOINT_NAME)
    context.validation_definitions.delete(name=VALIDATION_DEFINTION_NAME)
    expectation_suite = context.suites.delete(name=EXPECTATION_SUITE_NAME)

    checkpoint = create_gx_validation_workflow_components(
        expectation_suite_name=EXPECTATION_SUITE_NAME,
        validation_definition_name=VALIDATION_DEFINTION_NAME,
        checkpoint_name=CHECKPOINT_NAME,
    )

#### Examine Expectations in GX Cloud

Examine the newly added Expectations in the [GX Cloud UI](https://hubs.ly/Q02TyCZS0). You will see the GX Core-created Expectations under the **Cloud API** section.

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_expectations_added.png",
    alt="GX Cloud display of Expectations added using GX Core",
    width=800,
)

### Validate the sample data

The GX workflow configuration is now persisted in your GX Cloud organization, accessible via the Cloud Data Context. Run the Checkpoint to validate the sample customer profile data, and then explore the Validation Results in GX Cloud.

#### Run the Checkpoint

Run the Checkpoint to validate the sample data.

In [None]:
checkpoint_result = checkpoint.run()

#### View results in GX Cloud

The Validation Result object can be extracted from the returned Checkpoint Result object. When produced using a Cloud Data Context, the Validation Result object provides a `result_url` field that contains a direct link to your Validation Results in GX Cloud.

In [None]:
validation_result = checkpoint_result.run_results[
    list(checkpoint_result.run_results.keys())[0]
]

print(
    f"Click this link to view your Validation Results in GX Cloud:\n\n{validation_result.result_url}"
)

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_linked_validation_results.png",
    alt="Validation Results of Expectations against sample customer profile data",
    width=800,
)

The Validation Results for the `Customer profile expectations` suite inform you that three out of four Expectations passed. The Expectation that the standard deviation of customer annual income is 10k failed - the results indicate that the observed standard deviation is slightly lower, about $9.6k.

For the purposes of this tutorial, it is important that an Expectation failed, rather than why it failed, so that you can experience the exploration of both passing and failing results in GX Cloud.

## Integrate GX Cloud validation in the Airflow DAG

You have run data validation from this notebook, next, you will run data validation within an Airflow DAG.

### Inspect DAG code

Examine the DAG code below that defines the `cookbook3_validate_postgres_table_data` pipeline. The key actions of the code are:
* Fetch and run the GX Cloud Checkpoint.

   ```
   context = gx.get_context()
 
   checkpoint = context.checkpoints.get("Customer profile checkpoint")
   
   checkpoint_result = checkpoint.run()
   ```

   * Note that the code assumes that the GX Cloud credentials have been made available in the Airflow environment so that `gx.get_context()` returns a Cloud Data Context.
   * This code snippet, customized for your desired Checkpoint, can be retrieved from GX Cloud using the validation code snippet feature. See the next section of this cookbook for more detail.

* Extract and log the result of validation and GX Cloud results url.

   ```
    validation_result = checkpoint_result.run_results[
        list(checkpoint_result.run_results.keys())[0]
    ]

    if validation_result["success"]:
        log.info(f"Validation succeeded: {validation_result.result_url}")
    else:
        log.warning(f"Validation failed: {validation_result.result_url}")
   ```

In [None]:
%pycat inspect.getsource(dag)

### GX Cloud validation code snippet

GX Cloud will generate a validation code snippet, which provides the code needed to run a GX Cloud Checkpoint using GX Core. The validation code snippet can be copy-pasted within an Airflow DAG to trigger a Checkpoint run. 

1. Navigate to the Validations tab of your Data Asset.
2. Click the **Use code snippet** button `</>` directly to the right of the **Validate** button.
3. Click **Generate Snippet**.

This displays the Validation Expectations dialog box, which contains a GX Core 1.0.x code snippet that has been populated with the name of your Checkpoint. For instance, for the Checkpoint created by this cookbook, you'll see the following snippet:
```
import great_expectations as gx

context = gx.get_context()

checkpoint = context.checkpoints.get("Customer profile checkpoint")

checkpoint.run()
```

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_validation_code_snippet.png",
    alt="Generate the validation code snippet in GX Cloud",
    width=800,
)

### View the Airflow pipeline

To view the `cookbook3_validate_postgres_table_data` pipeline in the Airflow UI, log into the locally running Airflow instance.

1. Open [http://localhost:8080/](http://localhost:8080/) in a browser window.
2. Log in with these credentials:
  * Username: `admin`
  * Password: `gx`

You will see the pipeline under **DAGs** on login.

In [None]:
IPython.display.Video("img/screencaptures/log_in_to_airflow.mp4")

### Trigger the pipeline

You can trigger the DAG from this notebook, using the provided convenience function in the cell below, or you can trigger the DAG manually in the Airflow UI.

In [None]:
tutorial.airflow.trigger_airflow_dag_and_wait_for_run(
    "cookbook3_validate_postgres_table_data"
)

To trigger the `cookbook3_validate_postgres_table_data` DAG from the Airflow UI, click the **Trigger DAG** button (with a play icon) under *Actions*. This will queue the DAG and it will execute shortly. The successful run is indicated by the run count inside the green circle under **Runs**. The triggering of a similar DAG is shown in the clip below.

In [None]:
IPython.display.Video("img/screencaptures/trigger_airflow_dag.mp4")

The `cookbook3_validate_postgres_table_data` DAG can be rerun multiple times; you can experiment with running it from this notebook or from the Airflow UI. 

### View pipeline results

Once the pipeline has been run, Validation Results are available in GX Cloud. You can either go directly to the GX Cloud UI, or access the link from the pipeline logs. To access the pipeline run logs in the Airflow UI:

1. On the DAGs screen, click on the run(s) of interest under Runs.
2. Click the name of the individual run you want to examine. This will load the DAG execution details.
3. Click the Graph tab, and then the `cookbook3_validate_postgres_table_data` task box on the visual rendering.
4. Click the Logs tab to load the DAG logs. The link to the GX Cloud results will be in the log output.

In [None]:
IPython.display.Video(
    "img/screencaptures/cookbook3_view_pipeline_results.mp4", width=1000
)

## Review and take action on validation results in GX Cloud

### Review validation results

When you integrate data validation into your pipeline using GX, GX Cloud provides a central UI to review and share validation results; result output is not limited to pipeline log messages.

Data validation results are shown in GX Cloud on the **Validations** tab of a Data Asset. You can access these results using a direct link (as shown in this cookbook), or by navigating within the GX Cloud UI.

In addition to the results of individual runs, the Validations tab provides a historical view of your data validation results over multiple runs. This consolidated view contributes to an improved understanding and monitoring of your Data Asset health and quality over time, rather than relying on point-in-time assessments of data quality.

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_gx_cloud_validations_tab_over_time.png",
    alt="GX Cloud Validations tab actions: Alert, Share, Validate",
    width=800,
)

### Take action on validation results

GX Cloud enables you to take action on results generated by validation in your data pipeline. The key capabilities are Alerting, Sharing, and in-app triggering of Validation.

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_gx_cloud_validations_tab_actions.png",
    alt="GX Cloud Validations tab actions: Alert, Share, Validate",
    width=400,
)

Alerting is enabled by default on newly created Data Assets. If any Validations fail, then you will receive an email that notifies you of the failure and provides a direct link to the failing validation run.

In [None]:
IPython.display.Image(
    "img/screencaptures/cookbook3_validation_failure_email_alert.png",
    alt="GX Cloud email alert for data validation failure",
    width=800,
)

Results can easily be shared with others in your organization. Once individuals have been [added to your GX Cloud organization](https://docs.greatexpectations.io/docs/cloud/users/manage_users#invite-a-user), then you can provide a Share link that takes them directly to the validation run of interest.

Validation can be triggered manually from the GX Cloud UI, enabling data developers and other stakeholders to revalidate data without needing to modify the existing data pipeline operation.

## Summary

This cookbook has walked you through the process of defining a validation workflow with GX Core, persisting the worfklow configuration in GX Cloud, integrating data validation in an Airflow pipeline, and then accessing and taking action on validation results in GX Cloud.

[Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb), [Cookbook 2](Cookbook_2_Validate_data_during_ingestion_take_action_on_failures.ipynb), and Cookbook 3 (this cookbook) have demonstrated how you can integrate data validation in a Python-enabled orchestrator using GX. While the cookbook examples have used Airflow DAGs, the same principles will apply when using GX in other orchestrators, such as Dagster, Prefect, or any other orchestrator that supports Python code.