# Cookbook 1: Validate data during ingestion (happy path)

This cookbook showcases a sample GX data validation workflow characteristic of data ingestion at the start of the data pipeline. Data is loaded into a Pandas dataframe, cleaned, validated, and then ingested into a Postgres database table.

This cookbook explores the validation workflow first in a notebook setting, then embedded within an Airflow pipeline. Airflow pipelines are also referred to as directed acyclic graphs, or DAGs.

This cookbook features a "happy path" scenario in which data passes validation and generates a successful pipeline run.

## Imports and constant definition

This tutorial features the `great_expectations` library.

The `tutorial_code` module contains helper functions used within this notebook and the associated Airflow pipeline.

The `airflow_dags` submodule is included so that you can inspect the code used in the related Airflow DAG directly from this notebook.

In [None]:
import pathlib
import inspect

import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

import tutorial_code as tutorial
import airflow_dags.cookbook1_ingest_customer_data as dag

## Load raw customer data

In this tutorial, you will clean and validate a dataset containing synthesized customer data. The data is loaded from a CSV file into a Pandas DataFrame.

In [None]:
DATA_DIR = pathlib.Path("/cookbooks/data/raw")

df_customers_raw = pd.read_csv(DATA_DIR / "customers.csv", encoding="unicode_escape")

In [None]:
print(f"Loaded {df_customers_raw.shape[0]} customer rows into dataframe.\n")

display(df_customers_raw.head())

## Examine destination table

The customer data will be loaded into a Postgres table, `customers`. Examine the schema of the destination table and compare to the initial schema and contents of the raw customer data.

In [None]:
tutorial.db.get_table_schema(table_name="customers")

Prior to running the Airflow pipeline, the Postgres `customers` table contains no data.

In [None]:
tutorial.db.get_table_row_count(table_name="customers")

## Clean customer data

To clean the customer data, you will use a pre-prepared function, `clean_customer_data`. The cleaning code is displayed below, and then invoked to clean the raw customer data.

In [None]:
%pycat inspect.getsource(tutorial.cookbook1.clean_customer_data)

In [None]:
df_customers = tutorial.cookbook1.clean_customer_data(df_customers_raw)

In [None]:
print(f"Loaded {df_customers_raw.shape[0]} cleaned customer rows.\n")

display(df_customers.head())

## GX validation workflow

You will validate the cleaned customer data using GX prior to loading it into a Postgres database table. First, this section will explain an example of a simple a GX data validation workflow. Then, you'll apply that knowledge to validate the customer data.

This tutorial contains concise explanations of GX components and workflows. For more detail, visit the [Introduction to GX Core](https://docs.greatexpectations.io/docs/core/introduction/) in the GX docs.

### Validate data with a single Expectation

All GX workflows start with the creation of a **Data Context**. A Data Context is the Python object that serves as an entrypoint for the GX Core Python library, and it also manages the settings and metadata for your GX workflow.

In [None]:
context = gx.get_context()

Next, you create a **Data Source**, **Data Asset**, and **Batch Definition**. You then use the Batch Definition to generate a **Batch** of data to validate.

```{admonition} Adding GX components to the Data Context
GX components are unique on name. Once a component is created with the Data Context, adding another component with the same name will cause an error. To enable repeated execution of cookbook cells that add GX workflow components, you will see the following pattern:

    try:
        Add a new component(s) to the context
    except:
        Get component(s) from the context by name
```

In [None]:
# Create Data Source, Data Asset, and Batch Definition.
try:
    data_source = context.data_sources.add_pandas("pandas")
    data_asset = data_source.add_dataframe_asset(name="customer data")
    batch_definition = data_asset.add_batch_definition_whole_dataframe(
        "batch definition"
    )

except:
    data_source = context.data_sources.get("pandas")
    data_asset = data_source.get_asset(name="customer data")
    batch_definition = data_asset.get_batch_definition("batch definition")

# Get the Batch from the Batch Definition.
batch = batch_definition.get_batch(batch_parameters={"dataframe": df_customers})

An **Expectation** is a simple, declarative, verifiable assertion about your data. You can validate a Batch of data using an Expectation. Available Expectations can be easily found and instantiated using the `gxe` alias defined in the cookbook imports.

First, create an Expectation that expects the columns in the customer data to match the provided ordered list of column names.

In [None]:
expectation = gxe.ExpectTableColumnsToMatchOrderedList(
    column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"]
)

Next, validate your Batch using the Expectation.

In [None]:
validation_result = batch.validate(expectation)

GX returns an `ExpectationValidationResult` object that provides metadata about the result of the validation and that can be accessed like a dictionary. The `ExpectationValidationResult` provides a variety of fields, most critically, the `success` field that indicates whether or not the Expectation passed.

In [None]:
print(f"Results type: {type(validation_result)}\n")

display(validation_result)

### Validate data with an Expectation Suite

Batches of data can also be validated with an **Expectation Suite**, which is a collection of Expectations.

First, add a new Expectation Suite to the Data Context. 

In [None]:
# Create Expectation Suite.
try:
    expectation_suite = context.suites.add(
        gx.ExpectationSuite(name="customer expectations")
    )
except:
    expectation_suite = context.suites.delete(name="customer expectations")
    expectation_suite = context.suites.add(
        gx.ExpectationSuite(name="customer expectations")
    )

Next, add Expectations to the Expectation Suite. Below, you will see Expectations that describe the required format of the customer data added to the Expectation Suite.

In [None]:
expectations = [
    gxe.ExpectTableColumnsToMatchOrderedList(
        column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"]
    ),
    gxe.ExpectColumnValuesToBeOfType(column="customer_id", type_="int"),
    *[
        gxe.ExpectColumnValuesToBeOfType(column=x, type_="str")
        for x in ["name", "city", "state", "zip"]
    ],
    gxe.ExpectColumnValuesToMatchRegex(column="dob", regex=r"^\d{4}-\d{2}-\d{2}$"),
    gxe.ExpectColumnValuesToBeInSet(
        column="country", value_set=["AU", "CA", "DE", "FR", "GB", "IT", "NL", "US"]
    ),
]

for expectation in expectations:
    expectation_suite.add_expectation(expectation)

Lastly, validate the Batch using the Expectation Suite.

In [None]:
# Validate Batch using Expectation Suite.
validation_result = batch.validate(expectation_suite)

When validating a Batch using an Expectation Suite, GX returns an `ExpectationSuiteValidationResult` object. 

In [None]:
type(validation_result)

Like the `ExpectationValidationResult` object, the `ExpectationSuiteValidationResult` object provides metadata about the result of the validation, but contains results for each of the individual Expectations that were run during the validation.

* The `success` field indicates whether or not the validation passed. All individual Expectations in the Expectation Suite must pass for `success` to be `True`.
* The `results` field contains indiviual results for each Expectation.

In [None]:
print(f"Validation passed: {validation_result['success']}\n")

display(validation_result["results"])

## Integrate GX validation in the Airflow DAG

You will use the `success` metadata of the GX validation result object to control the actions of the `cookbook1_validate_and_ingest_to_postgres` Airflow pipeline.

### Inspect DAG code

Examine the DAG code below that defines the `cookbook1_validate_and_ingest_to_postgres` pipeline. The DAG code checks the results of the GX validation before data is written to Postgres. If validation succeeds, the data is written to Postgres, but if validation fails, the pipeline will raise an error and halt.

```
# Halt pipeline with error if validation fails.
if not validation_result["success"]:
    raise Exception("GX data validation failed.")
```

In [None]:
%pycat inspect.getsource(dag)

### View the Airflow pipeline

To view the `cookbook1_validate_and_ingest_to_postgres` pipeline in the Airflow UI, log into the locally running Airflow instance.

1. Open [http://localhost:8080/](http://localhost:8080/) in a browser window.
2. Log in with these credentials:
  * Username: `admin`
  * Password: `gx`

You will see the pipeline under **DAGs** on login.

![Log in to tutorial Airflow UI](static/images/cookbook1_log_in_to_airflow_ui.gif)

### Trigger the Airflow pipeline

You can trigger the DAG from this notebook, using the provided convenience function in the cell below, or you can trigger the DAG manually in the Airflow UI.

In [None]:
dag_run_id, dag_run_state = tutorial.airflow.trigger_airflow_dag(
    "cookbook1_validate_and_ingest_to_postgres"
)
print(f"DAG run {dag_run_id} is {dag_run_state}.")

To trigger the `cookbook1_validate_and_ingest_to_postgres` DAG from the Airflow UI, click the **Trigger DAG** button (with a play icon) under Actions. This will queue the DAG and it will execute shortly. The successful run is indicated by the run count inside the green circle under Runs.

![Trigger the Airflow DAG](static/images/cookbook1_trigger_dag.gif)

The `cookbook1_validate_and_ingest_to_postgres` DAG can be rerun multiple times; you can experiment with running it from this notebook or from the Airflow UI. The pipeline insert ignores into the Postgres `customers` table, meaning that it will not attempt to insert a row with the same primary key as an existing row.

### View pipeline results

Once the pipeline has been run, the `customers` table is populated with the cleaned customer data. You can view the updated table count and a sampling of rows.

In [None]:
tutorial.db.get_table_row_count(table_name="customers")

In [None]:
pd.read_sql_query(
    "select * from customers limit 10", con=tutorial.db.get_local_postgres_engine()
)

It can also be helpful to view the pipeline logs to investigate the details of a successful (or unsuccessful run). To examine these logs in the Airflow UI:
1. On the DAGs screen, click on the run(s) of interest under Runs.
2. Click the name of the individual run you want to examine. This will load the DAG execution details.
3. Click the Graph tab, and then the `cookbook1_validate_and_ingest_to_postgres` task box on the visual rendering.
4. Click the Logs tab to load the DAG logs.

You can see in the screen capture below that the logs reflect the row insertion print statement that was included in the DAG code.

![Check logs for successful pipeline run](static/images/cookbook1_check_pipeline_logs.gif)

## Summary

This cookbook has walked you through the process of validating data using GX and integrating the data validation workflow in an Airflow pipeline.

Future cookbooks will explore additional scenarios in which pipeline validation fails, the pipeline is halted, and invalid data is automatically handled in the pipeline execution.