# Cookbook 2: Validate data during ingestion (take action on failures)

This cookbook showcases a sample GX data validation workflow characteristic of data ingestion at the start of the data pipeline. Data is loaded into a Pandas dataframe, cleaned, validated, and then ingested into a Postgres database table. This cookbook explores the validation workflow first in a notebook setting, then embedded within an Airflow pipeline.

This cookbook features a scenario in which a subset of data fails validation and must be handled in the pipeline.

This cookbook builds on [Cookbook 1: Validate data during ingestion (happy path)](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb) and focuses on how data validation failures can be programmatically handled in the pipeline based on GX Validation Results. This cookbook assumes basic familiarity with GX Core workflows; for a step-by-step explanation of the GX data validation workflow, refer to [Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb). 

## Imports

This tutorial features the `great_expectations` library.

The `tutorial_code` module contains helper functions used within this notebook and the associated Airflow pipeline.

The `airflow_dags` submodule is included so that you can inspect the code used in the related Airflow DAG directly from this notebook.

In [None]:
import pathlib
import inspect

import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

import tutorial_code as tutorial

## Load raw data

In this tutorial, you will clean and validate a dataset containing synthesized product data. The data is loaded from a CSV file into a Pandas DataFrame.

In [None]:
DATA_DIR = pathlib.Path("/cookbooks/data/raw")

df_products_raw = pd.read_csv(DATA_DIR / "products.csv", encoding="unicode_escape")

In [None]:
print(f"Loaded {df_products_raw.shape[0]} product rows into dataframe.\n")

display(df_products_raw.head())

## Examine destination tables

The product data will be normalized and loaded into multiple Postgres tables:
* `products`
* `product_category`
* `product_subcategory`

Examine the schema of the destination tables and compare to the initial schema and contents of the raw product data.

In [None]:
tutorial.db.get_table_schema(table_name="products")

In [None]:
tutorial.db.get_table_schema(table_name="product_category")

In [None]:
tutorial.db.get_table_schema(table_name="product_subcategory")

## Clean product data

To clean the product data and separate it into three dataframes to normalize the data, you will use a pre-prepared function, `clean_product_data`. The cleaning code is displayed below, and then invoked to clean the raw product data.

In [None]:
%pycat inspect.getsource(tutorial.cookbook2.clean_product_data)

In [None]:
df_products, df_product_categories, df_product_subcategories = (
    tutorial.cookbook2.clean_product_data(df_products_raw)
)

In [None]:
print(f"Loaded {df_products.shape[0]} cleaned product rows.\n")

df_products.head()

In [None]:
print(f"Loaded {df_product_categories.shape[0]} cleaned product category rows.\n")

df_product_categories.head()

In [None]:
print(f"Loaded {df_product_subcategories.shape[0]} cleaned product subcategory rows.\n")

df_product_subcategories.head()

## GX data validation workflow

You will validate the cleaned product data using GX prior to loading it into a Postgres database table.

The GX data validation workflow was introduced in [Cookbook 1](Cookbook_1_Validate_data_during_ingestion_happy_path.ipynb), which provided a walkthrough of the following GX components:
* Data Context
* Data Source
* Data Asset
* Batch Definition
* Batch
* Expectation
* Expectation Suite
* Validation Result

This cookbook will extend the GX validation workflow to include the Validation Definition and Checkpoint components, and will further explore the validation metadata returned in the Validation Result.

This tutorial contains concise explanations of GX components and workflows. For more detail, visit the [Introduction to GX Core](https://docs.greatexpectations.io/docs/core/introduction/) in the GX docs.

### Set up the GX validation workflow

This validation will create the following Expectations:
* Expect that the product dataset contains the following columns, in the specified order
* Expect that all product unit prices are at least $1 USD
* Expect that all products have a higher unit price than unit cost

```{admonition} Reminder: Adding GX components to the Data Context
GX components are unique on name. Once a component is created with the Data Context, adding another component with the same name will cause an error. To enable repeated execution of cookbook cells that add GX workflow components, you will see the following pattern:

    try:
        Add a new component(s) to the context
    except:
        Get component(s) from the context by name, or delete and recreate the component(s)
```

In [None]:
# Create the Data Context.
context = gx.get_context()

# Create the Data Source, Data Asset, and Batch Definition.
try:
    data_source = context.data_sources.add_pandas("pandas")
    data_asset = data_source.add_dataframe_asset(name="customer data")
    batch_definition = data_asset.add_batch_definition_whole_dataframe(
        "batch definition"
    )

except:
    data_source = context.data_sources.get("pandas")
    data_asset = data_source.get_asset(name="customer data")
    batch_definition = data_asset.get_batch_definition("batch definition")

# Get the Batch from the Batch Definition.
batch = batch_definition.get_batch(batch_parameters={"dataframe": df_products})

# Create the Expectation Suite.
try:
    expectation_suite = context.suites.add(
        gx.ExpectationSuite(name="product expectations")
    )
except:
    expectation_suite = context.suites.delete(name="product expectations")
    expectation_suite = context.suites.add(
        gx.ExpectationSuite(name="product expectations")
    )

expectations = [
    gxe.ExpectTableColumnsToMatchOrderedList(
        column_list=[
            "product_id",
            "name",
            "brand",
            "color",
            "unit_cost_usd",
            "unit_price_usd",
            "product_category_id",
            "product_subcategory_id",
        ]
    ),
    gxe.ExpectColumnValuesToBeBetween(column="unit_price_usd", min_value=1.0),
    gxe.ExpectColumnPairValuesAToBeGreaterThanB(
        column_A="unit_price_usd", column_B="unit_cost_usd"
    ),
]

for expectation in expectations:
    expectation_suite.add_expectation(expectation)

validation_result = batch.validate(expectation_suite)

In [None]:
validation_result["success"]

### Extend the validation workflow

A **Validation Definition** pairs a Batch Definition with an Expectation Suite. It defines what data you want to validate using which Expectations.

In [None]:
# Create the Validation Definition.
try:
    validation_definition = context.validation_definitions.add(
        gx.ValidationDefinition(
            name="product validation definition",
            data=batch_definition,
            suite=expectation_suite,
        )
    )
except:
    context.validation_definitions.delete(name="product validation definition")
    validation_definition = context.validation_definitions.add(
        gx.ValidationDefinition(
            name="product validation definition",
            data=batch_definition,
            suite=expectation_suite,
        )
    )

A **Checkpoint** executes data validation based on the specifications of the Validation Definition. Checkpoints also enable actions to be tied to data validation, and 

`unexpected_index_column_names`

Result format: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/choose_a_result_format/

In [None]:
# Create Checkpoint.
try:
    checkpoint = context.checkpoints.add(
        gx.Checkpoint(
            name="checkpoint",
            validation_definitions=[validation_definition],
            result_format={
                "result_format": "COMPLETE",
                # "include_unexpected_rows": True,
                # "exclude_unexpected_values": True,
                "unexpected_index_column_names": ["product_id"],
            },
        )
    )
except:
    context.checkpoints.delete(name="checkpoint")
    checkpoint = context.checkpoints.add(
        gx.Checkpoint(
            name="checkpoint",
            validation_definitions=[validation_definition],
            result_format={
                "result_format": "COMPLETE",
                # "include_unexpected_rows": True,
                # "exclude_unexpected_values": True,
                "unexpected_index_column_names": ["product_id"],
            },
        )
    )

Next, run the Checkpoint. When validating dataframe Data Sources, the dataframe must be supplied to the Checkpoint at runtime.

In [None]:
checkpoint_result = checkpoint.run(batch_parameters={"dataframe": df_products})

## Examine Validation Result

In [None]:
# Extract the Validation Result object from the Checkpoint results.
validation_result = checkpoint_result.run_results[
    list(checkpoint_result.run_results.keys())[0]
]

In [None]:
validation_result["success"]

```
 "statistics": {
    "evaluated_expectations": 3,
    "successful_expectations": 2,
    "unsuccessful_expectations": 1,
    "success_percent": 66.66666666666666
  },
```

In [None]:
expectations_run = validation_result["statistics"]["evaluated_expectations"]
expectations_failed = validation_result["statistics"]["unsuccessful_expectations"]

print(
    f"{expectations_run} Expectations were run, {expectations_failed} Expectations failed."
)

In [None]:
failed_expectations = []
for result in validation_result["results"]:
    if result["success"] is True:
        failed_expectations.append(result)

## pull out bad rows

In [None]:
failed_expectation = [
    x
    for x in validation_result["results"]
    if x["expectation_config"]["type"] == "expect_column_values_to_be_between"
][0]
failed_expectation

In [None]:
failed_expectation["result"]["unexpected_index_list"]
bad_product_ids = [
    x["product_id"] for x in failed_expectation["result"]["unexpected_index_list"]
]
bad_product_ids

In [None]:
# Pull out bad rows from original product dataset.
df_products[df_products["product_id"].isin(bad_product_ids)]

In [None]:
# Drop the bad rows.
df_products_validated = df_products.drop(
    df_products[df_products["product_id"].isin(bad_product_ids)].index
).reset_index(drop=True)

df_products_validated

In [None]:
(
    products_validation_result,
    product_category_validation_result,
    product_category_validation_result,
) = tutorial.cookbook2.validate_product_data(
    df_products, df_product_categories, df_product_subcategories
)