# Validate data during ingestion (happy path)

This cookbook showcases a sample data validation workflow characteristic of data ingestion at the start of the data pipeline. Data is loaded into a Pandas dataframe, explored, cleaned, and then validated prior to ingestion into a relational database table.

This cookbook explores the validation workflow first in a notebook setting, then embedded within an Airflow pipeline.

## Library import and constant definition

In [None]:
import pathlib

import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

import tutorial_code as tutorial

In [None]:
DATA_DIR = pathlib.Path("/cookbooks/data/raw")

## Load and explore sample data

In this tutorial, you will explore and clean the customers dataset 

In [None]:
df_customers_raw = pd.read_csv(DATA_DIR / "customers.csv", encoding="unicode_escape")

In [None]:
df_customers_raw.head()

Look at definition of postgres table

In [None]:
df_customers_raw.dtypes

In [None]:
df_customers = tutorial.cookbook1.clean_customer_data(df_customers_raw)

print(df_customers.dtypes)
df_customers.head()

## GX validation workflow

Validate data interactively with a single expectation

In [None]:
context = gx.get_context()

# Create Data Source, Data Asset, Batch Definition, and Batch.
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="customer data")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df_customers})

# Create Expectation.
expectation = gx.expectations.ExpectTableColumnsToMatchOrderedList(column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"])

# Validate Batch using Expectation.
validation_result = batch.validate(expectation)

In [None]:
type(validation_result)

In [None]:
gx.core.expectation_validation_result.ExpectationValidationResult

Validate data interactively with an Expectation Suite

In [None]:
# look at validation result

In [None]:
# Create Expectation Suite.
EXPECTATION_SUITE_NAME = "customer expectations"

try:
    expectation_suite = context.suites.add(gx.ExpectationSuite(name=EXPECTATION_SUITE_NAME))
except:
    expectation_suite = context.suites.delete(name=EXPECTATION_SUITE_NAME)
    expectation_suite = context.suites.add(gx.ExpectationSuite(name=EXPECTATION_SUITE_NAME))


expectations = [
    gxe.ExpectTableColumnsToMatchOrderedList(column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"]),
    gxe.ExpectColumnValuesToBeOfType(column="customer_id", type_="int"),
    *[gxe.ExpectColumnValuesToBeOfType(column=x, type_="str") for x in ["name", "city", "state", "zip"]],
    gxe.ExpectColumnValuesToMatchRegex(column="dob", regex=r"^\d{4}-\d{2}-\d{2}$"),
    gxe.ExpectColumnValuesToBeInSet(column="country", value_set=["AU", "CA", "DE", "FR", "GB", "IT", "NL", "US"])
]

for expectation in expectations:
    expectation_suite.add_expectation(expectation)

# Validate Batch using Expectation Suite.
validation_result = batch.validate(expectation_suite)

validation_result["success"]

In [None]:
type(validation_result)

In [None]:
%pycat airflow_dags/cookbook1_ingest_customer_data.py

## Trigger the DAG