# Validate data during ingestion (happy path)

This cookbook showcases a sample data validation workflow characteristic of data ingestion at the start of the data pipeline. Data is loaded into a Pandas dataframe, explored, cleaned, and then validated prior to ingestion into a relational database table.

This cookbook explores the validation workflow first in a notebook setting, then embedded within an Airflow pipeline.

## Library import and constant definition

In [1]:
import pathlib

import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

import tutorial_code as tutorial

In [2]:
DATA_DIR = pathlib.Path("/cookbooks/data/raw")

## Load and explore sample data

In this tutorial, you will explore and clean the customers dataset 

In [3]:
df_customers_raw = pd.read_csv(DATA_DIR / "customers.csv", encoding="unicode_escape")

In [4]:
df_customers_raw.head()

Unnamed: 0,CustomerKey,Gender,Name,City,State Code,State,Zip Code,Country,Continent,Birthday
0,301,Female,Lilly Harding,WANDEARAH EAST,SA,South Australia,5523,Australia,Australia,7/3/1939
1,325,Female,Madison Hull,MOUNT BUDD,WA,Western Australia,6522,Australia,Australia,9/27/1979
2,554,Female,Claire Ferres,WINJALLOK,VIC,Victoria,3380,Australia,Australia,5/26/1947
3,786,Male,Jai Poltpalingada,MIDDLE RIVER,SA,South Australia,5223,Australia,Australia,9/17/1957
4,1042,Male,Aidan Pankhurst,TAWONGA SOUTH,VIC,Victoria,3698,Australia,Australia,11/19/1965


Look at definition of postgres table

In [5]:
df_customers_raw.dtypes

CustomerKey     int64
Gender         object
Name           object
City           object
State Code     object
State          object
Zip Code       object
Country        object
Continent      object
Birthday       object
dtype: object

In [6]:
df_customers = tutorial.cookbook1.clean_customer_data(df_customers_raw)

print(df_customers.dtypes)
df_customers.head()

customer_id             int64
name                   object
dob            datetime64[ns]
city                   object
state                  object
zip                    object
country                object
dtype: object


Unnamed: 0,customer_id,name,dob,city,state,zip,country
0,301,Lilly Harding,1939-07-03,Wandearah East,SA,5523,AU
1,325,Madison Hull,1979-09-27,Mount Budd,WA,6522,AU
2,554,Claire Ferres,1947-05-26,Winjallok,VIC,3380,AU
3,786,Jai Poltpalingada,1957-09-17,Middle River,SA,5223,AU
4,1042,Aidan Pankhurst,1965-11-19,Tawonga South,VIC,3698,AU


## GX validation workflow

Validate data interactively with a single expectation

In [7]:
context = gx.get_context()

# Create Data Source, Data Asset, Batch Definition, and Batch.
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="customer data")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df_customers})

# Create Expectation.
expectation = gx.expectations.ExpectTableColumnsToMatchOrderedList(column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"])

# Validate Batch using Expectation.
validation_result = batch.validate(expectation)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
type(validation_result)

great_expectations.core.expectation_validation_result.ExpectationValidationResult

In [9]:
gx.core.expectation_validation_result.ExpectationValidationResult

great_expectations.core.expectation_validation_result.ExpectationValidationResult

Validate data interactively with an Expectation Suite

In [10]:
# look at validation result

In [11]:
# Create Expectation Suite.
EXPECTATION_SUITE_NAME = "customer expectations"

try:
    expectation_suite = context.suites.add(gx.ExpectationSuite(name=EXPECTATION_SUITE_NAME))
except:
    expectation_suite = context.suites.delete(name=EXPECTATION_SUITE_NAME)
    expectation_suite = context.suites.add(gx.ExpectationSuite(name=EXPECTATION_SUITE_NAME))


expectations = [
    gxe.ExpectTableColumnsToMatchOrderedList(column_list=["customer_id", "name", "dob", "city", "state", "zip", "country"]),
    gxe.ExpectColumnValuesToBeOfType(column="customer_id", type_="int"),
    *[gxe.ExpectColumnValuesToBeOfType(column=x, type_="str") for x in ["name", "city", "state", "zip"]],
    gxe.ExpectColumnValuesToMatchRegex(column="dob", regex=r"^\d{4}-\d{2}-\d{2}$"),
    gxe.ExpectColumnValuesToBeInSet(column="country", value_set=["AU", "CA", "DE", "FR", "GB", "IT", "NL", "US"])
]

for expectation in expectations:
    expectation_suite.add_expectation(expectation)

# Validate Batch using Expectation Suite.
validation_result = batch.validate(expectation_suite)

validation_result["success"]

Calculating Metrics:   0%|          | 0/48 [00:00<?, ?it/s]

True

In [12]:
type(validation_result)

great_expectations.core.expectation_validation_result.ExpectationSuiteValidationResult

In [13]:
%pycat airflow_dags/cookbook1_ingest_customer_data.py

[0;32mimport[0m [0mdatetime[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mos[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mpathlib[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mpandas[0m [0;32mas[0m [0mpd[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mairflow[0m [0;32mimport[0m [0mDAG[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mairflow[0m[0;34m.[0m[0moperators[0m[0;34m.[0m[0mpython[0m [0;32mimport[0m [0mPythonOperator[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mtutorial_code[0m [0;32mas[0m [0mtutorial[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mcookbook1_validate_and_ingest_to_postgres[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0mDATA_DIR[0m [0;34m=[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m([0m[0mos[0m[0;34m.[0m[0mgetenv[0m[0;34m([0m[0;34m"AIRFLOW_HOME"[0m[0;34m)[0m[0;34m)[0m [0;34m/

## Trigger the DAG