# Integrate Data Validation Into Your Pipeline



In [None]:
# Prep environment and logging

import json
import os
import logging
import great_expectations as ge
import great_expectations.jupyter_ux
import pandas as pd
import uuid # used to generate run_id
from datetime import datetime

import tzlocal

great_expectations.jupyter_ux.setup_notebook_logging()



## How is data validation integrated into a pipeline?

To continue the example we used in the previous notebook, 
you created expectations for the data asset "orders". By doing this 
you defined what you expect a valid orders file to look like.

Once your pipeline is deployed, it will process new orders files as they arrive.

Just before calling the method that does the computation on a new file, call Great Expectations' 
validate method to make sure that the file meets your expectations about 
what a valid orders file should look like.
If the file does not pass validation, you can decide what to do, e.g., stop the pipeline, since its output on invalid input cannot be guaranteed.


To run validation you need 2 things: 
* Something to validate - in our case it is a file loaded into a Pandas data frame (or Spark Dataframe, if your pipeline is built on Spark)
* Expectations to validate against - if you provide the name of the data asset for which you created expectations ("orders" in our example) to the validate method, Great Expectations will fetch the file with your expectations. 



### Create a DataContext object

Just like in the previous notebook where you created expectations, we need to create a `DataContext` object that represents Great Expectations in your data pipeline.


In [None]:
# context = ge.data_context.DataContext('../../', expectation_explorer=True)
context = ge.data_context.DataContext('../../', expectation_explorer=False)

### Choose the data asset name

In [None]:
context.list_expectations_configs() # ????

In [None]:
data_asset_name = "orders" # TODO: replace with your value!

### Load a file for validation

set `file_path_to_validate` below to the full path of a file you want to validate

In [None]:
file_path_to_validate = # TODO: your file path


#### Uncomment the following if you are using Pandas

In [None]:
# df = pd.read_csv(file_path_to_validate)
# df.head()

#### Uncomment the following if you are using Spark

In [None]:
# from pyspark.sql import SparkSession
# from great_expectations.dataset import PandasDataset, SqlAlchemyDataset, SparkDFDataset
# spark = SparkSession.builder.getOrCreate()
# df = SparkDFDataset(spark.read.csv(file_path_to_validate))
# df.spark_df.show()

### Pipeline run id

Since your pipeline will validate batches before every run, we should pass run id to `validate`.

In this notebook will just use a random UUID as run id.

In [None]:
# Generate a run-id that GE will use to key shared parameters
run_id = str(uuid.uuid1())


### Validate the file

This is the "workhorse" method of Great Expectations. Call it in your pipeline code after loading the file and just before passing it to your computation.



In [None]:
validation_result = ge.validate(df,
          data_context=context, # Great Expectations context for your project
          data_asset_name=data_asset_name, # data asset name that corresponds to a collection of expectations
          run_id=run_id
          )

if validation_result["success"]:
    print("This file meets all expectations from a valid batch of {0:s}".format(data_asset_name))
else:
    print("This file is not a valid batch of {0:s}".format(data_asset_name))


### This is what a validation result looks like.

The result object will have an element for each expectation defined for this data asset.
If the batch did not pass an expectation, the element will have additional data (percentage of non-conforming records, examples of non-conforming values, etc.) 


In [None]:
print(json.dumps(validation_result, indent=4))

### Finishing touches - notifications and saving validation results and validated batches

#### Notifications
You want to be notified when the pipeline validated a batch, especially when the validation failed.

Great Expectations provides a Slack integration. To enable this, uncomment `result_callback` section in your project's great-expectations.yml and enter your Slack webhook URL (see https://api.slack.com/incoming-webhooks).

#### Saving validation results

To enable the storing of validation results, uncomment the `result_store` section in your great_expectations.yml file. There are 2 options - local (see `filesystem` key) and on S3 (see `s3` key).

This option will ensure that results of all validations are stored (both for batches that passed validation and those that did not meet our expectations).

#### Saving failed batches

When a batch fails validation (it does not pass all the expectations of the data asset), it is useful to save the batch along with the validation results for future review. You can enable this option in your project's great_expectations.yml file. There are 2 options - local (see `filesystem` key) and on S3 (see `s3` key).

