# Integrate Data Validation Into Your Pipeline



In [1]:
# Prep environment and logging

import json
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
from datetime import datetime

great_expectations.jupyter_ux.setup_notebook_logging()



2019-10-08T18:51:11-0400 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.
2019-10-08T18:51:11-0400 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.
2019-10-08T18:51:11-0400 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.


## Integrate data validation into your pipeline

[**Watch a short tutorial video**](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#video)


[**Read more in the tutorial**](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation)

**Reach out for help on** [**Great Expectations Slack**](https://greatexpectations.io/slack)




### Get a DataContext object


In [2]:
context = ge.data_context.DataContext()

### Get a pipeline run id

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#set-a-run-id)


In [3]:
# Generate a run id - a pipeline run id, a timestamp or any other string that is meaningful to you 
# and will help you refer to validation results. We recommend they be chronologically sortable.
run_id = datetime.utcnow().isoformat().replace(":", "") + "Z"
run_id

'2019-10-08T225116.415513Z'

### Choose data asset name and expectation suite name

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#choose-data-asset-and-expectation-suite)


In [None]:
great_expectations.jupyter_ux.list_available_data_asset_names(context)

In [4]:
data_asset_name = 'demo__dir/default/npidata_pfile' # TODO: replace with your value!
expectation_suite_name = "warning" # TODO: replace with your value!

### Obtain the batch to validate

Learn about `get_batch` in [this tutorial]](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#obtain-a-batch-to-validate)



##### If your pipeline processes Pandas Dataframes:

```
import pandas as pd
df = pd.read_csv(file_path_to_validate)
batch = context.get_batch(data_asset_name, expectation_suite_name, BatchKwargs(df=df))
batch.head()
```

##### If your pipeline processes Spark Dataframes:
```
from pyspark.sql import SparkSession
from great_expectations.dataset import PandasDataset, SqlAlchemyDataset, SparkDFDataset
spark = SparkSession.builder.getOrCreate()
df = SparkDFDataset(spark.read.csv(file_path_to_validate))
batch = context.get_batch(data_asset_name, expectation_suite_name, BatchKwargs(df=df))
batch.head()
```

##### If your pipeline processes SQL querues:

* A. To validate an existing table:

```
data_asset_name = 'USE THE TABLE NAME'
batch = context.get_batch(data_asset_name, 
                        expectation_suite_name=expectation_suite_name,
                        BatchKwargs(table=data_asset_name)) 
batch.head()
```

* B. To validate a query result set:

```
data_asset_name = 'USE THE NAME YOU SPECIFIED WHEN YOU CREATED THE EXPECTATION SUITE FOR THIS QUERY'
batch = context.get_batch(data_asset_name, 
                        expectation_suite_name=expectation_suite_name,
                        BatchKwargs(query='SQL FOR YOUR QUERY'))
batch.head()
```


In [8]:
batch = context.get_batch(data_asset_name=data_asset_name, 
                        expectation_suite_name=expectation_suite_name,
                        batch_kwargs=BatchKwargs(path='/data/demo/npidata_pfile/npidata_pfile_20050523-20190908_0.csv'))
batch.head()


2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:35-0400 - INFO - 	3 expectation(s) included in expectation_suite.


Unnamed: 0,NPI,Entity Type Code,Replacement NPI,Employer Identification Number (EIN),Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,Provider Name Prefix Text,Provider Name Suffix Text,...,Healthcare Provider Taxonomy Group_6,Healthcare Provider Taxonomy Group_7,Healthcare Provider Taxonomy Group_8,Healthcare Provider Taxonomy Group_9,Healthcare Provider Taxonomy Group_10,Healthcare Provider Taxonomy Group_11,Healthcare Provider Taxonomy Group_12,Healthcare Provider Taxonomy Group_13,Healthcare Provider Taxonomy Group_14,Healthcare Provider Taxonomy Group_15
0,1679576722,1.0,,,,WIEBE,DAVID,A,,,...,,,,,,,,,,
1,1588667638,1.0,,,,PILCHER,WILLIAM,C,DR.,,...,,,,,,,,,,
2,1497758544,2.0,,<UNAVAIL>,"CUMBERLAND COUNTY HOSPITAL SYSTEM, INC",,,,,,...,,,,,,,,,,
3,1306849450,1.0,,,,SMITSON,HAROLD,LEROY,DR.,II,...,,,,,,,,,,
4,1215930367,1.0,,,,GRESSOT,LAURENT,,DR.,,...,,,,,,,,,,


### Validate the batch

This is the "workhorse" method of Great Expectations. Call it in your pipeline code after loading the file and just before passing it to your computation.

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#validate)



In [9]:
validation_result = batch.validate(run_id=run_id)

if validation_result["success"]:
    print("This file meets all expectations from a valid batch of {0:s}".format(data_asset_name))
else:
    print("This file is not a valid batch of {0:s}".format(data_asset_name))


2019-10-08T18:53:57-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:53:57-0400 - INFO - 	3 expectation(s) included in expectation_suite.
This file is not a valid batch of demo__dir/default/npidata_pfile


### Review the validation results

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#review-validation-results)


In [10]:
print(json.dumps(validation_result, indent=4))

{
    "results": [
        {
            "success": true,
            "expectation_config": {
                "expectation_type": "expect_column_to_exist",
                "kwargs": {
                    "column": "Provider Other Organization Name Type Code"
                }
            },
            "exception_info": {
                "raised_exception": false,
                "exception_message": null,
                "exception_traceback": null
            }
        },
        {
            "success": true,
            "result": {
                "observed_value": [
                    3.0,
                    4.0,
                    5.0
                ],
                "element_count": 9999,
                "missing_count": 9172,
                "missing_percent": 0.9172917291729173
            },
            "expectation_config": {
                "expectation_type": "expect_column_distinct_values_to_be_in_set",
                "kwargs": {
                    "column": "Provi

### Validation Operators

The `validate` method evaluates one batch of data against one expectation suite and returns a dictionary of validation results. This is sufficient when you explore your data and get to know Great Expectations.
When deploying Great Expectations in a real data pipeline, you will typically discover additional needs:

* validating a group of batches that are logically related
* validating a batch against several expectation suites
* doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).

Validation Operators provide a convenient abstraction for both bundling the validation of multiple expectation suites and the actions that should be taken after the validation.

[Read more about Validation Operators](https://docs.greatexpectations.io/en/latest/features/validation_operators_and_actions.html?utm_source=notebook&utm_medium=integrate_validation)




In [11]:
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file

results = context.run_validation_operator(
    assets_to_validate=[batch],
    run_identifier=run_id,
    validation_operator_name="action_list_operator",
)

results

2019-10-08T18:56:16-0400 - INFO - 	3 expectation(s) included in expectation_suite.
2019-10-08T18:56:16-0400 - INFO - 	3 expectation(s) included in expectation_suite.


{{'expectation_suite_identifier': {'data_asset_name': demo__dir/default/npidata_pfile,
  'run_id': '2019-10-08T225116.415513Z'}: {'validation_result': {'results': [{'success': True,
     'expectation_config': {'expectation_type': 'expect_column_to_exist',
      'kwargs': {'column': 'Provider Other Organization Name Type Code'}},
     'exception_info': {'raised_exception': False,
      'exception_message': None,
      'exception_traceback': None}},
    {'success': True,
     'result': {'observed_value': [3.0, 4.0, 5.0],
      'element_count': 9999,
      'missing_count': 9172,
      'missing_percent': 0.9172917291729173},
     'expectation_config': {'expectation_type': 'expect_column_distinct_values_to_be_in_set',
      'kwargs': {'column': 'Provider Other Organization Name Type Code',
       'value_set': [3.0, 4.0, 5.0]}},
     'exception_info': {'raised_exception': False,
      'exception_message': None,
      'exception_traceback': None}},
    {'success': False,
     'result': {'obse