# GE validation demo




In [1]:
%matplotlib inline

from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.data_context.types.resource_identifiers import (
    ExpectationSuiteIdentifier,
)
from great_expectations.exceptions import DataContextError
import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
from ruamel.yaml import YAML


# 0. Get the project context


In [2]:
context = ge.get_context()

The context of the project is stored in the **great_expectations.yml**. When you load context, you read data from it. When you save datasource, expectations, checkpoints, you just add new section in the yaml file.

# 1. Create a datasource
Unlike tdda, GE is a heavyweight framework, we must load the source data into the framework.

In [3]:
# you can name it as you want
datasource_name = "ge_demo"

In [4]:

my_datasource_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetFilesystemDataConnector
    base_directory: ../../../data
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name
"""
print(my_datasource_yaml)


name: ge_demo
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetFilesystemDataConnector
    base_directory: ../../../data
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name



## 1.1 Test if your data source configuration is valid

In [5]:
context.test_yaml_config(yaml_config=my_datasource_yaml)

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	default_inferred_data_connector_name : InferredAssetFilesystemDataConnector

	Available data_asset_names (3 of 7):
		adult.csv (1 of 1): ['adult.csv']
		adult_cleaned.csv (1 of 1): ['adult_cleaned.csv']
		adult_eval.csv (1 of 1): ['adult_eval.csv']

	Unmatched data_references (0 of 0):[]

	default_runtime_data_connector_name:RuntimeDataConnector

	Available data_asset_names (0 of 0):
		Note : RuntimeDataConnector will not have data_asset_names until they are passed in through RuntimeBatchRequest

	Unmatched data_references (0 of 0): []



<great_expectations.datasource.new_datasource.Datasource at 0x7fe4a6d93100>

## 1.2 save the datasource

In [7]:
sanitize_yaml_and_save_datasource(context, my_datasource_yaml, overwrite_existing=False)



## 1.3 List existing datasource

In [8]:
context.list_datasources()

[{'execution_engine': {'module_name': 'great_expectations.execution_engine',
   'class_name': 'PandasExecutionEngine'},
  'module_name': 'great_expectations.datasource',
  'name': 'ge_demo',
  'class_name': 'Datasource',
  'data_connectors': {'default_inferred_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector',
    'base_directory': '../../../data',
    'class_name': 'InferredAssetFilesystemDataConnector',
    'default_regex': {'group_names': ['data_asset_name'], 'pattern': '(.*)'}},
   'default_runtime_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector',
    'batch_identifiers': ['default_identifier_name'],
    'class_name': 'RuntimeDataConnector'}}}]

# 2. Create custom validation rules

In GE, validation rules are called **expectation suit**. We need to create an empty expectation suit, then add rules in it.
## 2.1 Create expectation suit

Below code takes an **expectation suit name** as input, if **expectation suit name** exists in the project, it returns the existing expectation suit. Otherwise, it will create an empty one.

In [9]:
expectation_suite_name = "ge_demo.rules"

try:
    suite = context.get_expectation_suite(expectation_suite_name=expectation_suite_name)
    print(
        f'Loaded ExpectationSuite "{suite.expectation_suite_name}" containing {len(suite.expectations)} expectations.'
    )
except DataContextError:
    suite = context.create_expectation_suite(
        expectation_suite_name=expectation_suite_name
    )
    print(f'Created ExpectationSuite "{suite.expectation_suite_name}".')

Loaded ExpectationSuite "ge_demo.rules" containing 6 expectations.


## 2.2 Add validation rules to the expectation suite

All supported expectations: https://greatexpectations.io/expectations/

# Validation rules

Table level validation rule:
1. Table must have 32603 rows and 15 columns
2. Table can't have duplicate rows

Column level validation rule:
1. Column Age must be a number
2. Column Age can't have null
3. Column Age must have value between 0 and 120



In [10]:
# define a validation rule to validate row number
table_row_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_table_row_count_to_be_between",
      "kwargs": {
        "max_value": 32537,
        "min_value": 32537
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=table_row_rule)

{"expectation_type": "expect_table_row_count_to_be_between", "meta": {}, "kwargs": {"max_value": 32537, "min_value": 32537}}

In [11]:
# define a validation rule to validate column number, name and order
table_column_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_table_columns_to_match_ordered_list",
      "kwargs": {
        "column_list": [
          "age",
          "workclass",
          "fnlwgt",
          "education",
          "education-num",
          "marital-status",
          "occupation",
          "relationship",
          "race",
          "sex",
          "capital-gain",
          "capital-loss",
          "hours-per-week",
          "native-country",
          "income"
        ]
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=table_column_rule)

{"expectation_type": "expect_table_columns_to_match_ordered_list", "meta": {}, "kwargs": {"column_list": ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]}}

In [12]:
# define a validation rule to detect duplicate row
detect_duplication_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_compound_columns_to_be_unique",
      "kwargs": {
        "column_list": [
          "age",
          "workclass",
          "fnlwgt",
          "education",
          "education-num",
          "marital-status",
          "occupation",
          "relationship",
          "race",
          "sex",
          "capital-gain",
          "capital-loss",
          "hours-per-week",
          "native-country",
          "income"
        ]
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=detect_duplication_rule)

{"expectation_type": "expect_compound_columns_to_be_unique", "meta": {}, "kwargs": {"column_list": ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]}}

In [17]:
# define a validation rule to validate row number
age_number_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_column_values_to_match_regex",
      "kwargs": {
        "column": "age",
        "regex": "^[-+]?\d+$"
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=age_number_rule)

{"expectation_type": "expect_column_values_to_match_regex", "kwargs": {"column": "age", "regex": "^[-+]?\\d+$"}, "meta": {}}

In [18]:
# define a validation rule to validate row number
age_no_null_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "age",
        # "mostly": 0.99
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=age_no_null_rule)

{"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "age"}, "meta": {}}

In [19]:
# define a validation rule to validate row number
age_max_rule = ExpectationConfiguration(
    **{
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "age",
        "max_value": 120,
        "min_value": 0
      },
      "meta": {}
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=age_max_rule)

{"expectation_type": "expect_column_values_to_be_between", "kwargs": {"column": "age", "max_value": 120, "min_value": 0}, "meta": {}}

## 2.3 Save the expectation suite to the project context



In [13]:
context.save_expectation_suite(expectation_suite=suite, expectation_suite_name=expectation_suite_name)

# we can get an identifier by using the name of expectation suite
suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=expectation_suite_name)

# use the below command will use the config in the expectation folders to generate a web page that contains the
# information of the newly created expectation suite
context.build_data_docs(resource_identifiers=[suite_identifier])

# open the web page in a browser
context.open_data_docs(resource_identifier=suite_identifier)

# 3. Creat checkpoint

We have defined data source and validation rules. Now we need to associate the data source and validation rules togethor. For this purpose, we introduce a new concept checkpoint. Checkpoint will apply a list of expectation sets on a list of dataset, then based on the result, it will execute a list of actions (e.g. save result to data-docs, send alert to slack, etc.)
For more information, you can visit the [official doc](https://docs.greatexpectations.io/docs/reference/checkpoints_and_actions)

In this tutorial, we will use the **SimpleCheckpoint class** to create a checkpoint. It provides a basic set of actions (e.g. store Validation Result, store evaluation parameters, update Data Docs, and optionally, send a Slack notification). So we don't need to declare an action_list in the checkpoint configuration.

## 3.1 Specify your checkpoint config

In [14]:
yaml = YAML()

# Use yaml to configure a checkpoint
ck0_name = "pengfei_demo_checkpoint"
print(f"datasource_name: {datasource_name}")
dataset_name= "adult_with_duplicates.csv"
print(f"expectation_suite_name: {expectation_suite_name}")

datasource_name: ge_demo
expectation_suite_name: ge_demo.rules


In [15]:

checkpoint_config = f"""
name: {ck0_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: {datasource_name}
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: {dataset_name}
      data_connector_query:
        index: -1
    expectation_suite_name: {expectation_suite_name}
"""

# preview the checkpoint config
print(checkpoint_config)


name: pengfei_demo_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: ge_demo
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_with_duplicates.csv
      data_connector_query:
        index: -1
    expectation_suite_name: ge_demo.rules



In [16]:
from pprint import pprint

# if you don't know the data asset name in your project, you can use below command to get the available asset name list
print(f"available data source list: \n{context.get_available_data_asset_names()}")

# you can also get available expectation suite list
print(f"available data source list: \n{context.list_expectation_suite_names()}")

available data source list: 
{'ge_demo': {'default_inferred_data_connector_name': ['adult.csv', 'adult_train.csv', 'adult_with_header.csv', 'adult_eval.csv', 'adult_serving.csv', 'adult_with_duplicates.csv', 'adult_cleaned.csv'], 'default_runtime_data_connector_name': []}}
available data source list: 
['ge_demo.rules']


## 3.2 Validate your checkpoint configuration

To test your checkpoint, you can use below command. If it's  valid, you should see "Successfully instantiated SimpleCheckpoint"

In [17]:
my_checkpoint = context.test_yaml_config(yaml_config=checkpoint_config)

Attempting to instantiate class from config...
	Instantiating as a SimpleCheckpoint, since class_name is SimpleCheckpoint
	Successfully instantiated SimpleCheckpoint


Checkpoint class name: SimpleCheckpoint


## 3.3 Save your checkpoint
If your checkpoint config is valid, you can save it to your project context

In [18]:
context.add_checkpoint(**yaml.load(checkpoint_config))

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "pengfei_demo_checkpoint",
  "profilers": [],
  "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "ge_demo",
        "data_connector_name": "default_inferred_data_connector_name",
        "data_asset_name": "adult_with_duplicates.csv",
        "data_connector_query": {
       

## 3.4 Run checkpoint and view the output

To run the Checkpoint, you can use below command now and review its output in Data Docs

In [20]:
context.run_checkpoint(checkpoint_name=ck0_name)
context.open_data_docs()

Calculating Metrics:   0%|          | 0/16 [00:00<?, ?it/s]

Find below returned duplicate value in source data

```csv
37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
```

## 3.5 create a check point for cleaned data



In [17]:
ck1_name = "pengfei_demo_checkpoint1"
print(f"datasource_name: {datasource_name}")
dataset_name_1= "adult_cleaned.csv"
print(f"expectation_suite_name: {expectation_suite_name}")

datasource_name: ge_demo
expectation_suite_name: ge_demo.rules


In [18]:
checkpoint_config_1 = f"""
name: {ck1_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: {datasource_name}
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: {dataset_name_1}
      data_connector_query:
        index: -1
    expectation_suite_name: {expectation_suite_name}
"""

# preview the checkpoint config
print(checkpoint_config_1)


name: pengfei_demo_checkpoint1
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: ge_demo
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_cleaned.csv
      data_connector_query:
        index: -1
    expectation_suite_name: ge_demo.rules



In [19]:
context.add_checkpoint(**yaml.load(checkpoint_config_1))

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "pengfei_demo_checkpoint1",
  "profilers": [],
  "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "ge_demo",
        "data_connector_name": "default_inferred_data_connector_name",
        "data_asset_name": "adult_cleaned.csv",
        "data_connector_query": {
          "ind

In [20]:
context.run_checkpoint(checkpoint_name=ck1_name)
context.open_data_docs()

Calculating Metrics:   0%|          | 0/16 [00:00<?, ?it/s]