# Tutorial 2

In tutorial 1, we used the CLI and generated notebook to create data source, expectation suites and check points. In this tutorial, we will try to create a notebook by ourselves and do all the actions with one single notebook.




In [16]:
%matplotlib inline

from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.data_context.types.resource_identifiers import (
    ExpectationSuiteIdentifier,
)
from great_expectations.exceptions import DataContextError
import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
from ruamel.yaml import YAML


# 0. Get the project context

To get the project context, we called the get_context() method. This works only if the notebook is located inside the greate expectation project root directory. Because the source code of get_context() is just call the DataContext() without argument

```python
def get_context():
    from great_expectations.data_context.data_context import DataContext

    return DataContext()
```
If you want to work anywhere outside the project, you need to create a context by using the DataContext class. Below code is an
example.

```python
ge_project_root_dir="../great_expectations_validation/great_expectations"
data_context = ge.data_context.DataContext(context_root_dir=ge_project_root_dir)
```

In [4]:
context = ge.get_context()

The context of the project is stored in the **great_expectations.yml**. When you load context, you read data from it. When you save datasource, expectations, checkpoints, you just add new section in the yaml file.

# 1. Create a new pandas Datasource
Here we create a file based datasource

In [6]:
# you can name it as you want
datasource_name = "pengfei_test"

my_datasource_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetFilesystemDataConnector
    base_directory: ../../data
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name
"""
print(my_datasource_yaml)


name: pengfei_test
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetFilesystemDataConnector
    base_directory: ../../data
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name



## 1.1 Test if your data source configuration is valid

In [7]:
context.test_yaml_config(yaml_config=my_datasource_yaml)

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	default_inferred_data_connector_name : InferredAssetFilesystemDataConnector

	Available data_asset_names (2 of 2):
		adult.csv (1 of 1): ['adult.csv']
		adult_with_duplicates.csv (1 of 1): ['adult_with_duplicates.csv']

	Unmatched data_references (0 of 0):[]

	default_runtime_data_connector_name:RuntimeDataConnector

	Available data_asset_names (0 of 0):
		Note : RuntimeDataConnector will not have data_asset_names until they are passed in through RuntimeBatchRequest

	Unmatched data_references (0 of 0): []



<great_expectations.datasource.new_datasource.Datasource at 0x7fb5a02e2f10>

## 1.2 save the datasource

In [8]:
sanitize_yaml_and_save_datasource(context, my_datasource_yaml, overwrite_existing=False)


In [None]:
## 1.3 List existing datasource

In [9]:
context.list_datasources()

[{'execution_engine': {'class_name': 'PandasExecutionEngine',
   'module_name': 'great_expectations.execution_engine'},
  'class_name': 'Datasource',
  'module_name': 'great_expectations.datasource',
  'data_connectors': {'default_inferred_data_connector_name': {'base_directory': '../../data',
    'module_name': 'great_expectations.datasource.data_connector',
    'default_regex': {'group_names': ['data_asset_name'], 'pattern': '(.*)'},
    'class_name': 'InferredAssetFilesystemDataConnector'},
   'default_runtime_data_connector_name': {'batch_identifiers': ['default_identifier_name'],
    'module_name': 'great_expectations.datasource.data_connector',
    'class_name': 'RuntimeDataConnector'}},
  'name': 'census_income_validation'},
 {'execution_engine': {'class_name': 'PandasExecutionEngine',
   'module_name': 'great_expectations.execution_engine'},
  'class_name': 'Datasource',
  'module_name': 'great_expectations.datasource',
  'data_connectors': {'default_inferred_data_connector_nam

# 2. Create or Edit expectation suits

Below code takes an **expectation suit name** as input, if it exists in the project, it returns the e

In [12]:
expectation_suite_name = "pengfei.test1"


try:
    suite = context.get_expectation_suite(expectation_suite_name=expectation_suite_name)
    print(
        f'Loaded ExpectationSuite "{suite.expectation_suite_name}" containing {len(suite.expectations)} expectations.'
    )
except DataContextError:
    suite = context.create_expectation_suite(
        expectation_suite_name=expectation_suite_name
    )
    print(f'Created ExpectationSuite "{suite.expectation_suite_name}".')

Created ExpectationSuite "pengfei.test1".


# 2.1 Add expectations to the expectation suite

After getting the instance of the expectation suite, we can and need to add expectations(validation rules) to the suite.
Below is an example, we add a rule that verify column age can not have null values.

For now, if the expectation suite already contains expectations(validation rules), we can't remove them by using the suite object. We can only add new rules.



In [13]:
# define a validation rule, it's a dict fill with key/value
expectation_configuration = ExpectationConfiguration(
    **{
        "meta": {},
        "expectation_type": "expect_column_values_to_not_be_null",
        "kwargs": {"column": "age"},
    }
)

# add the rule to the suite
suite.add_expectation(expectation_configuration=expectation_configuration)

{"kwargs": {"column": "age"}, "meta": {}, "expectation_type": "expect_column_values_to_not_be_null"}

# 2.2 Save the expectation suite to the project context

After you save the expectation, it will create a new directory in the folder "expectations". The name of the created directory is the first part of the expectation suite name (i.e. pengfei). The json file in the directory (i.e. test1.json) is the actual file that contains all the expectations (validation rules)

```text
├── great_expectations
│         ├── checkpoints
│         ├── expectations
│                   ├── census_income_expectation_suite
│                   └── pengfei
|                         |__test1.json

│         ├── plugins
│         └── uncommitted
└── README.md

```

In [14]:
context.save_expectation_suite(expectation_suite=suite, expectation_suite_name=expectation_suite_name)

# we can get an identifier by using the name of expectation suite
suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=expectation_suite_name)

# use the below command will use the config in the expectation folders to generate a web page that contains the
# information of the newly created expectation suite
context.build_data_docs(resource_identifiers=[suite_identifier])

# open the web page in a browser
context.open_data_docs(resource_identifier=suite_identifier)

You should see the below page in your browser

![Expectation_ui](../images/expectations_validation_rule.png)

# 3. Creat checkpoint

We have defined data source and validation rules. Now we need to associate the data source and validation rules togethor. For this purpose, we introduce a new concept checkpoint. Checkpoint will apply a list of expectation sets on a list of dataset, then based on the result, it will execute a list of actions (e.g. save result to data-docs, send alert to slack, etc.)
For more information, you can visit the [official doc](https://docs.greatexpectations.io/docs/reference/checkpoints_and_actions)

In this tutorial, we will use the **SimpleCheckpoint class** to create a checkpoint. It provides a basic set of actions (e.g. store Validation Result, store evaluation parameters, update Data Docs, and optionally, send a Slack notification). So we don't need to declare an action_list in the checkpoint configuration.

## 3.1 Specify your checkpoint config

In [18]:

yaml = YAML()

# Use yaml to configure a checkpoint
my_checkpoint_name = "pengfei_test_checkpoint"

checkpoint_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: pengfei_test
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_with_duplicates.csv
      data_connector_query:
        index: -1
    expectation_suite_name: pengfei.test1
"""

# preview the checkpoint config
print(checkpoint_config)


name: census_income_checkpoint
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: pengfei_test
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: adult_with_duplicates.csv
      data_connector_query:
        index: -1
    expectation_suite_name: pengfei.test1



We will explain the above config yaml file line by line,

- class_name: is the checkpoint class name
- run_name_template: the name of the checkpoint run result after each validation. So you should include time stamp in it to make it easy to track

- validations: defines the association between data source and validation rules
  - batch_request:
      datasource_name: name of your datasource
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: target file name in your data source
      data_connector_query:
        index: -1
    expectation_suite_name: the name of your expectation suite


If you are not sure about the names, you can use below command to get the available data source and expectation suite

In [19]:
from pprint import pprint

# if you don't know the data asset name in your project, you can use below command to get the available asset name list
pprint(context.get_available_data_asset_names())

# you can also get available expectation suite list
pprint(context.list_expectation_suite_names())

{'census_income_validation': {'default_inferred_data_connector_name': ['adult.csv',
                                                                       'adult_with_duplicates.csv'],
                              'default_runtime_data_connector_name': []},
 'pengfei_test': {'default_inferred_data_connector_name': ['adult.csv',
                                                           'adult_with_duplicates.csv'],
                  'default_runtime_data_connector_name': []}}
['census_income_expectation_suite.test1', 'pengfei.test1']


## 3.2 Validate your checkpoint configuration

To test your checkpoint, you can use below command. If it's  valid, you should see "Successfully instantiated SimpleCheckpoint"

In [20]:
my_checkpoint = context.test_yaml_config(yaml_config=checkpoint_config)

Attempting to instantiate class from config...
	Instantiating as a SimpleCheckpoint, since class_name is SimpleCheckpoint
{
  "name": "census_income_checkpoint",
  "config_version": 1.0,
  "template_name": null,
  "module_name": "great_expectations.checkpoint",
  "class_name": "SimpleCheckpoint",
  "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
  "expectation_suite_name": null,
  "batch_request": null,
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "evaluation_parameters": {},
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "pengfei

## 3.3 Save your checkpoint
If your checkpoint config is valid, you can save it to your project context

In [21]:
context.add_checkpoint(**yaml.load(checkpoint_config))

{
  "name": "census_income_checkpoint",
  "config_version": 1.0,
  "template_name": null,
  "module_name": "great_expectations.checkpoint",
  "class_name": "Checkpoint",
  "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
  "expectation_suite_name": null,
  "batch_request": null,
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "evaluation_parameters": {},
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "pengfei_test",
        "data_connector_name": "default_inferred_data_connector_name",
        "data_asset_name": "adult_with_duplicates

<great_expectations.checkpoint.checkpoint.SimpleCheckpoint at 0x7fb576288820>

## 3.4 Run checkpoint and view the output

To run the Checkpoint, you can use below command now and review its output in Data Docs

In [22]:
context.run_checkpoint(checkpoint_name=my_checkpoint_name)
context.open_data_docs()

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

You should see below web page
![Expectation_checkpoint](../images/expectations_checkpoint.png)
