# Information

The folowing notebook is a completely manual setup of a _great_expectations_ pipeline. Usually many of the steps (especially setup) are fulfilled by the _great_expectations_ clientl. This notebook is only for educational reasons to get a deeper understanding on what is going on within the library.

# Setup

In [12]:
import json
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest


## Create datasource config and load 

In [28]:
# Load DataContext to memory
context = ge.get_context()

In [16]:
datasource_yaml="""
name: transactions_use_case
class_name: Datasource
execution_engine:
  class_name: PandasExecutionEngine
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetFilesystemDataConnector
    base_directory: ../data
    default_regex:
      group_names: 
        - data_asset_name
      pattern: (.*)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name
"""

In [17]:
context.test_yaml_config(data_source_yaml) 

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	default_inferred_data_connector_name : InferredAssetFilesystemDataConnector

	Available data_asset_names (3 of 3):
		articles.csv (1 of 1): ['articles.csv']
		transactions.csv (1 of 1): ['transactions.csv']
		transactions_faulty.csv (1 of 1): ['transactions_faulty.csv']

	Unmatched data_references (0 of 0):[]

	default_runtime_data_connector_name:RuntimeDataConnector

	Available data_asset_names (0 of 0):
		Note : RuntimeDataConnector will not have data_asset_names until they are passed in through RuntimeBatchRequest

	Unmatched data_references (0 of 0): []



<great_expectations.datasource.new_datasource.Datasource at 0x7f6086ff5b50>

After succesfully setting up the config and testing it, we have to load it to the appropriate palace in oure _great_expectations.yaml_ file wich has been created with the initialisation of great_expectations

In [19]:
context.add_datasource(**yaml.load(datasource_yaml))

The default 'Loader' for 'load(stream)' without further arguments can be unsafe.
Use 'load(stream, Loader=ruamel.yaml.Loader)' explicitly if that is OK.
Alternatively include the following in your code:


In most other cases you should consider using 'safe_load(stream)'
  context.add_datasource(**yaml.load(datasource_yaml))


<great_expectations.datasource.new_datasource.Datasource at 0x7f6086f4e430>

## Testing the new datasource

In [38]:
batch_request = RuntimeBatchRequest(
    datasource_name="transactions_use_case", # this has to be equal as the defined data_source_name in the datasource_yaml
    data_connector_name="default_runtime_data_connector_name",
    data_asset_name="transactions",  # This can be anything that identifies this data_asset for you
    runtime_parameters={"path": "./data/transactions.csv"},  # Add your path here.
    batch_identifiers={"default_identifier_name": "default_identifier"},
)

In [39]:
context.create_expectation_suite(
    expectation_suite_name="test_suite", overwrite_existing=True
)

{
  "data_asset_type": null,
  "expectation_suite_name": "test_suite",
  "ge_cloud_id": null,
  "expectations": [],
  "meta": {
    "great_expectations_version": "0.14.6"
  }
}

In [40]:
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="test_suite"
)
print(validator.head())

         date article_id  value
0  05.07.2021      200XL  49.99
1  20.06.2021       400S  19.99
2  15.04.2021      200XL  49.99
3  02.04.2021       500M  79.99
4  02.04.2021       100L  29.99


# Creating an expectation suite

To create a testing pipeline with set of expectations a suite has to be created. It is a JSON with all expectations that will be run on a given dataset.

There are 2 options to setup expectations.
* Use the profiler and a "healthy" dataset to auto generate expectations based on the input tables form and simple metrics (mean, min, max) 
* Create an expectations JSON manually by utilizing already implemented expectations and/or create custom expectations to test against certain input data
   
This notebook will go with the latter approach. To use the profiler just run _great_expectations suite new_ and select the profiler option. This will start a new helper notebook where expectatins based on the input data are created. 

In [62]:
from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.data_context.types.resource_identifiers import ExpectationSuiteIdentifier

### 1 Create expectation suite

In [63]:
context.create_expectation_suite(
    expectation_suite_name="transaction_validation_suite", overwrite_existing=True
)

{
  "data_asset_type": null,
  "expectation_suite_name": "transaction_validation_suite",
  "ge_cloud_id": null,
  "expectations": [],
  "meta": {
    "great_expectations_version": "0.14.6"
  }
}

### 2 Set suite to the new expectation suite

In [64]:
try:
    suite = context.get_expectation_suite(expectation_suite_name=expectation_suite_name)
    print(
        f'Loaded ExpectationSuite "{suite.expectation_suite_name}" containing {len(suite.expectations)} expectations.'
    )
except DataContextError:
    suite = context.create_expectation_suite(
        expectation_suite_name=expectation_suite_name
    )
    print(f'Created ExpectationSuite "{suite.expectation_suite_name}".')

Loaded ExpectationSuite "transaction_validation_suite" containing 0 expectations.


### 3 Create expectations

In [75]:
# Create an Expectation
articles = pd.read_csv(("./data/articles.csv"))
print(articles.head())

article_id= articles['article_id'].values.tolist()
print(article_id)
expectation_configuration = ExpectationConfiguration(
   expectation_type="expect_column_values_to_be_in_set",
   kwargs={
      "column":"article_id",
      "value_set": article_id
   },
   meta={
       "notes": {
         "content": """The input value set is derived from the articles.csv. This way the validation data will always be compared 
         to the currently available articles. Optionally another suite for checking articles may be beneficial"""
      }
   }
)
# Add the Expectation to the suite
suite.add_expectation(expectation_configuration=expectation_configuration)

  article_id article_name  article_color_id article_size_id
0       100L      T-Shirt               100               L
1      200XL        Pants               200              XL
2       300M        Jeans               300              XM
3       400S          Hat               400               S
4       500M     Sneakers               500               M
['100L', '200XL', '300M', '400S', '500M']


{"expectation_type": "expect_column_values_to_be_in_set", "kwargs": {"column": "article_id", "value_set": ["100L", "200XL", "300M", "400S", "500M"]}, "meta": {"notes": {"content": "The input value set is derived from the articles.csv. This way the validation data will always be compared \n         to the currently available articles. Optionally another suite for checking articles may be beneficial"}}}

### 4 Save created expectations to expectation suite JSON

In [66]:
print(context.get_expectation_suite(expectation_suite_name=expectation_suite_name))
context.save_expectation_suite(expectation_suite=suite, expectation_suite_name=expectation_suite_name)

suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=expectation_suite_name)
context.build_data_docs(resource_identifiers=[suite_identifier])
context.open_data_docs(resource_identifier=suite_identifier)

{
  "data_asset_type": null,
  "expectation_suite_name": "transaction_validation_suite",
  "ge_cloud_id": null,
  "expectations": [],
  "meta": {
    "great_expectations_version": "0.14.6"
  }
}


# Calling single great_expectation functions with python

### Import the required data for check

In [7]:
transactions_df = ge.read_csv("./data/transactions.csv")
transactions_faulty_df = ge.read_csv("./data/transactions_faulty.csv")

# create set of all article id's
articles = pd.read_csv(("./data/articles.csv"))
print(articles.head())

article_id= articles['article_id'].values.tolist()
print(article_id)


  article_id article_name  article_color_id article_size_id
0       100L      T-Shirt               100               L
1      200XL        Pants               200              XL
2       300M        Jeans               300              XM
3       400S          Hat               400               S
4       500M     Sneakers               500               M
['100L', '200XL', '300M', '400S', '500M']


In [8]:
print(transactions_faulty_df.expect_column_values_to_be_in_set('article_id', article_id))

NameError: name 'df' is not defined