# How to Build a `RuleBasedProfiler`
* This Notebook will demonstrate the steps we need to take to generate a simple `RuleBasedProfiler` by initializing the components in memory.

* We will start from a new Great Expectations Data Context (ie `great_expectations` folder after running `great_expectations init`), and begin by adding the Datasource, and progressively adding more components


In [1]:
import great_expectations as ge
from ruamel import yaml

from great_expectations.core.batch import BatchRequest
from great_expectations.core import ExpectationSuite

from great_expectations.rule_based_profiler.rule.rule import Rule
from great_expectations.rule_based_profiler.rule_based_profiler import RuleBasedProfiler

from great_expectations.rule_based_profiler.domain_builder import (
    DomainBuilder,
    SimpleColumnSuffixDomainBuilder,
)
from great_expectations.rule_based_profiler.parameter_builder import (
    MetricMultiBatchParameterBuilder,
)
from great_expectations.rule_based_profiler.expectation_configuration_builder import (
    DefaultExpectationConfigurationBuilder,
)



In [2]:
data_context: ge.DataContext = ge.get_context()

  and should_run_async(code)


## Set-up: Adding `taxi_data` `Datasource`
* Add `taxi_data` as a new `Datasource`
* We are using an `InferredAssetFilesystemDataConnector` to connect to data in the `test_sets/taxi_yellow_tripdata_samples` folder and get one `DataAsset` (`yellow_tripdata_sample_2018`) that has 12 Batches (1 Batch/month).

In [3]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples"

datasource_config = {
    "name": "taxi_multi_batch_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "default_inferred_data_connector_name": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "month"],
                "pattern": "(yellow_tripdata_sample_2018)-(\\d.*)\\.csv",
            },
        },
        "default_inferred_data_connector_name_all_years": {
            "class_name": "InferredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "default_regex": {
                "group_names": ["data_asset_name", "year", "month"],
                "pattern": "(yellow_tripdata_sample)_(\\d.*)-(\\d.*)\\.csv",
            },
        },
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))
data_context.add_datasource(**datasource_config)

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	default_inferred_data_connector_name : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample_2018 (3 of 12): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 29):['.DS_Store', 'first_3_files', 'random_subsamples']

	default_inferred_data_connector_name_all_years : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample (3 of 36): ['yellow_tripdata_sample_2018-01.csv', 'yellow_tripdata_sample_2018-02.csv', 'yellow_tripdata_sample_2018-03.csv']

	Unmatched data_references (3 of 5):['.DS_Store', 'first_3_files', 'random_subsamples']



<great_expectations.datasource.new_datasource.Datasource at 0x7f8258f6b0d0>

# BatchRequests

* In this example, we will be using two `BatchRequests` using our `Datasource`. 
   * `single_batch_batch_request` : which gives the most recent (December) data from the 2018 `taxi_data` dataset. 
   * `multi_batch_batch_request`: which gives all 12 Batches of data from the 2018 `taxi_data` datataset.

In [4]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_datasource",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name="yellow_tripdata_sample_2018",
    data_connector_query={"index": -1},
)

In [5]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_datasource",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name="yellow_tripdata_sample_2018",
)

# Example 1:  `RuleBasedProfiler` with just a `DomainBuilder` and `ExpectationConfigurationBuilder`

## Build a `DomainBuilder`

In the process of building a `RuleBasedProfiler`, one of the first components we want to build/test
is a `DomainBuilder`, which returns the Domains (tables, columns, set of columns, etc) that the our resulting `Expectations` will be run on. In our example, the `DomainBuilder` will output a list of columns that follow a certain pattern, namely have `'_amount'` in their suffix. To this end we will be using a `SimpleColumnSuffixDomainBuilder` which allows you to choose columns based on their suffix, and our `DomainBuilder` will output a list of 4 columns : `fare_amount`, `tip_amount`, `tolls_amount` and `total_amount`.

The `RuleBasedProfiler` also contains additional `DomainBuilders` that allow you to do more sophisticated filtering on your data, depending on the column name, cardinality or type. 

These include:

 * `SimpleColumnSuffixDomainBuilder`: which allows you to choose columns based on their suffix. In our example, we iwll be using this `DomainBuilder`, to allow the `RuleBasedProfiler` to output the columns that have "`_amount`" in the suffix. 
 * `CategoricalColumnDomainBuilder`: which allows you to choose columns based on their cardinality (number of unique values).
 * `SimpleSemanticTypeDomainBuilder`: which allows you to choose columns based on their semantic types (such as numeric, or text).
 * `MapMetricColumnDomainBuilder`: which allows you to choose columns based on Map Metrics, which give a yes/no answer for individual values or rows. 

In addition, there are `DomainBuilders` that do not perform any additional filtering, but are required by the Expectations that are being built by the `RuleBasedProfiler`. 
 * `ColumnDomainBuilder`: Outputs Column Domain, which are required by `ColumnExpectations` like (`expect_column_median_to_be_between`). `ColumnDomainBuilder` can also be used to specifically include or exclude columns if you already know which ones you want.
 * `TableDomainBuilder`:  Outputs Table Domain, which is required by `Expectations` that act on tables, like (`expect_table_row_count_to_equal`, or `expect_table_columns_to_match_set`). 


#### `SimpleColumnSuffixDomainBuilder`

In [6]:
domain_builder: DomainBuilder = SimpleColumnSuffixDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
    column_name_suffixes=["_amount"],
)
domains: list = domain_builder.get_domains()

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
# assert that the domains we get are the ones we expect
assert len(domains) == 4
assert domains == [
    {"domain_type": "column", "domain_kwargs": {"column": "fare_amount"}},
    {"domain_type": "column", "domain_kwargs": {"column": "tip_amount"}},
    {"domain_type": "column", "domain_kwargs": {"column": "tolls_amount"}},
    {"domain_type": "column", "domain_kwargs": {"column": "total_amount"}},
]

To continue our example, we will continue building a `RuleBasedProfiler` using our `SimpleColumnSuffixDomainBuilder`

## Build `Rule`
* The first `Rule` that we build will output `expect_column_values_to_not_be_null` because it does not take in  additional information other than Domain. We will add `ParameterBuilders` in a subsequent example.

In [8]:
default_expectation_configuration_builder = DefaultExpectationConfigurationBuilder(
    expectation_type="expect_column_values_to_not_be_null",
    column="$domain.domain_kwargs.column", # Get the column from domain_kwargs that are retrieved from the DomainBuilder
)

In [9]:
simple_rule: Rule = Rule(
    name="rule_with_no_parameters",
    domain_builder=domain_builder,
    expectation_configuration_builders=[default_expectation_configuration_builder],
)

## Create `RuleBasedProfiler` and add `Rule`
* We create a simple `RuleBasedProfiler` and add the `Rule` that we added in the previous step is added to the Profiler. When we run the Profiler, the output is an `ExpectationSuite` with 4 `Expectations`, which we expect.

In [10]:
from great_expectations.core import ExpectationSuite
from great_expectations.rule_based_profiler.rule_based_profiler import RuleBasedProfiler

In [11]:
my_rbp: RuleBasedProfiler = RuleBasedProfiler(
    name="my_simple_rbp", data_context=data_context, config_version=1.0
)

In [12]:
my_rbp.add_rule(rule=simple_rule)


In [13]:
res: ExpectationSuite = my_rbp.run()

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
assert len(res.expectations) == 4

In [15]:
res.expectations

[{"kwargs": {"column": "fare_amount"}, "expectation_type": "expect_column_values_to_not_be_null", "meta": {}},
 {"kwargs": {"column": "tip_amount"}, "expectation_type": "expect_column_values_to_not_be_null", "meta": {}},
 {"kwargs": {"column": "tolls_amount"}, "expectation_type": "expect_column_values_to_not_be_null", "meta": {}},
 {"kwargs": {"column": "total_amount"}, "expectation_type": "expect_column_values_to_not_be_null", "meta": {}}]

As expected our simple `RuleBasedProfiler` will output an `ExpectationSuite` with 4 `Expectations`, one for each of our 4 columns. 

# Example 2: `RuleBasedProfiler` with `DomainBuilder`, `ParameterBuilder` `ExpectationConfigurationBuilder`

## Build a DomainBuilder
* Using the same `SimpleColumnSuffixDomainBuilder` from our previous example.

In [16]:
domain_builder: DomainBuilder = SimpleColumnSuffixDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
    column_name_suffixes=["_amount"],
)
domains: list = domain_builder.get_domains()

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [17]:
domains

[{
   "domain_type": "column",
   "domain_kwargs": {
     "column": "fare_amount"
   }
 },
 {
   "domain_type": "column",
   "domain_kwargs": {
     "column": "tip_amount"
   }
 },
 {
   "domain_type": "column",
   "domain_kwargs": {
     "column": "tolls_amount"
   }
 },
 {
   "domain_type": "column",
   "domain_kwargs": {
     "column": "total_amount"
   }
 }]

## Build a `ParameterBuilder`

`ParameterBuilders` help calcluate "reasonable" parameters for Expectations based on data that is specified by a `BatchRequest`.

The largest categories include: 
- `metric_multi_batch_parameter_builder`: Which is able to calculate a numeric Metric (like `column.min`) across multiple Batches (or just one Batch).
- `value_set_multi_batch_parameter_builder`: Which is able to build a value set across multiple Batches (or just one Batch). 

In some cases, there is a better way to build a value set using regex or dates. 
- `regex_pattern_string_parameter_builder`: Which contains a set of default regex patterns and builds a value set of the best-matching patterns. Users are also able to pass in new patterns as a parameter. 
- `simple_date_format_string_parameter_builder`: Which contains a set of default datetime-format patterns and builds a value set of the best-matching patterns. Users are also able to pass in new patterns as a parameter. 

Across multiple-Batches, we can build more-sophisticated parameters by using sampling methods. 
- `numeric_range_multi_batch_parameter_builder`: Which is able to provide range estimations across Batches using sampling methods. For instance, if we expect a table's `row_count` to change between Batches, we could calculate the min / max values of row_count by using the `NumericMetricRangeMultiBatchParameterBuilder`. These parameters could then be used by `ExpectTableRowCountToBeBetween`


In our example we will be using a `MetricMultiBatchParameterBuilder` to estimate the `column.min` Metric for the 4 columns defined by our Domain Builder. These are passed in as `metric_domain_kwargs` and are accessible using the fully qualified parameter `$domain.domain_kwargs`.

In [18]:
numeric_range_parameter_builder: MetricMultiBatchParameterBuilder = (
    MetricMultiBatchParameterBuilder(
        data_context=data_context,
        batch_request=single_batch_batch_request,
        metric_name="column.min",
        metric_domain_kwargs="$domain.domain_kwargs",  # domain kwarg values are accessible using fully qualified parameters
        name="my_column_min",
    )
)

## Build an ExpectationConfigurationBuilder

`ExpectationConfigurationBuilder` is being built for `expect_column_values_to_be_greater_than` which will use the `column.min` values that are calculated using the `ParameterBuilder`. These are now accessible using the fully qualified parameter `$parameter.my_column_min.value[-1]`.  The `[-1]` indicates that we will use the min value from the latest Batch (the only `Batch` in this case since our `BatchRequest` only returns a single `Batch`).

In [19]:
config_builder: DefaultExpectationConfigurationBuilder = (
    DefaultExpectationConfigurationBuilder(
        expectation_type="expect_column_values_to_be_greater_than",
        value="$parameter.my_column_min.value[-1]", # the parameter is accessible using a fully qualified parameter
        column="$domain.domain_kwargs.column", # domain kwarg values are accessible using fully qualified parameters
        name="my_column_min",
    )
)

## Build a `Rule`, `RuleBasedProfiler`, and run 

Now we build a rule with our `ParameterBuilder`, `DomainBuilder` and `ExpectationConfigurationBuilder`.

In [20]:
simple_rule: Rule = Rule(
    name="rule_with_parameters",
    domain_builder=domain_builder,
    parameter_builders=[numeric_range_parameter_builder],
    expectation_configuration_builders=[config_builder],
)

In [21]:
my_rbp = RuleBasedProfiler(name="my_rbp", data_context=data_context
                           , config_version=1.0)


Add the `Rule` to our `RuleBasedProfiler` and run. 

In [22]:
my_rbp.add_rule(rule=simple_rule)

In [23]:
res: ExpectationSuite = my_rbp.run()

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

In [24]:
assert len(res.expectations) == 4

In [25]:
res.expectations

[{"kwargs": {"name": "my_column_min", "value": -80.0, "column": "fare_amount"}, "expectation_type": "expect_column_values_to_be_greater_than", "meta": {}},
 {"kwargs": {"name": "my_column_min", "value": 0.0, "column": "tip_amount"}, "expectation_type": "expect_column_values_to_be_greater_than", "meta": {}},
 {"kwargs": {"name": "my_column_min", "value": 0.0, "column": "tolls_amount"}, "expectation_type": "expect_column_values_to_be_greater_than", "meta": {}},
 {"kwargs": {"name": "my_column_min", "value": -80.3, "column": "total_amount"}, "expectation_type": "expect_column_values_to_be_greater_than", "meta": {}}]

The resulting `ExpectationSuite` now contain values (`-80.0`, `0.0` etc) that were calculated from the Batch of data defined by the `BatchRequest`.


# Appendix

* Here we have additional example configuration of `DomainBuilder` and `ParameterBuilders` that were not included in the previous 2 Examples. 

## `DomainBuilders`

#### `ColumnDomainBuilder`
This `DomainBuilder` outputs column Domains, which are required by `ColumnExpectations` like (`expect_column_median_to_be_between`).

In [26]:
from great_expectations.rule_based_profiler.domain_builder import ColumnDomainBuilder

In [27]:
domain_builder: DomainBuilder = ColumnDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
)
domains: list = domain_builder.get_domains()
assert len(domains) == 18 # all columns in yellow_tripdata_sample_2018

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

`ColumnDomainBuilder` can also be used to specifically include or exclude columns if you already know which ones you want.

In [28]:
domain_builder: DomainBuilder = ColumnDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
    include_column_names=["vendor_id"]
)
domains: list = domain_builder.get_domains()
domains

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

[{
   "domain_type": "column",
   "domain_kwargs": {
     "column": "vendor_id"
   }
 }]

In [29]:
domain_builder: DomainBuilder = ColumnDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
    exclude_column_names=["vendor_id"]
)
domains: list = domain_builder.get_domains()
assert len(domains) == 17 # all columns in yellow_tripdata_sample_2018 with vendor_id excluded

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

#### `TableDomainBuilder`
This `DomainBuilder` outputs table `Domains`, which is required by `Expectations` that act on tables, like (`expect_table_row_count_to_equal`, or `expect_table_columns_to_match_set`).

In [30]:
from great_expectations.rule_based_profiler.domain_builder import TableDomainBuilder

In [31]:
domain_builder: DomainBuilder = TableDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
)
domains: list = domain_builder.get_domains()
domains

[{
   "domain_type": "table"
 }]

#### `MapMetricColumnDomainBuilder`

This `DomainBuilder` allows you to choose columns based on Map Metrics, which give a yes/no answer for individual values or rows. In this example, we use the Map Metrics `column_values.nonnull` to filter out a column that was all `None` from `taxi_data`. 

In [32]:
from great_expectations.rule_based_profiler.domain_builder import MapMetricColumnDomainBuilder

In [33]:
domain_builder: DomainBuilder = MapMetricColumnDomainBuilder(
    data_context=data_context,
    batch_request=single_batch_batch_request,
    map_metric_name="column_values.nonnull"
)
domains: list = domain_builder.get_domains()
len(domains) == 17 # filtered 1 column that was all None

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/39 [00:00<?, ?it/s]

True

#### `CategoricalColumnDomainBuilder`

This `DomainBuilder` allows you to choose columns based on their cardinality (number of unique values).The `CategoricalColumnDomainBuilder` will take in various `limit_modes` for cardinality, and in this example we are only interested in columns that have "very_few" (less than 10) unique values. For a full of valid modes, along with the associated values, please refer to the `CardinalityLimitMode` enum in:

https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/rule_based_profiler/helpers/cardinality_checker.py

In [34]:
from great_expectations.rule_based_profiler.domain_builder import CategoricalColumnDomainBuilder

In [35]:
domain_builder: DomainBuilder = CategoricalColumnDomainBuilder(
    batch_request=single_batch_batch_request,
    data_context=data_context,
    limit_mode="very_few", # VERY_FEW = 10 or less
)

In [36]:
domains: list = domain_builder.get_domains()
assert len(domains) == 9

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/21 [00:00<?, ?it/s]

#### `SimpleSemanticTypeColumnDomainBuilder`


This DomainBuilder allows  you to choose columns based on their semantic types (such as `numeric`, or `text`).  

Semantic types are defined as an `Enum` object called `SemanticDomainTypes`, which can be found here :
    https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/rule_based_profiler/types/domain.py

In [37]:
from great_expectations.rule_based_profiler.domain_builder import SimpleSemanticTypeColumnDomainBuilder

In [38]:
domain_builder: DomainBuilder = SimpleSemanticTypeColumnDomainBuilder(
    batch_request=single_batch_batch_request,
    data_context=data_context,
    semantic_types=['numeric']
)

In [39]:
domains: list = domain_builder.get_domains()
assert len(domains) == 15

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

## `ParameterBuilders`

`ParameterBuilders` work under the hood by populating a `ParameterContainer`, which can also be shared by multiple `ParameterBuilders`. It requires a Domain, and `metric_name`, with `domain_kwargs` accessible from the `DomainBuilder` using the fully qualified parameter `$domain.domain_kwargs`.

For the sake of simplicity, we will define a `Domain` object directly using the `Domain()` constructor, and pass in a column name within `domain_kwargs`. 

In [40]:
from great_expectations.rule_based_profiler.types.domain import Domain
from great_expectations.execution_engine.execution_engine import MetricDomainTypes
from great_expectations.rule_based_profiler.types import ParameterContainer

In [41]:
domain: Domain = Domain(domain_type=MetricDomainTypes.COLUMN, domain_kwargs = {'column': 'total_amount'})

#### `MetricMultiBatchParameterBuilder`

The `MetricMultiBatchParameterBuilder` computes a Metric on data from one or more batches. It takes `domain_kwargs`, `value_kwargs`, and `metric_name` as arguments.

In [42]:
from great_expectations.rule_based_profiler.parameter_builder import MetricMultiBatchParameterBuilder

In [43]:
numeric_range_parameter_builder: MetricMultiBatchParameterBuilder = (
    MetricMultiBatchParameterBuilder(
        data_context=data_context,
        batch_request=multi_batch_batch_request, # we are passing in our multi_batch_batch_request here
        metric_name="column.min",
        metric_domain_kwargs=domain.domain_kwargs,
        name="my_column_min",
    )
)

In [44]:
parameter_container: ParameterContainer = ParameterContainer(parameter_nodes=None)

In [45]:
numeric_range_parameter_builder.build_parameters(domain=domain, parameter_container=parameter_container)
# we check the parameter container
print(parameter_container.parameter_nodes)


Calculating Metrics:   0%|          | 0/48 [00:00<?, ?it/s]

{'parameter': {'parameter': {'my_column_min': {'value': [-19.8, -57.3, -6.8, -63.06, -11.8, -6.8, -30.6, -16.8, -4.3, -100.8, -12.8, -80.3], 'details': {'metric_configuration': {'metric_name': 'column.min', 'domain_kwargs': {'column': 'total_amount'}, 'metric_value_kwargs': None, 'metric_dependencies': None}, 'num_batches': 12}}}}}


`my_column_min[value]` now contains a list of 12 values, which are the minimum values the `total_amount` column for each of the 12 Batches associated with 2018 `taxi_data` data. If we were to use the latest (December) value in a `ExpectationConfigurationBuilder`, it would be accessible through the fully-qualified parameter: `$parameter.my_column_min.value[-1]`

#### `ValueSetMultiBatchParameterBuilder`

The `ValueSetMultiBatchParameterBuilder` is able to build a value set across multiple Batches (or just one Batch). 

In [46]:
from great_expectations.rule_based_profiler.parameter_builder import ValueSetMultiBatchParameterBuilder

In [47]:
domain: Domain = Domain(domain_type=MetricDomainTypes.COLUMN, domain_kwargs = {'column': 'vendor_id'})

In [48]:
# instantiating a new parameter container, since it can contain the results of more than one ParmeterBuilder. 
parameter_container: ParameterContainer = ParameterContainer(parameter_nodes=None)

In [49]:
value_set_parameter_builder: ValueSetMultiBatchParameterBuilder = (
    ValueSetMultiBatchParameterBuilder(
        data_context=data_context,
        batch_request= multi_batch_batch_request,
        metric_domain_kwargs=domain.domain_kwargs,
        name="my_value_set",
    )
)

In [50]:
value_set_parameter_builder.build_parameters(
    parameter_container=parameter_container,
    domain=domain,
    parameters = {domain.id: parameter_container}
)

Calculating Metrics:   0%|          | 0/48 [00:00<?, ?it/s]

In [51]:
print(parameter_container.parameter_nodes)

{'parameter': {'parameter': {'my_value_set': {'value': [1, 2, 4], 'details': {'metric_configuration': {'metric_name': 'column.distinct_values', 'domain_kwargs': {'column': 'vendor_id'}, 'metric_value_kwargs': None, 'metric_dependencies': None}, 'num_batches': 12}}}}}


`my_value_set[value]` now contains a list of 3 values, which is a list of all unique `vendor_ids` across 12 Batches in the 2018 `taxi_data` dataset.

#### `RegexPatternStringParameterBuilder`

The `RegexPatternStringParameterBuilder` contains a set of default regex patterns and builds a value set of the best-matching patterns. Users are also able to pass in new patterns as a parameter.

In [52]:
from great_expectations.rule_based_profiler.parameter_builder import RegexPatternStringParameterBuilder

In [53]:
domain: Domain = Domain(domain_type=MetricDomainTypes.COLUMN, domain_kwargs = {'column': 'vendor_id'})

* `vendor_id` is a single integer. Let's see if our default patterns can match it. 

In [54]:
parameter_container: ParameterContainer = ParameterContainer(parameter_nodes=None)

In [55]:
regex_parameter_builder: RegexPatternStringParameterBuilder = (
    RegexPatternStringParameterBuilder(
        data_context=data_context,
        batch_request=single_batch_batch_request,
        metric_domain_kwargs=domain.domain_kwargs,
        name="my_regex_set",
    )
)

In [56]:
regex_parameter_builder.build_parameters(
    parameter_container=parameter_container,
    domain=domain,
    parameters = {domain.id: parameter_container}
)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/25 [00:00<?, ?it/s]

  return func(self, *args, **kwargs)


In [57]:
print(parameter_container.parameter_nodes)

{'parameter': {'parameter': {'my_regex_set': {'value': [], 'details': {'evaluated_regexes': {'^\\s+/': 0.0, '\\s+/$': 0.0, '\\b[0-9a-fA-F]{8}\\b-[0-9a-fA-F]{4}-[0-5][0-9a-fA-F]{3}-[089ab][0-9a-fA-F]{3}-\\b[0-9a-fA-F]{12}\\b ': 0.0, '/https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%_\\+.~#()?&//=]*)/': 0.0, '/\\d+/': 0.0, '/[A-Za-z0-9\\.,;:!?()\\"\'%\\-]+/': 0.0, '/<\\/?(?:p|a|b|img)(?: \\/)?>/': 0.0, '/-?\\d+/': 0.0, '/-?\\d+(\\.\\d*)?/': 0.0, '/(?:[A-Fa-f0-9]){0,4}(?: ?:? ?(?:[A-Fa-f0-9]){0,4}){0,7}/': 0.0, '/(?:25[0-5]|2[0-4]\\d|[01]\\d{2}|\\d{1,2})(?:.(?:25[0-5]|2[0-4]\\d|[01]\\d{2}|\\d{1,2})){3}/': 0.0}, 'threshold': 1.0}}}}}


* Looks like `my_regex_set[value]` is an empty list. This means that none of the `evaluated_regexes` matched our domain. Let's try the same thing again, but this time with a regex that will match our `vendor_id` column. `^\\d{1}$` and `^\\d{2}$` which will match 1 or 2 digit integers anchored at the beginning and end of the string.

In [58]:
regex_parameter_builder: RegexPatternStringParameterBuilder = (
    RegexPatternStringParameterBuilder(
        data_context=data_context,
        batch_request=single_batch_batch_request,
        metric_domain_kwargs=domain.domain_kwargs,
        candidate_regexes=["^\\d{1}$", "^\\d{2}$"], # currently we don't support a single-candidate list (bugfix needed)
        name="my_regex_set",
    )
)

In [59]:
regex_parameter_builder.build_parameters(
    parameter_container=parameter_container,
    domain=domain,
    parameters = {domain.id: parameter_container}
)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

In [60]:
print(parameter_container.parameter_nodes)

{'parameter': {'parameter': {'my_regex_set': {'value': ['^\\d{1}$'], 'details': {'evaluated_regexes': {'^\\d{1}$': 1.0, '^\\d{2}$': 0.0}, 'threshold': 1.0}}}}}


* Now `my_regex_set[value]` contains `^\\d{1}$`.

#### `SimpleDateFormatStringParameterBuilder`

The `SimpleDateFormatStringParameterBuilder` contains a set of default Datetime format patterns and builds a value set of the best-matching patterns. Users are also able to pass in new patterns as a parameter.

In [61]:
from great_expectations.rule_based_profiler.parameter_builder import SimpleDateFormatStringParameterBuilder

In [62]:
domain: Domain = Domain(domain_type=MetricDomainTypes.COLUMN, domain_kwargs = {'column': 'pickup_datetime'})

In [63]:
parameter_container: ParameterContainer = ParameterContainer(parameter_nodes=None)

In [64]:
simple_date_format_string_parameter_builder: SimpleDateFormatStringParameterBuilder = (
    SimpleDateFormatStringParameterBuilder(
        data_context=data_context,
        batch_request=single_batch_batch_request,
        metric_domain_kwargs=domain.domain_kwargs,
        name="my_value_set",
    )
)

In [65]:
simple_date_format_string_parameter_builder.build_parameters(
    parameter_container=parameter_container,
    domain=domain,
    parameters = {domain.id: parameter_container}
)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/121 [00:00<?, ?it/s]

In [66]:
print(parameter_container.parameter_nodes)

{'parameter': {'parameter': {'my_value_set': {'value': '%Y-%m-%d %H:%M:%S', 'details': {'success_ratio': 1.0, 'candidate_strings': ['%H:%M:%S', '%H:%M:%S,%f', '%H:%M:%S.%f', '%Y %b %d %H:%M:%S.%f', '%Y %b %d %H:%M:%S.%f %Z', '%Y %b %d %H:%M:%S.%f*%Z', '%Y%m%d %H:%M:%S.%f', '%Y-%m-%d', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M:%S %z', '%Y-%m-%d %H:%M:%S%z', '%Y-%m-%d %H:%M:%S,%f', '%Y-%m-%d %H:%M:%S,%f%z', '%Y-%m-%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S.%f%z', "%Y-%m-%d'T'%H:%M:%S", "%Y-%m-%d'T'%H:%M:%S%z", "%Y-%m-%d'T'%H:%M:%S'%z'", "%Y-%m-%d'T'%H:%M:%S.%f", "%Y-%m-%d'T'%H:%M:%S.%f'%z'", '%Y-%m-%d*%H:%M:%S', '%Y-%m-%d*%H:%M:%S:%f', '%Y-%m-%dT%z', '%Y/%m/%d', '%Y/%m/%d*%H:%M:%S', '%b %d %H:%M:%S', '%b %d %H:%M:%S %Y', '%b %d %H:%M:%S %z', '%b %d %H:%M:%S %z %Y', '%b %d %Y %H:%M:%S', '%b %d, %Y %H:%M:%S %p', '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S*%f', '%d-%b-%Y %H:%M:%S', '%d-%b-%Y %H:%M:%S.%f', '%d-%m-%Y', '%d/%b %H:%M:%S,%f', '%d/%b/%Y %H:%M:%S', '%d/%b/%Y:%H:%M:%S', '%d/%b/%Y:%H:%M:%S %z', '%d

In [67]:
parameter_container.parameter_nodes["parameter"]["parameter"]["my_value_set"]["value"]

'%Y-%m-%d %H:%M:%S'

The result contains our matching `datetime` pattern, which is `'%Y-%m-%d %H:%M:%S'`

#### `NumericMetricRangeMultiBatchParameterBuilder`

The `NumericMetricRangeMultiBatchParameterBuilder` is able to provide range estimations across Batches using sampling methods. For instance, if we expect a table's row_count to change between Batches, we could calculate the min / max values of row_count by using the `NumericMetricRangeMultiBatchParameterBuilder`. These parameters could then be used by `Expectations` that take in ranges, like `ExpectTableRowCountToBeBetween`, or `ExpectColumnValuesToBeBetween`.

In this example, we will be taking a single Metric, `column.mean` and calculating it for a single column, `total_amount`. The parameter we will be building is the column mean-range, which are the min-max values of the `total_amount` column across random samples of 12 Batches of the 2018 `taxi_data` dataaset. 

We will also be passing in specifications for estimator, namely `bootstrap` sampling with a false-positive rate of less than 0.01. 

In [68]:
from great_expectations.rule_based_profiler.parameter_builder import NumericMetricRangeMultiBatchParameterBuilder

In [69]:
domain: Domain = Domain(domain_type=MetricDomainTypes.COLUMN, domain_kwargs = {'column': 'total_amount'})

In [70]:
numeric_metric_range_parameter_builder: NumericMetricRangeMultiBatchParameterBuilder = NumericMetricRangeMultiBatchParameterBuilder(
    name="column_mean_range",
    metric_name="column.mean",
    estimator="bootstrap",
    metric_domain_kwargs=domain.domain_kwargs,
    false_positive_rate=1.0e-2,
    round_decimals=0,
    data_context=data_context,
    batch_request=multi_batch_batch_request,
)

In [71]:
parameter_container: ParameterContainer = ParameterContainer(parameter_nodes=None)

In [72]:
numeric_metric_range_parameter_builder.build_parameters(
    parameter_container=parameter_container,
    domain=domain,
    parameters = {domain.id: parameter_container}
)

Calculating Metrics:   0%|          | 0/48 [00:00<?, ?it/s]

In [73]:
print(parameter_container.parameter_nodes)

{'parameter': {'parameter': {'column_mean_range': {'value': {'value_range': [16.0, 43.0]}, 'details': {'metric_configuration': {'metric_name': 'column.mean', 'domain_kwargs': {'column': 'total_amount'}, 'metric_value_kwargs': None, 'metric_dependencies': None}, 'num_batches': 12}}}}}


As we see, the mean value range for the `total_amount` column is `16.0` to `44.0`