# How to write multi-batch `BatchRequest`
* A `BatchRequest` facilitates the return of a `batch` of data from a configured `Datasource`. To find more about `Batches`, please refer to the [related documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_get_a_batch_of_data_from_a_configured_datasource#1-construct-a-batchrequest). 
* A `BatchRequest` can return 0 or more Batches of data depending on the underlying data, and how it is configured. This guide will help you configure `BatchRequests` to return multiple batches, which can be used by
   1. Self-Initializing Expectations to estimate parameters
   2. DataAssistants to estimate Expectations and parameters.
   
* Note : Multi-batch BatchRequests are not supported in `RuntimeDataConnector`.

* `BatchRequest` is one of the , can return 0 or more Batches depending on the underlying data, and how it's configured.

## FileSystem Example

### Example Directory

Imagine we have a directory of 12 csv files, each corresponding to 1 month of Taxi rider data

```
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
yellow_tripdata_sample_2020-04.csv
yellow_tripdata_sample_2020-05.csv
yellow_tripdata_sample_2020-06.csv
yellow_tripdata_sample_2020-07.csv
yellow_tripdata_sample_2020-08.csv
yellow_tripdata_sample_2020-09.csv
yellow_tripdata_sample_2020-10.csv
yellow_tripdata_sample_2020-11.csv
yellow_tripdata_sample_2020-12.csv
```


In [1]:
import great_expectations as ge

from ruamel import yaml

from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest

from great_expectations.rule_based_profiler.rule.rule import Rule
from great_expectations.rule_based_profiler.rule_based_profiler import RuleBasedProfiler, RuleBasedProfilerResult

from great_expectations.rule_based_profiler.domain_builder import (
    DomainBuilder,
    ColumnDomainBuilder,
)
from great_expectations.rule_based_profiler.parameter_builder import (
    MetricMultiBatchParameterBuilder,
)
from great_expectations.rule_based_profiler.expectation_configuration_builder import (
    DefaultExpectationConfigurationBuilder,
)

* Load `DataContext`

In [2]:
data_context: ge.DataContext = ge.get_context()

### `ConfiguredDataConnector` Example

* Add `Datasource` named `taxi_multi_batch_datasource` with two `InferredAssetDataConnectors`. A key difference is in the `pattern` they use to build the `data_asset_name`. Depending on which `group_names` are used, we can either create a data Asset with a single batch (corresponding to 1 csv file) or a data Asset with 12 batches (corresponding to 12 csv files for 2020)

* The first DataConnector is called `inferred_data_connector_single_batch_asset`, which takes the entire file name  (`(.*)`), and maps it to the `data_asset_name` group.
    * For the directory , we get 12 Data Assets, with 1 Batch each.
    * This can be seen in the output of `test_yaml_config()`, which shows the 12 data assets, with 1 Batch each. 
    
    * Here is the output: 
    
    ```	
    Available data_asset_names (3 of 12):
		yellow_tripdata_sample_2020-01 (1 of 1): ['yellow_tripdata_sample_2020-01.csv']
		yellow_tripdata_sample_2020-02 (1 of 1): ['yellow_tripdata_sample_2020-02.csv']
		yellow_tripdata_sample_2020-03 (1 of 1): ['yellow_tripdata_sample_2020-03.csv']
    ```

* A second DataConnector is called `inferred_data_connector_multi_batch_asset`
    * It takes `(yellow_tripdata_sample_2020)` and maps it to the `data_asset_name` group, and matches the month (`(\\d.*)`) as the second group (`month`). 
    * In the case of the files in our directory, we will return a single data asset named `yellow_tripdata_sample_2020`, with each of the 12 months corresponding to Batches for the asset. 
    * This can be seen in the output of `test_yaml_config()`, which shows 3 of the 12 Batches corresponding to `yellow_tripdata_sample_2020`
    * Here is the output:
 ```
 Available data_asset_names (1 of 1):
       yellow_tripdata_sample_2020 (3 of 12): ['yellow_tripdata_sample_2020-01.csv', 'yellow_tripdata_sample_2020-02.csv', 'yellow_tripdata_sample_2020-03.csv']
 ```

In [3]:
data_path: str = "../../../../test_sets/taxi_yellow_tripdata_samples/samples_2020"

datasource_config = {
    "name": "taxi_multi_batch_configured_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "configured_data_connector_single_batch_asset": {
            "class_name": "ConfiguredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "assets": {
                "yellow_trip_data_jan":
                {
                    "group_names": ["full_name"],
                    "pattern": "yellow_tripdata_sample_2020-01\\.csv",
                },
                "yellow_trip_data_feb":
                {
                    "group_names": ["full_name"],
                    "pattern": "yellow_tripdata_sample_2020-02\\.csv",
                }
            },
        },
        
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetFilesystemDataConnector",
            "base_directory": data_path,
            "assets": {
                "yellow_tripdata_all_months":{
                    "pattern": "yellow_tripdata_sample_(.*)\\.csv",
                    "group_names": ["month"],

                }
            },
        },
        
    },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_all_months (3 of 12): ['yellow_tripdata_sample_2020-01.csv', 'yellow_tripdata_sample_2020-02.csv', 'yellow_tripdata_sample_2020-03.csv']

	Unmatched data_references (0 of 0):[]

	configured_data_connector_single_batch_asset : ConfiguredAssetFilesystemDataConnector

	Available data_asset_names (2 of 2):
		yellow_trip_data_feb (1 of 1): ['yellow_tripdata_sample_2020-02.csv']
		yellow_trip_data_jan (1 of 1): ['yellow_tripdata_sample_2020-01.csv']

	Unmatched data_references (3 of 22):['yellow_tripdata_sample_2020-01.csv', 'yellow_tripdata_sample_2020-03.csv', 'yellow_tripdata_sample_2020-04.csv']



<great_expectations.datasource.new_datasource.Datasource at 0x7f97e88fc2e0>

In [4]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

## BatchRequest

* Depending on which `DataConnector` you send a `BatchRequest` to, you will retrieve a different number of `Batches`

* Single Batch returned by `inferred_data_connector_single_batch_asset`

In [5]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_configured_datasource",
    data_connector_name="configured_data_connector_single_batch_asset",
    data_asset_name="yellow_trip_data_jan",
)

In [6]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [7]:
batch_list

[<great_expectations.core.batch.Batch at 0x7f97e8958310>]

* Multi Batch returned by `inferred_data_connector_multi_batch_asset`

In [13]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_configured_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_all_months",
)

In [14]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [15]:
multi_batch_batch_list

[<great_expectations.core.batch.Batch at 0x7f97eca781f0>,
 <great_expectations.core.batch.Batch at 0x7f97eca7adc0>,
 <great_expectations.core.batch.Batch at 0x7f97edb0bfd0>,
 <great_expectations.core.batch.Batch at 0x7f97ecc315e0>,
 <great_expectations.core.batch.Batch at 0x7f97ecc329d0>,
 <great_expectations.core.batch.Batch at 0x7f97ece8bdc0>,
 <great_expectations.core.batch.Batch at 0x7f97ecd9bfd0>,
 <great_expectations.core.batch.Batch at 0x7f97ed16e5e0>,
 <great_expectations.core.batch.Batch at 0x7f97ed306be0>,
 <great_expectations.core.batch.Batch at 0x7f97eded7340>,
 <great_expectations.core.batch.Batch at 0x7f97edfeea60>,
 <great_expectations.core.batch.Batch at 0x7f97ee880040>]

* You can also get a single Batch from a multi-batch DataConnector by passing in `data_connector_query`. Index `-1` will return the most recent (month = `12`) batch. 

In [18]:
multi_batch_batch_list[0].to_dict() # 'batch_identifiers': {'month': '01'}},

{'data': '<great_expectations.execution_engine.pandas_batch_data.PandasBatchData object at 0x7f97e88d0ac0>',
 'batch_request': {'datasource_name': 'taxi_multi_batch_configured_datasource',
  'data_connector_name': 'configured_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_all_months',
  'limit': None,
  'data_connector_query': None,
  'batch_spec_passthrough': None},
 'batch_definition': {'datasource_name': 'taxi_multi_batch_configured_datasource',
  'data_connector_name': 'configured_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_all_months',
  'batch_identifiers': {'month': '2020-01'}},
 'batch_spec': {'path': '/Users/work/Development/great_expectations/tests/test_fixtures/rule_based_profiler/example_notebooks/great_expectations/../../../../test_sets/taxi_yellow_tripdata_samples/samples_2020/yellow_tripdata_sample_2020-01.csv'},
 'batch_markers': {'ge_load_time': '20220721T215645.992194Z',
  'pandas_data_fingerprint': 'f8a392d5db80c

# Using auto-initializing `Expectations` to generate parameters

* We will generate a `Validator` using our `multi_batch_batch_list`

In [19]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [20]:
example_suite = data_context.create_expectation_suite(expectation_suite_name="example_suite", overwrite_existing=True)

In [21]:
validator = data_context.get_validator_using_batch_list(batch_list=multi_batch_batch_list, expectation_suite=example_suite)

* When you run methods on the validator, it will typically run on the most recent batch (index `-1`), even if the Validator has access to a longer Batch list. For example, notice that the `pickup_datetime` and `dropoff_datetime` below are all associated with December, indicating that it is with the most recent Batch.

In [22]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2.0,2020-12-15 12:20:27,2020-12-15 12:40:49,4.0,5.76,1.0,N,209,237,1.0,21.0,0.0,0.5,2.5,0.0,0.3,26.8,2.5
1,2.0,2020-12-28 12:51:25,2020-12-28 13:15:12,1.0,11.64,1.0,N,161,220,1.0,33.5,0.0,0.5,5.0,2.8,0.3,44.6,2.5
2,2.0,2020-12-27 10:43:42,2020-12-27 10:51:05,1.0,1.22,1.0,N,163,48,1.0,7.0,0.0,0.5,2.06,0.0,0.3,12.36,2.5
3,2.0,2020-12-08 13:42:52,2020-12-08 13:54:45,1.0,1.84,1.0,N,137,229,2.0,9.0,0.0,0.5,0.0,0.0,0.3,12.3,2.5
4,2.0,2020-12-19 11:56:43,2020-12-19 12:08:43,1.0,1.55,1.0,N,24,74,1.0,9.5,0.0,0.5,2.58,0.0,0.3,12.88,0.0


### Typical Workflow
* A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters.

* Here is an example running `expect_column_median_to_be_between()` by "guessing" at the `min_value` and `max_value`. 

In [23]:
validator.expect_column_median_to_be_between(column="trip_distance", min_value=0, max_value=1)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "meta": {},
  "success": false,
  "result": {
    "observed_value": 1.61
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

* The observed value for the most recent batch (December/2020) is going to be `1.61`, which means the Expectation fails

* Now we run the same expectation again, but this time with `auto=True`. This means the `median` values are going to calculated across the `batch_list` associated with the `Validator` (ie 12 Batches for 2020), which gives the min value of `1.6` and the max value of `1.92`

In [24]:
validator.expect_column_median_to_be_between(column="trip_distance", auto=True)




Generating Expectations:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "meta": {},
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_median_to_be_between",
    "kwargs": {
      "column": "trip_distance",
      "min_value": 1.6,
      "max_value": 1.92,
      "strict_min": false,
      "strict_max": false
    },
    "meta": {
      "auto_generated_at": "20220721T220025.571580Z",
      "great_expectations_version": "0.15.13+79.g5c6db848e.dirty"
    }
  },
  "result": {
    "observed_value": 1.61
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

* The `auto=True` will also automatically run the Expectation against the most recent Batch (which has an observed value of `1.61`) and the Expectation will pass. 

* You can now save the `ExpectationSuite`.

In [25]:
validator.save_expectation_suite()

### Running the `ExpectationSuite` against single `Batch`

* Now the ExpectationSuite can be used to validate single batches using a Checkpoint

In [26]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_configured_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
   data_asset_name="yellow_tripdata_all_months",
    data_connector_query={
        "index": 0 # this one will correspond to Jan 2020
    }
)


In [27]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request_from_multi,
            "expectation_suite_name": "example_suite",
        }
    ],
}
data_context.add_checkpoint(**checkpoint_config)

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "my_checkpoint",
  "profilers": [],
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "taxi_multi_batch_configured_datasource",
        "data_connector_name": "configured_data_connector_multi_batch_asset",
        "data_asset_name": "yellow_tripdata_all_months",
        "data_connector_query": {
          "index": 0
        }
      },

In [28]:
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

In [29]:
results.success

True