# How to write multi-batch `BatchRequest` - `Sql` Example
* A `BatchRequest` facilitates the return of a `batch` of data from a configured `Datasource`. To find more about `Batches`, please refer to the [related documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_get_a_batch_of_data_from_a_configured_datasource#1-construct-a-batchrequest). 
* A `BatchRequest` can return 0 or more Batches of data depending on the underlying data, and how it is configured. This guide will help you configure `BatchRequests` to return multiple batches, which can be used by
   1. Self-Initializing Expectations to estimate parameters
   2. DataAssistants to profile your data and create and Expectation suite with self-intialized parameters.
   
* Note : Multi-batch BatchRequests are not supported in `RuntimeDataConnector`.

In [None]:
import great_expectations as ge
from ruamel import yaml
from great_expectations.core.batch import BatchRequest
import sqlite3
import pprint

* Load `DataContext`

In [None]:
data_context: ge.DataContext = ge.get_context()

## Sql Example

### Example Database

Imagine we have a database of 12 tables, each corresponding to 1 month of Taxi rider data. 


In [None]:
data_path: str = "../../../test_sets/taxi_yellow_tripdata_samples/sqlite/yellow_tripdata_2020.db"

In [None]:
# connect to sqlite DB, and print the existing tables
con = sqlite3.connect(data_path)
cur = con.cursor()
cur.execute('SELECT name from sqlite_master where type= "table"')
pprint.pprint(cur.fetchall())

### `SimpleSqlDatasource` Example

In our example, we add a `SimpleSqlalchemyDatasource` named `taxi_multi_batch_sql_datasource` with 2 `tables`, namely `yellow_tripdata_sample_2020_01`, `yellow_tripdata_sample_2020_02`. The configuration for `yellow_tripdata_sample_2020_02` is mostly used for our `Checkpoint` at the end, so the following doc will focus more on `yellow_tripdata_sample_2020_01`.

The configuration for `yellow_tripdata_sample_2020_01` also contains 2 `partitioners` which correspond to names of `ConfiguredAssetSqlDataConnectors`. 

**Note**: This example only uses `tables`, but `introspection` could also be used. For more information, please refer to the document [How to configure a DataConnector for splitting and sampling tables in SQL](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/advanced/how_to_configure_a_dataconnector_for_splitting_and_sampling_tables_in_sql)

The partitioner `whole_table` is built-in to GE, and takes the whole table and returns it as a single Batch. 

It gives the following output, which corresponds to our two tables: 

```bash 
Data Connectors:
    whole_table : ConfiguredAssetSqlDataConnector
    Available data_asset_names (1 of 1):
        yellow_tripdata_sample_2020_01 (1 of 1): [{}]
        yellow_tripdata_sample_2020_02 (1 of 1): [{}]
```


The partitioner `by_vendor_id` is configured by us, and uses a `splitter_method` to split the table values into multiple batches. The splitter we use is `_split_on_divided_integer`, which creates Batches according to rows where value of column_name divided (using integral division) by the given divisor are same. The column name and divisor are given as part of the `splitter_kwargs` parameter.
    
Here is the output, which shows the data asset `yellow_tripdata_sample_2020_01` with 3 batches, each associated with a different `vendor_id`. These become our `batch_identifiers` that distinguish one `Batch` from another.

```bash
Data Connectors:
	by_vendor_id : ConfiguredAssetSqlDataConnector
	Available data_asset_names (1 of 1):
		yellow_tripdata_sample_2020_01 (3 of 3): [{'vendor_id': 0}, {'vendor_id': 1}, {'vendor_id': 2}]
```

In [None]:
data_path: str = "../../../test_sets/taxi_yellow_tripdata_samples/sqlite/yellow_tripdata_2020.db"

datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "module_name": "great_expectations.datasource",
    "class_name": "SimpleSqlalchemyDatasource",
    "connection_string": "sqlite:///" + data_path,
    "tables":{
        "yellow_tripdata_sample_2020_01": {
            "partitioners":{
                "whole_table": {},
                "by_vendor_id":{
                    "splitter_method": "split_on_divided_integer",
                    "splitter_kwargs": {
                        "column_name": "vendor_id",
                        "divisor": 1
                        }
                    },
                },
            },
        "yellow_tripdata_sample_2020_02": {
            "partitioners":{
                "whole_table": {},
                },
            }
    },
}
data_context.test_yaml_config(yaml.dump(datasource_config))


In [None]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

## BatchRequest

Depending on which `DataConnector` (ie. `Partitioner`) you send a `BatchRequest` to, you will retrieve a different number of `Batches`

Single Batch returned by `whole_table`

In [None]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="whole_table",
    data_asset_name="yellow_tripdata_sample_2020_01",
)

In [None]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [None]:
batch_list

Multi Batch returned by `by_vendor_id`

In [None]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="by_vendor_id",
    data_asset_name="yellow_tripdata_sample_2020_01",
)

In [None]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [None]:
multi_batch_batch_list

You can also get a single Batch from a multi-batch DataConnector by passing in `data_connector_query`. 

In [None]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="by_vendor_id",
    data_asset_name="yellow_tripdata_sample_2020_01",
    data_connector_query={ 
        "batch_filter_parameters": {"vendor_id": 2}
    }
)

In [None]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request_from_multi)

In [None]:
batch_list[0].to_dict() # 'batch_identifiers': {'vendor_id': '2'}},

# Using auto-initializing `Expectations` to generate parameters

We will generate a `Validator` using our `multi_batch_batch_list`

In [None]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [None]:
example_suite = data_context.create_expectation_suite(expectation_suite_name="example_sql_suite", overwrite_existing=True)

In [None]:
validator = data_context.get_validator_using_batch_list(batch_list=multi_batch_batch_list, expectation_suite=example_suite)

When you run methods on the validator, it will typically run on the most recent batch (index `-1`), even if the Validator has access to a longer Batch list. For example, notice that rows below are all associated with `vendor_id` : `2`.

In [None]:
validator.head()

### Typical Workflow
A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters.

Here is an example running `expect_column_median_to_be_between()` by "guessing" at the `min_value` and `max_value`. 

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", min_value=0, max_value=1)

The observed value for our `yellow_tripdata_sample_2020_01` table where `vendor_id` = `2`  is going to be `1.6`, which means the Expectation fails

Now we run the same expectation again, but this time with `auto=True`. This means the `median` values are going to calculated across the `batch_list` associated with the `Validator` (ie 3 Batches for `yellow_tripdata_sample_2020_01`), which gives the min value of `1.5` and the max value of `5.23`

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", auto=True)

The `auto=True` will also automatically run the Expectation against the most recent Batch (which has an observed value of `1.61`) and the Expectation will pass. 

You can now save the `ExpectationSuite`.

In [None]:
validator.save_expectation_suite()

### Running the `ExpectationSuite` against single `Batch`

Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`.

In [None]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="whole_table",
    data_asset_name="yellow_tripdata_sample_2020_02",

)


In [None]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request_from_multi,
            "expectation_suite_name": "example_sql_suite",            
        }
    ],
}
data_context.add_checkpoint(**checkpoint_config)

In [None]:
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

In [None]:
results.success