# How to write multi-batch `BatchRequest` - Configured `Sql` Example
* A `BatchRequest` facilitates the return of a `batch` of data from a configured `Datasource`. To find more about `Batches`, please refer to the [related documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_get_a_batch_of_data_from_a_configured_datasource#1-construct-a-batchrequest). 
* A `BatchRequest` can return 0 or more Batches of data depending on the underlying data, and how it is configured. This guide will help you configure `BatchRequests` to return multiple batches, which can be used by
   1. Self-Initializing Expectations to estimate parameters
   2. DataAssistants to profile your data and create and Expectation suite with self-intialized parameters.
   
* Note : Multi-batch BatchRequests are not supported in `RuntimeDataConnector`.

In [5]:
import great_expectations as ge
from ruamel import yaml
from great_expectations.core.batch import BatchRequest
import sqlite3
import pprint


* Load `DataContext`

In [6]:
data_context: ge.DataContext = ge.get_context()

## Sql Example

### Example Database

Imagine we have a database of 1 table, with `yellow_tripdata_sample_2020`, corresponding to all 12 months' `taxi_trip` data for 2020.


In [7]:
# connect to sqlite DB, and print the existing tables
CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/test_ci"
from sqlalchemy import create_engine
from sqlalchemy import inspect
engine = create_engine(CONNECTION_STRING)
insp = inspect(engine)
print(insp.get_table_names())

['yellow_tripdata_sample_2020']


## Example Configuration

In our example, we add a `Datasource` named `taxi_multi_batch_sql_datasource` with 1 table. We also have a `ConfiguredAssetSqlDataConnector` named `configured_data_connector_multi_batch_asset`.

The DataConnector contains 2 `assets`, both associated with the `table_name` named`yellow_tripdata_sample_2020`.

The asset `yellow_tripdata_sample_2020_full` contains no other parameter other than the `table_name` and optional `schema_name`, which mean the whole table will be loaded as one Batch in the asset. 

The asset `yellow_tripdata_sample_2020_by_year_and_month` contains `table_name` and `schema_name`, as well as splitter configuration. The splitter we use is `split_on_year_and_month`, which creates Batches according to the `pickup_datetime` column which of type timestamp in the database schema.

**Note**: This example only uses `splitters` but sampling can also be used. For more information, please refer to the document [How to configure a DataConnector for splitting and sampling tables in SQL](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/advanced/how_to_configure_a_dataconnector_for_splitting_and_sampling_tables_in_sql)

In [32]:
datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "configured_data_connector_multi_batch_asset": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "assets":{
                "yellow_tripdata_sample_2020_full":
                {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                },
    
                "yellow_tripdata_sample_2020_by_year_and_month":{
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "public",
                    "splitter_method": "split_on_year_and_month",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                        },
                    },
                },
            },
            
        },
}

data_context.test_yaml_config(yaml.dump(datasource_config))

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
	configured_data_connector_multi_batch_asset : ConfiguredAssetSqlDataConnector

	Available data_asset_names (2 of 2):
		yellow_tripdata_sample_2020_by_year_and_month (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 1}}, {'pickup_datetime': {'year': 2020, 'month': 10}}, {'pickup_datetime': {'year': 2020, 'month': 11}}]
		yellow_tripdata_sample_2020_full (1 of 1): [{}]

	Unmatched data_references (0 of 0):[]



<great_expectations.datasource.new_datasource.Datasource at 0x7fa3710fd0a0>

We see we have successfully configured this because the output shows a 2 data assets
- `yellow_tripdata_sample_2020_full` associated with 1 batch. 
- `yellow_tripdata_sample_2020_by_year_and_month` with 12 batches, each associated with a different month in our `pickup_datetime` column. 

In [15]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

## BatchRequest

Depending on how we configured our assets, when you send a `BatchRequest`, you will retrieve a different number of `Batches`

Single Batch returned by `yellow_tripdata_sample_2020_full`

In [16]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020_full",
)

In [17]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [18]:
batch_list

[<great_expectations.core.batch.Batch at 0x7fa3865e2c10>]

Multi Batch returned by `yellow_tripdata_sample_2020_by_year_and_month`

In [19]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020_by_year_and_month",
)

In [20]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [21]:
multi_batch_batch_list # 12 batches

[<great_expectations.core.batch.Batch at 0x7fa3865ea280>,
 <great_expectations.core.batch.Batch at 0x7fa3720b3670>,
 <great_expectations.core.batch.Batch at 0x7fa36db1b3d0>,
 <great_expectations.core.batch.Batch at 0x7fa371845550>,
 <great_expectations.core.batch.Batch at 0x7fa37106a3d0>,
 <great_expectations.core.batch.Batch at 0x7fa36de41040>,
 <great_expectations.core.batch.Batch at 0x7fa372099c70>,
 <great_expectations.core.batch.Batch at 0x7fa371845730>,
 <great_expectations.core.batch.Batch at 0x7fa37107c3a0>,
 <great_expectations.core.batch.Batch at 0x7fa370ffa250>,
 <great_expectations.core.batch.Batch at 0x7fa371635ca0>,
 <great_expectations.core.batch.Batch at 0x7fa37108a880>]

You can also get a single Batch from a multi-batch DataConnector by passing in `data_connector_query`. 

In [22]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020_by_year_and_month",
    data_connector_query={ 
        "batch_filter_parameters": {"pickup_datetime": {"year": 2020, "month": 1}}
    }
)

In [23]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request_from_multi)

In [24]:
batch_list[0].to_dict() # 'batch_identifiers': {'pickup_datetime': '2020-01'}},

{'data': '<great_expectations.execution_engine.sqlalchemy_batch_data.SqlAlchemyBatchData object at 0x7fa3710cdbb0>',
 'batch_request': {'datasource_name': 'taxi_multi_batch_sql_datasource',
  'data_connector_name': 'configured_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_sample_2020_by_year_and_month',
  'limit': None,
  'batch_spec_passthrough': None,
  'data_connector_query': {'batch_filter_parameters': {'pickup_datetime': {'year': 2020,
     'month': 1}}}},
 'batch_definition': {'datasource_name': 'taxi_multi_batch_sql_datasource',
  'data_connector_name': 'configured_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_sample_2020_by_year_and_month',
  'batch_identifiers': {'pickup_datetime': {'year': 2020, 'month': 1}}},
 'batch_spec': {'data_asset_name': 'yellow_tripdata_sample_2020_by_year_and_month',
  'table_name': 'yellow_tripdata_sample_2020',
  'batch_identifiers': {'pickup_datetime': {'year': 2020, 'month': 1}},
  'schema_nam

# Using auto-initializing `Expectations` to generate parameters

We will generate a `Validator` using our `multi_batch_batch_list`

In [25]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [26]:
example_suite = data_context.create_expectation_suite(expectation_suite_name="example_sql_suite", overwrite_existing=True)

In [27]:
validator = data_context.get_validator_using_batch_list(batch_list=multi_batch_batch_list, expectation_suite=example_suite)

When you run methods on the validator, it will typically run on the most recent batch (index `-1`), even if the Validator has access to a longer Batch list. For example, notice that rows below are all associated with `pickup_datetime` being `9` (September, 2020). This is because the datetime values are stored lexicographically, meaning `1` and `11`, `12` values will appear **before** `2` and `3`.

**NOTE** This will be fixed in time for the `DataAssistant` release

For simplicity, let's get a `validator` with the December `Batch`, which is in index `"3"` (after `1`, `10`, `11`). Notice that we are also casting the value as a `list` using the square brackets. 

In [28]:
validator = data_context.get_validator_using_batch_list(batch_list=[multi_batch_batch_list[3]], expectation_suite=example_suite)

In [29]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2.0,2020-12-15 12:20:27,2020-12-15 12:40:49,4.0,5.76,1.0,N,209,237,1.0,21.0,0.0,0.5,2.5,0.0,0.3,26.8,2.5
1,2.0,2020-12-28 12:51:25,2020-12-28 13:15:12,1.0,11.64,1.0,N,161,220,1.0,33.5,0.0,0.5,5.0,2.8,0.3,44.6,2.5
2,2.0,2020-12-27 10:43:42,2020-12-27 10:51:05,1.0,1.22,1.0,N,163,48,1.0,7.0,0.0,0.5,2.06,0.0,0.3,12.36,2.5
3,2.0,2020-12-08 13:42:52,2020-12-08 13:54:45,1.0,1.84,1.0,N,137,229,2.0,9.0,0.0,0.5,0.0,0.0,0.3,12.3,2.5
4,2.0,2020-12-19 11:56:43,2020-12-19 12:08:43,1.0,1.55,1.0,N,24,74,1.0,9.5,0.0,0.5,2.58,0.0,0.3,12.88,0.0


### Typical Workflow
A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters.

Here is an example running `expect_column_median_to_be_between()` by "guessing" at the `min_value` and `max_value`. 

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", min_value=0, max_value=1)

The observed value for our `yellow_tripdata_sample_2020_01` table where `vendor_id` = `2`  is going to be `1.6`, which means the Expectation fails

Now we run the same expectation again, but this time with `auto=True`. This means the `median` values are going to calculated across the `batch_list` associated with the `Validator` (ie 3 Batches for `yellow_tripdata_sample_2020_01`), which gives the min value of `1.5` and the max value of `5.23`

In [None]:
validator.expect_column_median_to_be_between(column="trip_distance", auto=True)

The `auto=True` will also automatically run the Expectation against the most recent Batch (which has an observed value of `1.61`) and the Expectation will pass. 

You can now save the `ExpectationSuite`.

In [None]:
validator.save_expectation_suite()

### Running the `ExpectationSuite` against single `Batch`

Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`.

In [None]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="configured_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020_by_year_and_month",
    data_connector_query={ 
        "batch_filter_parameters": {"pickup_datetime": {"year": 2020, "month": 2}}
    }


)


In [None]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request_from_multi,
            "expectation_suite_name": "example_sql_suite",            
        }
    ],
}
data_context.add_checkpoint(**checkpoint_config)

In [None]:
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

In [None]:
results.success

# Appendix


## Other Parameters for `ConfiguredAssetSqlDataConnector`

The signature of the `ConfiguredAssetSqlDataConnector` also contains the following parameters: 

The following required parameters:
* `name`: The name of this DataConnector.
* `datasource_name`: The name of the Datasource that contains it.
* `execution_engine`: An ExecutionEngine.
* `assets`: The dictionary containing the asset configurations.

The `assets` dictionary can contain the following keys and values:
* `table_name`: string that defines the `table_name` associated with the asset. If table_name is omitted, then the `table_name` defaults to the asset name.
* `schema_name`: optional string that defines the `schema` for the asset.
* `include_schema_name`: A `bool` that determines, "Should the `data_asset_name` include the `schema` as a prefix?"
* `splitter_method`: string that names method to split the target table into multiple `Batches`.
* `splitter_kwargs`: a dict containing arguments to pass to `splitter_method`.
* `sampling_method`: string that names method to downsample within a target `Batch`.
* `sampling_kwargs` (dict): Keyword arguments to pass to `sampling_method`.
* `batch_spec_passthrough` (dict): dictionary with keys that will be added directly to `batch_spec`.


For more information on `splitters` and `samplers` please consider the following documentation:[How to configure a DataConnector for splitting and sampling tables in SQL](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/advanced/how_to_configure_a_dataconnector_for_splitting_and_sampling_tables_in_sql)

**Note**: although 



# Loading Data into Postgresql Database

* The following code can be used to build the postgres database used in this notebook. It is included (and commented out) for reference.
* In order to load the data into a local `postgresql` database, please feel free to use the `docker-compose.yml` file available at `great_expectations/assets/docker/postgresql/`. 

### To spin up the `postgresql` database
* Have [Docker Desktop](https://www.docker.com/products/docker-desktop/) running locally.
* Navigate to `great_expectations/assets/docker/postgresql/`
* Type `docker-compose up`
* Then uncomment and run the following snippet

In [13]:
# from tests.test_utils import load_data_into_test_database
# from typing import List
# import sqlalchemy as sa
# import pandas as pd
# CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/test_ci"

# data_paths: List[str] = [
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-01.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-02.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-03.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-04.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-05.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-06.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-07.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-08.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-09.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-10.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-11.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-12.csv",
# ]

    
# engine = sa.create_engine(CONNECTION_STRING)
# connection = engine.connect()
# table_name = "yellow_tripdata_sample_2020"
# res = connection.execute(f"DROP TABLE IF EXISTS {table_name}")

# for data_path in data_paths:
#     # This utility is not for general use. It is only to support testing.
#     load_data_into_test_database(
#         table_name="yellow_tripdata_sample_2020",
#         csv_path=data_path,
#         connection_string=CONNECTION_STRING,
#         load_full_dataset=True,
#         drop_existing_table=False,
#         convert_colnames_to_datetime=["pickup_datetime", "dropoff_datetime"]
#     )

Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-01.csv']
Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-02.csv']
Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-03.csv']
Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-04.csv']
Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-05.csv']
Adding to existing table yellow_tripdata_sample_2020 and adding data from ['../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-06.csv']
Adding to existing table yellow_tr