# How to write multi-batch `BatchRequest` - Inferred `Sql` Example
* A `BatchRequest` facilitates the return of a `batch` of data from a configured `Datasource`. To find more about `Batches`, please refer to the [related documentation](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_get_a_batch_of_data_from_a_configured_datasource#1-construct-a-batchrequest). 
* A `BatchRequest` can return 0 or more Batches of data depending on the underlying data, and how it is configured. This guide will help you configure `BatchRequests` to return multiple batches, which can be used by
   1. Self-Initializing Expectations to estimate parameters
   2. DataAssistants to profile your data and create and Expectation suite with self-intialized parameters.
   
* Note : Multi-batch BatchRequests are not supported in `RuntimeDataConnector`.

In [1]:
import great_expectations as ge
from ruamel import yaml
from great_expectations.core.batch import BatchRequest
import sqlite3
import pprint


* Load `DataContext`

In [2]:
data_context: ge.DataContext = ge.get_context()

## Sql Example

### Example Database

Imagine we have a database of 1 table, with `yellow_tripdata_sample_2020`, corresponding to all 12 months' `taxi_trip` data for 2020.


In [3]:
# connect to sqlite DB, and print the existing tables
CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/test_ci"
from sqlalchemy import create_engine
from sqlalchemy import inspect
engine = create_engine(CONNECTION_STRING)
insp = inspect(engine)
print(insp.get_table_names())

['yellow_tripdata_sample_2020']


## Example Configuration

In our example, we add a Datasource named `taxi_multi_batch_sql_datasource` with 1 table. We also have a `InferredAssetSqlDataConnector` named `inferred_data_connector_multi_batch_asset`.

The Dataconnector configuration also includes a `splitter_method` to split the table values into multiple batches. The splitter we use is `split_on_year_and_month`, which creates Batches according to the `pickup_datetime` column which of type `timestamp` in the database schema. 


Our configuration also includes `schema_name` that is defined as part of `introspection_directives`. For other options for the DataConnector configuration, including other `introspection_directives`, please refer to the **Appendix** below. 



In [4]:
datasource_config = {
    "name": "taxi_multi_batch_sql_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "inferred_data_connector_multi_batch_asset": {
            "class_name": "InferredAssetSqlDataConnector",
                    "splitter_method": "split_on_year_and_month",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                        },
                    "introspection_directives":{
                        "schema_name": "public"
                        },
                    },
                },
            }

data_context.test_yaml_config(yaml.dump(datasource_config))


Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
	inferred_data_connector_multi_batch_asset : InferredAssetSqlDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample_2020 (3 of 12): [{'pickup_datetime': {'year': 2020, 'month': 2}}, {'pickup_datetime': {'year': 2020, 'month': 7}}, {'pickup_datetime': {'year': 2020, 'month': 12}}]

	Unmatched data_references (0 of 0):[]



<great_expectations.datasource.new_datasource.Datasource at 0x7f7d61cd5e20>

We see we have successfully configured this because the output shows a 1 data asset

* `yellow_tripdata_sample_2020` with 12 batches, each associated with a different month in our pickup_datetime column.

In [5]:
# add_datasource only if it doesn't already exist in our configuration
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)

## BatchRequest

Depending on which `DataConnector` (ie. `Partitioner`) you send a `BatchRequest` to, you will retrieve a different number of `Batches`

Single Batch returned by `whole_table`

In [6]:
single_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
)

In [7]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request)

In [8]:
batch_list

[<great_expectations.core.batch.Batch at 0x7f7d61cd5f40>,
 <great_expectations.core.batch.Batch at 0x7f7d61c7f1c0>,
 <great_expectations.core.batch.Batch at 0x7f7d78e097f0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65a00>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65ca0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65f40>,
 <great_expectations.core.batch.Batch at 0x7f7d61d658b0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d904c0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d90760>,
 <great_expectations.core.batch.Batch at 0x7f7d61d90a00>,
 <great_expectations.core.batch.Batch at 0x7f7d61d90ca0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d90f40>]

Multi Batch returned by `by_pickup_month`

In [9]:
multi_batch_batch_request: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
)

In [10]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [11]:
multi_batch_batch_list # 12 batches

[<great_expectations.core.batch.Batch at 0x7f7d61d1c5e0>,
 <great_expectations.core.batch.Batch at 0x7f7d78da9f70>,
 <great_expectations.core.batch.Batch at 0x7f7d61c70520>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65af0>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65190>,
 <great_expectations.core.batch.Batch at 0x7f7d61da95b0>,
 <great_expectations.core.batch.Batch at 0x7f7d61da9850>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65490>,
 <great_expectations.core.batch.Batch at 0x7f7d61d90df0>,
 <great_expectations.core.batch.Batch at 0x7f7d78e5ff40>,
 <great_expectations.core.batch.Batch at 0x7f7d61d65940>,
 <great_expectations.core.batch.Batch at 0x7f7d61da9550>]

You can also get a single Batch from a multi-batch DataConnector by passing in `data_connector_query`. 

In [12]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
    data_connector_query={ 
        "batch_filter_parameters": {"pickup_datetime": {"year": 2020, "month": 1}}
    }
)

In [13]:
batch_list = data_context.get_batch_list(batch_request=single_batch_batch_request_from_multi)

In [14]:
batch_list[0].to_dict() # 'batch_identifiers': {'pickup_datetime': '2020-01'}},

{'data': '<great_expectations.execution_engine.sqlalchemy_batch_data.SqlAlchemyBatchData object at 0x7f7d61db24c0>',
 'batch_request': {'datasource_name': 'taxi_multi_batch_sql_datasource',
  'data_connector_name': 'inferred_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_sample_2020',
  'data_connector_query': {'batch_filter_parameters': {'pickup_datetime': {'year': 2020,
     'month': 1}}},
  'limit': None,
  'batch_spec_passthrough': None},
 'batch_definition': {'datasource_name': 'taxi_multi_batch_sql_datasource',
  'data_connector_name': 'inferred_data_connector_multi_batch_asset',
  'data_asset_name': 'yellow_tripdata_sample_2020',
  'batch_identifiers': {'pickup_datetime': {'year': 2020, 'month': 1}}},
 'batch_spec': {'data_asset_name': 'yellow_tripdata_sample_2020',
  'table_name': 'yellow_tripdata_sample_2020',
  'batch_identifiers': {'pickup_datetime': {'year': 2020, 'month': 1}},
  'type': 'table',
  'data_asset_name_prefix': '',
  'data_asset_name_s

# Using auto-initializing `Expectations` to generate parameters

We will generate a `Validator` using our `multi_batch_batch_list`

In [15]:
multi_batch_batch_list = data_context.get_batch_list(batch_request=multi_batch_batch_request)

In [16]:
example_suite = data_context.create_expectation_suite(expectation_suite_name="example_sql_suite", overwrite_existing=True)

In [17]:
validator = data_context.get_validator_using_batch_list(batch_list=multi_batch_batch_list, expectation_suite=example_suite)

When you run methods on the validator, it will typically run on the most recent batch (index `-1`), even if the Validator has access to a longer Batch list. For example, notice that rows below are all associated with `pickup_datetime` being `9` (September, 2020). This is because the datetime values are stored lexicographically, meaning `1` and `11`, `12` values will appear **before** `2` and `3`.

For simplicity, let's get a `validator` with the December `Batch`, which is in index `"3"` (after `1`, `10`, `11`). Notice that we are also casting the value as a `list` using the square brackets. 

In [18]:
validator = data_context.get_validator_using_batch_list(batch_list=[multi_batch_batch_list[3]], expectation_suite=example_suite)

In [19]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-17 16:57:19,2020-01-17 17:11:18,1.0,1.4,1.0,N,163,68,1.0,10.0,3.5,0.5,3.55,0.0,0.3,17.85,2.5
1,1.0,2020-01-28 11:25:16,2020-01-28 11:42:32,1.0,1.8,1.0,N,234,162,1.0,11.5,2.5,0.5,3.7,0.0,0.3,18.5,2.5
2,2.0,2020-01-17 05:51:40,2020-01-17 06:10:33,1.0,9.21,1.0,N,162,138,1.0,27.0,0.5,0.5,7.38,6.12,0.3,44.3,2.5
3,1.0,2020-01-01 10:17:55,2020-01-01 10:32:23,1.0,3.4,1.0,N,79,246,1.0,13.5,2.5,0.5,3.36,0.0,0.3,20.16,2.5
4,2.0,2020-01-21 12:57:08,2020-01-21 13:02:57,1.0,0.8,1.0,N,161,237,1.0,5.5,0.0,0.5,1.76,0.0,0.3,10.56,2.5


### Typical Workflow
A `batch_list` becomes really useful when you are calculating parameters for auto-initializing Expectations, as they us a `RuleBasedProfiler` under-the-hood to calculate parameters.

Here is an example running `expect_column_median_to_be_between()` by "guessing" at the `min_value` and `max_value`. 

In [20]:
validator.expect_column_median_to_be_between(column="trip_distance", min_value=0, max_value=1)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 1.6
  }
}

The observed value for our `yellow_tripdata_sample_2020_01` table where `vendor_id` = `2`  is going to be `1.6`, which means the Expectation fails

Now we run the same expectation again, but this time with `auto=True`. This means the `median` values are going to calculated across the `batch_list` associated with the `Validator` (ie 3 Batches for `yellow_tripdata_sample_2020_01`), which gives the min value of `1.5` and the max value of `5.23`

In [21]:
validator.expect_column_median_to_be_between(column="trip_distance", auto=True)




Generating Expectations:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "kwargs": {
      "column": "trip_distance",
      "min_value": 1.6,
      "max_value": 1.6,
      "strict_min": false,
      "strict_max": false
    },
    "meta": {
      "auto_generated_at": "20220908T014809.543396Z",
      "great_expectations_version": "0.15.21+28.g75ae687f9.dirty"
    },
    "expectation_type": "expect_column_median_to_be_between"
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 1.6
  }
}

The `auto=True` will also automatically run the Expectation against the most recent Batch (which has an observed value of `1.61`) and the Expectation will pass. 

You can now save the `ExpectationSuite`.

In [22]:
validator.save_expectation_suite()

### Running the `ExpectationSuite` against single `Batch`

Now the ExpectationSuite can be used to validate single batches using a Checkpoint. In our example, let's validate a different table, `yellow_tripdata_sample_2020_02`, using the `ExpectationSuite` we built from `yellow_tripdata_sample_2020_01`.

In [23]:
single_batch_batch_request_from_multi: BatchRequest = BatchRequest(
    datasource_name="taxi_multi_batch_sql_datasource",
    data_connector_name="inferred_data_connector_multi_batch_asset",
    data_asset_name="yellow_tripdata_sample_2020",
    data_connector_query={ 
        "batch_filter_parameters": {"pickup_datetime": {"year": 2020, "month": 2}}
    }


)


In [24]:
checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": single_batch_batch_request_from_multi,
            "expectation_suite_name": "example_sql_suite",            
        }
    ],
}
data_context.add_checkpoint(**checkpoint_config)

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
  ],
  "batch_request": {},
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "module_name": "great_expectations.checkpoint",
  "name": "my_checkpoint",
  "profilers": [],
  "runtime_configuration": {},
  "validations": [
    {
      "batch_request": {
        "datasource_name": "taxi_multi_batch_sql_datasource",
        "data_connector_name": "inferred_data_connector_multi_batch_asset",
        "data_asset_name": "yellow_tripdata_sample_2020",
        "data_connector_query": {
          "batch_filter_parameters": {
        

In [25]:
results = data_context.run_checkpoint(
    checkpoint_name="my_checkpoint"
)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

In [26]:
results.success

False

# Appendix

## Other Parameters for `InferredAssetSqlDataConnector`

The signature of the `InferredAssetSqlDataConnector` also contains the following required parameters:
- `name`: string describing the name of this DataConnector.
- `datasource_name`: the name of the Datasource that contains it.
- `execution_engine`: an ExecutionEngine.

And the following optional parameters
- `data_asset_name_prefix`: string describing an optional prefix to prepend to inferred data_asset_names.
- `data_asset_name_suffix`: string describing an optional suffix to append to inferred data_asset_names.
- `include_schema_name`: bool which answers the question : "Should the data_asset_name include the schema as a prefix?"
- `splitter_method`: string that names method to split the target table into multiple `Batches`.
- `splitter_kwargs`: dict containing keyword arguments to pass to `splitter_method`.
- `sampling_method`: string that names method to sample `Batches`.
- `sampling_kwargs`: dict containing keyword arguments to pass to `sampling_method`.
- `excluded_tables`: A list of tables to ignore when inferring data asset_names
- `included_tables`: If not `None`, only include tables in this list when inferring data asset_names
- `skip_inapplicable_tables`:  If `True`, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred data_asset_names. If `False`, the class will throw an error during initialization if any such tables are encountered.
- `batch_spec_passthrough`: dictionary with keys that will be added directly to batch_spec.
- `introspection_directives`: Arguments passed to the introspection method to guide introspection

Valid keys for `introspection_directives` include: 
- `schema_name`: string describing schema to introspect (default is `None`). We used this parameter in our example above. 
- `ignore_information_schemas_and_system_tables`: bool (`default=True`) which determines whether to ignore information schemas and system tables when introspecting the database.
- `system_tables`: optional list of strings that define `system_tables` for your db. 
- `information_schemas`: optional list of strings that define `information_schemas` for your db. 
- `include_views`: bool (`default=True`) which determines whether to include `views` when introspecting the database. 

## Loading Data into Postgresql Database

* The following code can be used to build the postgres database used in this notebook. It is included (and commented out) for reference.
* In order to load the data into a local `postgresql` database, please feel free to use the `docker-compose.yml` file available at `great_expectations/assets/docker/postgresql/`. 

### To spin up the `postgresql` database
* Have [Docker Desktop](https://www.docker.com/products/docker-desktop/) running locally.
* Navigate to `great_expectations/assets/docker/postgresql/`
* Type `docker-compose up`
* Then uncomment and run the following snippet

In [None]:
# from tests.test_utils import load_data_into_test_database
# from typing import List
# import sqlalchemy as sa
# import pandas as pd
# CONNECTION_STRING = "postgresql+psycopg2://postgres:@localhost/test_ci"

# data_paths: List[str] = [
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-01.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-02.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-03.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-04.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-05.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-06.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-07.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-08.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-09.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-10.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-11.csv",
#      "../../../test_sets/taxi_yellow_tripdata_samples/yellow_tripdata_sample_2020-12.csv",
# ]

    
# engine = sa.create_engine(CONNECTION_STRING)
# connection = engine.connect()
# table_name = "yellow_tripdata_sample_2020"
# res = connection.execute(f"DROP TABLE IF EXISTS {table_name}")

# for data_path in data_paths:
#     # This utility is not for general use. It is only to support testing.
#     load_data_into_test_database(
#         table_name="yellow_tripdata_sample_2020",
#         csv_path=data_path,
#         connection_string=CONNECTION_STRING,
#         load_full_dataset=True,
#         drop_existing_table=False,
#         convert_colnames_to_datetime=["pickup_datetime", "dropoff_datetime"]
#     )