# Grater Expectations Tutorial

![Grater Expectations](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/grater_expectations_background_small.png)

### Introduction
Welcome to Grater Expectations tutorial! This notebook will help you run through a full example of using Grater Expectations. To repeat on what is already mentioned in the README, note that you need the following to run all the components of the tutorial:

-  AWS account with [programmatic access keys](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html)
- [Docker Engine](https://docs.docker.com/engine/): to create new images to run on AWS Lambda and push them to ECR
- [AWS CLI](https://aws.amazon.com/cli/): to login to AWS, create an ECR repository and push docker images to ECR
- Python 3.8: It is recommended to use conda ([Miniconda](https://docs.conda.io/en/latest/miniconda.html)) for easy environment creation and management
- [Terraform](https://www.terraform.io/): to spin up S3 buckets for GE artifacts and the Data Docs website and a Lambda function for testing
- IDE (e.g. VS Code, optional): for easier development (not necessarily for notebooks, but definitely for Python files)

If you have these installed, then you are ready to continue with the tutorial!

<hr>

In order to validate your data, Great Expectations is a package that offers a battery-included set of logic to get up-and-running fast. Fully figuring out how Great Expectations works and applying it to your project, however, can be somewhat involved. This is what Grater Expectations and this tutorial help you with!

Grater Expectations makes a few choices for you and offers scripts, configurations and notebooks to get you started. The choices that were made are:

- Great Expectations output will be stored on S3
- The rendered Data Docs site will be stored on S3
- You will write your own data loading logic to read data into memory as a pandas DataFrame
- You will write your own set of expectations to test the quality of this data
- The validation logic will be deployed as Docker container via AWS Lambda

To set you up for the above, you already entered configurations in `testing_config.yml` for the tutorial, which will be used throughout the code to generate AWS services and access them.

<hr>

To get an idea of how Grater Expectations can help you to develop and deploy logic that you can use to test data, a simplified workflow is shown below. 

![Workflow](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/high_level_workflow.png)

The overall idea is to **configure** data testing by using Great Expectations, so that a preset selection of tests can be run over new data at **runtime**.

In order to **configure** Great Expectations, an example dataset representative of future data to be tested is loaded. Using this dataset, multiple tests (*expectations*) are defined and bundled into a set of tests (*expectation suite*). In order to call this expectation suite at runtime, it is connected to a checkpoint. This checkpoint can then be called at runtime to test new data against, the logic of which is developed a Python script and built into a Docker image. This image is deployed as an AWS Lambda.

At **runtime**, an event containing information on which new data to load and test can then be sent to the deployed Lambda to invoke it. The Lambda will then load and validate the new data, using the checkpoint and expectation suite previously developed. The results will then be published on a so-called Data Docs website, which end users can then inspect.

In the rest of this tutorial, each of the steps to set up services, configure Great Expectations and validate new data will be detailed.

### Table of contents
This notebook will guide you through each of the steps to get you testing your data as soon as possible. The steps are:

1. [Setting up a virtual environment](#setting-up-a-virtual-environment)
2. [Provisioning S3 buckets](#provisioning-s3-buckets)
3. [Imports and configurations](#imports-and-configurations)
4. [Initialization of objects](#initialization-of-objects)
5. [Uploading tutorial data to S3](#uploading-tutorial-data-to-s3)
6. [Loading data](#loading-data)
7. [Generating a batch request, expectation suite and validator](#generating-a-batch-request-expectation-suite-and-validator)
8. [Creating Expectations](#writing-expectations)
9. [Finalizing expectation suite and creating checkpoint](#finalizing-expectation-suite-and-creating-checkpoint)
10. [Instantiating the Data Docs website](#instantiating-the-data-docs-website)
11. [Developing logic for the Lambda function](#developing-logic-for-the-lambda-function)
12. [Creating a Docker image and deployment on AWS ECR](#creating-a-docker-image-and-deployment-on-aws-ecr)
13. [Deploying the AWS Lambda function](#deploying-the-aws-lambda-function)
14. [Validating new batches of data](#validating-new-batches-of-data)
15. [Wrap up of tutorial and clean-up of AWS services](#wrap-up-of-tutorial-and-clean-up-of-aws-services)
16. [End of tutorial](#end-of-tutorial)

<br>
<hr>


#### Setting up a virtual environment
In order to run the logic contained within this notebook, make sure that it was started up from a virtual environment that contained all required Python dependencies. The easiest way to assure this is to first make a virtual environment for the project and then install `grater_expectations` within it.

To create a new virtual environment, e.g. for python 3.8, and installing the package and its dependencies, run the following:

<br>
<hr>

**Option 1: Pip**

```bash
# Create a virtual environment
python -m venv env

# Activate the virtual environment
env/Scripts/Activate # Windows
source env/bin/activate # MacOS

# Install into the virtual envirpnment
pip install grater_expectations
```

<br>


**Option 2: Anaconda**

```bash
# Create a conda environment
conda create --name grater_expectations python=3.8

# Activate the conda environment
conda activate grater_expectations

# Install into the virtual environment
pip install grater_expectations
```
<br>
<hr>


#### Provisioning S3 buckets
For testing, Grater Expectations is configured to interact with S3 buckets on AWS. To be able to run all the code in this notebook the **store_bucket**, **site_bucket** and **data_bucket** as configured in the testing_config.yml must be provisioned.

To do so, auto-generated Terraform files can be found in the *terraform/buckets* directory of this project. To use these configurations to generate S3 buckets in that directory, open a (Git bash) terminal, set your AWS programmatic access credentials as environment variables and run the following commands:

<br>

```bash
# Set credentials
export AWS_ACCESS_KEY_ID=<enter_aws_access_key_here>
export AWS_SECRET_ACCESS_KEY=<enter_aws_secret_access_key_here>
export AWS_SESSION_TOKEN=<enter_session_token_here> # If using AWS SSO

# Go to correct Terraform directory
cd terraform/buckets

# Initialize Terraform
terraform init

# Generate deployment plan, enter yes at the prompt if the plan is correct
terraform apply
```
<br>

Terraform will then provide you with a plan, which it will deploy if you enter 'yes' at the prompt as shown below.

![Terraform bucket prompt](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/tutorial_tf_bucket_prompt.png)

After entering yes, Terraform will deploy the buckets and will show you the following output when finished.

![Terraform bucket apply](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/tutorial_tf_bucket_apply.png)

<br>

You can check if the buckets were successfully created by logging in to the console and following [this link](https://s3.console.aws.amazon.com/s3/buckets)

<br>
<hr>

#### Imports and configurations
In the cell below, packages are imported and configurations are loaded from the `project_config.yml`, which was automatically created by `initialize_project.py`. This config file is based on the parameters set in `testing_config.yml` at the root of this repository

Note specifically the import of load_csv_from_s3 as load_data. For this tutorial, a function was already developed to load csv files from S3 and turn these into pandas DataFrames. You can inspect the code for this function in `supporting_functions.py` or by calling `help(load_data)`

In [None]:
# Imports of Python and Grater Expectations packages and logic
import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
import boto3
from supporting_functions import (TestingConfiguration,
                                  get_file_keys_from_s3,
                                  print_ge_site_link,
                                  generate_link_in_notebook,
                                  invoke_lambda_function)
from supporting_functions import load_csv_from_s3 as load_data
import json
import os

# -- Load parameters from configuration file
test_config = TestingConfiguration("project_config.yml")
test_config.load_config()

#### Initialization of objects
Next, objects required to interact with AWS from Python and a Great Expectations DataContext are initialized.

By loading the GE configuration file (`great_expectations.yml` found in the great_expectations subdirectory of this project. The DataContext class defaults to this location and file when called), the `DataContext` object stores important parameters for you to interact with GE and automatically interact with S3 buckets for storing output, pulling checkpoints and generating Data Docs sites. For more information on data contexts, check out [the documentation](https://docs.greatexpectations.io/docs/terms/data_context/)

**NOTE**: If generating the DataContext object generates *invalid store configuration* warnings, you probably forgot to provision the S3 buckets for the tutorial. Ensure these are properly provisioned before continuing with the rest of the tutorial

In [None]:
# -- 1. Initialize GE and S3 objects
s3_client = boto3.client("s3")
bucket = boto3.resource("s3").Bucket(test_config.data_bucket)
context = ge.data_context.DataContext()

#### Uploading tutorial data to S3

To emulate a *normal* setting, you will now upload the tutorial data found in this repository to the S3 data bucket you provisioned using terraform. In order to do so, the PATH_TUTORIAL_DATA constant is set to the local path to the tutorial data. Next, the initialized bucket client (which is set to the data bucket) is used to upload the data to your bucket.

The data being uploaded contains information about taxi trips in New York City. More information about it can be found [here](https://registry.opendata.aws/nyc-tlc-trip-records-pds/)

After running the code cell below, you can enter the AWS terminal and check the S3 bucket to see if the data now resides on it. The cell automatically provides a link to the bucket that you can click.

In [None]:
# -- Path constant
PATH_TUTORIAL_DATA = test_config.prefix_data

# -- Upload logic
for dataset in os.listdir(PATH_TUTORIAL_DATA):
    # -- Set path to file
    path_file = os.path.abspath(PATH_TUTORIAL_DATA+dataset)

    # -- Upload to S3
    bucket.meta.client.upload_file(
        Filename=path_file, 
        Bucket=test_config.data_bucket, 
        Key=PATH_TUTORIAL_DATA+dataset
    )

# -- Provide link to S3 bucket
url = f"https://s3.console.aws.amazon.com/s3/buckets/{bucket.name}?region={test_config.region}&tab=objects"
generate_link_in_notebook(url)

#### Loading data

The next step is to load an example dataset that can be used to create expectations for. As previously mentioned, the function `load_csv_from_s3` was already developed for this tutorial and was imported as `load_data`. 

Having uploaded the tutorial data to S3 in the previous cell, we can now download it and transform it into a pandas DataFrame with the cell below (which is obviously redundant, but implemented for the sake of this tutorial). To know which file to load, we first get a list of objects residing on S3 using `get_file_keys_from_s3` and then pick the oldest one from this list (objects are sorted).

The code of the `load_csv_from_s3` function is shown below.

<br>

```python
def load_csv_from_s3(
    s3_bucket_client: boto3.resource("s3").Bucket, prefix: str
) -> pd.DataFrame:
    """Function to loads a csv from S3 into a pandas DataFrame

    Parameters
    ----------
    s3_bucket_client : boto3.resource
        Instantiated s3 bucket client using boto3. Note that it should already
        be pointing to the bucket from which you want to load objects
    prefix : str
        Prefix to csv object on S3

    Returns
    -------
    pd.DataFrame
        The loaded csv object as pandas DataFrame
    """
    s3_object = s3_bucket_client.Object(prefix).get()
    df = pd.read_csv(s3_object["Body"])

    return df
```

<br>

Pay special attention to what inputs this function needs and how it knows what file to load, since this will be used in the Lambda function as well. As you can see in the function source code, it will need an initialized bucket client and a prefix to an object in order to load data.

Apart from the dataset, Great Expectations needs some additional parameters for operations down the line. These are an identifier for the batch being run and a name for the data asset. These can be the same, as long as they can be used to identify which dataset is being evaluated.

In [None]:
# -- Get object keys from S3, sort, pick oldest
list_objects = get_file_keys_from_s3(s3_client, test_config.data_bucket, PATH_TUTORIAL_DATA)
list_objects.sort()
asset_name = list_objects[0]

# -- Load dataset
df_batch = load_data(bucket, asset_name)
batch_identifier = "tutorial_batch_dataset"

# -- Show top rows of the dataset to get an idea of its contents
df_batch.head(10)

#### Generating a batch request, expectation suite and validator
Now that the data has been loaded, this batch will be passed to a RuntimeBatchRequest below in order to start building an expectation suite. An expectation suite is Great Expectations jargon for a collection of expectations (or tests) that you want to run your data against. 

Great Expectations requires data to be passed as a request when you want to use the context object to generate things such as expectation suites. The RuntimeBatchRequest is used here, because we are loading data (with our own logic) at runtime and want to pass that to subsequent objects.

More information on runtime batch requests can be found [here](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector/).

Using the RuntimeBatchRequest, two things are done next:
1. Generating an expectation suite: this will serve as a collection of tests you will run for future datasets
2. Generate a validator: this object will use the batch dataset loaded previously to start running expectations and storing these in the suite

As soon as the suite and validator are initiated, you can start writing expectations in the next cells

In [None]:
# -- 3. Generate batch request at runtime using loaded tile
batch_request = RuntimeBatchRequest(
    datasource_name="runtime_data",
    data_connector_name="runtime_data_connector",
    data_asset_name=asset_name,
    runtime_parameters={"batch_data": df_batch},
    batch_identifiers={"batch_identifier": batch_identifier,},
)

# -- 4. Generate expectation suite, start validator
suite = context.create_expectation_suite(
    test_config.expectations_suite_name,
    overwrite_existing=True,  
)

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=test_config.expectations_suite_name,
)

#### Creating Expectations
Using the validator object, expectations can be formulated below. Since Great Expectations comes with many expectations out of the box, [this page](https://greatexpectations.io/expectations) is generally a good place to start browsing through these. 

Predefined expectations can be used by calling them using the validator object and passing the required arguments. For example, to run an expectation on the number of rows in a dataframe, the following snippet can be used:

<br>

```python
# Get number of rows of current batch
row_count = df_batch.shape[0]

# Make expectation where the maximum deviation from the batch number of rows is 1%
max_delta = 0.01
validator.expect_table_row_count_to_be_between(
    min_value=row_count * (1-max_delta), max_value=row_count * (1+max_delta))
```

<br>

Expectations can be run both on the level of a table, e.g. evaluating the number of rows or columns, and on the level of a column, e.g. evaluating the minimum and maximum within a column. The expected values for such tests are set when generating the expectations suite using the `validator` object in the cells below and will be used for future validations.

Alternatively, expectations can also be set using dynamic evaluation parameters, which is just an expensive set of words for test values that you determine at runtime. This can be useful if you for example want to compare your current dataset with the data of last month and use values in your expectations based on last month's data. An example of how to configure these dynamic evaluation parameters is shown below. More information about them can be found [here](https://docs.greatexpectations.io/docs/reference/evaluation_parameters/)

Apart from existing expectations, you can also develop expectations yourself. If you want to do so, more information can be found about that [here](https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/overview)

In [None]:
# -- Table level expectations

# -- Constant parameters
ROW_COUNT_DELTA = .05

# -- 1. Expect future datasets to contain the same columns as current batch DataFrame
expected_columns = df_batch.columns
validator.expect_table_columns_to_match_set(
    column_set=expected_columns, exact_match=True
)

# -- 2. Expect row count to be in between range, based on set delta and number of rows of batch dataset
row_count = df_batch.shape[0]
validator.expect_table_row_count_to_be_between(
    min_value=row_count * (1-ROW_COUNT_DELTA), 
    max_value=row_count * (1+ROW_COUNT_DELTA),
)

In [None]:
# -- Column level expectations

# -- 1. Values are never null
for column in expected_columns:
    validator.expect_column_values_to_not_be_null(column)

# -- 2. Date columns are parseable
date_columns = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
for column in date_columns:
    validator.expect_column_values_to_be_dateutil_parseable(column)

# -- 3. Check dtypes, assuming dtypes of the batch dataset are correct (this is
#       something you might rather want to hard-code for real products)
dict_dtypes = {}
for column, dtype in zip(df_batch.dtypes.index, df_batch.dtypes):
    dict_dtypes[column] = str(dtype)

for column, dtype in dict_dtypes.items():
    validator.expect_column_values_to_be_of_type(column, dtype)

# -- 4. Expect values of specific columns to be between lower- and upper bounds
dict_bounds = {"VendorID":[1,2],
              "payment_type":[1,4]
              }

for column, (lower_bound, upper_bound) in dict_bounds.items():
    # NOTE: for large datasets, this expectation is really slow. Using seperate tests
    # for what the minimum should be and the maximum should be is a lot faster, while
    # achieving the same test-wise
    validator.expect_column_values_to_be_between(column, lower_bound, upper_bound)

In [None]:
# -- Expectation using dynamic evaluation parameters. Instead of passing values directly 
#    here, a dictionary is passed with the name of the dynamic parameters along with a mock
#    value for the current validation. At runtime, this value will need to be provided 
#    when running the validations

test_column = "passenger_count"
max_passenger_count = df_batch[test_column].max()

validator.expect_column_max_to_be_between(
    column=test_column,
    min_value={
        "$PARAMETER": "min_max_passenger_count",
        "$PARAMETER.min_max_passenger_count": max_passenger_count
        },
    max_value={
        "$PARAMETER": "max_max_passenger_count",
        "$PARAMETER.max_max_passenger_count": max_passenger_count+2
        },
)

#### Finalizing expectation suite and creating checkpoint
After running all the expectations in the cells above, the cell below can be executed to save this set of expectations as a suite and couple it with a checkpoint. 

A checkpoint is an object which can be called by validation logic to run new data batches against, coupling an expectation_suite with parameters to a name. In addition, actions can be coupled to this checkpoint, such as automatically updating the Data Docs website or sending a message on Slack. 

This checkpoint can be used in other scripts (lambda_function.py) by passing a new batch of data (as RuntimeBatchRequest) along with the checkpoint name to the `run_checkpoint` method of an initialized DataContext (which is the same as the `context` object initialized in this notebook). Such a call would look like:

<br>

```python
results = context.run_checkpoint(
    checkpoint_name="CHECKPOINT_NAME",
    validations=[{"batch_request": batch_request}],
    )
```

<br>

When the code below is ran, Great Expectations automatically saves the expectation suite and checkpoint to S3 via the validator and context objects.

In [None]:
# -- 6. Save suite
validator.save_expectation_suite(discard_failed_expectations=False)

# -- 7.1 Create Simple checkpoint with automatic data docs updates
checkpoint_config = {
    "name": test_config.checkpoint_name,
    "config_version": 3,
    "class_name": "SimpleCheckpoint",
    "expectation_suite_name": test_config.expectations_suite_name,
    "run_name_template": test_config.run_name_template,
}

# -- 7.2 Add to context object
context.add_checkpoint(**checkpoint_config)

#### Instantiating the Data Docs website
Next, the command below can be ran to build the Data Docs website on S3, which provides an interactive user interface in which you can browse through expectation suites and check validation results. More on Data Docs can be found [here](https://docs.greatexpectations.io/docs/reference/data_docs/)

If you use a SimpleCheckpoint, as is the case in this tutorial, the website will automatically be updated each time validaitons are run. If not, you have to either manually or programmatically update the Data Docs website but calling `context.build_data_docs()`

After initializing the Data Docs website, you should be able to see the expectation suite we just generated and inspect its expectations. In the following steps, we will start running validations, which well then also appear on the website.

In [None]:
ge_site_output = context.build_data_docs()
print_ge_site_link(ge_site_output)

#### Developing logic for the Lambda function

As previously stated, Grater Expectations implements data testing through deploying testing logic on an AWS Lambda function that can be called over new data. Normally you would have to configure this Lambda yourself to be able to load data at runtime and run expectations. For this tutorial, the Lambda function code has already been completed with data loading logic and in order for it to run, it expects to receive the prefix of the data file on S3 in the event so it knows what to load and what to run expectations over from its event at runtime. The JSON that is expected is structured as `{"object_prefix":<prefix_to_dataset_on_s3>}`


You can find all the code for the Lambda function in `lambda_function.py`. To get a better understanding of the function and its steps, it is worthwhile to open this file and walk through the steps and the code. Most of the classes and functions that it uses are stored in `supporting_functions.py`, so it is also worthwhile to have a look there.

In essence, the function is rather straightforward in its steps:
1. Initialize objects, load configuration files
2. Load data from S3 to pandas using the object_prefix passed in the event at runtime
3. Convert the data to a RuntimeBatchRequest
4. Set dynamic evaluation parameters
5. Run validations
6. Evaluate results

As the Lambda function does not require any tweaks, if you understand its contents and functioning, you can proceed to the next step.

<br>
<hr>

#### Creating a Docker image and deployment on AWS ECR
Because there are size constraints when it comes to using Python packages on AWS Lambda (max 250MB of loaded packages through layers), Grater Expectations uses Docker images instead (for which the size constraint is 10GB).

To help you setting this up, `initialize_project.py` automatically generated all the boilerplate code you need to create a Docker image and load it to ECR. Said logic can be found in:
- **Dockerfile**: this file contains the required steps to build a new Docker image to deploy on AWS
- **build_image_store_on_ecr.sh**: bash script containing all steps to create a new Docker image using the Dockerfile and load it to ECR, provided you have the Docker Engine and AWS CLI installed and your user credentials (AWS) can be accessed

Since the Lambda function is ready to be Dockerized and deployed at this stage, you can do so by calling this `build_image_store_on_ecr.sh` bash script in the terminal from the directory of the tutorial, as shown below. **Make sure that Docker Engine is running before you run the bash script!**

<br>

```bash
# Set AWS credentials in the current terminal
export AWS_ACCESS_KEY_ID=<enter_aws_access_key_here>
export AWS_SECRET_ACCESS_KEY=<enter_aws_secret_access_key_here>

# Go to the directory of the tutorial (if the terminal is not already opened there)
cd tutorial

# Run bash script from tutorial directory
sh build_image_store_on_ecr.sh
```
**NOTE**:
1. If a prompt appears after creating the ECR repository, you can close it by pressing q.
2. For Windows users, the build_image_store_on_ecr.sh will not work when called from CMD or Powershell. Instead, use Git Bash (which is automatically installed on your machine when you install Git) to call the bash script. Before doing so, make sure that you export your credentials in the terminal, so it can interact with AWS.

<br>

When `build_image_store_on_ecr.sh` is called, the script will build a new Docker image for Python 3.8 using a publicly available base image from AWS, install all dependencies within it based on `requirements.txt` and copy required code- and configuration files onto the image (`supporting_function.py`, `lambda_function.py`, `project_config.yml` and `great_expectations/great_expectations.yml`). Next, it will create a new repository on AWS ECR (if needed) and upload the Docker image to it. The output in the terminal should look as follows:

![Bash output of deployment](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/bash_output_deployment.png)

<br>
<hr>

#### Deploying the AWS Lambda function

Now that the Lambda logic is available as Docker image on ECR, the next step is to deploy a Lambda function that uses this Docker image.

This can be done by using the Terraform configurations that were automatically when you ran `initialize_project.py` for the tutorial. In the *terraform/lambda* subdirectory of the tutorial, you will find the configurations you need to spin up the Lambda.

Similarly as with the S3 buckets, open up a terminal in this directory (or open the directory in the terminal you already have opened) and run the `terraform init` and `terraform plan` commands, typing *yes* when prompted by Terraform. For completeness, see the snippet below.

<br>

```bash
# Set credentials (if not already set in the current terminal)
export AWS_ACCESS_KEY_ID=<enter_aws_access_key_here>
export AWS_SECRET_ACCESS_KEY=<enter_aws_secret_access_key_here>

# Go to correct Terraform directory
cd terraform/lambda

# Initialize Terraform
terraform init

# Generate deployment plan, enter yes at the prompt if the plan is correct
terraform apply
```
<br>

If successfull, you will see the following outputs.

![Terraform Lambda output](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/tutorial_tf_lambda.png)

<br>

You can click the link below to check the Lambda function in the AWS console.

In [None]:
url = f"https://{test_config.region}.console.aws.amazon.com/lambda/home?region={test_config.region}#/functions/grater_expectations_validation_tutorial?tab=code"
generate_link_in_notebook(url)

#### Validating new batches of data

Now that the validation Lambda is ready to be used, the next step is to have it validate new batches of taxi trip data.  

To do so, we re-use the list of prefixes to datasets in `list_objects` that was previously generated. As you might recall, we used the first dataset in this list as batch dataset to configure expectations for. Now we can use the other datasets to run the expectation suite over using the Lambda function.

To do so, we take the remaining prefixes in the list and generate payloads from them to serve to the Lambda. As previously stated, the Lambda function expects to receive `{"object_prefix":<prefix_to_dataset_on_s3>}` in its event to know which dataset to load. When invoking the Lambda from Python, it expects to receive this JSON encoded as bytes. 

Therefore, the prefixes in `list_objects_lambda` are put into JSON's and then encoded. After doing so, the `invoke_lambda_function` is called sequentially for each of the datasets, where the responses of the calls to the Lambda function are stored in `responses`.

After calling the Lambda, you can check the Data Docs website for the results of running the expectation suite on the other datasets.

In [None]:
# -- Initialize client
lambda_client = boto3.client('lambda', region_name=test_config.region)

# -- Get object keys from S3, sort, drop oldest dataset as this was used when setting up the expectation suite
list_objects_lambda = list_objects[1:]

# -- Generate payloads for invoking the Lambda
list_payloads = []
for prefix in list_objects_lambda:
    payload = {"object_prefix":prefix}
    payload_bytes = json.dumps(payload).encode("utf-8")
    list_payloads.append(payload_bytes)

# -- Invoke Lambda sequentially, store responses
responses = []
for payload in list_payloads:
    response = invoke_lambda_function(
        lambda_client=lambda_client, 
        payload=payload,
        lambda_function="grater_expectations_validation_tutorial")
    responses.append(response)

# -- Print GE website link for easy access
print_ge_site_link(ge_site_output)

#### Wrap up of tutorial and clean-up of AWS services

If you are able to see the validation runs on the Data Docs website, you have successfully completed the tutorial! You are now equipped and ready to start using Grater Expectations on your own projects. To do so, you can generate a new configuration for a project in `testing_config.yml`, ensure that you fill in parameters for each of the configurations and then call `python initialize_project.py -p <project_name>`. A new directory for your project will then be bootstrapped.

Before doing so, you should **clean up the services** that were provisioned for running this tutorial, to prevent unnecessary costs from accruing on your AWS bill.

To do so, open up your terminal again, browse to the *terraform/buckets* directory and, thereafter, the *terraform/lambda* directory and call `terraform destroy`. An example for the Lambda configurations is shown below.

<br>

![Terraform Lambda destroy](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/tutorial_tf_lambda_destroy.png)

<br>

![Terraform Lambda destroy yes](https://raw.githubusercontent.com/jschra/grater_expectations/main/docs/images/tutorial_tf_lambda_destroy_yes.png)

<br>

After deprovisioning the S3 buckets and the Lambda function, the last step is to remove the Docker image from ECR. This can be done by running the code below.

In [None]:
# -- Initialize ECR object to remove repo and image
ecr = boto3.client("ecr", region_name= test_config.region)
ecr.delete_repository(
    registryId=test_config.account_id,
    repositoryName=test_config.docker_image_name,
    force=True
)

#### End of tutorial
And that concludes this tutorial! If you were running this notebook through a Jupyter server, you can stop it from running by calling the command below.

In [None]:
os.system("jupyter notebook stop")