# Expectation Suite
### Introduction
In order to validate your data, Great Expectations is a package that offers a battery-included set of logic to get up-and-running fast. Fully figuring out how Great Expectations works and applying it to your project, however, can be somewhat involved. This is what Grater Expectations helps you with!

This bootstrapped project makes a few choices for you and offers scripts, configurations and notebooks to get you started. The choices that were made are:

- Great Expectations output will be stored on S3
- The rendered Data Docs site will be stored on S3
- You will write your own data loading logic to read data into memory as a pandas DataFrame
- You will write your own set of expectations to test the quality of this data
- The validation logic will be deployed as Docker container via AWS Lambda

To set you up for the above, the `testing_config.yml` configuration file had you enter parameters that will be used throughout this project. Assuming these were properly set, you can now continue to set up your expectation suite!

### Description
This notebook can be used to generate a so-called expectation suite that can be used to run validations against your data. An expectation suite is Great Expectations jargon for a collection of expectations (or tests) that you want to run your data against. 

This notebook helps you to set up such a set of expectations, by loading a batch of data and writing expectations that can be run on it. After doing so, this set of expectations will be saved as an expectation suite for later usage.

In the last step, this expectation suite will be connected to a checkpoint, which is an object that can be called by data validation logic to run new data batches against. Hence, after setting up an expectation suite and checkpoint using this notebook, the next step is to finalize the AWS lambda function found in `lambda_function.py` and deploy that on AWS as a Docker image. To assist with these steps Terraform configurations, a Dockerfile and a bash script (`build_image_store_on_ecr.sh`) were automatically generated.

### Dependencies

#### Virtual environment
In order to run the logic contained within this notebook, make sure that it was started up from a virtual environment that contained all required Python dependencies. The easiest way to assure this is to first make a virtual environment for the project and then install `grater_expectations` within it.

To create a new virtual environment, e.g. for python 3.8, and installing the package and its dependencies, run the following:

<br>

**Option 1: Pip**

```bash
# Create a virtual environment
python -m venv env

# Activate the virtual environment
env/Scripts/Activate # Windows
source env/bin/activate # MacOS

# Install into the virtual envirpnment
pip install grater_expectations
```

<br>

**Option 2 - Anaconda**

```bash
# Create a conda environment
conda create --name grater_expectations python=3.8

# Activate the conda environment
conda activate grater_expectations

# Install into the virtual environment
conda install grater_expectations
```

<br>
<hr>

<br>

#### S3 buckets
For testing, Great Expectations is configured to interact with S3 buckets on AWS. To be able to run all the code in this notebook the **store_bucket** and **site_bucket** as configured in the testing_config.yml must be provisioned.

To do so, auto-generated Terraform files can be found in the *terraform/buckets* directory of this project. To use these configurations to generate S3 buckets, open a terminal in this directory and run the following commands:

<br>

```bash
# Initialize Terraform
terraform init

# Generate deployment plan, enter yes at the prompt if the plan is correct
terraform apply
```
<br>

**NOTE**: the code for provisioning an S3 bucket to store data in is commented out by default, assuming that a storage location with data already exists. If, however, you do want to provision an S3 bucket to store data in, comment out the configurations found in `terraform/buckets/data_bucket.tf` and run the Terraform commands shown above

#### Imports and configurations
In the cell below, packages are imported and configurations are loaded from the `project_config.yml` file. This config file was automatically generated by `initialize_project.py` based on the parameters set in `testing_config.yml` at the root of this repository

In [None]:
# Imports
import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
import boto3
from supporting_functions import (TestingConfiguration, 
                                  checkpoint_without_datadocs_update, 
                                  print_ge_site_link)
import os

# Load parameters from configuration file
test_config = TestingConfiguration("project_config.yml")
test_config.load_config()

#### Initialization of objects
Next, objects required to interact with AWS from Python and a Great Expectations DataContext are initialized.

By loading the GE configuration file (`great_expectations.yml` found in the great_expectations subdirectory of this project. The DataContext class defaults to this location and file when called), the `DataContext` object stores important parameters for you to interact with GE and automatically interact with S3 buckets for storing output, pulling checkpoints and generating Data Docs sites. For more information on data contexts, check out [the documentation](https://docs.greatexpectations.io/docs/terms/data_context/)

In [None]:
# -- 1. Initialize GE and S3 objects
s3_client = boto3.client("s3")
bucket = boto3.resource("s3").Bucket(test_config.data_bucket)
context = ge.data_context.DataContext()

#### Loading data
Next, logic must be written to load data. The easiest way to do so is to write a new function for this that you store in `supporting_functions.py`, so that it can be used both in this notebook and in the Lambda function.

If you have your data locally as a CSV file, such a function could be as simple as:

<br>

```python
def load_data(path: str) -> pd.DataFrame:
    df_batch = pd.read_csv(path)

    return df_batch
```

<br>

However, note that if you transfer this logic to a Lambda function, data will most probably have to be downloaded from another location, for example an S3 bucket. Loading logic for such a case could look like:

<br>

```python
def load_csv_from_s3(
    s3_bucket_client: boto3.resource("s3").Bucket, file_key: str
) -> pd.DataFrame:
    """Function that loads a csv from S3 into a pandas DataFrame

    Parameters
    ----------
    s3_bucket_client : boto3.resource
        Instantiated s3 bucket client using boto3. Note that it should already
        be pointing to the bucket from which you want to load objects
    file_key : str
        Key to the channel within the tile to be loaded

    Returns
    -------
    pd.DataFrame
        The loaded data as pandas DataFrame
    """
    s3_object = s3_bucket_client.Object(file_key).get()
    df_batch = pd.read_csv(s3_object['Body'])

    return df_batch
```

<br>

Pay special attention to what inputs this function needs and how it knows what file to load, since this will need to be used in the Lambda as well. If you are able to pass the path to the file during runtime when using a Lambda (e.g. in the event sent to it), you can simply use that as a parameter of the function.

Apart from the dataset, Great Expectations needs some additional parameters for operations down the line. These are an identifier for the batch being run and a name for the data asset. These can be the same, as long as they can be used to identify which dataset is being evaluated.

In [None]:
# -- 2. Load data, generate asset name and batch identifier
#
# To generate a testing suite, a batch of data can be used to develop
# validations. Accordingly, in the lines below custom logic is required to load
# a test dataset as a pandas DataFrame in the df_batch argument
#
# To make the RuntimeBatchRequest complete, which is used by GE for the
# validations, an asset_name and batch_identifier are required
df_batch = load_data()  # Needs to be defined!
batch_identifier = "BATCH IDENTIFIER OR LOGIC TO GENERATE IT GOES HERE"
asset_name = "ASSET NAME OR LOGIC TO GENERATE IT GOES HERE"

#### Generate batch request, generate suite and start validator
When you have succesfully written logic to load data, the loaded batch will be passed to a RuntimeBatchRequest below in order to start an expectation suite. Great Expectations requires data to be passed as a request when you want to use the context to generate things such as expectation suites and checkpoints. The RuntimeBatchRequest is used here, because we are loading data (with our own logic) at runtime and want to pass that to subsequent objects.

More information on runtime batch requests can be found [here](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector/).

Using the RuntimeBatchRequest, two things are done next:
1. Generating an expectation suite: this will serve as a collection of tests you will run for future datasets
2. Generate a validator: this object will use the batch dataset loaded previously to start running expectations over and storing these in the suite

As soon as the suite and validator are initiated, you can start writing expectations in the next cells


In [None]:
# -- 3. Generate batch request at runtime using loaded tile
batch_request = RuntimeBatchRequest(
    datasource_name="runtime_data",
    data_connector_name="runtime_data_connector",
    data_asset_name=asset_name,
    runtime_parameters={"batch_data": df_batch},
    batch_identifiers={"batch_identifier": batch_identifier,},
)

# -- 4. Generate expectation suite, start validator
suite = context.create_expectation_suite(
    test_config.expectations_suite_name,
    overwrite_existing=True,  
)

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=test_config.expectations_suite_name,
)

#### Expectations
Using the validator object, expectations can be formulated below. Since Great Expectations comes with many expectations out of the box, [this page](https://greatexpectations.io/expectations) is generally a good place to start browsing through these. 

Predefined expectations can be used by calling them using the validator object and passing the required arguments. For example, to run an expectation on the number of rows in a dataframe, the following snippet can be used:

<br>

```python
# Get number of rows of current batch
row_count = df_batch.shape[0]

# Make expectation where the maximum deviation from the batch number of rows is 1%
max_delta = 0.01
validator.expect_table_row_count_to_be_between(
    min_value=row_count * (1-max_delta), max_value=row_count * (1+max_delta))
```

<br>

If you want to develop custom expectations, more information can be found about there [here](https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/overview)

In [None]:
# -- 5. Set expectations
## PUT YOUR EXPECTATIONS BELOW

#### Finalize suite and create checkpoint
After running all the expectations that you want to apply to the data, the cell below can be executed to save the set of expectations as a suite and couple it with a checkpoint. This checkpoint can be used in other scripts (lambda_function.py) by passing a new batch of data (as RuntimeBatchRequest) along with the checkpoint name to the `run_checkpoint` method of an initialized DataContext (which is the same as the `context` object initialized in this notebook). Such a call would look like:

<br>

```python
results = context.run_checkpoint(
    checkpoint_name="CHECKPOINT_NAME",
    validations=[{"batch_request": batch_request}],
    )
```

<br>

**NOTE**: you should choose what kind of checkpoint you want to use below: one with or without automatic Data Docs updates

When the code below is ran, Great Expectations automatically saves the expectation suite and checkpoint to S3 via the validator and context objects.

In [None]:
# -- 6. Save suite
validator.save_expectation_suite(discard_failed_expectations=False)

# -- 7. Create checkpoint
#       Two options are provided here for creating a checkpoint: 
#       1. Use a SimpleCheckpoint, which contains an action to automatically update the
#          Data Docs website whenever a validation is run
#       2. Use checkpoint_without_datadocs_update, which generates the same
#          SimpleCheckpoint but without an automatic Data Docs update action. This is
#          useful if you run many validations (1000+) in parallel, since rendering the
#          Data Docs website becomes very slow in that case          

# -- 7.1 Create Simple checkpoint with automatic data docs updates (DEFAULT)
checkpoint_config = {
    "name": test_config.checkpoint_name,
    "config_version": 3,
    "class_name": "SimpleCheckpoint",
    "expectation_suite_name": test_config.expectations_suite_name,
    "run_name_template": test_config.run_name_template,
}

# -- 7.2 Create checkpoint without automatic data docs update
# checkpoint_config = checkpoint_without_datadocs_update(test_config)
context.add_checkpoint(**checkpoint_config)

#### Instantiate Data Docs website
Next, the command below can be ran to build the Data Docs website on S3, which provides an interactive user interface in which you can browse through expectation suites and check validation results. More on Data Docs can be found [here](https://docs.greatexpectations.io/docs/reference/data_docs/)

In general, if you generate the website once, new validations should automatically be loaded onto it. If the website start lagging behind, just rerun the command below.

In [None]:
ge_site_output = context.build_data_docs()
print_ge_site_link(ge_site_output)

#### Next steps

After you have run all of the steps above, instantiated a Data Docs website and can see your expectation suite there, you are ready to proceed with developing the logic for the Lambda function in `lambda_function.py`, creating a Docker Image with that, uploading that to ECR and deploying it as a Lambda function.

Please refer to the README for more information.

#### Stop Jupyter server
After running all commands above and generating an expectation suite, checkpoint and website, run the command below to stop the Jupyter server from running

In [None]:
os.system("jupyter notebook stop")