**Introduction**

- This notebook follows the documentation of the "Great Expectations" package using Pandas Dataframes.
- The original document is posted at https://docs.greatexpectations.io/docs/core/connect_to_data/dataframes
- The required Python packages are installed on this server. (Make sure to use the "Conda Python 3.12" kernel)
- Datafiles are in `/data/public/tutorials/ge_tutorials`

**Please, note:** The code in this notebook is taken from the code snippets in the tutorial. You need to make edits to make it work.

In [19]:
import pandas as pd
import great_expectations as gx
print(gx.__version__)

1.2.1


# Connect to dataframe data
Reference: https://docs.greatexpectations.io/docs/core/connect_to_data/dataframes/

A dataframe is a set of data that resides in-memory and is represented in your code by a variable to which it is assigned. To connect to this in-memory data you will define a Data Source based on the type of dataframe you are connecting to, a Data Asset that connects to the dataframe in question, and a Batch Definition that will return all of the records in the dataframe as a single Batch of data.

## Create a Data Source
Because the dataframes reside in memory you do not need to specify the location of the data when you create your Data Source. Instead, the type of Data Source you create depends on the type of dataframe containing your data. Great Expectations has methods for connecting to both pandas and Spark dataframes.

### Prerequisites
[A preconfigured Data Context](https://docs.greatexpectations.io/docs/core/set_up_a_gx_environment/create_a_data_context). These examples assume the variable `context` contains your Data Context.



In [45]:
# context = gx.get_context() # EphemeralDataContext
context = gx.get_context(mode="file") # Let's save the context to the file system
print(type(context).__name__)

FileDataContext


### Define the Data Source parameters.

A dataframe Data Source requires the following information:

- name: A name by which to reference the Data Source. This should be unique among all Data Sources on the Data Context.

Update `data_source_name` in the following code with a descriptive name for your Data Source:

In [4]:
data_source_name = "my_data_source"

### Create the Data Source.

To read a pandas dataframe you will need to create a pandas Data Source. Likewise, to read a Spark dataframe you will need to create a Spark Data Source.

In [10]:
data_source = context.data_sources.add_pandas(name=data_source_name)

You can see the list of data sources in your context with
```
context.data_sources.all()
```
Also, delete one with
```
context.data_sources.delete(name=data_source_name)
```


## Create a Data Asset

A dataframe Data Asset is used to group your Validation Results. For instance, if you have a data pipeline with three stages and you wanted the Validation Results for each stage to be grouped together, you would create a Data Asset with a unique name representing each stage.


### Optional. Retrieve your Data Source.

If you do not already have a variable referencing your pandas or Spark Data Source, you can retrieve a previously created one with:

In [15]:
data_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name)
data_source

PandasDatasource(type='pandas', name='my_data_source', id=UUID('68fe3c5e-7dcb-48fa-aa5d-67eb92f6514a'), assets=[])

### Define the Data Asset's parameters.

A dataframe Data Asset requires the following information:

name: A name by which the Data Asset can be referenced. This should be unique among Data Assets on the Data Source.
Update the data_asset_name parameter in the following code with a descriptive name for your Data Asset:

Add a Data Asset to the Data Source.

Execute the following code to add a Data Asset to your Data Source:

In [17]:
data_asset_name = "my_dataframe_data_asset"
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

## Create a Batch Definition

Typically, a Batch Definition is used to describe how the data within a Data Asset should be retrieved. With dataframes, all of the data in a given dataframe will always be retrieved as a Batch.

This means that Batch Definitions for dataframe Data Assets don't work to subdivide the data returned for validation. Instead, they serve as an additional layer of organization and allow you to further group your Validation Results. For example, if you have already used your dataframe Data Assets to group your Validation Results by pipeline stage, you could use two Batch Definitions to further group those results by having all automated validations use one Batch Definition and all manually executed validations use the other.

In [18]:
batch_definition_name = "my_batch_definition"
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)

## Provide a dataframe through Batch Parameters
Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.

Define the Batch Parameter dictionary.

A dataframe can be added to a Batch Parameter dictionary by defining it as the value of the dictionary key `dataframe`:

In [None]:
batch_parameters = {"dataframe": dataframe}



In [23]:
csv_path = "/data/public/tutorials/ge_tutorials/data/yellow_tripdata_sample_2019-01.csv"
dataframe = pd.read_csv(csv_path)
print(f"Number of records: {dataframe.shape[0]:,}")
batch_parameters = {"dataframe": dataframe}

Number of records: 10,000


In [24]:
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

In [25]:
# Create an Expectation to test
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="passenger_count", max_value=6, min_value=1
)

In [26]:
# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

In [27]:
# Test the Expectation
validation_results = batch.validate(expectation)
print(validation_results)

Calculating Metrics: 100%|████████████████████████████████████████████████████| 10/10 [00:00<00:00, 303.74it/s]

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "batch_id": "my_data_source-my_dataframe_data_asset",
      "column": "passenger_count",
      "min_value": 1.0,
      "max_value": 6.0
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_counts": [],
    "partial_unexpected_index_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}





# Define Expectations

## Create an Expectation
An Expectation is a verifiable assertion about your data. Expectations make implicit assumptions about your data explicit, and they provide a flexible, declarative language for describing expected behavior. They can help you better understand your data and help you improve data quality.



### Choose an Expectation to create.

GX comes with many built in Expectations to cover your data quality needs. You can find a catalog of these Expectations in the Expectation Gallery. When browsing the Expectation Gallery you can filter the available Expectations by the data quality issue they address and by the Data Sources they support. There is also a search bar that will let you filter Expectations by matching text in their name or description.

In your code, you will find the classes for Expectations in the expectations module:

In [46]:
from great_expectations import expectations as gxe

### Determine the Expectation's required parameters

To determine the parameters your Expectation uses to evaluate data, reference the Expectation's entry in the Expectation Gallery. Under the Args section you will find a list of parameters that are necessary for the Expectation to be evaluated, along with the a description of the value that should be provided.

Parameters that indicate a column, list of columns, or a table must be provided when the Expectation is created. The value in these parameters is used to differentiate instances of the same Expectation class. All other parameters can be set when the Expectation is created or be assigned a dictionary lookup that will allow them to be set at runtime.

### Optional. Determine the Expectation's other parameters

In addition to the parameters that are required for an Expectation to evaluate data all Expectations also support some standard parameters that determine how strictly Expectations are evaluated and permit the addition of metadata. In the Expectations Gallery these are found under each Expectation's Other Parameters section.

These parameters are:

| Parameter	| Purpose |
|-----------|---------|
| meta	| A dictionary of user-supplied metadata to store with an Expectation. This dictionary can be used to add notes about the purpose and intended use of an Expectation. |
| mostly	| A special argument that allows for fuzzy validation of ColumnMapExpectations and MultiColumnMapExpectations based on a percentage of successfully validated rows. If the percentage is high enough, the Expectation will return a success value of true. |
                                                                                 

                                                                                                                               
### Create the Expectation.

Using the Expectation class you picked and the parameters you determined when referencing the Expectation 
Gallery, you can create your Expectation.

**Preset Parameters**
In this example the `ExpectColumnMaxToBeBetween` Expectation is created and all of its parameters are defined in advance while leaving `strict_min` and `strict_max` as their default values:

In [47]:
preset_expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="passenger_count", min_value=1, max_value=6
)

**Runtime Parameters** 

Runtime parameters are provided by passing a dictionary to the `expectation_parameters` argument of a Checkpoint's `run()` method.

To indicate which key in the expectation_parameters dictionary corresponds to a given parameter in an Expectation you define a lookup as the value of the parameter when the Expectation is created. This is done by passing in a dictionary with the key `$PARAMETER` when the Expectation is created. The value associated with the `$PARAMETER` key is the lookup used to find the parameter in the runtime dictionary.

In this example, `ExpectColumnMaxToBeBetween` is created for both the passenger_count and the fare fields, and the values for min_value and max_value in each Expectation will be passed in at runtime. To differentiate between the parameters for each Expectation a more specific key is set for finding each parameter in the runtime `expectation_parameters` dictionary:

In [48]:
passenger_expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="passenger_count",
    min_value={"$PARAMETER": "expect_passenger_max_to_be_above"},
    max_value={"$PARAMETER": "expect_passenger_max_to_be_below"},
)
fare_expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="fare",
    min_value={"$PARAMETER": "expect_fare_max_to_be_above"},
    max_value={"$PARAMETER": "expect_fare_max_to_be_below"},
)

The runtime expectation_parameters dictionary for the above example would look like:

In [32]:
runtime_expectation_parameters = {
    "expect_passenger_max_to_be_above": 4,
    "expect_passenger_max_to_be_below": 6,
    "expect_fare_max_to_be_above": 10.00,
    "expect_fare_max_to_be_below": 500.00,
}

## Retrieve a Batch of sample data
Ref: https://docs.greatexpectations.io/docs/core/define_expectations/retrieve_a_batch_of_test_data

Expectations can be individually validated against a Batch of data. This allows you to test newly created Expectations, or to create and validate Expectations to further your understanding of new data. But first, you must retrieve a Batch of data to validate your Expectations against.

GX provides two methods of retrieving sample data for testing or data exploration. The first is to request a Batch of data from any Batch Definition you have previously configured. The second is to use the built in `pandas_default` Data Source to read in a Batch of data from a datafile such as a .csv or .parquet file without first defining a corresponding Data Source, Data Asset, and Batch Definition.

**Batch Definitions** both organize a Data Asset's records into Batches and provide a method for retrieving those records. Any Batch Definition can be used to retrieve a Batch of records for use in testing Expectations or data exploration.



In [49]:
import great_expectations as gx

context = gx.get_context(mode="file") # set_up_context_for_example(context)

  timestamp = datetime.utcnow().replace(tzinfo=tzutc())
  body["sentAt"] = datetime.utcnow().replace(tzinfo=tzutc()).isoformat()


In [None]:
# Retrieve the Batch Definition:
data_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

# Retrieve the first valid Batch of data:
batch = batch_definition.get_batch()

# Or use a Batch Parameter dictionary to specify a Batch to retrieve
# These are sample Batch Parameter dictionaries:
yearly_batch_parameters = {"year": "2019"}
monthly_batch_parameters = {"year": "2019", "month": "01"}
daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"}

# This code retrieves the Batch from a monthly Batch Definition:

batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})

print(batch.head())

The `pandas_default` Data Source is built into every Data Context and can be found at `.data_sources.pandas_default` on your Data Context.

The pandas_default Data Source provides methods to read the contents of a single datafile in any format supported by pandas. These `.read_*(...)` methods do not create a Data Asset or Batch Definition for the datafile. Instead, they simply return a Batch of data.

Because the pandas_default Data Source's `.read_*(...)` methods only return a Batch and do not save configurations for reading files to the Data Context, they are less versatile than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default Data Source is only intended to facilitate testing Expectations and engaging in data exploration. The pandas_default Data Source's `.read_*(...)` methods are less suited for use in production and automated workflows.

In [51]:
import great_expectations as gx

context = gx.get_context()
#set_up_context_for_example(context)

# Provide the path to a data file:
file_path = "/data/public/tutorials/ge_tutorials/data/yellow_tripdata_sample_2019-01.csv"

# Use the `pandas_default` Data Source to read the file:
sample_batch = context.data_sources.pandas_default.read_csv(file_path)

# Verify that data was read into `sample_batch`:
display(sample_batch.head())

Calculating Metrics: 100%|██████████████████████████████████████████████████████| 1/1 [00:00<00:00, 239.07it/s]


   vendor_id      pickup_datetime     dropoff_datetime  passenger_count  \
0          1  2019-01-15 03:36:12  2019-01-15 03:42:19                1   
1          1  2019-01-25 18:20:32  2019-01-25 18:26:55                1   
2          1  2019-01-05 06:47:31  2019-01-05 06:52:19                1   
3          1  2019-01-09 15:08:02  2019-01-09 15:20:17                1   
4          1  2019-01-25 18:49:51  2019-01-25 18:56:44                1   

   trip_distance  rate_code_id store_and_fwd_flag  pickup_location_id  \
0            1.0             1                  N                 230   
1            0.8             1                  N                 112   
2            1.1             1                  N                 107   
3            2.5             1                  N                 143   
4            0.8             1                  N                 246   

   dropoff_location_id  payment_type  fare_amount  extra  mta_tax  tip_amount  \
0                   48       

## Test an Expectation
Ref: https://docs.greatexpectations.io/docs/core/define_expectations/test_an_expectation

Data can be validated against individual Expectations. This workflow is generally used when engaging in exploration of new data, or when building out a set of Expectations to comprehensively describe the state that your data should conform to.


In [41]:
# import great_expectations as gx

context = gx.get_context()

# Use the `pandas_default` Data Source to retrieve a Batch of sample Data from a data file:
file_path = "/data/public/tutorials/ge_tutorials/data/yellow_tripdata_sample_2019-01.csv"
batch = context.data_sources.pandas_default.read_csv(file_path)

# Define the Expectation to test:
expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="passenger_count", min_value=1, max_value=6
)

# Test the Expectation:
validation_results = batch.validate(expectation)

# Evaluate the Validation Results:
print(validation_results)

# If needed, adjust the Expectation's preset parameters and test again:
expectation.min_value = 1
expectation.max_value = 6

# Test the modified expectation and review the new Validation Results:
new_validation_results = batch.validate(expectation)
print(new_validation_results)

  timestamp = datetime.utcnow().replace(tzinfo=tzutc())
Calculating Metrics: 100%|██████████████████████████████████████████████████████| 4/4 [00:00<00:00, 402.41it/s]


{
  "success": true,
  "expectation_config": {
    "type": "expect_column_max_to_be_between",
    "kwargs": {
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
      "column": "passenger_count",
      "min_value": 1.0,
      "max_value": 6.0
    },
    "meta": {}
  },
  "result": {
    "observed_value": 6
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


Calculating Metrics: 100%|██████████████████████████████████████████████████████| 4/4 [00:00<00:00, 430.19it/s]

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_max_to_be_between",
    "kwargs": {
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset",
      "column": "passenger_count",
      "min_value": 1.0,
      "max_value": 6.0
    },
    "meta": {}
  },
  "result": {
    "observed_value": 6
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}



  body["sentAt"] = datetime.utcnow().replace(tzinfo=tzutc()).isoformat()


# Run Validations

## Create a Validation Definition
Ref: https://docs.greatexpectations.io/docs/core/run_validations/create_a_validation_definition

A Validation Definition is a fixed reference that links a Batch of data to an Expectation Suite. It can be run by itself to validate the referenced data against the associated Expectations for testing or data exploration. Multiple Validation Definitions can also be provided to a Checkpoint which, when run, executes Actions based on the Validation Results for each provided Validation Definition.



In [None]:
import great_expectations as gx

context = gx.get_context()

# Retrieve an Expectation Suite
expectation_suite_name = "my_expectation_suite"
expectation_suite = context.suites.get(name=expectation_suite_name)

# Retrieve a Batch Definition
data_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

# Create a Validation Definition
definition_name = "my_validation_definition"
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=expectation_suite, name=definition_name
)

# Add the Validation Definition to the Data Context
validation_definition = context.validation_definitions.add(validation_definition)

## Run a Validation Definition

Ref: https://docs.greatexpectations.io/docs/core/run_validations/run_a_validation_definition


In [None]:
import great_expectations as gx

context = gx.get_context()

# Retrieve the Validation Definition
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)

# Run the Validation Definition
validation_results = validation_definition.run()

# Review the Validation Results
print(validation_results)

# Trigger actions based on results

## Create a Checkpoint with Actions
Ref: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/create_a_checkpoint_with_actions

A Checkpoint executes one or more Validation Definitions and then performs a set of Actions based on the Validation Results each Validation Definition returns.



In [None]:
import great_expectations as gx

context = gx.get_context()
set_up_context_for_example(context)

# Create a list of one or more Validation Definitions for the Checkpoint to run
validation_definitions = [
    context.validation_definitions.get("my_validation_definition")
]

# Create a list of Actions for the Checkpoint to perform
action_list = [
    # This Action sends a Slack Notification if an Expectation fails.
    gx.checkpoint.SlackNotificationAction(
        name="send_slack_notification_on_failed_expectations",
        slack_token="${validation_notification_slack_webhook}",
        slack_channel="${validation_notification_slack_channel}",
        notify_on="failure",
        show_failed_expectations=True,
    ),
    # This Action updates the Data Docs static website with the Validation
    #   Results after the Checkpoint is run.
    gx.checkpoint.UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]

# Create the Checkpoint
checkpoint_name = "my_checkpoint"
checkpoint = gx.Checkpoint(
    name=checkpoint_name,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={"result_format": "COMPLETE"},
)

# Save the Checkpoint to the Data Context
context.checkpoints.add(checkpoint)

# Retrieve the Checkpoint later
checkpoint_name = "my_checkpoint"
checkpoint = context.checkpoints.get(checkpoint_name)

## Choose result format

Ref: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/choose_a_result_format/

When you validate data with GX Core you can set the level of detail returned in your Validation Results by specifying a value for the optional result_format parameter. These settings will be applied to the results returned by each validated Expectation.

Typical use cases customizing Result Format settings include summarizing values that cause Expectations to fail durring data exploration, retrieving failed rows to facilitate cleaning data, or excluding excess Validation Result data in published Data Docs.

### Create a dictionary and set the verbosity of returned Validation Results.

The verbosity of your Validation Results can be set as the value of the key "result_format" in your Result Format dictionary. In order from least verbosity to greatest detail, the valid values for the "result_format" key are:

"BOOLEAN_ONLY"
"BASIC"
"SUMMARY"
"COMPLETE".
The default verbosity level of Validation Results generated by Expectations is "SUMMARY".

### Optional. Specify configurations for additional settings available to the base result_format.

Once you have defined the base configuration in your result_format key, you can further tailor the format of your Validation Results by defining additional key/value pairs in your Result Format dictionary.

Reference the table at https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/choose_a_result_format/?result_format_string=basic for valid keys and how they influence the format of generated Validation Results.

## Validation Results reference table

Ref: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/choose_a_result_format/#validation-results-reference-tables


| Field within result	| Value |
|-------------------|------------|
|element_count	|The total number of values in the column.|
|missing_count	|The number of missing values in the column.|
|missing_percent	|The total percent of rows missing values for the column.|
|unexpected_count	|The total count of unexpected values in in a column.|
|unexpected_percent	|The overall percent of unexpected values in a column.|
|unexpected_percent_nonmissing	|The percent of unexpected values in a column, excluding rows that have no value for that column.|
|observed_value	|The aggregate statistic computed for the column. This only applies to Expectations that pertain to the aggregate value |of a column, rather than the individual values in each row for the column.|
|partial_unexpected_list	|A partial list of values that violate the Expectation. (Up to 20 values by default.) |
|partial_unexpected_index_list	|A partial list the unexpected values in the column, as defined by the columns in unexpected_index_column_names. (Up to 20 indecies by default.)|
|partial_unexpected_counts	|A partial list of values and counts, showing the number of times each of the unexpected values occur. (Up to 20 unexpected value/count pairs by default.)|
|unexpected_index_list	|A list of the indices of the unexpected values in the column, as defined by the columns in unexpected_index_column_names.|
|unexpected_index_query	|A query that can be used to retrieve all unexpected values (SQL and Spark), or the full list of unexpected indices (Pandas).|
|unexpected_list	|A list of up to 200 values that violate the Expectation.|

## Run a Checkpoint
Ref: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/run_a_checkpoint

Running a Checkpoint will cause it to validate all of its Validation Definitions. It will then execute its Actions based on the results returned from those Validation Definitions. Finally, the Validation Results will be returned by the Checkpoint.

At runtime, a Checkpoint can take in a batch_parameters dictionary that selects the Batch to validate from each Validation Definition. A Checkpoint will also accept an expectation_parameters dictionary that provides values for the parameters of the any Expectations that have been configured to accept parameters at runtime.



In [None]:
import great_expectations as gx

context = gx.get_context()
set_up_context_for_example(context)

checkpoint = context.checkpoints.get("my_checkpoint")

batch_parameters = {"month": "01", "year": "2019"}

expectation_parameters = {
    "expect_fare_max_to_be_above": 5.00,
    "expect_fare_max_to_be_below": 1000.00,
}

validation_results = checkpoint.run(
    batch_parameters=batch_parameters, expectation_parameters=expectation_parameters
)