# Great Expectations

This lesson uses the new great expectations (at this point of writing, it is 1.3.9) 

# Data Context 

Document: https://docs.greatexpectations.io/docs/core/set_up_a_gx_environment/create_a_data_context

> A Data Context defines the storage location for metadata, such as your configurations for Data Sources, Expectation Suites, Checkpoints, and Data Docs. It also contains your Validation Results and the metrics associated with them, and it provides access to those objects in Python, along with other helper functions for the GX Python API.

> All scripts that utilize GX Core should start with the creation of a Data Context.



In [1]:
import pandas as pd
import great_expectations as gx
from great_expectations import expectations as gxe

context = gx.get_context()

# Connecting to Data File: Data Source

Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.



Document: https://docs.greatexpectations.io/docs/core/connect_to_data/filesystem_data/

In [2]:
source_folder = "data/"
data_source_name = "resale_flat"

In [3]:
data_source = context.data_sources.add_pandas_filesystem(
    name=data_source_name, 
    base_directory=source_folder
)

## Data Assets

A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.

GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.



Documentation: https://docs.greatexpectations.io/docs/core/connect_to_data/filesystem_data/?data_asset=file#create-a-data-asset

In [4]:
# we can retrieve as well
data_source = context.data_sources.get(data_source_name)

In [5]:
asset_name = "resale_csv_files"

In [6]:
file_csv_asset = data_source.add_csv_asset(name=asset_name)

##  Create a Batch Definition

A Batch Definition allows you to request all the records from a Data Asset or a subset based on the contents of a date and time field.



In [7]:
# we can retrive 
file_data_asset = context.data_sources.get(data_source_name).get_asset(asset_name)

In [8]:
batch_definition_name = "resale_flat_201701_202306.csv"
batch_definition_path = "resale_flat_201701_202306.csv"

batch_definition = file_data_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
)

In [9]:
batch = batch_definition.get_batch()

In [10]:
print(batch.head(4))

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

     month        town flat_type block        street_name storey_range  \
0  2017-01  ANG MO KIO    2 ROOM   406  ANG MO KIO AVE 10     10 TO 12   
1  2017-01  ANG MO KIO    3 ROOM   108   ANG MO KIO AVE 4     01 TO 03   
2  2017-01  ANG MO KIO    3 ROOM   602   ANG MO KIO AVE 5     01 TO 03   
3  2017-01  ANG MO KIO    3 ROOM   465  ANG MO KIO AVE 10     04 TO 06   

   floor_area_sqm      flat_model  lease_commence_date     remaining_lease  \
0            44.0        Improved                 1979  61 years 04 months   
1            67.0  New Generation                 1978  60 years 07 months   
2            67.0  New Generation                 1980  62 years 05 months   
3            68.0  New Generation                 1980   62 years 01 month   

   resale_price  
0      232000.0  
1      250000.0  
2      262000.0  
3      265000.0  


# Create the "expectations"

Document: https://docs.greatexpectations.io/docs/core/define_expectations/organize_expectation_suites

In [11]:
preset_expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="lease_commence_date", min_value=1, max_value=2020
)

In [12]:
validation_results = batch.validate(preset_expectation)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

In [13]:
print(validation_results)

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_max_to_be_between",
    "kwargs": {
      "batch_id": "resale_flat-resale_csv_files",
      "column": "lease_commence_date",
      "min_value": 1.0,
      "max_value": 2020.0
    },
    "meta": {}
  },
  "result": {
    "observed_value": 2019
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


You can create expectations into a suite of expectations 

> An Expectation Suite contains a group of Expectations that describe the same set of data. Combining all the Expectations that you apply to a given set of data into an Expectation Suite allows you to evaluate them as a group, rather than individually. All of the Expectations that you use to validate your data in production workflows should be grouped into Expectation Suites.



In [14]:
suite_name = "sctp_expectation_suite"
suite = gx.ExpectationSuite(name=suite_name)

In [15]:
suite = context.suites.add(suite)

In [16]:
suite.add_expectation(preset_expectation)

ExpectColumnMaxToBeBetween(id='61ec4e75-b86d-4f85-9661-d519994c4daa', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=False, rendered_content=None, windows=None, batch_id=None, column='lease_commence_date', row_condition=None, condition_parser=None, min_value=1.0, max_value=2020.0, strict_min=False, strict_max=False)

now, to run our suite, we will go through the validation flow.

See: https://docs.greatexpectations.io/docs/core/run_validations/create_a_validation_definition

In [17]:
definition_name = "sctp_validation_definition"
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=suite, name=definition_name
)

In [18]:
validation_results = validation_definition.run()

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

In [19]:
print(validation_results)

{
  "success": true,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_max_to_be_between",
        "kwargs": {
          "batch_id": "resale_flat-resale_csv_files",
          "column": "lease_commence_date",
          "min_value": 1.0,
          "max_value": 2020.0
        },
        "meta": {},
        "id": "61ec4e75-b86d-4f85-9661-d519994c4daa"
      },
      "result": {
        "observed_value": 2019
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }
  ],
  "suite_name": "sctp_expectation_suite",
  "suite_parameters": {},
  "statistics": {
    "evaluated_expectations": 1,
    "successful_expectations": 1,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "meta": {
    "great_expectations_version": "1.4.1",
    "batch_spec": {
      "path": "data/resale_flat_201701_202306.csv",
      "re

# Checkpoint 

A Checkpoint executes one or more Validation Definitions and then performs a set of Actions based on the Validation Results each Validation Definition returns.



# Bonus:

It is better to directly map dataframe and use the great expectation framework

In this example, let us try to use the other resale data.

In [20]:
import pandas as pd
import great_expectations as gx
from great_expectations import expectations as gxe

In [21]:
resale_data = pd.read_csv("data/resale_flat_202307.csv")

In [22]:
# https://docs.greatexpectations.io/docs/core/connect_to_data/dataframes/

context = gx.get_context()

data_source_name = "resale_dataframe"
data_source = context.data_sources.add_pandas(name=data_source_name)

# create asset
data_asset_name = "resale_202307_asset"
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

# create batch
batch_definition_name = "resale_flat_202307_dataframe"
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)

In [23]:
# get the data via batch
batch_parameters = {"dataframe": resale_data}

new_batch = batch_definition.get_batch(batch_parameters=batch_parameters)

In [24]:
print(new_batch.head(10))

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

     month        town flat_type block        street_name storey_range  \
0  2023-07  ANG MO KIO    2 ROOM   406  ANG MO KIO AVE 10     04 TO 06   
1  2023-07  ANG MO KIO    3 ROOM  308B   ANG MO KIO AVE 1     19 TO 21   
2  2023-07  ANG MO KIO    3 ROOM   462  ANG MO KIO AVE 10     01 TO 03   
3  2023-07  ANG MO KIO    3 ROOM   462  ANG MO KIO AVE 10     10 TO 12   
4  2023-07  ANG MO KIO    3 ROOM   540  ANG MO KIO AVE 10     01 TO 03   
5  2023-07  ANG MO KIO    3 ROOM   466  ANG MO KIO AVE 10     10 TO 12   
6  2023-07  ANG MO KIO    3 ROOM   560  ANG MO KIO AVE 10     07 TO 09   
7  2023-07  ANG MO KIO    3 ROOM   313   ANG MO KIO AVE 3     01 TO 03   
8  2023-07  ANG MO KIO    3 ROOM   328   ANG MO KIO AVE 3     07 TO 09   
9  2023-07  ANG MO KIO    3 ROOM   610   ANG MO KIO AVE 4     07 TO 09   

   floor_area_sqm      flat_model  lease_commence_date     remaining_lease  \
0            44.0        Improved                 1979  54 years 11 months   
1            70.0         Mod

In [25]:
# we can create new expectations 
new_expectation = gx.expectations.ExpectColumnToExist(column="month",column_index=0
)

In [26]:
validation_results = new_batch.validate(new_expectation)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [27]:
print(validation_results)

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_to_exist",
    "kwargs": {
      "batch_id": "resale_dataframe-resale_202307_asset",
      "column": "month",
      "column_index": 0
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


# Lastly, see expectations gallery for more examples!: https://greatexpectations.io/expectations/