# Short Introduction to Great Expectations Concepts on Hopsworks

The purpose of this notebook is to give a brief introduction to Great Expectations concepts and classes which are relevant for integration with the Hopsworks MLOps platform. Hopsworks works out of the box with Great Expectations classes, so no need to learn new abstractions and syntax. Define your Expectation Suite, register it to your Feature Group and try inserting data. Hopsworks take care of the rest!

In [1]:
!pip install -U hopsworks --quiet

[K     |████████████████████████████████| 119 kB 8.1 MB/s 
[K     |████████████████████████████████| 51 kB 5.4 MB/s 
[K     |████████████████████████████████| 132 kB 48.6 MB/s 
[K     |████████████████████████████████| 45 kB 2.8 MB/s 
[K     |████████████████████████████████| 68 kB 5.5 MB/s 
[K     |████████████████████████████████| 43 kB 2.1 MB/s 
[K     |████████████████████████████████| 4.9 MB 53.2 MB/s 
[K     |████████████████████████████████| 42 kB 1.4 MB/s 
[K     |████████████████████████████████| 2.8 MB 39.8 MB/s 
[K     |████████████████████████████████| 2.3 MB 42.4 MB/s 
[K     |████████████████████████████████| 109 kB 47.9 MB/s 
[K     |████████████████████████████████| 67 kB 1.3 MB/s 
[K     |████████████████████████████████| 1.6 MB 52.5 MB/s 
[K     |████████████████████████████████| 546 kB 47.0 MB/s 
[K     |████████████████████████████████| 79 kB 4.1 MB/s 
[K     |████████████████████████████████| 9.0 MB 40.6 MB/s 
[K     |██████████████████████████████

In [2]:
import great_expectations as ge
import pandas as pd
from pprint import pprint

## Load example data and create a Feature Group

In [3]:
# Transactions data used in the fraud detection tutorial
trans_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/transactions.csv", parse_dates=["datetime"])
trans_df.head(3)

Unnamed: 0,tid,datetime,cc_num,category,amount,latitude,longitude,city,country,fraud_label
0,11df919988c134d97bbff2678eb68e22,2022-01-01 00:00:24,4473593503484549,Health/Beauty,62.95,42.30865,-83.48216,Canton,US,0
1,dd0b2d6d4266ccd3bf05bc2ea91cf180,2022-01-01 00:00:56,4272465718946864,Grocery,85.45,33.52253,-117.70755,Laguna Niguel,US,0
2,e627f5d9a9739833bd52d2da51761fc3,2022-01-01 00:02:32,4104216579248948,Domestic Transport,21.63,37.60876,-77.37331,Mechanicsville,US,0


In [9]:
# login in hopsworks

import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()

Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated

Paste it here: 2L4Tg8EVrrtJeCaN.x3Ob0PjNJaBTNGNh6jLiN2xo2JwjbTJLgg16dKHfNiX2SrlIUcOmWV6CfM35aLok
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/155
Connected. Call `.close()` to terminate connection gracefully.




In [10]:
trans_fg = fs.get_or_create_feature_group(
    name="mini_transactions_fraud_batch_fg",
    version=1,
    description="Transaction data",
    primary_key=['cc_num'],
    event_time=['datetime']
)

# Insert a single row to persist the FG in the backend
trans_fg.insert(trans_df.head(1))

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/155/fs/97/fg/588


Uploading Dataframe: 0.00% |          | Rows 0/1 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/155/jobs/named/mini_transactions_fraud_batch_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f50b3d9cc50>, None)

## Great Expectations Classes and Concepts

### Expectations

The central concept is the Expectation. Each type of expectation specifies one or more metrics whose values to will be computed on a DataFrame. The configuration of a given expectation type specifies an acceptable range to be compared to a the observed value. Expectation entities are not specific to your data or a particuliar DataFrame, they merely define operations to be performed and success criteria. 

During development or prototyping, Great Expectations offers a DataFrame wrapper to explore your data. It enables auto-completion method for core supported expectations. As different expectations will require different kwargs for their configuration, it helps gaining familiarity with new expectations.

In Hopsworks, expectations will usually be evaluated on a Feature or pair of Feature. The expectation type enables a standardisation of the metrics computed while the configuration allows the user to adapt the success of the validation to their particular conditions. In a production setup, each Feature will be validated by multiple expectations. Note that Hopsworks populates the meta field of expectation with an id to enable smoother integrations.

In [11]:
ge_df = ge.from_pandas(trans_df)
result = ge_df.expect_column_mean_to_be_between(column="amount", min_value=10, max_value=100)
print(f"Observed mean : {result['result']['observed_value']}, success : {result['success']}")

Observed mean : 421.1216635540464, success : False


### Validation Result

An expectation generates a validation result entity on evaluation. This result is specific to a DataFrame and acts as a log for the evaluation. There are different information in the validation result:
- The success or failure of the evaluation, meaning are the metrics calculated with the success criteria specified in the expectations.
- A meta field for extra information
- An expectation_config field which contains the entity that was evaluated against the DataFrame. Crucial for reproducibility, still needs the same data.
- An exception_info, in case the evaluation raised an exception.
- The result field itself, with information about the computed metrics. The type of information depends on the expectation_type.

When the expectation has an Hopsworks id, the result can be linked to the expectation on upload. This allows a history of data validation to start being populated to simplify validation monitoring.

In [12]:
pprint(result)

{
  "expectation_config": {
    "kwargs": {
      "column": "amount",
      "min_value": 10,
      "max_value": 100,
      "result_format": "BASIC"
    },
    "expectation_type": "expect_column_mean_to_be_between",
    "meta": {}
  },
  "meta": {},
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": 421.1216635540464,
    "element_count": 106020,
    "missing_count": null,
    "missing_percent": null
  }
}


### Expectation Suite

The Expectation Suite is an other central concept of Great Expectations relevant to working with Hopsworks. The suite is simply a collection of Expectations to be evaluated against the same DataFrame. It provides persistance for the expectation types and configurations to be run. 

Expectation suite are the core abstraction used to integrate with Hopsworks. Each version of a Feature Group will have a unique expectation suite used to validate data before their insertion into the Feature Store. All expectations connected to a given Feature should be included in the suite as well as all expectation performed on pair of Features. The expectation suite is attached to a Feature Group in the backend enabling convenient storage and retrieval. In addition, Hopsworks is configured by default to use the suite to automatically run validation on insertion.

In [13]:
new_expectation_suite = ge.core.ExpectationSuite(
    expectation_suite_name="expectation_suite_101"
    )

new_expectation_suite.add_expectation(result['expectation_config'])

# # equivalent to 
# new_expectation_suite.add_expectation(
#     ge.core.ExpectationConfiguration(
#         expectation_type="expect_column_mean_to_be_between",
#         kwargs={
#             "column":"amount",
#             "min_value":10,
#             "max_value":100
#         }
#     )
# )

{"kwargs": {"column": "amount", "min_value": 10, "max_value": 100, "result_format": "BASIC"}, "expectation_type": "expect_column_mean_to_be_between", "meta": {}}

In [14]:
# Run the validation manually with Great Expectations
ge_df = ge.from_pandas(trans_df, expectation_suite=new_expectation_suite)
report = ge_df.validate()

# Or setup automatic validation on insert with Hopsworks in a single line of code
trans_fg.save_expectation_suite(
    expectation_suite=new_expectation_suite,
    validation_ingestion_policy="ALWAYS")

Attached expectation suite to featuregroup, edit it at https://c.app.hopsworks.ai:443/p/155/fs/97/fg/588


### Validation Report

On performing validation of a DataFrame using an expectation suite, Great Expectations generates a validation report. This report collects all results from the expectation ran during the validation as well as various related metadata (overall success, timestamps, version, etc...)

By default, running validation with Hopsworks uploads the validation report to the backend. Hopsworks UI provides a simple way to consult a summary of these reports orm download the full report for more thorough investigation.

In [7]:
pprint(report)

{
  "results": [
    {
      "expectation_config": {
        "kwargs": {
          "column": "amount",
          "min_value": 10,
          "max_value": 100,
          "result_format": "BASIC"
        },
        "expectation_type": "expect_column_mean_to_be_between",
        "meta": {}
      },
      "meta": {},
      "success": false,
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "result": {
        "observed_value": 421.1216635540464,
        "element_count": 106020,
        "missing_count": null,
        "missing_percent": null
      }
    }
  ],
  "statistics": {
    "evaluated_expectations": 1,
    "successful_expectations": 0,
    "unsuccessful_expectations": 1,
    "success_percent": 0.0
  },
  "meta": {
    "great_expectations_version": "0.14.3",
    "expectation_suite_name": "expectation_suite_101",
    "run_id": {
      "run_name": null,
      "run_time": "2022-08-16T07:27:45.

In [15]:
# automatic validation on insertion, checkout the report in Hopsworks UI
trans_fg.insert(trans_df)

Validation Report saved successfully, explore a summary at https://c.app.hopsworks.ai:443/p/155/fs/97/fg/588


Uploading Dataframe: 0.00% |          | Rows 0/106020 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/155/jobs/named/mini_transactions_fraud_batch_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f50b813b090>, {
   "results": [
     {
       "expectation_config": {
         "kwargs": {
           "column": "amount",
           "min_value": 10,
           "max_value": 100,
           "result_format": "BASIC"
         },
         "expectation_type": "expect_column_mean_to_be_between",
         "meta": {
           "expectationId": 173
         }
       },
       "meta": {},
       "success": false,
       "exception_info": {
         "raised_exception": false,
         "exception_message": null,
         "exception_traceback": null
       },
       "result": {
         "observed_value": 421.1216635540464,
         "element_count": 106020,
         "missing_count": null,
         "missing_percent": null
       }
     }
   ],
   "statistics": {
     "evaluated_expectations": 1,
     "successful_expectations": 0,
     "unsuccessful_expectations": 1,
     "success_percent": 0.0
   },
   "meta": {
     "great_expectations_version": "0.14.3",
     "expectation