# Great Expectations: demo part 1

Please set yourself up by:
- Cloning the git repository from GitHub: https://github.com/heineken-advanced-analytics/great-expectation-workshop-pydata.git.
- ```pip install great-expectations```, or if you run into import errors ```pip install great-expectations==0.11.1```.
- Navigate to this workshop: ```workshop-pydata-part1.ipynb```.

In this part we cover:
- Why we need GE: a possible data validation pipeline without GE.
- How we can implement a data validation pipeline with GE through its API.
- Which expectations are built in and how to use them.
- How custom expectations can be implemented.

In [4]:
from typing import Dict, List
import pandas as pd
import great_expectations as ge
from great_expectations.dataset import MetaPandasDataset, PandasDataset

## Why we need Great Expectations

A possible validation pipeline without GE.

Our validation goal is to:
- Construct a validation pipeline that checks the dataset, 
- potentially other datasets coming in over time,
- from a given data schema.

In [5]:
# We start by loading the file that we want to check.
filepath = "data/weather_brasil_201301.csv"
weather_data_df = pd.read_csv(filepath)

In [6]:
# Familiarize with the data
weather_data_df.head()

Unnamed: 0,id,elevation,lat,lon,code,city,timestamp,precip,air_pres,solar_rad,temp,rel_humid,wind_speed
0,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 00:00:00,,983.1,,30.1,44.0,2.9
1,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 01:00:00,,983.4,,30.0,43.0,1.5
2,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 02:00:00,,983.5,,30.5,40.0,1.9
3,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 03:00:00,,983.9,,29.8,45.0,2.5
4,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 04:00:00,,983.8,,28.7,54.0,3.0


In [8]:
# Expectations one ach column, its name and its data type.
_schema = [
    {"col_name": "id", "dtype": "int"},
    {"col_name": "elevation", "dtype": "float"},
    {"col_name": "lat", "dtype": "float"},
    {"col_name": "lon", "dtype": "float"},
    {"col_name": "code", "dtype": "str"},
    {"col_name": "city", "dtype": "str"},
    {"col_name": "timestamp", "dtype": "datetime"},
    {"col_name": "precip", "dtype": "float"},
    {"col_name": "air_pres", "float": "datetime"},
    {"col_name": "solar_rad", "dtype": "float"},
    {"col_name": "temp", "dtype": "float"},
    {"col_name": "rel_humid", "dtype": "float"},
    {"col_name": "wind_speed", "dtype": "float"},
]

In [11]:
# We could validate this with the function below.
def has_ordered_required_columns(df: pd.DataFrame, schema: List[Dict]) -> bool:
    """
    Check if the configured required columns are present in the DataFrame.

    Parameters
    ----------
    df
        DataFrame to check.
    schema
        Dictionary with expected dataframe schema.

    Returns
    -------
        Boolean indicating whether the actual columns correspond to the expected columns.
    """
    required_columns = [schema_item["col_name"] for schema_item in schema]
    return df.columns.tolist() == required_columns

In [19]:
# The result on our dataframe.
validation_result = has_ordered_required_columns(weather_data_df, _schema)

In [14]:
# Check function to see if pipeline would continue or break given the validation result.
def continue_pipeline(validation_result: bool):
    if validation_result:
        print("Pipeline continues.")
    else:
        print("Pipeline breaks.")    

In [20]:
continue_pipeline(validation_result)

Pipeline continues.


In [21]:
# Now, if a column would be missing.
weather_data_df_no_windspeed = weather_data_df.drop(
    columns=["wind_speed"]
)

In [22]:
# The validation will fail and the pipeline will break.
validation_result = has_ordered_required_columns(
    weather_data_df_no_windspeed, _schema
)

In [23]:
continue_pipeline(validation_result)

Pipeline breaks.


## The downsides

A validation pipeline like this however implies the following:
- We need to write code for every check we need on a datafile.
- which should be tested and maintained.
- We need to document the code and tests.

Great Expectations helps us to remedy such painpoints.

In [31]:
## GE can work with pandas, by extending the pandas DataFrame class
weather_data_batch = ge.read_csv(filepath)
print(type(weather_data_batch))

<class 'great_expectations.dataset.pandas_dataset.PandasDataset'>


In [25]:
weather_data_batch.head()

Unnamed: 0,id,elevation,lat,lon,code,city,timestamp,precip,air_pres,solar_rad,temp,rel_humid,wind_speed
0,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 00:00:00,,983.1,,30.1,44.0,2.9
1,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 01:00:00,,983.4,,30.0,43.0,1.5
2,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 02:00:00,,983.5,,30.5,40.0,1.9
3,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 03:00:00,,983.9,,29.8,45.0,2.5
4,178,237.0,-6.835777,-38.311583,A333,São Gonçalo,2013-01-01 04:00:00,,983.8,,28.7,54.0,3.0


In [27]:
# Take the schema again to extract the columns in the expected order.
required_columns = [schema_item["col_name"] for schema_item in _schema]

Now we can simply attach some built-in expectations of Great Expectations to our dataframe.

In [33]:
weather_data_batch.expect_table_columns_to_match_ordered_list(
    required_columns
);

In [34]:
# And simply validate the batch against the expectations.
validation_result = weather_data_batch.validate()
print(validation_result)

{
  "statistics": {
    "evaluated_expectations": 1,
    "successful_expectations": 1,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "meta": {
    "great_expectations.__version__": "0.11.1",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2020-06-08T21:41:32.246354+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "36eee32e-a9d0-11ea-b50a-24ee9ae663af"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20200608T214132.246354Z"
  },
  "results": [
    {
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "expectation_config": {
        "kwargs": {
          "column_list": [
            "id",
            "elevation",
            "lat",
            "lon",
            "code",
            "city",
            "timestamp",
            "precip",
            "air_pres

In [42]:
continue_pipeline(validation_result["success"])

Pipeline continues.


In [43]:
weather_data_batch.expect_column_values_to_be_between(
    column="temp", min_value=-30, max_value=60
);

## Try it yourself!

- Access the documentation: ```shift-tab``` on ```weather_data_batch.expect_```.
- Or have a look at the glossary: https://docs.greatexpectations.io/en/0.11.2/reference/glossary_of_expectations.html#expectation-glossary (share link through chat).
- Please share your expectation through the chat!

We'll be back in a few minutes.

In [44]:
# <expectation comes here>

## Only those built-in expectations?

If we want, we can specify custom logic and add it as an expectation to our batch. 

We overdid ourselves and created an expectation checking if the given columns hold unique combinations of values.

In [48]:
class CustomPandasDataset(PandasDataset):
    
    # setting the _data_asset_type is not required,
    # but helps GE interpreting this as a custom expectation.
    _data_asset_type = "CustomPandasDataset"

    @MetaPandasDataset.multicolumn_map_expectation
    def expect_columns_combination_to_be_unique(
        self, 
        column_list,
        *,
        index=None
    ):
        # A pandas series with a boolean per row is the required output
        # for this multicolumn map expectation format.
        result = pd.Series()

        grouped_df = column_list.groupby(index if index else column_list.columns[0])

        for index, group in grouped_df:
            if not (group.nunique(axis=0) == 1).all(axis=None):
                result = result.append(
                    pd.Series([False for x in range(len(group))]), ignore_index=True
                )
            else:
                result = result.append(
                    pd.Series([True for x in range(len(group))]), ignore_index=True
                )
    
        return result

In [49]:
weather_data_batch = ge.read_csv(filepath, dataset_class=CustomPandasDataset)

In [50]:
weather_data_batch.expect_table_columns_to_match_ordered_list(
    required_columns
);

In [51]:
weather_data_batch.expect_column_values_to_be_between(
    column="temp", min_value=-30, max_value=60
);

In [52]:
# We check if a weather station always has the same lat/lon.
weather_data_batch.expect_columns_combination_to_be_unique(
    ["id", "lat", "lon"]
);

In [53]:
validation_result = weather_data_batch.validate()
# Now we see the custom expectation in this validation result.
print(validation_result)

{
  "statistics": {
    "evaluated_expectations": 3,
    "successful_expectations": 3,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "meta": {
    "great_expectations.__version__": "0.11.1",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2020-06-08T22:18:44.433165+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "99d494b6-a9d5-11ea-9098-24ee9ae663af"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20200608T221844.433165Z"
  },
  "results": [
    {
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "expectation_config": {
        "kwargs": {
          "column_list": [
            "id",
            "elevation",
            "lat",
            "lon",
            "code",
            "city",
            "timestamp",
            "precip",
            "air_pres

In [54]:
continue_pipeline(validation_result["success"])

Pipeline continues.


In [61]:
## Let's double check if it works by introducing an error here.

## So far the basic flow...


We have seen:
- how we can use Great Expectations to implement built-in expectations, 
- extend functionality with custom expectations, 
- run validations against those expectations, and 
- we inspected the validation output.

Unfortuantely, it is not possible to easily validate incoming batches over time, you would need to add these expectations to a batch every time.

We also have not demonstrated those **neat automated docs** and **fancy validation reports**.

Please stay tuned as in part 2 of this workshop, Joost will demonstrate that through the CLI!