# Using Great Expectations for DBT Model Development

[dbt (data build tool)](https://www.getdbt.com) is an incredible tool that makes building SQL transformation pipelines a dream. In dbt, each `.sql` file is a **model**. As these models are developed, you can encode assumptions about input and output datasets as **expectations**.

This has the following benefits:

1. These are machine verifiable and can be used to monitor data flowing through your pipelines.
2. These eliminate poisonous implicit assumptions that cause data engineers re-work and waste time - "How do we define visits?"
3. These **will eventually** be easy to edit.
4. These **will eventually** be easy to reason about visually.

In [1]:
import json
import os

import great_expectations as ge
import pandas as pd
import sqlalchemy


## Initialize a DataContext

A great expectations `DataContext` represents the collection of data asset specifications in this project.

You'll need:
- the directory where you ran `great_expectations init` (where the .great_expectations.yml file is).
- dbt profile name  you would like great expectations to use in the datasources section of your great_expectations configuration

The `DBTDataContext` uses a few dbt configuration files:
- **dbt profile** (stored in `~/.dbt/profiles.yml`) and will execute against your default `target` specified there.
- **dbt project** (stored in the `dbt_project.yml` file in your project directory)

In [2]:
context = ge.data_context.DataContext('../../')

## Get a Dataset

Using the data context, provide the name of the datasource configured in your project config ("dbt" in this case), and the name of the dbt model to which to connect

In [3]:
df_base_scheduleappointment = context.get_data_asset("dbt", "base/base_schedule_appointment")

  """)


In [4]:
df_base_scheduleappointment.get_expectations_config()

	0 failing expectations
	0 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


{'data_asset_name': 'base/base_schedule_appointment',
 'meta': {'great_expectations.__version__': '0.5.1__develop__sch_internal'},
 'expectations': [],
 'data_asset_type': 'Dataset'}

## Declare Expectations

In [5]:
df_base_scheduleappointment.expect_column_values_to_be_in_set('active', ['t', 'f'])

{'success': True,
 'result': {'element_count': 4464508,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [6]:
df_base_scheduleappointment.save_expectations_config()

	0 failing expectations
	1 result_format kwargs
	0 include_configs kwargs
	0 catch_exceptions kwargs
If you wish to change this behavior, please set discard_failed_expectations, discard_result_format_kwargs, discard_include_configs_kwargs, and discard_catch_exceptions_kwargs appropirately.


#### Since we extract appointment's date from start_date column, can we assume that a non-empty value is always present?

In [None]:
df_base_scheduleappointment.expect_column_values_to_not_be_null('start_date')

#### No! Good to know. We will add a where clause to our every_visit_per_day's SQL to filter the empties out. Also, let's record the fact that we expect close to half of scheduleappopintments records to not have a value. If the percentage is exceeded, valiation will fail.

In [None]:
df_base_scheduleappointment.expect_column_values_to_not_be_null('start_date', mostly=0.5)

#### Can we assume that (the non-empty) start_date values will be in a reasonable range (few recent years) ?

In [None]:
df_base_scheduleappointment.expect_column_values_to_be_between('start_date', min_value='2010-01-01')

#### No! A small percentage of values are earlier that would make sense. Just like the in the "null" case earlier, let's add a where clause to every_visit_per_day's SQL and add an expectation on the input dataset to say that we do expect some nonsensical dates, but only a very small percentage (up to 1%)

In [None]:
df_base_scheduleappointment.expect_column_values_to_be_between('start_date', min_value='2010-01-01', mostly=0.99)

#### We assume that records in scheduleappointments table represents one day appointments, not hospitalizations or any other multi-day treatments. Actually, can we assume that? Let's check... 

#### In this case the easiest way to add this expectation is to add a computed column to the input model's SQL and add an expectation of this column values range. This is a pattern that is worth noticing - often it is easier to add an expectation from the standard library to a computed column than a custom expectation.

In [None]:
df_base_scheduleappointment.expect_column_values_to_be_between('scheduled_duration_sec', min_value=10, max_value=86400)

#### Our assumption is *almost* correct - there is a very small number of exceptions. In this case let's not even modify every_visit_per_day's SQL. Let's create an expectation on the input dataset that the rate of exceptions is under 1%. If it every changes, validation will fail and we will be notified.

In [None]:
df_base_scheduleappointment.expect_column_values_to_be_between('scheduled_duration_sec', min_value=10, max_value=86400, mostly=0.99)

#### We probably keep adding expectations to the input dataset, but for the demo purpose let's stop here and save the expectations

In [None]:
df_base_scheduleappointment.save_expectations_config()

### Previously we encoded our assumptions about the input dataset as expectations - this protects us from the risk coming from upstream. Now let's be nice to our the downstream models that consume the output of the every_visit_per_day model. We will encode our assumptions on our model's result set. This advertises to the downstream consumers what they can expect from us - a data contract of sorts.

In [None]:
# Load the result set of every_visit_per_day model into GE 

query_str = dbt_tools.get_model_compiled_sql('schedule_appointments')

df_every_visit_per_day = ge.dataset.SqlAlchemyDataset(engine=engine, table_name="tmp{0:d}".format(random.randint(1,100000)), custom_sql=query_str)
df_every_visit_per_day._initialize_expectations(config=every_visit_per_day_dataset_config)

In [None]:
df_every_visit_per_day = context.get_data_asset("dbt", "schedule_appointments")

#### Since we filtered out empty start_date values, we can confidently advertise this fact:

In [None]:
df_every_visit_per_day.expect_column_values_to_not_be_null('start_date')

#### "success: True" means that the result set conforms to our expectations - good!

#### Same logic applies to the range of values of start_date, since we filtered the unreasonably old ones in our SQL

In [None]:
df_every_visit_per_day.expect_column_values_to_be_between('start_date', min_value='2010-01-01')

#### We carry the expectation of no multi day appointments (with no more than 1% exceptions) from the input dataset (since we are not filtering them out in our SQL)

In [None]:
df_every_visit_per_day.expect_column_values_to_be_between('scheduled_duration_sec', min_value=10, max_value=86400, mostly=0.99)

#### Are there any duplicated in the input dataset and, if yes, should we dedup them in every_visit_per_day's SQL?

#### This requires us to hypothesize about what might be viewed as a unique key in scheduleappointment. Let's say that the combination of start_date, office id, patient id and the provider id is a good candidate. Let's add this is a column to our model: `concat(sa.start_date, '__', office_id, '__', user_id_patient, '__', user_id_to_see) as appointment_key`.

#### Yet again, this is an example of adding a computed column so that we can reason on it using expectations. Check if we can assume this value to be unique in our result set: 

In [None]:
df_every_visit_per_day.expect_column_values_to_be_unique('appointment_key')

#### It is mostly unique - less than 2% exceptions. It is possible that we will have to deal with deduplication before deployment, but for now let's just encode this assumption as an expectation so that we don't forget it and so that other stake holders can see it:

In [None]:
df_every_visit_per_day.expect_column_values_to_be_unique('appointment_key', mostly=0.98)

#### Here is another example of encoding our assumption that would not be visible in the SQL source itself - scheduleappointment has `active` column that takes True and False values. We are not sure what it means - should "inactive" appointments be filtered out? Let's defer this decision and encode the assumption that the value of active is not important as an expectation on our output dataset:

In [None]:
df_every_visit_per_day.expect_column_values_to_be_in_set('active', ['t'])

#### Let's stop here for now and save the expectations on the output set:

In [None]:
df_every_visit_per_day.save_expectations_config()

### The expectation collections for the two datasets are saved into JSON files in great_expectations/data_asset_configurations folder in the current project - let's commit them.