**Set the kernel to "Workshop Environment" from the Jupyter Kernels.**

# Data Quality Testing with Great Expectations

## Introduction
Data quality is the foundation of reliable analytics. Poor data leads to flawed insights and decisions. This workshop explores data quality testing using the powerful Great Expectations library.

## Why Data Quality Matters
- **Trust**: Quality data builds confidence in results
- **Efficiency**: Early detection prevents downstream issues 
- **Consistency**: Ensures reliable model performance
- **Governance**: Meets regulatory requirements

## Our Approach
Using the "Bike Sharing" dataset from UCI, we'll learn how to:
- Define expectations about your data
- Validate these expectations systematically
- Document and report quality issues
- Integrate quality checks into pipelines

## Task 1: Explore the Dataset
Let's begin by exploring the Bike Sharing dataset. Download the data, load it into a dataframe, and perform initial exploratory analysis to understand its structure and contents.

In [1]:
# Import necessary libraries for data processing and quality testing
# - great_expectations: Our primary tool for data quality validation
# - sqlite3: To connect with our SQLite database
# - pandas: For data manipulation and analysis
import great_expectations as gx
import sqlite3
import pandas as pd

from utils import database
from utils.checker import check

metric column.standard_deviation.aggregate_fn is being registered with different metric_provider; overwriting metric_provider


In [2]:
# Initialize the database with our bike sharing dataset
# This sets up a SQLite database with the necessary tables and imports the data
database.init()

# Create a connection to our database for querying
conn = sqlite3.connect("database.db")

[32m2025-04-10 11:13:03.772[0m | [1mINFO    [0m | [36mutils.database[0m:[36mreset_database[0m:[36m54[0m - [1mDatabase reset completed[0m
[32m2025-04-10 11:13:03.773[0m | [1mINFO    [0m | [36mutils.database[0m:[36minit[0m:[36m142[0m - [1mInitializing database to step 0: 0_spring_2011[0m
[32m2025-04-10 11:13:03.776[0m | [1mINFO    [0m | [36mutils.database[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 0_create_table.sql from 0_spring_2011 - Successfully applied[0m
[32m2025-04-10 11:13:03.792[0m | [1mINFO    [0m | [36mutils.database[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 1_bike_rental_2011_spring.sql from 0_spring_2011 - Successfully applied[0m


In [3]:
# Set up Great Expectations context
# This creates the environment where we define and validate expectations
context = gx.get_context()

# Add our SQLite database as a data source for Great Expectations
# This allows us to test data directly from the database
data_source = context.data_sources.add_sqlite(
    "sample", connection_string="sqlite:///database.db"
)

In [4]:
# Define the data asset we want to validate
# An asset in Great Expectations represents a table or query result that we want to test
asset_name = "bike_rental"
database_table_name = "bike_rental"
table_data_asset = data_source.add_table_asset(
    table_name=database_table_name, name=asset_name
)

# Create a batch definition that specifies which data we want to validate
# Here we're selecting the entire table for our first season (spring 2011)
full_table_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="0_spring_2011",
)

In [5]:
# Load the data into a batch and display the first few rows
# This gives us our first look at the structure and content of the dataset
full_table_batch = full_table_batch_definition.get_batch()

full_table_batch.head().data.loc[
    :, ["season", "weekday", "temp", "casual", "registered", "total"]
]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,season,weekday,temp,casual,registered,total
0,Spring,Monday,7.98,2,11,13
1,Spring,Monday,7.98,1,6,7
2,Spring,Monday,7.98,1,5,6
3,Spring,Monday,7.98,0,1,1
4,Spring,Monday,7.04,1,1,2


In [6]:
# Query the database directly to investigate potential data quality issues
# Here we're looking for records where the 'casual' rider count equals 2
# This helps us understand the distribution of this variable
query = """
SELECT season, weekday, temp, casual, registered, total
FROM bike_rental
WHERE casual = 2
"""

pd.read_sql_query(query, conn)

Unnamed: 0,season,weekday,temp,casual,registered,total
0,Spring,Monday,7.98,2,11,13
1,Spring,Monday,7.04,2,30,32
2,Spring,Tuesday,10.80,2,58,60
3,Spring,Wednesday,7.04,2,24,26
4,Spring,Thursday,4.22,2,106,108
...,...,...,...,...,...,...
81,Spring,Monday,17.38,2,27,29
82,Spring,Tuesday,20.20,2,8,10
83,Spring,Thursday,20.20,2,14,16
84,Spring,Friday,18.32,2,11,13


# Task 2: Set Expectations for the Spring Data

Now that you've explored the dataset, it's time to define your first expectations - the rules that your data should follow to be considered high quality.

## Basic Expectations Examples

Great Expectations provides various expectation types to validate different aspects of your data. Check out all available expectation in the gallery. https://greatexpectations.io/expectations/

1. **Column Existence**
   ```python
   # Check that specific columns exist in your dataset
   expect_column_to_exist(column="temp")

2. **Set Membership**
   ```python
   # Confirm categorical variables contain only allowed values
   expect_column_values_to_be_in_set(
       column="weathersit", 
       value_set=[1, 2, 3, 4]  # 1:Clear, 2:Cloudy, 3:Light Rain, 4:Heavy Rain
   )
   ```


In [7]:
# Task 1: Check if the 'season' column exists in the dataset
# This is a fundamental check to ensure that our data has the expected structure

### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnToExist(
    column="season",
)
### SOLUTION_END ###

result = full_table_batch.validate(expectation, result_format="COMPLETE")
check(task=1, result=result)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

[32m2025-04-10 11:13:03.884[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck[0m:[36m24[0m - [32m[1mGreat job! The result is as expected.[0m


In [8]:
result

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_to_exist",
    "kwargs": {
      "batch_id": "sample-bike_rental",
      "column": "season"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [9]:
# TODO: make them set one expectation for another datatype themselves

In [10]:
# TODO: add expectations of maximum bike rentals according to the max today + 50 or so

In [11]:
# TODO: set expectation for correlation of rising temperatures to rising bike rentals

In [12]:
# TODO: check the expectations

But, you didn't come to this Workshop just to see a fancy way of doing exactly the same as your basic unit test is doing, so let's get into more complex stuff

# Task 3: Adjust and Set New Expectations for the Summer Data

In [13]:
# TODO: load the summer dataset

In [14]:
# TODO: run the expectations for that new dataset and look at the output

## Excursion: Data Docs

In [15]:
# TODO: explain data docs and how it works

In [16]:
# TODO: generate the data docs

In [17]:
# TODO: look at the data docs

## Back To Business

In [18]:
# TODO: refine the expectations

In [19]:
# TODO: add more complex expectations (give them a list of suggestions again)

In [20]:
# TODO: add a fun expectation, that expects bike rentals to rise, because they have risen before

In [21]:
# TODO: check the expectations

# Task 4: Adjust for Autumn 

In [22]:
# TODO: load new dataset

In [23]:
# TODO: check the expectations

In [24]:
# TODO: fix what needs fixing

In [25]:
# TODO: Maybe add something even more complex?

In [26]:
# TODO: Recheck the Expectations

# Task 5: Check with Winter and set final expectations
You can check out all kinds of expectations here: https://greatexpectations.io/expectations/

In [27]:
# TODO: load new dataset

In [28]:
# TODO: check the expectations

In [29]:
# TODO: fix what needs fixing

In [30]:
# TODO: Maybe add something even more complex?

In [31]:
# TODO: Recheck the Expectations

# Task 6: Verify your data and see if something shifts the next year

In [32]:
# TODO: load new dataset

In [33]:
# TODO: check the expectations

Discuss these expectations => did you do a good job? What changed? Do you now have confidence in your data foundation for your AI model? Discuss pros and cons of using a Testing Framework!

# Task 7: Think about AI Implementation

Could you now implement AI to design a flexible pricing model? How would you do it? What is the advantage over doing this by hand?

In [34]:
# TODO: Make this last part better and more to the point ^^