# Data Quality Testing - Workshop SDS25

## Goal
The Goal of this Workshop is gaining hands-on experience in Data Quality assessment, using the Great Expectations Framework. However, the knowledge gained in this Workshop is transferrable to other Testing Frameworks as well. The Data Quality Assessment gives you confidence in the data foundation used in AI.

## DataSet
The DataSet we will be using during this Workshop is the "Bike Sharing" Dataset (https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). It contains aggregated information about bike rental, either per hour or per day, over two years, in Washington D.C.

The DataSet was already loaded into the database in this container (database.db). For each set of the exercise, the required data is loaded with a corresponding code cell, so you don't have to worry about it.

The Notebook will lead you through the following scenarios:  
- Making basic expectations based on Spring Data of the first year
- Checking and refining expectations based on added Summar Data of the first year
- Same for Autumn, using increasingly complex Expectations
- Same for Winter
- Finally using the data from year 2 to verify the expectations, identify datashifts etc.

In the end, the goal is to have a solid data foundation to implement a flexible pricing system.



# Task 1: Inspect the DataSet
Familiarize yourself with the "Bike Sharing" Dataset (https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). Explore the dataset using the cells below.

In [1]:
from great_expectations import ExpectationSuite

import database
import great_expectations as gx
import sqlite3
import pandas as pd
from checker import check

metric column.standard_deviation.aggregate_fn is being registered with different metric_provider; overwriting metric_provider


In [2]:
# Initialise the dataset
database.init()

conn = sqlite3.connect('database.db')

[32m2025-04-08 15:46:21.622[0m | [1mINFO    [0m | [36mdatabase[0m:[36mreset_database[0m:[36m54[0m - [1mDatabase reset completed[0m
[32m2025-04-08 15:46:21.622[0m | [1mINFO    [0m | [36mdatabase[0m:[36minit[0m:[36m142[0m - [1mInitializing database to step 0: 0_spring_2011[0m
[32m2025-04-08 15:46:21.624[0m | [1mINFO    [0m | [36mdatabase[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 0_create_table.sql from 0_spring_2011 - Successfully applied[0m
[32m2025-04-08 15:46:21.639[0m | [1mINFO    [0m | [36mdatabase[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 1_bike_rental_2011_spring.sql from 0_spring_2011 - Successfully applied[0m


In [3]:
# Set the context for great expectations (the data set)
context = gx.get_context()

data_source = context.data_sources.add_sqlite("sample", connection_string="sqlite:///database.db")

In [4]:
# Get the batch
asset_name = "bike_rental"
database_table_name = "bike_rental"
table_data_asset = data_source.add_table_asset(
    table_name=database_table_name, name=asset_name
)

full_table_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="0_spring_2011",
)


In [5]:
# TODO: fix something in database, atemp, humidity and weathersit are None, also the year is None => probably integer and double values were not transported correctly?

In [6]:
full_table_batch = full_table_batch_definition.get_batch()
full_table_batch.head()


Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

     id      dteday  season    yr  mnth  hour  holiday weekday  workingday  \
0  1808  2011-03-21  Spring  None     3     0        0  Monday           1   
1  1809  2011-03-21  Spring  None     3     1        0  Monday           1   
2  1810  2011-03-21  Spring  None     3     2        0  Monday           1   
3  1811  2011-03-21  Spring  None     3     3        0  Monday           1   
4  1812  2011-03-21  Spring  None     3     5        0  Monday           1   

  weathersit  ...  atemp   hum windspeed  casual  registered  total  \
0       None  ...   None  None   26.0027       2          11     13   
1       None  ...   None  None   26.0027       1           6      7   
2       None  ...   None  None   22.0028       1           5      6   
3       None  ...   None  None   22.0028       0           1      1   
4       None  ...   None  None   19.9995       1           1      2   

   felt_temp  humidity  year weather  
0      3.998      66.0  2011     Bad  
1      3.998      71.0  20

In [7]:
# Read the data from the database using SQL
query = """
SELECT * FROM bike_rental
WHERE casual = 2
"""

pd.read_sql_query(query, conn)

Unnamed: 0,id,dteday,season,yr,mnth,hour,holiday,weekday,workingday,weathersit,...,atemp,hum,windspeed,casual,registered,total,felt_temp,humidity,year,weather
0,1808,2011-03-21,Spring,,3,0,0,Monday,1,,...,,,26.0027,2,11,13,3.9980,66.0,2011,Bad
1,1813,2011-03-21,Spring,,3,6,0,Monday,1,,...,,,16.9979,2,30,32,3.9980,76.0,2011,Bad
2,1837,2011-03-22,Spring,,3,6,0,Tuesday,1,,...,,,16.9979,2,58,60,11.0006,87.0,2011,Good
3,1877,2011-03-23,Spring,,3,23,0,Wednesday,1,,...,,,0.0000,2,24,26,5.9978,90.0,2011,Bad
4,1885,2011-03-24,Spring,,3,7,0,Thursday,1,,...,,,15.0013,2,106,108,1.0016,100.0,2011,Bad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,3824,2011-06-13,Spring,,6,5,0,Monday,1,,...,,,26.0027,2,27,29,18.0032,64.0,2011,Good
82,3845,2011-06-14,Spring,,6,2,0,Tuesday,1,,...,,,19.0012,2,8,10,24.9992,49.0,2011,Good
83,3914,2011-06-16,Spring,,6,23,0,Thursday,1,,...,,,16.9979,2,14,16,20.9996,83.0,2011,Bad
84,3917,2011-06-17,Spring,,6,2,0,Friday,1,,...,,,7.0015,2,11,13,18.9998,94.0,2011,Good


# Task 2: Set Expectations for the Spring Data
Now that you explored the dataset, it is time to set up your first expectations (exciting!)

# TODO: add some basics expectations as a markdown for example

In [8]:
# Task 1: Check if the season is Spring

# Setup a new expectation, we will use this to check if the season is Spring
season_expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="season",
    value_set=["Spring"],
)

result = full_table_batch.validate(season_expectation, result_format="COMPLETE")
check(task=1, result=result)

Calculating Metrics:   0%|          | 0/13 [00:00<?, ?it/s]

[32m2025-04-08 15:46:21.849[0m | [32m[1mSUCCESS [0m | [36mchecker[0m:[36mcheck[0m:[36m21[0m - [32m[1mGreat job! The result is as expected.[0m


In [9]:
# TODO: make them set one expectation for another datatype themselves

In [10]:
# TODO: add expectations of maximum bike rentals according to the max today + 50 or so

In [11]:
# TODO: set expectation for correlation of rising temperatures to rising bike rentals

In [12]:
# TODO: check the expectations

But, you didn't come to this Workshop just to see a fancy way of doing exactly the same as your basic unit test is doing, so let's get into more complex stuff

# Task 3: Adjust and Set New Expectations for the Summer Data

In [13]:
# TODO: load the summer dataset

In [14]:
# TODO: run the expectations for that new dataset and look at the output

## Excursion: Data Docs

In [15]:
# TODO: explain data docs and how it works

In [16]:
# TODO: generate the data docs

In [17]:
# TODO: look at the data docs

## Back To Business

In [18]:
# TODO: refine the expectations

In [19]:
# TODO: add more complex expectations (give them a list of suggestions again)

In [20]:
# TODO: add a fun expectation, that expects bike rentals to rise, because they have risen before

In [21]:
# TODO: check the expectations

# Task 4: Adjust for Autumn 

In [22]:
# TODO: load new dataset

In [23]:
# TODO: check the expectations

In [24]:
# TODO: fix what needs fixing

In [25]:
# TODO: Maybe add something even more complex?

In [26]:
# TODO: Recheck the Expectations

# Task 5: Check with Winter and set final expectations
You can check out all kinds of expectations here: https://greatexpectations.io/expectations/

In [27]:
# TODO: load new dataset

In [28]:
# TODO: check the expectations

In [29]:
# TODO: fix what needs fixing

In [30]:
# TODO: Maybe add something even more complex?

In [31]:
# TODO: Recheck the Expectations

# Task 6: Verify your data and see if something shifts the next year

In [32]:
# TODO: load new dataset

In [33]:
# TODO: check the expectations

Discuss these expectations => did you do a good job? What changed? Do you now have confidence in your data foundation for your AI model? Discuss pros and cons of using a Testing Framework!

# Task 7: Think about AI Implementation

Could you now implement AI to design a flexible pricing model? How would you do it? What is the advantage over doing this by hand?

In [34]:
# TODO: Make this last part better and more to the point ^^