**Set the kernel to "Workshop Environment" from the Jupyter Kernels.**

# Data Quality Testing with Great Expectations

## Introduction
Data quality is the foundation of reliable analytics. Poor data leads to flawed insights and decisions. This workshop explores data quality testing using the powerful Great Expectations library.

## Why Data Quality Matters
- **Trust**: Quality data builds confidence in results
- **Efficiency**: Early detection prevents downstream issues 
- **Consistency**: Ensures reliable model performance
- **Governance**: Meets regulatory requirements

## Our Approach
Using the "Bike Sharing" dataset from UCI, we'll learn how to:
- Define expectations about your data
- Validate these expectations systematically
- Document and report quality issues
- Integrate quality checks into pipelines

## Task 1: Explore the Dataset
Let's begin by exploring the Bike Sharing dataset. Download the data, load it into a dataframe, and perform initial exploratory analysis to understand its structure and contents.

In [None]:
# Import necessary libraries for data processing and quality testing
# - great_expectations: Our primary tool for data quality validation
# - sqlite3: To connect with our SQLite database
# - pandas: For data manipulation and analysis
import great_expectations as gx
import sqlite3
import pandas as pd

from utils import database
from utils.checker import check

In [None]:
# Initialize the database with our bike sharing dataset
# This sets up a SQLite database with the necessary tables and imports the data
database.init()

# Create a connection to our database for querying
conn = sqlite3.connect("database.db")

In [None]:
# Set up Great Expectations context
# This creates the environment where we define and validate expectations
context = gx.get_context()

# Add our SQLite database as a data source for Great Expectations
# This allows us to test data directly from the database
data_source = context.data_sources.add_sqlite(
    "sample", connection_string="sqlite:///database.db"
)

In [None]:
# Define the data asset we want to validate
# An asset in Great Expectations represents a table or query result that we want to test
asset_name = "bike_rental"
database_table_name = "bike_rental"
table_data_asset = data_source.add_table_asset(
    table_name=database_table_name, name=asset_name
)

# Create a batch definition that specifies which data we want to validate
# Here we're selecting the entire table for our first season (spring 2011)
full_table_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="0_spring_2011",
)

In [None]:
# Load the data into a batch and display the first few rows
# This gives us our first look at the structure and content of the dataset
full_table_batch = full_table_batch_definition.get_batch()

full_table_batch.head().data.loc[
    :, ["season", "weekday", "temp", "casual", "registered", "total"]
]

In [None]:
# Query the database directly to investigate potential data quality issues
# Here we're looking for records where the 'casual' rider count equals 2
# This helps us understand the distribution of this variable
query = """
SELECT season, weekday, temp, casual, registered, total
FROM bike_rental
WHERE casual = 2
"""

pd.read_sql_query(query, conn)

# Task 2: Set Expectations for the Spring Data

Now that you've explored the dataset, it's time to define your first expectations - the rules that your data should follow to be considered high quality.

## Basic Expectations Examples

Great Expectations provides various expectation types to validate different aspects of your data. Check out all available expectation in the gallery. https://greatexpectations.io/expectations/

1. **Column Existence**
   ```python
   # Check that specific columns exist in your dataset
   expect_column_to_exist(column="temp")

2. **Set Membership**
   ```python
   # Confirm categorical variables contain only allowed values
   expect_column_values_to_be_in_set(
       column="weathersit", 
       value_set=[1, 2, 3, 4]  # 1:Clear, 2:Cloudy, 3:Light Rain, 4:Heavy Rain
   )
   ```


In [None]:
# Task 1: Check if the 'season' column exists in the dataset
# This is a fundamental check to ensure that our data has the expected structure

### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnToExist(
    column="season",
)
### SOLUTION_END ###

result = full_table_batch.validate(expectation, result_format="COMPLETE")
check(task=1, result=result)

In [None]:
# Task 2: Check if the season only contains Spring
# This is a more specific check to ensure that the data is consistent with our expectations

### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="season",
    value_set=["Spring"],
)
### SOLUTION_END ###

result = full_table_batch.validate(expectation, result_format="COMPLETE")
print("The expectation result is: ", result["success"])
check(task=2, result=result)

The expectation failed because the 'season' column contains values other than 'Spring'. Let's investigate further by querying the distinct values in the 'season' column.

In [None]:
# Investigate why the expectation failed
query = """
SELECT DISTINCT season
FROM bike_rental
"""

pd.read_sql_query(query, conn)

In [None]:
# Task 3: Fix the data quality issue
# Here we need to remove records that do not meet our expectation
# This is a critical step to ensure that our dataset is clean and reliable
# We will delete records where the season is not 'Spring'

### SOLUTION_START ###
query = """
DELETE FROM bike_rental
WHERE season != 'Spring'
"""
### SOLUTION_END ###

# Execute the DELETE query to remove records that do not meet the expectation
with conn:
    conn.execute(query)

# Re-run the validation after fixing the data
full_table_batch = full_table_batch_definition.get_batch()
result = full_table_batch.validate(expectation, result_format="COMPLETE")
print("The expectation result is: ", result["success"])

check(task=3, result=result)

In [None]:
# TODO: add expectations of maximum bike rentals according to the max today + 50 or so

In [None]:
# TODO: set expectation for correlation of rising temperatures to rising bike rentals

In [None]:
# TODO: check the expectations

But, you didn't come to this Workshop just to see a fancy way of doing exactly the same as your basic unit test is doing, so let's get into more complex stuff

# Task 3: Adjust and Set New Expectations for the Summer Data

In [None]:
# TODO: load the summer dataset

In [None]:
# TODO: run the expectations for that new dataset and look at the output

## Excursion: Data Docs

In [None]:
# TODO: explain data docs and how it works

In [None]:
# TODO: generate the data docs

In [None]:
# TODO: look at the data docs

## Back To Business

In [None]:
# TODO: refine the expectations

In [None]:
# TODO: add more complex expectations (give them a list of suggestions again)

In [None]:
# TODO: add a fun expectation, that expects bike rentals to rise, because they have risen before

In [None]:
# TODO: check the expectations

# Task 4: Adjust for Autumn 

In [None]:
# TODO: load new dataset

In [None]:
# TODO: check the expectations

In [None]:
# TODO: fix what needs fixing

In [None]:
# TODO: Maybe add something even more complex?

In [None]:
# TODO: Recheck the Expectations

# Task 5: Check with Winter and set final expectations
You can check out all kinds of expectations here: https://greatexpectations.io/expectations/

In [None]:
# TODO: load new dataset

In [None]:
# TODO: check the expectations

In [None]:
# TODO: fix what needs fixing

In [None]:
# TODO: Maybe add something even more complex?

In [None]:
# TODO: Recheck the Expectations

# Task 6: Verify your data and see if something shifts the next year

In [None]:
# TODO: load new dataset

In [None]:
# TODO: check the expectations

Discuss these expectations => did you do a good job? What changed? Do you now have confidence in your data foundation for your AI model? Discuss pros and cons of using a Testing Framework!

# Task 7: Think about AI Implementation

Could you now implement AI to design a flexible pricing model? How would you do it? What is the advantage over doing this by hand?

In [None]:
# TODO: Make this last part better and more to the point ^^