**IMPORTANT: Set the kernel to "Workshop Environment" from the Jupyter Kernels menu before running.**

# Data Quality Testing with Great Expectations

## Introduction
Data quality is the cornerstone of reliable analytics and trustworthy AI systems. Poor quality data inevitably leads to flawed insights, inaccurate models, and poor decisions. This workshop provides a hands-on introduction to data quality testing using the powerful open-source library, Great Expectations.

## Why Data Quality Matters
- **Trust**: High-quality data builds confidence in analytical results and model predictions.
- **Efficiency**: Detecting and fixing data issues early prevents costly downstream problems in ETL pipelines and model training.
- **Consistency**: Ensures data reliability over time, leading to more stable and predictable model performance.
- **Governance**: Helps meet data governance standards and regulatory requirements.

## Our Approach in this Workshop
We will use the "Bike Sharing" dataset from the UCI Machine Learning Repository. Throughout this workshop, you'll learn how to:
- Define explicit expectations (rules) about your data.
- Systematically validate data against these expectations.
- Document data quality results and identify issues using Data Docs.
- Understand how to integrate quality checks into data pipelines (conceptually).

## Goals of this Workshop
- Learn to design meaningful data quality checks applicable to real-world scenarios, such as building the data foundation for a flexible (AI) pricing system.
- Gain experience using Great Expectations to define, validate, and visualize expectations and their results over time.

## Dataset Information
We'll be working with a modified version of the UCI Bike Sharing dataset (https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). For simplicity and clarity in this workshop, we've made slight alterations to the columns and denormalized some measures (like temperature) to make them more intuitive.

In [1]:
# Import necessary libraries
# - great_expectations (gx): The core library for defining and validating data expectations.
# - sqlite3: Used to connect to and interact with our SQLite database containing the dataset.
# - pandas (pd): Essential for data manipulation, analysis, and reading SQL query results.
# - uuid: Used to generate unique identifiers for Great Expectations objects like Suites and Validators.
import great_expectations as gx
import sqlite3
import pandas as pd
import uuid

# Import utility functions specific to this workshop environment
# - database: Contains functions to initialize and manage the workshop database.
# - checker: Provides helper functions to check task completion.
# - server: Includes functions to serve Great Expectations Data Docs locally.
from utils import database
from utils.checker import check_solution
from utils.server import serve_docs, stop_server

metric column.standard_deviation.aggregate_fn is being registered with different metric_provider; overwriting metric_provider


## Task 1: Initialize Environment and Explore Initial Data

First, we need to set up our environment. This involves initializing a local SQLite database with the first part of our dataset (Rental Bike Data from Spring 2011) and establishing a connection to it. We'll then configure Great Expectations to use this database as a data source and take a first look at the data structure.

In [2]:
# Initialize the database using the provided utility function.
# This function typically resets the database, creates the necessary table schema,
# and loads the initial dataset (Spring 2011 bike rentals).
database.init()

# Create a standard Python DB-API connection object to our SQLite database.
# We'll use this connection for direct SQL queries with pandas later.
conn = sqlite3.connect("database.db")

[32m2025-06-24 11:23:16.641[0m | [1mINFO    [0m | [36mutils.database[0m:[36mreset_database[0m:[36m54[0m - [1mDatabase reset completed[0m
[32m2025-06-24 11:23:16.643[0m | [1mINFO    [0m | [36mutils.database[0m:[36minit[0m:[36m142[0m - [1mInitializing database to step 0: 0_spring_2011[0m
[32m2025-06-24 11:23:16.645[0m | [1mINFO    [0m | [36mutils.database[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 0_create_table.sql from 0_spring_2011 - Successfully applied[0m
[32m2025-06-24 11:23:16.663[0m | [1mINFO    [0m | [36mutils.database[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 1_bike_rental_2011_spring.sql from 0_spring_2011 - Successfully applied[0m


In [3]:
# Obtain a Great Expectations Data Context.
# The Data Context is the main entry point for the Great Expectations API,
# managing configurations for Data Sources, Expectation Suites, and Validation Results.
gx_context = gx.get_context()

# Add our SQLite database as a Data Source within the Great Expectations context.
# This tells Great Expectations how to connect to our data.
# We give it a name ('sqlite_datasource') and provide the connection string.
sqlite_data_source = gx_context.data_sources.add_sqlite(
    name="sqlite_datasource", connection_string="sqlite:///database.db"
)

In [4]:
# Define a Data Asset representing the table we want to validate.
# An Asset is a logical representation of data (like a table or a query result)
# within a Data Source that we want to apply expectations to.
asset_name = "bike_rental_asset"
database_table_name = "bike_rental" # The actual table name in the SQLite DB
table_data_asset = sqlite_data_source.add_table_asset(
    name=asset_name,
    table_name=database_table_name
)

# Create a Batch Definition for the initial data load (Spring 2011).
# A Batch Definition specifies *how* to fetch a batch of data from the asset.
# Here, 'add_batch_definition_whole_table' means we want to validate the entire table
# as a single batch for this initial dataset.
spring_2011_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="spring_2011_data", # A descriptive name for this specific batch definition
)

In [5]:
# Fetch the actual data Batch based on the Batch Definition.
# A Batch is the specific slice of data (in this case, the whole Spring 2011 table)
# that we will run validations against.
spring_2011_batch = spring_2011_batch_definition.get_batch()

# Display the first few rows of the loaded Batch using its built-in head() method.
# This provides a quick preview of the data's structure and content.
# We select specific columns for a cleaner view.
spring_2011_batch.head().data.loc[
    :, ["season", "weekday", "temp", "casual", "registered", "total"]
]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,season,weekday,temp,casual,registered,total
0,Spring,Monday,7.98,2,11,13
1,Spring,Monday,7.98,1,6,7
2,Spring,Monday,7.98,1,5,6
3,Spring,Monday,7.98,0,1,1
4,Spring,Monday,7.04,1,1,2


In [6]:
# Perform direct SQL querying for initial exploration or investigation.
# Here, we query the database directly (using pandas and our sqlite connection)
# to examine records where the 'casual' rider count is exactly 2.
# This type of ad-hoc analysis helps understand data distributions and identify potential anomalies
# or patterns before defining formal expectations.
query = """
SELECT season, weekday, temp, casual, registered, total
FROM bike_rental
WHERE casual = 2
"""

# Execute the query and display the results using pandas
pd.read_sql_query(query, conn)

Unnamed: 0,season,weekday,temp,casual,registered,total
0,Spring,Monday,7.98,2,11,13
1,Spring,Monday,7.04,2,30,32
2,Spring,Tuesday,10.80,2,58,60
3,Spring,Wednesday,7.04,2,24,26
4,Spring,Thursday,4.22,2,106,108
...,...,...,...,...,...,...
81,Spring,Monday,17.38,2,27,29
82,Spring,Tuesday,20.20,2,8,10
83,Spring,Thursday,20.20,2,14,16
84,Spring,Friday,18.32,2,11,13


## Task 2: Define and Validate Initial Expectations (Spring Data)

Now that we've loaded and briefly explored the initial Spring 2011 data, let's define our first **Expectations**. Expectations are assertions or rules about what we expect from our data for it to be considered high quality. We'll start with some basic structural and content checks.

### Basic Expectation Examples

Great Expectations offers a wide range of built-in expectation types. You can explore the full list in the [Expectation Gallery](https://greatexpectations.io/expectations/). Here are a couple of common examples:

1.  **Column Existence**: Ensure specific columns are present.
    ```python
    # Checks if a column named 'temp' exists
    gx.expectations.ExpectColumnToExist(column="temp")
    ```

2.  **Set Membership**: Verify that values in a categorical column belong to an allowed set.
    ```python
    # Checks if 'weathersit' values are only 1, 2, 3, or 4
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="weathersit", 
        value_set=[1, 2, 3, 4]  # 1:Clear, 2:Cloudy, 3:Light Rain, 4:Heavy Rain
    )
    ```

We will now define and validate expectations against our `spring_2011_batch`.

### Task 2.1: Check if the 'season' column exists.

In [7]:
# This is a fundamental structural check. 
# If critical columns are missing, downstream processing will likely fail.

# Instantiate the Expectation object.
### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnToExist(
    column="season",
)
### SOLUTION_END ###

# Validate the expectation against our Spring 2011 batch.
# The `validate` method runs the expectation logic on the batch data.
# `result_format="COMPLETE"` provides detailed results, including observed values.
result = spring_2011_batch.validate(expectation, result_format="COMPLETE")

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
# Use the workshop's checker utility to verify the task result.
check_solution(task=1, result=result)

[32m2025-06-24 11:23:18.163[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


### Task 2.2: Check if the 'season' column *only* contains 'Spring'.

In [9]:
# Since we loaded only Spring data, we expect this to be true.
# This checks for data consistency within this specific batch.

### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="season",
    value_set=["Spring"], # The set of allowed values
)
### SOLUTION_END ###

# Validate this expectation against the same Spring 2011 batch.
result = spring_2011_batch.validate(expectation, result_format="COMPLETE")

# Print the overall success status from the result object.
# The expected result of this expectations should be "False", as there is an error in this dataset
# So this Task succeeds if the expectations fails
print("The expectation result is: ", result["success"])

Calculating Metrics:   0%|          | 0/13 [00:00<?, ?it/s]

The expectation result is:  False


In [10]:
# Use the workshop's checker utility.
check_solution(task=2, result=result)

[32m2025-06-24 11:23:18.426[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


**Investigation:** The previous expectation failed! This indicates that our assumption was incorrect, and the 'season' column in the initial data contains values other than 'Spring'. To understand why, let's query the distinct values present in the 'season' column directly from the database.

In [11]:
# Investigate why the ExpectColumnValuesToBeInSet expectation failed.
# Query the database for all unique values in the 'season' column.
query = """
SELECT DISTINCT season
FROM bike_rental
"""

# Execute the query using pandas and the database connection.
pd.read_sql_query(query, conn)

Unnamed: 0,season
0,Spring
1,Sprung


**Data Cleaning:** The investigation revealed a data quality issue: a typo ('Sprung' instead of 'Spring'). In a real-world scenario, we might update the source or apply a transformation. For this workshop, we'll directly remove the offending rows from our database table to ensure the data conforms to our expectation.

### Task 2.3: Fix the data quality issue found.

In [12]:
# Define a SQL query to delete rows where 'season' is not 'Spring'.
# This is a direct manipulation for workshop purposes; in practice, data cleaning
# might involve more complex logic or updating source systems.

### SOLUTION_START ###
query = """
DELETE FROM bike_rental
WHERE season != 'Spring'
"""
### SOLUTION_END ###

# Execute the DELETE query using the database connection.
# Using 'with conn:' ensures the transaction is committed.
with conn:
    conn.execute(query)

# IMPORTANT: Re-fetch the batch after modifying the underlying data.
# Great Expectations batches often represent a snapshot; fetching it again ensures
# we are validating the *cleaned* data.
spring_2011_batch = spring_2011_batch_definition.get_batch()

# Re-run the previous ExpectColumnValuesToBeInSet expectation to confirm the fix.
# We expect this to succeed now.
result = spring_2011_batch.validate(expectation, result_format="COMPLETE")
print("The expectation result is: ", result["success"])

Calculating Metrics:   0%|          | 0/13 [00:00<?, ?it/s]

The expectation result is:  True


In [13]:
# Use the workshop's checker utility.
check_solution(task=3, result=result)

[32m2025-06-24 11:23:18.721[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


Unit tests can check for specific values, but Great Expectations excels at defining broader data characteristics and constraints. Let's explore an expectation that checks the range of a numeric column.

### Task 2.4: Add an expectation for the range of hourly bike rentals.

In [14]:
# First, let's find the actual minimum and maximum values of the 'total' column
# in our current (cleaned Spring 2011) dataset to inform our expectation.
# The 'total' column represents the total number of rentals in a given hour.

# Write the SQL query to find the min and max of the 'total' column.
### SOLUTION_START ###
query = """
SELECT min(total), max(total) 
FROM bike_rental
"""
### SOLUTION_END ###

# Execute the query and print the result.
print(pd.read_sql_query(query, conn))

   min(total)  max(total)
0           1         638


### Task 2.5: Define range expectation based on query.

In [15]:
# Now, define an Expectation that the maximum value of the 'total' column
# should fall within a certain range.
# Based on the query result (max=638), let's set the range to be
# 90% to 110% of the observed maximum. This allows for some variation
# but flags significant deviations.
observed_max = 638

### SOLUTION_START ###
expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="total",
    min_value=0.9 * observed_max, # Lower bound (90%)
    max_value=1.1 * observed_max, # Upper bound (110%)
)
### SOLUTION_END ###

# Validate the expectation against the Spring 2011 batch.
result = spring_2011_batch.validate(expectation, result_format="COMPLETE")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

In [16]:
# Use the workshop's checker utility.
check_solution(task=4, result=result)

[32m2025-06-24 11:23:18.928[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


## Task 3: Advanced Expectations and Data Docs Introduction

Let's explore more sophisticated expectations, like validating data formats using regular expressions. We'll also introduce **Expectation Suites** and **Data Docs**, core Great Expectations concepts for organizing expectations and visualizing validation results.

### Task 3.1: Validate the date format using a regular expression.

In [17]:
# The 'dteday' column should contain dates in 'YYYY-MM-DD' format.
# We can enforce this using `ExpectColumnValuesToMatchRegex`.

### SOLUTION_START ###
# Define the regex pattern for YYYY-MM-DD
# \d{4}: exactly four digits (year)
# -: literal hyphen
# \d{2}: exactly two digits (month/day)
# ^: matches the beginning of the string
# $: matches the end of the string
iso_date_regex = r"^\d{4}-\d{2}-\d{2}$" 

expectation = gx.expectations.ExpectColumnValuesToMatchRegex(
    column="dteday",
    regex=iso_date_regex,
)
# Note: A more precise regex could validate month/day ranges, e.g.: 
# regex=r"^(?:(?:19|20)\d\d)-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$"
### SOLUTION_END ###

# Validate the expectation against the Spring 2011 batch.
result = spring_2011_batch.validate(expectation, result_format="COMPLETE")

# Print the list of values that did *not* match the expected format.
# In this initial dataset, we expect this list to be empty.
print("These are the unexpected values:", ", ".join(result["result"]["unexpected_list"]))

Calculating Metrics:   0%|          | 0/13 [00:00<?, ?it/s]

These are the unexpected values: 2011/06/20, 2011.06.20, 20110620


In [18]:
# Use the workshop's checker utility.
check_solution(task=5, result=result)

[32m2025-06-24 11:23:19.091[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


## Excursion: Expectation Suites and Data Docs

So far, we've defined and validated expectations one by one. For real projects, you'll want to group related expectations together and visualize the results clearly.

-   **Expectation Suite**: A collection of expectations defined for a specific data asset (like our bike rental table). Think of it as a test suite for your data.
-   **Data Docs**: Automatically generated HTML documentation displaying expectation definitions and validation results. They provide a clear, shareable report on data quality.

Let's create our first Expectation Suite and see how Data Docs work.

### Cheatsheet for Key Great Expectations Concepts
-   ***Data Context***: The main entry point for managing GX objects (Data Sources, Suites, etc.).
-   ***Data Source***: Configuration for connecting to a data system (like our SQLite DB).
-   ***Data Asset***: A logical representation of data (e.g., a table) within a Data Source.
-   ***Batch Definition***: Specifies how to fetch a slice of data (a Batch) from an Asset.
-   ***Batch***: A specific slice of data retrieved based on a Batch Definition, ready for validation.
-   ***Expectation***: A verifiable assertion or rule about your data.
-   ***Expectation Suite***: A collection of Expectations, typically applied together to a Data Asset.
-   ***Validation Definition***: Links an Expectation Suite to a Batch Definition, defining *what* to validate against *which* data.
-   ***Validation Result***: The outcome of running a Validation Definition on a Batch.
-   ***Checkpoint***: A bundle of Validation, Batches and Expectations Suites that can be re-run (also with different Batches). Its most powerful feature is being able to run Actions (see below). Checkpoints are essential for pipeline workflows.
-   ***Actions***: Can be carried out after a checkpoint is run. Mostly used for notification or updating the Data Docs, but can be costumized as desired.
-   ***Data Docs***: Human-readable HTML reports generated from Expectation Suites and Validation Results.

### Conceptual Workflow
![](./img/gx_workflow_steps_and_components.png)

In [19]:
# Create an Expectation Suite.

# Define a unique name for our first Expectation Suite.
# Using UUID ensures the name is unique, especially if running the notebook multiple times.
suite_name = "bike_rental_suite_" + str(uuid.uuid4())

# Create an empty Expectation Suite object.
initial_expectation_suite = gx.ExpectationSuite(name=suite_name)

# Add some basic expectations to the suite.
# These check that the 'total' and 'dteday' columns should not contain null values.
initial_expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column = 'total')
)
initial_expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column = 'dteday')
)

# Add and save the suite to the Great Expectations context.
# This makes the suite persistent within the context's configured store.
initial_expectation_suite = gx_context.suites.add(initial_expectation_suite)

In [20]:
# Create a Validation Definition.
# A Validation Definition links an Expectation Suite to a Batch Definition.
# It specifies *which suite* to run against *which data batch*.

# Define a unique name for the Validation Definition.
validation_name = "spring_data_validation_" + str(uuid.uuid4())

# Create the ValidationDefinition object.
spring_validation_definition = gx.ValidationDefinition(
    name=validation_name, # Unique identifier for this validation setup
    data=spring_2011_batch_definition, # Use the batch definition for Spring 2011 data
    suite=initial_expectation_suite, # Use the suite we just created
)

# Add and save the Validation Definition to the context.
spring_validation_definition = gx_context.validation_definitions.add(spring_validation_definition)

### Task 3.2: Run the Validation Definition.

In [21]:
# Calling `.run()` on the Validation Definition executes all expectations in the associated suite
# against a batch of data specified by the associated batch definition.
validation_results = spring_validation_definition.run()

Calculating Metrics:   0%|          | 0/16 [00:00<?, ?it/s]

In [22]:
# Use the workshop's checker utility.
check_solution(task=6, result=validation_results)

[32m2025-06-24 11:23:19.386[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


In [23]:
# Print the validation results.
# The output is a JSON-like object containing overall success status, results for each expectation,
# observed values, and metadata about the validation run.
print(validation_results)

# Hint: In Jupyter Lab/VS Code, you might need to right-click the output and choose
# 'Open With -> Text Editor' or similar to view the full JSON structure easily.

{
  "success": true,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "total"
        },
        "meta": {},
        "id": "bb330585-6e0c-41fd-bc96-8eb85bafa893"
      },
      "result": {
        "element_count": 2207,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "dteday"
        },
        "meta": {},
        "id": "62cc7baf-2

In [24]:
# Build and View Data Docs.
# Great Expectations can generate HTML documentation from your suites and validation results.

# This command builds the Data Docs based on the suites and results stored in the context.
# It returns the path to the local directory where the docs were built.
data_doc_site_path = gx_context.build_data_docs()['local_site']

In [25]:
# Use the workshop utility to start a simple local HTTP server
# to view the generated Data Docs in your browser.
server_instance, server_thread = serve_docs(data_doc_site_path, port=8003, open_browser=False)
# Note: 'open_browser=False' prevents automatic opening; manually click the link above.

Serving directory '/private/var/folders/6b/st166m7n5b91t44xhm_1s5sm0000gn/T/tmpystfn99w' containing 'index.html'
Access documentation at: http://localhost:8003/
Server running on port 8003. Run stop_server(server, thread) to stop.


**Explore Data Docs:** Open the link provided in the output above (e.g., `http://localhost:8003/`).

Navigate through the Data Docs site:
-   Find the **Validation Results** section.
-   Click on the latest validation run (associated with `bike_rental_suite_...`).
-   Examine the results for each expectation (expect_column_values_to_not_be_null). See how the observed values compare to the expected criteria.
-   Explore the **Expectation Suites** section to see the definition of the suite itself.

Take a few minutes to familiarize yourself with the structure and information presented in Data Docs.

In [26]:
# Stop the local Data Docs server once you are finished exploring.
# It's good practice to release the port.
if 'server_instance' in locals() and server_instance:
    stop_server(server_instance, server_thread)
    # Clean up variables to prevent accidental reuse
    del server_instance
    del server_thread
else:
    print("Server was not started or already stopped.")

Shutting down server on port 8003...
Server on port 8003 stopped and port freed.


## Task 4: Evolving the Expectation Suite by Loading Summer Data

Data often changes over time (new categories appear, ranges shift). Expectation Suites should evolve to reflect these changes. Let's imagine that Summer data will be added soon. We need to update our `initial_expectation_suite` to allow 'Summer' in the 'season' column, in addition to the 'Spring' we already cleaned for.

In [27]:
# Lets move to the next step in the workshop, add the summer data.
database.set_step(1)

[32m2025-06-24 11:23:20.404[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 0_spring_2011[0m
[32m2025-06-24 11:23:20.406[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_create_table.sql (already applied)[0m
[32m2025-06-24 11:23:20.407[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 1_bike_rental_2011_spring.sql (already applied)[0m
[32m2025-06-24 11:23:20.408[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 1_summer_2011[0m
[32m2025-06-24 11:23:20.447[0m | [1mINFO    [0m | [36mutils.database[0m:[36mapply_migration[0m:[36m88[0m - [1mMigration 0_bike_rental_2011_summer.sql from 1_summer_2011 - Successfully applied[0m


True

### Task 4.1: Add an expectation allowing 'Spring' or 'Summer' in the 'season' column.

In [28]:
# We create a new expectation instance reflecting this updated requirement.

### SOLUTION_START ###
updated_season_expectation = gx.expectations.ExpectColumnValuesToBeInSet(
    column="season",
    value_set=["Spring", "Summer"], # Now allowing both seasons
)
### SOLUTION_END ###

# Add this new/updated expectation to our existing suite.
# Note: If an expectation for the same column and type already exists,
# `add_expectation` often replaces it (behavior might vary slightly by GX version).
# It's generally safer to manage suites explicitly if precise control over replacement is needed.
initial_expectation_suite.add_expectation(updated_season_expectation)

# We can immediately re-run our existing validation definition.
# This will run the *updated* suite against the *original Spring data*.
# This specific expectation ('Spring' or 'Summer') should still pass against the Spring data.
validation_results = spring_validation_definition.run()

Calculating Metrics:   0%|          | 0/29 [00:00<?, ?it/s]

In [29]:
# Use the workshop's checker utility.
check_solution(task=7, result=validation_results)

[32m2025-06-24 11:23:20.608[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


### Task 4.2 (Optional but good practice): Create a new Validation Definition for clarity.

In [30]:
# While we could reuse `spring_validation_definition`, creating a new one makes it clear
# that this validation run uses the *updated* suite.
validation_name_updated = "spring_data_validation_updated_" + str(uuid.uuid4())

### SOLUTION_START ###
spring_validation_definition_updated = gx.ValidationDefinition(
    name=validation_name_updated, # New unique name
    data=spring_2011_batch_definition, # Still using the Spring data batch definition
    suite=initial_expectation_suite, # Pointing to the *updated* suite object
)
### SOLUTION_END ###

# Add the new validation definition to the context.
spring_validation_definition_updated = gx_context.validation_definitions.add(
    spring_validation_definition_updated
)

In [31]:
# Run the new Validation Definition.
# This explicitly runs the updated suite against the Spring data.
### SOLUTION_START ###
results = spring_validation_definition_updated.run()
print(results)
### SOLUTION_END ###

# Hint: Check the output JSON. You should see the 'expect_column_values_to_be_in_set' 
# expectation now lists ["Spring", "Summer"] in its configuration, and it should succeed.

Calculating Metrics:   0%|          | 0/29 [00:00<?, ?it/s]

{
  "success": true,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "total"
        },
        "meta": {},
        "id": "bb330585-6e0c-41fd-bc96-8eb85bafa893"
      },
      "result": {
        "element_count": 4443,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "dteday"
        },
        "meta": {},
        "id": "62cc7baf-2

In [32]:
# Rebuild and view Data Docs to see the history.

# Build the data docs again. This incorporates the latest validation run.
data_doc_site_path = gx_context.build_data_docs()['local_site']

In [33]:
# Serve the updated Data Docs on a different port (e.g., 8005).
server_instance, server_thread = serve_docs(data_doc_site_path, port=8005, open_browser=False)
# Explore the Data Docs again. Notice the new validation run listed.
# You can compare runs and see how the suite definition has changed over time.

Serving directory '/private/var/folders/6b/st166m7n5b91t44xhm_1s5sm0000gn/T/tmpystfn99w' containing 'index.html'
Access documentation at: http://localhost:8005/
Server running on port 8005. Run stop_server(server, thread) to stop.


In [34]:
# Stop the Data Docs server when finished.
if 'server_instance' in locals() and server_instance:
    stop_server(server_instance, server_thread)
    del server_instance
    del server_thread
else:
    print("Server was not started or already stopped.")

Shutting down server on port 8005...
Server on port 8005 stopped and port freed.


## Task 5: Validating Seasonal Expectations & Monthly Batches

Now, let's load data for the next season (Fall 2011) into our database. We'll then explore defining expectations on specific subsets (batches) of data, such as focusing only on a particular month.

In [35]:
# Load Fall 2011 data into the database.
# The `database.set_step(2)` utility function handles adding the data for these seasons.
database.set_step(2)

[32m2025-06-24 11:23:21.811[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 0_spring_2011[0m
[32m2025-06-24 11:23:21.812[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_create_table.sql (already applied)[0m
[32m2025-06-24 11:23:21.813[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 1_bike_rental_2011_spring.sql (already applied)[0m
[32m2025-06-24 11:23:21.814[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 1_summer_2011[0m
[32m2025-06-24 11:23:21.816[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_bike_rental_2011_summer.sql (already applied)[0m
[32m2025-06-24 11:23:21.817[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 2_fall

True

In [36]:
# Define a new Batch Definition and Batch for the updated data.
# Since the underlying table 'bike_rental' now contains more data (Spring, Summer, Fall),
# we define a new batch definition to represent this current state.
fall_2011_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="data_through_fall_2011", # Represents data up to and including Fall 2011
)

# Get the corresponding Batch object containing Spring, Summer, and Fall data.
fall_2011_batch = fall_2011_batch_definition.get_batch()

### Task 5.1: Analyze monthly trends in bike rentals.

In [37]:
# Before setting month-specific expectations, let's see how the average
# total rentals ('total') varies by month ('mnth') in the data loaded so far.
# This helps understand seasonal patterns.

### SOLUTION_START ###
# SQL query to calculate the average 'total' rentals, grouped by month.
query = """
SELECT mnth, avg(total)
FROM bike_rental
GROUP BY mnth
ORDER BY mnth -- Optional: order by month for easier reading
"""
### SOLUTION_END ###

# Execute the query and display the monthly averages.
print(pd.read_sql_query(query, conn))

   mnth  avg(total)
0     3   87.842308
1     4  131.947149
2     5  182.555108
3     6  199.204420
4     7  189.974462
5     8  186.991792
6     9  177.576438
7    10  166.232840
8    11  142.095967
9    12  135.277083


**Introducing Monthly Batch Definitions:**
Great Expectations allows defining batches based on data splitting criteria, such as time columns. This is useful for validating data incrementally (e.g., daily, monthly) or applying different rules to different periods. We'll create a *monthly* batch definition based on the 'dteday' column.

In [38]:
# Create a dedicated Expectation Suite for October data.
# Sometimes, we might want specific checks only for certain periods.
# Let's create a new suite focused on validating October (month=10) data.

# Define a unique name for the October suite.
october_suite_name = "october_rules_suite_" + str(uuid.uuid4())

# Create the expectation: Check if the month ('mnth' column) is indeed 10.
# This acts as a sanity check for the batching mechanism.
expect_month_is_october = gx.expectations.ExpectColumnValuesToBeInSet(
    column='mnth',
    value_set=[10], # Only allow month 10
)

# Suggestion: A more meaningful expectation for October might be:
# Based on the previous query, the average for month 10 was ~166.
# expect_oct_mean = gx.expectations.ExpectColumnMeanToBeBetween(
#     column='total',
#     min_value=166 * 0.8, # e.g., 80% of observed mean
#     max_value=166 * 1.2, # e.g., 120% of observed mean
# )

# Create the new suite.
expectation_suite_october = gx.ExpectationSuite(name=october_suite_name)

# Add the month check expectation to this suite.
expectation_suite_october.add_expectation(expect_month_is_october)
# expectation_suite_october.add_expectation(expect_oct_mean) # Add the mean check if desired

# Add and save the October-specific suite to the context.
expectation_suite_october = gx_context.suites.add(expectation_suite_october)

### Task 5.2: Create a Monthly Batch Definition.

In [39]:
# We'll use a different data asset name here ('rentals_monthly_asset')
# to keep this batching strategy separate, although it points to the same table.
monthly_table_asset = sqlite_data_source.add_table_asset(
    name="rentals_monthly_asset", table_name=database_table_name
)

# Create a batch definition that splits the data by year and month based on the 'dteday' column.
### SOLUTION_START ###
batch_definition_monthly = monthly_table_asset.add_batch_definition_monthly(
    name="monthly_batches", # Name for this batching configuration
    column="dteday",        # The date/timestamp column to use for splitting
)
### SOLUTION_END ###

In [40]:
# Use the workshop's checker utility.
check_solution(task=8, result=batch_definition_monthly)

[32m2025-06-24 11:23:22.207[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


In [41]:
# List the available batch identifiers (year, month combinations) found in the data.
# This shows how GX has identified the distinct months present.
batch_definition_monthly.get_batch_identifiers_list()

[{'year': None, 'month': None},
 {'year': 2011, 'month': 3},
 {'year': 2011, 'month': 4},
 {'year': 2011, 'month': 5},
 {'year': 2011, 'month': 6},
 {'year': 2011, 'month': 7},
 {'year': 2011, 'month': 8},
 {'year': 2011, 'month': 9},
 {'year': 2011, 'month': 10},
 {'year': 2011, 'month': 11},
 {'year': 2011, 'month': 12}]

In [42]:
# Create a Validation Definition for the October data.
# This definition links our *monthly batch definition* to the *October-specific suite*.
october_validation_definition = gx.ValidationDefinition(
    name="validate_october_data",
    data=batch_definition_monthly,     # Use the monthly batch definition
    suite=expectation_suite_october, # Use the suite with October rules
)

# Add and save this definition to the context.
# Note: We don't add this to the context here as it's run immediately after,
# but in a persistent setup, you would typically add it.
# october_validation_definition = gx_context.validation_definitions.add(october_validation_definition)

In [43]:
# Run validation specifically for October 2011.
# When running a validation definition based on a splitter (like monthly),
# we provide `batch_parameters` to select the specific batch(es) to validate.
validation_result_october = october_validation_definition.run(
    batch_parameters={"year": 2011, "month": 10} # Select only Oct 2011 data
)

# Print the success status of the October validation.
validation_result_october["success"]

Calculating Metrics:   0%|          | 0/13 [00:00<?, ?it/s]

True

## Task 6: Adding Custom Expectations

Great Expectations has a rich library of built-in expectations ([Expectation Gallery](https://greatexpectations.io/expectations/)), but sometimes you need checks specific to your business logic that aren't covered. You can create custom expectations, often by leveraging SQL queries.

First, let's load the Winter 2011 data and then define a custom expectation.

In [44]:
# Load Winter 2011 data.
# `database.set_step(3)` adds the winter data to the existing table.
database.set_step(3)

# Verify that all four seasons are now present in the 'season' column.
query = """
SELECT DISTINCT season
FROM bike_rental
ORDER BY season -- Optional ordering
"""
# Execute the query
pd.read_sql_query(query, conn)

[32m2025-06-24 11:23:22.556[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 0_spring_2011[0m
[32m2025-06-24 11:23:22.564[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_create_table.sql (already applied)[0m
[32m2025-06-24 11:23:22.565[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 1_bike_rental_2011_spring.sql (already applied)[0m
[32m2025-06-24 11:23:22.571[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 1_summer_2011[0m
[32m2025-06-24 11:23:22.573[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_bike_rental_2011_summer.sql (already applied)[0m
[32m2025-06-24 11:23:22.574[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 2_fall

Unnamed: 0,season
0,Fall
1,Spring
2,Summer
3,Winter


In [45]:
# Review existing expectations in our main suite.
# Let's programmatically list the expectations currently in `initial_expectation_suite`.
print("Current expectations in the suite:")
for expectation_config in initial_expectation_suite.expectations:
    # Access attributes like expectation_type or kwargs['column']
    print(f"- Expectation Type: '{expectation_config.expectation_type}', Column: '{expectation_config.column}'")

Current expectations in the suite:
- Expectation Type: 'expect_column_values_to_not_be_null', Column: 'total'
- Expectation Type: 'expect_column_values_to_not_be_null', Column: 'dteday'
- Expectation Type: 'expect_column_values_to_be_in_set', Column: 'season'


### Task 6.1: Define a custom expectation using SQL.

In [46]:
# We want to ensure data integrity by checking if the sum of 'casual' and 'registered'
# always equals the 'total' column.
# We can use `UnexpectedRowsExpectation`, which expects a SQL query that returns *rows that violate* the condition.
# The expectation passes if the query returns zero rows.

### SOLUTION_START ###
# The SQL query selects rows where the sum does NOT match the total.
# Note the use of '{batch}' - Great Expectations replaces this placeholder
# with the appropriate table/query representing the current batch during validation.
sum_check_query ="""
SELECT *
FROM {batch}
WHERE casual + registered != total
"""
### SOLUTION_END ###

# Instantiate the custom expectation.
expect_casual_registered_sum_total = gx.expectations.UnexpectedRowsExpectation(
    unexpected_rows_query=sum_check_query,
    description="Check that the sum of casual and registered riders equals the total count", # Optional: description for Data Docs
)

# Get a batch representing the data up to Winter 2011
# We need a batch definition that covers all the data currently loaded.
winter_2011_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="data_through_winter_2011"
)
winter_2011_batch = winter_2011_batch_definition.get_batch()

# Validate this single custom expectation against the full 2011 data.
result = winter_2011_batch.validate(
    expect_casual_registered_sum_total,
    result_format="COMPLETE",
)

# You can inspect the 'result' object to see if any unexpected rows were found.
print(result)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "type": "unexpected_rows_expectation",
    "kwargs": {
      "batch_id": "sqlite_datasource-bike_rental_asset",
      "unexpected_rows_query": "\nSELECT *\nFROM {batch}\nWHERE casual + registered != total"
    },
    "meta": {},
    "description": "Check that the sum of casual and registered riders equals the total count"
  },
  "result": {
    "observed_value": 0,
    "details": {
      "unexpected_rows": []
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


In [47]:
# Use the workshop's checker utility.
check_solution(task=9, result=result)

[32m2025-06-24 11:23:22.886[0m | [32m[1mSUCCESS [0m | [36mutils.checker[0m:[36mcheck_solution[0m:[36m214[0m - [32m[1mGreat job! The result of the expectation is correct. Continue with the next task.[0m


In [48]:
# Add the custom expectation to our main suite.
# This ensures the check is included in future validation runs of this suite.
initial_expectation_suite.add_expectation(expect_casual_registered_sum_total)

UnexpectedRowsExpectation(id='017b966d-7902-4515-8881-c85f1a49aa72', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description='Check that the sum of casual and registered riders equals the total count', catch_exceptions=False, rendered_content=None, windows=None, batch_id=None, unexpected_rows_query='\nSELECT *\nFROM {batch}\nWHERE casual + registered != total')

## Task 7: Validating Future Data (Year 2012)

A key use case for data quality testing is detecting **data drift** or unexpected changes when new data arrives. Let's load the data for the entire year 2012 and run our established Expectation Suite against it. Will our assumptions from 2011 still hold?

In [49]:
# Load the full dataset for the year 2012.
# `database.set_step(4)` adds all data from 2012 (all seasons) to the table.
database.set_step(4)

[32m2025-06-24 11:23:23.050[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 0_spring_2011[0m
[32m2025-06-24 11:23:23.052[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_create_table.sql (already applied)[0m
[32m2025-06-24 11:23:23.052[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 1_bike_rental_2011_spring.sql (already applied)[0m
[32m2025-06-24 11:23:23.053[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 1_summer_2011[0m
[32m2025-06-24 11:23:23.054[0m | [34m[1mDEBUG   [0m | [36mutils.database[0m:[36mset_step[0m:[36m228[0m - [34m[1mSkipping 0_bike_rental_2011_summer.sql (already applied)[0m
[32m2025-06-24 11:23:23.054[0m | [1mINFO    [0m | [36mutils.database[0m:[36mset_step[0m:[36m221[0m - [1mProcessing directory: 2_fall

True

In [50]:
# Validate the full 2011+2012 dataset against our main suite.

# Create a batch definition representing the entire table, now containing 2011 and 2012 data.
# We use the original 'table_data_asset' defined earlier.
full_dataset_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="full_dataset_2011_2012",
)

# Create a Validation Definition linking this full dataset batch to our main suite.
full_data_validation_definition = gx.ValidationDefinition(
    name="validate_full_dataset",
    data=full_dataset_batch_definition,
    suite=initial_expectation_suite, # Use the suite containing all our general rules
)

# Run the validation.
validation_results_full = full_data_validation_definition.run(result_format="COMPLETE")

# Print the results.
# Pay close attention to the 'success' field and individual expectation results.
print(validation_results_full)

Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

{
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "total"
        },
        "meta": {},
        "id": "bb330585-6e0c-41fd-bc96-8eb85bafa893"
      },
      "result": {
        "element_count": 17578,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": [],
        "unexpected_list": [],
        "unexpected_index_query": "SELECT total \nFROM bike_rental \nWHERE total IS NULL;"
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id":

## Task 8: Data Quality for AI/ML Models

While the previous checks ensure general data integrity, AI/ML models often have specific requirements. Features need to be within expected ranges, categorical values must be consistent, and relationships between columns (like `weekday` and `workingday`) must hold true for the model to learn meaningful patterns.

In this section, we'll:
1.  Train a simple bike rental prediction model on the *current* dataset (potentially containing errors).
2.  Define a new Expectation Suite (`ml_suite`) with checks specifically relevant to the model's features (e.g., valid ranges for `hour`, `temp`, `humidity`, consistency between `weekday` and `workingday`).
3.  Validate the full dataset against this `ml_suite` to identify issues.
4.  Clean the data based on the validation failures.
5.  Retrain the model on the *cleaned* data and observe the potential improvement in performance.

This demonstrates how Great Expectations can be a crucial step in preparing data for reliable model training.

In [51]:
# Import the model training utility
from utils.model import train_and_evaluate_model

### Train Initial Model
First, let's train a baseline model on the data as it currently exists in the database (including 2011 and 2012).

In [52]:
df = pd.read_sql_query("SELECT * FROM bike_rental", conn)
print("Training initial model on potentially unvalidated data...")
train_and_evaluate_model(df)

Training initial model on potentially unvalidated data...
Training the model...
Model Evaluation:
Mean Absolute Error: 26.58 bike rentals
R² Score: 0.9396

Top 10 Most Important Features:
hour: 0.6008
temp: 0.1402
year: 0.0779
workingday: 0.0617
humidity: 0.0303
mnth: 0.0171
weather_Bad: 0.0161
season_Winter: 0.0132
windspeed: 0.0111
season_Fall: 0.0066


### Task 8.1: Define AI/ML-Specific Expectations
Now, let's create a list of expectations tailored for the features used by our model. These include checking valid ranges for numerical features and consistency between related categorical/binary features.

In [53]:
expectations = [
    ### SOLUTION_START ###
    # Expect hour to be between 0 and 23 (inclusive).
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="hour",
        min_value=0,
        max_value=23,
    ),
    # Expect temperature to be within a plausible range (e.g., -40 to 60 Celsius).
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="temp",
        min_value=-40, # Adjusted for C
        max_value=60,  # Adjusted for C
    ),
    # Expect humidity to be between 0 and 100 (inclusive).
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="humidity",
        min_value=0,
        max_value=100,
    ),
    # Custom Check: Ensure 'workingday' is not 1 (True) on weekends.
    # If workingday is 1, weekday should NOT be Saturday or Sunday.
    # This query finds rows that VIOLATE this condition.
    gx.expectations.UnexpectedRowsExpectation(
        unexpected_rows_query="""
        SELECT *
        FROM {batch}
        WHERE (workingday = 1 AND weekday IN ('Saturday', 'Sunday'))
        """,
        description="Working day flag should be 0 on weekends."
    ),
    # Custom Check: Ensure 'workingday' is 1 on weekdays, unless it's a holiday.
    # If workingday is 0, it should either be a weekend OR a holiday.
    # This query finds rows where workingday is 0, it's NOT a weekend, AND it's NOT a holiday (VIOLATION).
    gx.expectations.UnexpectedRowsExpectation(
        unexpected_rows_query="""
        SELECT *
        FROM {batch}
        WHERE (workingday = 0 AND weekday NOT IN ('Saturday', 'Sunday') AND holiday = 0)
        """,
        description="Non-working day flag should only be set on weekends or holidays."
    ),
    ### SOLUTION_END ###
]


# Create the ML-specific Expectation Suite
ml_suite = gx.ExpectationSuite(
    name="bike_rental_ml_suite_" + str(uuid.uuid4()),
    expectations=expectations,
)
# Add the suite to the context
ml_suite = gx_context.suites.add(ml_suite)

### Validate Data with ML Suite
Let's create a validation definition linking our full dataset batch to this new ML-focused suite and run the validation.

In [54]:
# Create the validation definition using the full dataset batch and the ML suite
validation_definition = gx.ValidationDefinition(
    name="ml_validation_definition_" + str(uuid.uuid4()),
    data=full_dataset_batch_definition,
    suite=ml_suite
)
# Add the validation definition to the context
validation_definition = gx_context.validation_definitions.add(validation_definition)

# Run the validation definition and analyze the results.
ml_validation_results = validation_definition.run(result_format="COMPLETE")
print(ml_validation_results)

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

{
  "success": false,
  "results": [
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_between",
        "kwargs": {
          "batch_id": "sqlite_datasource-bike_rental_asset",
          "column": "hour",
          "min_value": 0.0,
          "max_value": 23.0
        },
        "meta": {},
        "id": "4562eeee-c266-40f2-8e09-71ca6784cf89"
      },
      "result": {
        "element_count": 17578,
        "unexpected_count": 62,
        "unexpected_percent": 0.35271361929684836,
        "partial_unexpected_list": [
          25,
          -2,
          24,
          -1,
          26,
          -3,
          27,
          -4,
          25,
          -5,
          30,
          -2,
          28,
          -6,
          29,
          -7,
          26,
          -8,
          31,
          -9
        ],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 0.35271361929684836,
        "unexp

### Task 8.2: Clean Data Based on Validation Results
The validation results likely highlighted rows violating our ML-specific expectations (e.g., incorrect `workingday` flags, potentially invalid `hour`, `temp`, or `humidity` values if errors were present). Based on these findings, we define a SQL query to remove the offending rows. *Note: In a real-world scenario, you might investigate these rows further or apply transformations instead of outright deletion.*

In [55]:
# Based on the validation results, we define a query to remove rows failing the checks.

### SOLUTION_START ###
query = """
DELETE FROM bike_rental
WHERE
    -- Criterion 1: Invalid hour (outside 0-23)
    (hour < 0 OR hour > 23)

    -- Criterion 2: Unrealistic temperature (adjust bounds as needed)
    OR (temp < -40 OR temp > 60)

    -- Criterion 3: Invalid humidity (outside 0-100)
    OR (humidity < 0 OR humidity > 100)

    -- Criterion 4: Incorrect workingday flag for weekends (workingday=1 on Sat/Sun)
    OR (workingday = 1 AND weekday IN ('Saturday', 'Sunday'))

    -- Criterion 5: Incorrect workingday flag for weekdays (workingday=0 on non-holiday weekday)
    OR (workingday = 0 AND weekday NOT IN ('Saturday', 'Sunday') AND holiday = 0)

-- Optional: In a pipeline, you might only target newly added data or specific batches.
"""
### SOLUTION_END ###

# Execute the DELETE query using the database connection.
# Using 'with conn:' ensures the transaction is committed.
print("Cleaning data based on ML validation results...")
with conn:
    cursor = conn.execute(query)
    print(f"{cursor.rowcount} rows removed.")

Cleaning data based on ML validation results...
199 rows removed.


### Retrain Model on Cleaned Data
Finally, we fetch the cleaned data and retrain our model. Compare the evaluation metrics (like Mean Squared Error) with the initial model's performance to see the impact of data quality improvements.

In [56]:
# Fetch the cleaned data
df_cleaned = pd.read_sql_query("SELECT * FROM bike_rental", conn)

# Retrain and evaluate the model on the cleaned dataset
print("\nTraining model on cleaned data...")
train_and_evaluate_model(df_cleaned)

print("\nCompare the MAE (lower is better) between the initial and retrained models.")


Training model on cleaned data...
Training the model...
Model Evaluation:
Mean Absolute Error: 26.00 bike rentals
R² Score: 0.9451

Top 10 Most Important Features:
hour: 0.6067
temp: 0.1393
year: 0.0790
workingday: 0.0619
humidity: 0.0283
mnth: 0.0173
weather_Bad: 0.0147
season_Winter: 0.0126
windspeed: 0.0103
season_Fall: 0.0067

Compare the MAE (lower is better) between the initial and retrained models.
