# Running Data Quality tests for tables in OpenMetadata

In the following Notebook we will join two data sources to load into our `Tutorial Postgres.raw.public.taxi_yellow` table.

We will be using the following two assets from the [NYC Yellow Taxi Ride Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page):
- [Yellow Taxi Ride for September 2025 (parquet)](https://python-sdk-examples.s3.eu-west-3.amazonaws.com/data-quality/yellow_tripdata_2025-09.parquet)
- [Taxi Zones Lookup (csv)](https://python-sdk-examples.s3.eu-west-3.amazonaws.com/data-quality/taxi_zone_lookup.csv)

## Purpose
We want to showcase how we can leverage OpenMetadata's data quality mechanisms directly from code. For that, we're simulating a very simple ETL that builds the data for which we have set up data quality tests in the [given instructions](/lab/tree/README.md).

## Description of the ETL
The Yellow Taxi Ride dataset contains a couple of columns, Pickup Location ID and Dropoff Location ID, which refer to the zone in which each stop of the ride takes place. Yellow taxis either start or end in one of those zones, but we want to find only those that never leave the yellow area. The Taxi Zones Lookup dataset contains a mapping between the zone ID and the taxi type (e.g: Yellow Zone).

Our ETL will join the two data sources and filter for those of which Pickup and Dropoff location ID are both yellow zones. Since we only want a subset of it, we will also load only 10,000 rows of data to our table.

Once we've loaded the results to the destination table, we will use the [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) library to run the Data Quality tests we have defined in [OpenMetadata](http://localhost:8585/table/Tutorial%20Postgres.raw.public.taxi_yellow/profiler/data-quality).

## Dependencies
For our ETL we will be using Pyarrow to load the Parquet file, Pandas DataFrames to work with the Taxi Rides and Taxi Zones areas, [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) to run data quality tests and, since we're using Postgres as a database for our fake Data Warehouse we will need to install dependencies for the OpenMetadata [Postgres Connector](https://docs.open-metadata.org/latest/connectors/database/postgres). We will also need SQLAlchemy, which is installed by default with `openmetadata-ingestion`.

We can install all these dependencies specifying the right extras. A full list can be found in the project's [`setup.py`](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/setup.py), check it out if your installation differs from the example below.

## Requirements
If you haven't, please follow the [setup](/lab/tree/README.md#setup) steps in the README

For this example you will need:

- An OpenMetadata instance running (achieved by following the setup instructions above)
- A bot JWT token. You can do so by using [Ingestion Bot's](http://localhost:8585/bots/ingestion-bot) token from your OpenMetadata instance
- [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) version 1.11.0.0 or above (installed in this Notebook)

In [1]:
!pip install "openmetadata-ingestion[pandas,pyarrow,postgres]>=1.11.0.0"

Obtaining file:///opt/openmetadata/ingestion
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Collecting pyarrow~=16.0 (from openmetadata-ingestion==1.10.0.0.dev0)
  Downloading pyarrow-16.1.0-cp311-cp311-manylinux_2_28_aarch64.whl.metadata (3.0 kB)
Downloading pyarrow-16.1.0-cp311-cp311-manylinux_2_28_aarch64.whl (38.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.1/38.1 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hBuilding wheels for collected packages: openmetadata-ingestion
  Building editable for openmetadata-ingestion (pyproject.toml) ... [?25ldone
[?25h  Created wheel for openmetadata-ingestion: filename=openmetadata_ingestion-1.10.0.0.dev0-0.editable-py3-none-any.whl size=14132 sha256=5c3a6a7cd0b44a262ae2892074f397b31b22c32350458d

## Initial SDK setup
In this step we make sure our Python code is ready to work against OpenMetadata

You will be prompted for the JWT token mentioned in the [requirements](#requirements) section

In [2]:
from getpass import getpass

from metadata.sdk import configure

jwt_token = getpass("Please introduce a JWT token for authentication with OM")

configure(
    host="http://openmetadata_server:8585/api",
    jwt_token=jwt_token,
)

Please introduce a JWT token for authentication with OM ········


<metadata.sdk.client.OpenMetadata at 0xffff63bcda90>

## Implementation of the ETL

In [5]:
import pandas as pd

taxi_rides = pd.read_parquet("https://python-sdk-resources.s3.eu-west-3.amazonaws.com/data-quality/yellow_tripdata_2025-09.parquet")
taxi_zones = pd.read_csv("https://python-sdk-resources.s3.eu-west-3.amazonaws.com/data-quality/taxi_zone_lookup.csv")

In [6]:
taxi_rides.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
0,2,2025-09-01 00:19:20,2025-09-01 00:45:17,1.0,9.92,1.0,N,138,114,1,42.9,6.0,0.5,10.73,0.0,1.0,66.13,2.5,1.75,0.75
1,2,2025-09-01 00:15:20,2025-09-01 00:26:08,2.0,6.82,1.0,N,93,157,1,26.8,1.0,0.5,5.86,0.0,1.0,35.16,0.0,0.0,0.0
2,2,2025-09-01 00:06:07,2025-09-01 00:22:23,1.0,3.95,1.0,N,68,13,1,19.8,1.0,0.5,5.11,0.0,1.0,30.66,2.5,0.0,0.75
3,2,2025-09-01 00:49:47,2025-09-01 01:04:49,1.0,3.14,1.0,N,234,87,1,17.7,1.0,0.5,3.52,0.0,1.0,26.97,2.5,0.0,0.75
4,2,2025-09-01 00:05:00,2025-09-01 00:15:32,6.0,2.81,1.0,N,230,151,1,14.9,1.0,0.5,4.13,0.0,1.0,24.78,2.5,0.0,0.75


In [7]:
taxi_zones.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [8]:
# Check existing values.
taxi_zones["service_zone"].unique()

array(['EWR', 'Boro Zone', 'Yellow Zone', 'Airports', nan], dtype=object)

In [9]:
# Join tables based on `PULocationID` and `DOLocationID`
pickup_zones = taxi_rides.merge(taxi_zones[["LocationID", "service_zone"]], left_on="PULocationID", right_on="LocationID", how="left")["service_zone"]
dropoff_zones = taxi_rides.merge(taxi_zones[["LocationID", "service_zone"]], left_on="DOLocationID", right_on="LocationID", how="left")["service_zone"]
taxi_rides_with_pickup_and_dropoff_zone = taxi_rides.assign(PUZone=pickup_zones, DOZone=dropoff_zones)
taxi_rides_with_pickup_and_dropoff_zone.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee,PUZone,DOZone
0,2,2025-09-01 00:19:20,2025-09-01 00:45:17,1.0,9.92,1.0,N,138,114,1,...,0.5,10.73,0.0,1.0,66.13,2.5,1.75,0.75,Airports,Yellow Zone
1,2,2025-09-01 00:15:20,2025-09-01 00:26:08,2.0,6.82,1.0,N,93,157,1,...,0.5,5.86,0.0,1.0,35.16,0.0,0.0,0.0,Boro Zone,Boro Zone
2,2,2025-09-01 00:06:07,2025-09-01 00:22:23,1.0,3.95,1.0,N,68,13,1,...,0.5,5.11,0.0,1.0,30.66,2.5,0.0,0.75,Yellow Zone,Yellow Zone
3,2,2025-09-01 00:49:47,2025-09-01 01:04:49,1.0,3.14,1.0,N,234,87,1,...,0.5,3.52,0.0,1.0,26.97,2.5,0.0,0.75,Yellow Zone,Yellow Zone
4,2,2025-09-01 00:05:00,2025-09-01 00:15:32,6.0,2.81,1.0,N,230,151,1,...,0.5,4.13,0.0,1.0,24.78,2.5,0.0,0.75,Yellow Zone,Yellow Zone


In [10]:
# Filter out rows where either pick up or drop off zones are not `Yellow Zone`
yellow_only_rides = taxi_rides_with_pickup_and_dropoff_zone.loc[(taxi_rides_with_pickup_and_dropoff_zone.PUZone == "Yellow Zone") & (taxi_rides_with_pickup_and_dropoff_zone.DOZone == "Yellow Zone")]
yellow_only_rides.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee,PUZone,DOZone
2,2,2025-09-01 00:06:07,2025-09-01 00:22:23,1.0,3.95,1.0,N,68,13,1,...,0.5,5.11,0.0,1.0,30.66,2.5,0.0,0.75,Yellow Zone,Yellow Zone
3,2,2025-09-01 00:49:47,2025-09-01 01:04:49,1.0,3.14,1.0,N,234,87,1,...,0.5,3.52,0.0,1.0,26.97,2.5,0.0,0.75,Yellow Zone,Yellow Zone
4,2,2025-09-01 00:05:00,2025-09-01 00:15:32,6.0,2.81,1.0,N,230,151,1,...,0.5,4.13,0.0,1.0,24.78,2.5,0.0,0.75,Yellow Zone,Yellow Zone
5,1,2025-09-01 00:16:53,2025-09-01 00:29:36,2.0,2.0,1.0,N,79,164,1,...,0.5,4.0,0.0,1.0,23.95,2.5,0.0,0.75,Yellow Zone,Yellow Zone
6,1,2025-09-01 00:33:01,2025-09-01 00:43:13,2.0,3.1,1.0,N,164,236,1,...,0.5,4.1,0.0,1.0,24.75,2.5,0.0,0.75,Yellow Zone,Yellow Zone


In [11]:
# Write dataframe to the database
## Credentials to a user with write access are set up in `docker-compose.yml`
from sqlalchemy import MetaData, Table, create_engine, delete, insert

def insert_taxi_yellow_table(table, conn, keys, data_iter):
    keys = [key.lower() for key in keys]
    taxi_yellow_table = Table(table.table, MetaData(), autoload_with=conn)
    
    # Clean existing data
    conn.execute(delete(taxi_yellow_table))
    
    # Prepare insert statement    
    data = [dict(zip(keys, row)) for row in data_iter]
    
    stmt = insert(taxi_yellow_table).values(data)
    
    result = conn.execute(stmt)
    return result.rowcount

engine = create_engine("postgresql://user:pass@dwh:5432/raw")

with engine.connect() as connection:
    yellow_only_rides.head(10_000).to_sql(
        name="taxi_yellow",
        con=connection,
        index=False,
        if_exists="append",
        method=insert_taxi_yellow_table,
    )

## Run Data Quality tests

In [12]:
from metadata.sdk.data_quality import TestRunner

runner = TestRunner.for_table("Tutorial Postgres.raw.public.taxi_yellow")
results = runner.run()

for result in results:
    test_case = result.testCase
    test_result = result.testCaseResult

    print(f"\nTest: {test_case.name.root}")
    print(f"Status: {test_result.testCaseStatus}")
    print(f"Result: {test_result.result}")

[2025-11-07 10:32:57] INFO     {metadata.OMetaAPI:server_mixin:74} - OpenMetadata client running with Server version [1.10.4] and Client version [1.10.0.0]
[2025-11-07 10:32:57] INFO     {metadata.TestSuite:test_suite:102} - Retrieving table entity for FQN: Tutorial Postgres.raw.public.taxi_yellow
[2025-11-07 10:32:58] INFO     {metadata.TestSuite:test_suite:245} - Using existing test suite for table taxi_yellow
[2025-11-07 10:32:58] INFO     {metadata.TestSuite:core:33} - Executing test case dozone_column_value_is_yellow_zone for entity Tutorial Postgres.raw.public.taxi_yellow
[2025-11-07 10:32:58] INFO     {metadata.TestSuite:core:33} - Executing test case puzone_column_value_is_yellow_zone for entity Tutorial Postgres.raw.public.taxi_yellow
[2025-11-07 10:32:58] INFO     {metadata.TestSuite:core:33} - Executing test case taxi_yellow_table_row_count_is_10000 for entity Tutorial Postgres.raw.public.taxi_yellow
[2025-11-07 10:32:59] INFO     {metadata.Utils:logger:205} - [1mWorkflow O


Test: dozone_column_value_is_yellow_zone
Status: TestCaseStatus.Success
Result: Found 10000 value(s) matching regex pattern vs 10000 value(s) in the column.

Test: puzone_column_value_is_yellow_zone
Status: TestCaseStatus.Success
Result: Found 10000 value(s) matching regex pattern vs 10000 value(s) in the column.

Test: taxi_yellow_table_row_count_is_10000
Status: TestCaseStatus.Success
Result: Found rowCount=10000 rows vs. the expected 10000.0
