### Normalize Data

In [2]:
import dlt

data = [
    {
        "vendor_name": "VTS",
        "record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "coordinates": {
            "start": {"lon": -73.787442, "lat": 40.641525},
            "end": {"lon": -73.980072, "lat": 40.742963}
        },
        "passengers": [
            {"name": "John", "rating": 4.9},
            {"name": "Jack", "rating": 3.9}
        ]
    }
]

In [3]:
pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_data", 
    destination="duckdb", 
    dataset_name="taxi_rides"
)

load_info = pipeline.run(data, table_name="rides", write_disposition="replace")
print(load_info)

Pipeline ny_taxi_data load step completed in 0.31 seconds
1 load package(s) were loaded to destination duckdb and into dataset taxi_rides
The duckdb destination used duckdb:////home/myothet/repos/data-engineering/dlt/workshop/ny_taxi_data.duckdb location to store data
Load package 1739691561.910805 is LOADED and contains no failed jobs


In [4]:
print(pipeline.last_trace)

Run started at 2025-02-16 07:39:21.664427+00:00 and COMPLETED in 2.08 seconds with 4 steps.
Step extract COMPLETED in 0.13 seconds.

Load package 1739691561.910805 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.12 seconds.
Normalized data for the following tables:
- rides: 1 row(s)
- rides__passengers: 2 row(s)

Load package 1739691561.910805 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 1.60 seconds.
Pipeline ny_taxi_data load step completed in 0.31 seconds
1 load package(s) were loaded to destination duckdb and into dataset taxi_rides
The duckdb destination used duckdb:////home/myothet/repos/data-engineering/dlt/workshop/ny_taxi_data.duckdb location to store data
Load package 1739691561.910805 is LOADED and contains no failed jobs

Step run COMPLETED in 2.07 seconds.
Pipeline ny_taxi_data load step completed in 0.31 seconds
1 load package(s) were loaded to destinat

In [5]:
# Automatically detects schema 
# Flattens nested JSON
# Handles data type conversion 
# Splits lists into child tables
# Schema evolution support 

pipeline.dataset(dataset_type="default").rides.df().columns

Index(['vendor_name', 'record_hash', 'time__pickup', 'time__dropoff',
       'coordinates__start__lon', 'coordinates__start__lat',
       'coordinates__end__lon', 'coordinates__end__lat', '_dlt_load_id',
       '_dlt_id'],
      dtype='object')

In [6]:
# Timestamps were converted to the correct format
pipeline.dataset(dataset_type="default").rides.df()

Unnamed: 0,vendor_name,record_hash,time__pickup,time__dropoff,coordinates__start__lon,coordinates__start__lat,coordinates__end__lon,coordinates__end__lat,_dlt_load_id,_dlt_id
0,VTS,b00361a396177a9cb410ff61f20015ad,2009-06-14 23:23:00+00:00,2009-06-14 23:48:00+00:00,-73.787442,40.641525,-73.980072,40.742963,1739691561.910805,3wJYq/zKmS19Jw


In [7]:
# Splits lists into child tables
pipeline.dataset(dataset_type="default").rides__passengers.df()

Unnamed: 0,name,rating,_dlt_parent_id,_dlt_list_idx,_dlt_id
0,John,4.9,3wJYq/zKmS19Jw,0,fmNgOO2rwAtZzQ
1,Jack,3.9,3wJYq/zKmS19Jw,1,MMh2o0NW848qbA
