dlt for normalization

- Automatically detects schema – No need to define column types manually.
- Flattens nested JSON – Converts complex structures into table-ready formats.
- Handles data type conversion – Converts dates, numbers, and booleans correctly.
- Splits lists into child tables – Ensures relational integrity for better analysis.
- Schema evolution support – Adapts to changes in data structure over time.

In [None]:
data = [
    {
        "vendor_name": "VTS",
        "record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "coordinates": {
            "start": {"lon": -73.787442, "lat": 40.641525},
            "end": {"lon": -73.980072, "lat": 40.742963}
        },
        "passengers": [
            {"name": "John", "rating": 4.9},
            {"name": "Jack", "rating": 3.9}
        ]
    }
]

In [None]:
import dlt

# Define a dlt pipeline
pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_data", 
    destination="duckdb",
    dataset_name="taxi_rides",
)

In [None]:
# Run the pipeline (create a duckdb)
info = pipeline.run(data, table_name="rides", write_disposition="replace")

In [None]:
# Print the infos
print(info)
print(pipeline.last_trace)

In [None]:
# View information of 2 tables (tables in dataset (schema))
pipeline.dataset(dataset_type="default").rides.df() # main table
pipeline.dataset(dataset_type="default").rides__passengers.df() # child table