# Hands-on ELT with dbt and DuckDB

This notebook walks through a complete **extract-load-transform (ELT)** workflow using [dbt](https://www.getdbt.com/) on a demo dataset. It is designed for teaching data engineering concepts with a focus on how analytics engineers can build reliable transformations on top of an existing warehouse.

## Learning goals

By the end of this lab you will be able to:

* Explain the difference between ETL and ELT and why dbt aligns with modern ELT practices.
* Create a minimal dbt project that targets DuckDB as the analytical warehouse.
* Seed raw data, build staged models, construct a mart, and add data quality tests.
* Use dbt to run transformations and validate results from a Jupyter notebook.

## Prerequisites

* Python 3.9+ with `pip`.
* Jupyter environment with access to install packages.
* Basic familiarity with SQL and the command line.

> **Tip:** Every code cell in this notebook is safe to re-run. If you re-run the notebook, existing directories and files will be reused.

## 1. Install dependencies

We will use the [`dbt-duckdb`](https://github.com/dbt-labs/dbt-duckdb) adapter so that dbt can run SQL transformations locally without needing an external database server.

The installation may take a minute the first time it runs.

In [None]:
!pip install --quiet dbt-duckdb duckdb pandas

## 2. Set up the project structure

We will create a dedicated folder called `dbt_elt_demo` inside the current working directory.

* `DBT_PROFILES_DIR` is set so dbt can discover the profile configuration that tells it how to connect to DuckDB.
* `project_root` will be reused throughout the notebook.

In [None]:
import os
from pathlib import Path

project_root = Path.cwd() / "dbt_elt_demo"
project_root.mkdir(exist_ok=True)
os.environ["DBT_PROFILES_DIR"] = str(project_root / "profiles")
print("Project root:", project_root)
print("DBT_PROFILES_DIR:", os.environ["DBT_PROFILES_DIR"])

Create the sub-directories that dbt expects for models, seeds, and profiles.

In [None]:
models_staging = project_root / "models" / "staging"
models_marts = project_root / "models" / "marts"
seeds_dir = project_root / "seeds"
profiles_dir = Path(os.environ["DBT_PROFILES_DIR"])

for path in [models_staging, models_marts, seeds_dir, profiles_dir]:
    path.mkdir(parents=True, exist_ok=True)

project_root

## 3. Create a demo dataset

We will simulate data from an online subscription business.

* `customers` contains customer attributes.
* `orders` records transactions.

In a real ELT process, data would already be loaded into the warehouse before dbt runs. Here we treat the CSV files as our "loaded" layer by storing them in dbt's `seeds` directory.

In [None]:
import pandas as pd

customers = pd.DataFrame(
    {
        "customer_id": [1, 2, 3, 4],
        "customer_name": ["Avery Analytics", "Bobby Business", "Casey Commerce", "Dakota Data"],
        "customer_email": [
            "avery.analytics@example.com",
            "bobby.business@example.com",
            "casey.commerce@example.com",
            "dakota.data@example.com",
        ],
        "customer_segment": ["SMB", "Enterprise", "Enterprise", "SMB"],
    }
)

orders = pd.DataFrame(
    {
        "order_id": [101, 102, 103, 104, 105, 106],
        "customer_id": [1, 2, 3, 2, 4, 1],
        "order_date": pd.to_datetime([
            "2024-01-15",
            "2024-01-17",
            "2024-01-18",
            "2024-02-03",
            "2024-02-10",
            "2024-03-05",
        ]),
        "status": ["paid", "paid", "pending", "paid", "paid", "refunded"],
        "amount_usd": [1200.0, 2400.0, 800.0, 2600.0, 900.0, 1200.0],
    }
)

customers.to_csv(seeds_dir / "customers.csv", index=False)
orders.to_csv(seeds_dir / "orders.csv", index=False)

customers.head()

Preview the `orders` dataset to understand what transformations we might want to perform.

In [None]:
orders.head()

## 4. Configure dbt

dbt needs two YAML files:

1. `dbt_project.yml` – defines project-level settings like paths and materializations.
2. `profiles.yml` – stored in the profiles directory and describes how to connect to the warehouse (DuckDB in this case).

In [None]:
dbt_project_yaml = """
name: 'dbt_elt_demo'
version: '1.0.0'
config-version: 2

profile: 'dbt_elt_demo'

model-paths: ['models']
seed-paths: ['seeds']

models:
  dbt_elt_demo:
    staging:
      +materialized: view
    marts:
      +materialized: table
"""

(project_root / "dbt_project.yml").write_text(dbt_project_yaml.strip() + "
")
print((project_root / "dbt_project.yml").read_text())

In [None]:
profiles_yaml = f"""
dbt_elt_demo:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: '{project_root / "warehouse.duckdb"}'
      schema: main
      threads: 4
"""

(profiles_dir / "profiles.yml").write_text(profiles_yaml.strip() + "
")
print((profiles_dir / "profiles.yml").read_text())

## 5. Build the staging layer

Staging models clean and standardize raw data. They are thin wrappers over the seeds that cast types, rename columns, or filter data.

We will also declare tests in `schema.yml` to check for basic data quality.

In [None]:
stg_customers_sql = """
select
    customer_id,
    customer_name,
    customer_email,
    customer_segment
from {{ ref('customers') }}
"""

stg_orders_sql = """
select
    order_id,
    customer_id,
    order_date,
    status,
    amount_usd
from {{ ref('orders') }}
"""

(models_staging / "stg_customers.sql").write_text(stg_customers_sql.strip() + "
")
(models_staging / "stg_orders.sql").write_text(stg_orders_sql.strip() + "
")

print((models_staging / "stg_customers.sql").read_text())

In [None]:
print((models_staging / 'stg_orders.sql').read_text())

In [None]:
staging_schema_yaml = """
version: 2

seeds:
  - name: customers
    columns:
      - name: customer_id
        tests: [unique, not_null]
  - name: orders
    columns:
      - name: order_id
        tests: [unique, not_null]

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests: [unique, not_null]
  - name: stg_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: customer_id
        tests: [not_null]
"""

(models_staging / "schema.yml").write_text(staging_schema_yaml.strip() + "
")
print((models_staging / "schema.yml").read_text())

## 6. Build the mart layer

Marts aggregate business-friendly data models. Here we create a simple fact table that joins orders with their customers and adds a monthly grain.

This is the type of dataset analysts would use to power dashboards and downstream metrics.

In [None]:
marts_sql = """
with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)

select
    o.order_id,
    o.order_date,
    o.status,
    o.customer_id,
    c.customer_name,
    c.customer_segment,
    o.amount_usd,
    date_trunc('month', o.order_date) as order_month
from orders o
join customers c on o.customer_id = c.customer_id
"""

(models_marts / "fct_customer_orders.sql").write_text(marts_sql.strip() + "
")
print((models_marts / "fct_customer_orders.sql").read_text())

## 7. Run dbt commands

We are ready to run the project end-to-end:

1. `dbt debug` confirms configuration.
2. `dbt seed` loads the CSVs into DuckDB.
3. `dbt run` builds models in dependency order.
4. `dbt test` executes our data quality tests.

In [None]:
!cd {project_root} && dbt debug

In [None]:
!cd {project_root} && dbt seed

In [None]:
!cd {project_root} && dbt run

In [None]:
!cd {project_root} && dbt test

## 8. Explore the results

The transformations were materialized inside the DuckDB database specified in our profile. Use DuckDB directly from Python to inspect the final fact table.

In [None]:
import duckdb

con = duckdb.connect(str(project_root / "warehouse.duckdb"))
con.execute(
    """
    select
        order_month,
        customer_segment,
        count(*) as orders,
        sum(amount_usd) as revenue
    from fct_customer_orders
    group by 1, 2
    order by 1, 2
    """
).df()

For a more granular view, we can display the fact table itself.

In [None]:
con.execute('select * from fct_customer_orders order by order_date').df()

## 9. Clean up (optional)

If you want to start fresh, remove the `dbt_elt_demo` folder and rerun the notebook. Keeping it can be useful for exploring the compiled SQL under `target/` and the generated DuckDB file.

## Next steps

* Add slowly changing dimensions or incremental models.
* Schedule dbt runs with a tool such as dbt Cloud or an orchestration platform.
* Generate documentation with `dbt docs generate` and explore the lineage graph.