# Week 2 – Kestra ETL : Data preparation and backfill notes


This notebook documents how the Postgres + Kestra environment was started
and how the taxi data backfill was executed using Kestra scheduled flows.

Main files used:
- docker-compose.yaml
- kestra-flow.yaml (05_postgres_taxi_scheduled)


## Start infrastructure

In [None]:

!docker compose up -d


This starts pgdatabase, pgadmin, kestra and kestra internal postgres services.

## Verify services

In [None]:

!docker ps



## Open UIs

Kestra UI:
http://localhost:8080

PgAdmin:
http://localhost:8085



## Flow used for ingestion

The scheduled flow used is:

zoomcamp.05_postgres_taxi_scheduled

The flow dynamically builds file names using:

{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv

and loads data into:

public.yellow_tripdata  
public.green_tripdata



## Backfill strategy

The flow is scheduled monthly using Schedule triggers.

To ingest historical data, Kestra backfill was used instead of modifying the flow.

Two triggers exist:
- yellow_schedule
- green_schedule



## Backfill 2021 data

Backfill was executed for:

Period:
2021-01-01 → 2021-07-31

For both taxi types:
- yellow
- green



## Backfill steps in Kestra UI

1. Open flow:
   zoomcamp → 05_postgres_taxi_scheduled

2. Click the trigger:
   - yellow_schedule

3. Click Backfill and use:

   start: 2021-01-01T00:00:00  
   end:   2021-07-31T23:59:59  
   input:
     taxi = yellow

4. Repeat the same for:
   - green_schedule
   with taxi = green

Optional label:
backfill=true



## Why backfill works

The flow uses trigger.date to compute the file name.

For each scheduled execution created by backfill, the correct month is
automatically derived.

No code change is required to ingest a new year.


## Verify data in Postgres

In [None]:

-- Yellow taxi
SELECT
  MIN(tpep_pickup_datetime),
  MAX(tpep_pickup_datetime),
  COUNT(*)
FROM public.yellow_tripdata;


In [None]:

-- Green taxi
SELECT
  MIN(lpep_pickup_datetime),
  MAX(lpep_pickup_datetime),
  COUNT(*)
FROM public.green_tripdata;


## Example validation queries used in the quiz

In [None]:

-- Yellow March 2021
SELECT COUNT(*)
FROM public.yellow_tripdata
WHERE tpep_pickup_datetime >= '2021-03-01'
  AND tpep_pickup_datetime <  '2021-04-01';


In [None]:

-- Green 2020
SELECT COUNT(*)
FROM public.green_tripdata
WHERE lpep_pickup_datetime >= '2020-01-01'
  AND lpep_pickup_datetime <  '2021-01-01';



## Important note about deduplication

The flow uses a MERGE statement with a generated unique_row_id.

This means the tables contain de-duplicated data.

For some quiz questions that refer to raw CSV row counts,
small differences may appear compared to official dataset numbers.



## Files in this repository

- docker-compose.yaml
- kestra-flow.yaml
- week2-kestra-backfill-notes.ipynb
