Data Engineering Zoomcamp β Module 3 Homework
Data Warehousing with BigQuery (2026)
π Overview
This project demonstrates loading NYC Yellow Taxi 2024 data into Google Cloud Storage (GCS) and BigQuery, creating external tables, materialized tables, and optimized partitioned & clustered tables, and answering analytical questions related to storage and query performance.
The work follows best practices:
No credentials or project identifiers committed
GCP authentication via SDK
Data stored in GCS, queried via BigQuery
π Project Structure module3_homework/ βββ ingestion/ β βββ load_yellow_taxi_data.py βββ ingest/ # Python virtual environment (ignored by git) βββ README.md βββ .gitignore βββ example.env
π Configuration & Authentication Environment Variables
Sensitive values are not hard-coded.
Create a .env file locally:
GCP_PROJECT_ID=your-gcp-project-id GCS_BUCKET=your-gcs-bucket-name
Example template (committed):
GCP_PROJECT_ID=your-gcp-project-id GCS_BUCKET=your-gcs-bucket-name
Authentication
Authentication is done using the GCP SDK:
gcloud auth application-default login
This allows Python clients to authenticate automatically without JSON keys.
π¦ Data Ingestion (GCS) Script
ingestion/load_yellow_taxi_data.py
What it does:
Downloads Yellow Taxi parquet files (JanβJun 2024)
Creates a GCS bucket if it does not exist
Uploads files to GCS with retry & verification logic
Run:
python ingestion/load_yellow_taxi_data.py
Result:
gs:///yellow_tripdata_2024-01.parquet ... gs:///yellow_tripdata_2024-06.parquet