Skip to content

mzwili/module3_homework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Engineering Zoomcamp – Module 3 Homework

Data Warehousing with BigQuery (2026)

πŸ“Œ Overview

This project demonstrates loading NYC Yellow Taxi 2024 data into Google Cloud Storage (GCS) and BigQuery, creating external tables, materialized tables, and optimized partitioned & clustered tables, and answering analytical questions related to storage and query performance.

The work follows best practices:

No credentials or project identifiers committed

GCP authentication via SDK

Data stored in GCS, queried via BigQuery

πŸ—‚ Project Structure module3_homework/ β”œβ”€β”€ ingestion/ β”‚ └── load_yellow_taxi_data.py β”œβ”€β”€ ingest/ # Python virtual environment (ignored by git) β”œβ”€β”€ README.md β”œβ”€β”€ .gitignore └── example.env

πŸ” Configuration & Authentication Environment Variables

Sensitive values are not hard-coded.

Create a .env file locally:

GCP_PROJECT_ID=your-gcp-project-id GCS_BUCKET=your-gcs-bucket-name

Example template (committed):

example.env

GCP_PROJECT_ID=your-gcp-project-id GCS_BUCKET=your-gcs-bucket-name

Authentication

Authentication is done using the GCP SDK:

gcloud auth application-default login

This allows Python clients to authenticate automatically without JSON keys.

πŸ“¦ Data Ingestion (GCS) Script

ingestion/load_yellow_taxi_data.py

What it does:

Downloads Yellow Taxi parquet files (Jan–Jun 2024)

Creates a GCS bucket if it does not exist

Uploads files to GCS with retry & verification logic

Run:

python ingestion/load_yellow_taxi_data.py

Result:

gs:///yellow_tripdata_2024-01.parquet ... gs:///yellow_tripdata_2024-06.parquet

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages