DE Zoomcamp 2026 Final Project
A batch processing pipeline analyzing AI coding agent contributions on GitHub. Ingests GitHub Archive data via Airflow, stores in BigQuery (partitioned/clustered), transforms with dbt, and visualizes in Looker Studio.
+---------------------+ +---------------------+ +---------------------+
| GitHub Archive | | Apache Airflow | | GCS Bucket |
| gharchive.org |---->| Docker Compose |---->| Raw JSON.gz |
| 24 files/day | | Celery Executor | | 90-day TTL |
+---------------------+ +---------------------+ +----------+----------+
|
v
+----------+----------+
| BigQuery |
| Partitioned DAY |
| Clustered 3 fields|
+----------+----------+
|
v
+----------+----------+
| dbt 1.9+ |
| 1 Staging View |
| 3 Mart Tables |
+----------+----------+
|
v
+----------+----------+
| Looker Studio |
| AI Agent Dashboard|
| 7+ Charts |
+---------------------+
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | Apache Airflow 2.8.0 (Docker Compose) | DAG scheduling, data pipeline |
| Storage | GCS + BigQuery | Raw files + partitioned warehouse |
| Transformation | dbt 1.9+ | 1 staging + 3 mart models |
| Visualization | Looker Studio | 7+ interactive charts |
github-ai-contributions/
+-- airflow/
| +-- dags/ # Airflow DAG files
| +-- github_activity_pipeline.py
+-- dbt/
| +-- models/ # dbt transformation models
| +-- staging/
| +-- marts/
+-- terraform/ # Infrastructure as Code
| +-- main.tf
| +-- variables.tf
| +-- outputs.tf
+-- tests/ # Pytest test suites
+-- scripts/ # Validation scripts
+-- Makefile # Build automation
+-- docker-compose.yml # Airflow stack config
+-- README.md
+------------------------+ +------------------------+ +------------------------+
| download_github_archive| | upload_to_gcs | | validate_data_quality |
| |---->| |---->| |
| Download 24 hourly | | Upload raw JSON.gz | | Check JSON structure |
| files from GH Archive | | to GCS bucket | | Verify required fields |
+------------------------+ +------------------------+ +-----------+------------+
|
v
+------------------------+ +------------------------+ +------------------------+
| cleanup_temp_files | | load_to_bigquery | | transform_data |
| |<----| |<----| |
| Remove local temp | | Load to partitioned | | Flatten nested JSON |
| files, free disk | | BigQuery table | | Convert to BQ schema |
+------------------------+ +------------------------+ +------------------------+
| Task | Description |
|---|---|
download_github_archive |
Downloads GitHub Archive JSON for specified date (24 hours) |
upload_to_gcs |
Uploads raw data to GCS |
validate_data_quality |
Validates JSON structure and required fields |
transform_data |
Transforms to BigQuery-compatible schema |
load_to_bigquery |
Loads to partitioned BigQuery table |
cleanup_temp_files |
Removes temporary files |
Note: DAG runs in TEST_MODE (50k records per file) for faster execution. See airflow/dags/github_activity_pipeline.py for production configuration.
- Docker + Docker Compose
- Terraform >= 1.0
- gcloud CLI + GCP project with billing enabled
- Python 3.12+ (for dbt)
Create a GCP service account with BigQuery and Storage permissions:
# Create service account
gcloud iam service-accounts create github-activity-pipeline
# Grant BigQuery admin role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:github-activity-pipeline@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/bigquery.admin"
# Grant Storage admin role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:github-activity-pipeline@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.admin"
# Download credentials
gcloud iam service-accounts keys create keys/gcp-creds.json \
--iam-account=github-activity-pipeline@YOUR_PROJECT_ID.iam.gserviceaccount.comCopy the example file and configure:
cp .env.example .envEdit .env with your values:
GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_ID
GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/keys/gcp-creds.json
AIRFLOW_UID=50000
AIRFLOW__CORE__FERNET_KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")-
Initialize Terraform:
make terraform-init
-
Review planned changes:
make terraform-plan PROJECT_ID=YOUR_PROJECT_ID
This shows what resources will be created (GCS bucket, BigQuery dataset/table).
-
Apply changes:
make terraform-apply PROJECT_ID=YOUR_PROJECT_ID
-
Start the Airflow stack:
make airflow-up
-
Wait for Airflow to initialize (~2 minutes). Check status:
docker compose ps
All services should show "running" status.
-
Access the Airflow UI:
- URL: http://localhost:8080
- Username:
admin - Password:
admin
-
If DAGs don't appear, restart the scheduler:
docker compose restart airflow-scheduler
-
Set the Airflow Variable for GCP project:
- Go to Admin -> Variables
- Add variable:
project_id=YOUR_PROJECT_ID
-
Authenticate with GCP:
gcloud auth application-default login
-
Create virtual environment and install dependencies:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Configure dbt profile (if not auto-detected):
mkdir -p ~/.dbt cat > ~/.dbt/profiles.yml << EOF github_activity: target: dev outputs: dev: type: bigquery project: YOUR_PROJECT_ID dataset: github_activity method: oauth EOF
-
Load environment variables for dbt:
set -a && source .env && set +a
This exports the
.envvariables to your shell so dbt can read them. The project'sdbt/profiles.ymlanddbt/models/staging/sources.ymluse{{ env_var('GOOGLE_CLOUD_PROJECT') }}which requires the variable to be exported.
Note: Airflow runs in Docker with the official apache/airflow:2.8.0 image.
The requirements.txt is for local dbt transformations and testing only.
-
Trigger Airflow DAG:
- Open http://localhost:8080
- Toggle
github_activity_batch_pipelineDAG to Active - Click Play -> Trigger DAG w/ config -> set
execution_date(e.g.,2026-03-28) - Each run downloads 24 hours of data (~5 min)
-
Run dbt transformations:
make dbt-build # Runs: deps + run + test -
Verify data:
bq query --use_legacy_sql=false \ "SELECT event_date, COUNT(*) as events FROM github_activity.github_events GROUP BY event_date ORDER BY event_date"
After Airflow loads data, run dbt to create analytics models:
make dbt-build # deps + run + testModels created:
| Model | Type | Description |
|---|---|---|
stg_github_events |
View | Cleaned events with quality flags |
daily_stats |
Table | Daily aggregated metrics |
repo_health |
Table | Repository health scores |
ai_agent_stats |
Table | AI agent activity analytics |
Verify: bq query "SELECT * FROM github_activity.ai_agent_stats LIMIT 10"
- Go to https://lookerstudio.google.com/
- Create -> Report -> BigQuery connector
- Select:
YOUR_PROJECT_ID.github_activity.ai_agent_stats
Dashboard Visualizations (7+ charts):
| Visualization | Type | Data Source | Metrics |
|---|---|---|---|
| Key Metrics | Score Cards (4) | ai_agent_stats | Total AI Events, Unique Agents, Repos Touched, AI Ratio |
| AI Agent Activity | Donut | ai_agent_stats | actor_type, SUM(total_events) |
| Agent Timeline | Line | ai_agent_stats | stats_date, SUM(total_events) by actor_type |
| Top AI-Touched Repos | Bar | stg_github_events | repo_name, COUNT(*) WHERE actor_type='AI' |
| AI Adoption Over Time | Line | ai_ratio_timeline | event_date, ai_percentage |
| AI vs Human Ratio | Gauge | ai_agent_stats | ai_events/total_events |
| Event Type Breakdown | Stacked Bar | ai_agent_stats | actor_type, push/pr/comment/star events |
| Activity Heatmap | Grid | stg_github_events | HOUR(created_at), actor_type, COUNT(*) |
make test # All tests
make test-airflow # DAG structure tests
make test-terraform # Terraform validation
make test-dbt # dbt model testsThis section helps reviewers verify the project meets DE Zoomcamp criteria.
| Criterion | Location | How to Verify |
|---|---|---|
| Problem description | README.md, CLAUDE.md | Read project overview |
| Cloud + IaC | terraform/ | Run make terraform-plan |
| Data ingestion | airflow/dags/ | Check DAG structure with make test-airflow |
| Data warehouse | BigQuery console | Verify table is partitioned & clustered |
| Transformations | dbt/models/ | Run make dbt-test |
| Dashboard | Looker Studio link | Open dashboard URL |
| Reproducibility | All files | Run make validate && make test |
# Run all validations
make validate
# Run all tests
make test
# Full deployment test
make deploy| Issue | Solution |
|---|---|
| DAG not appearing | docker compose restart airflow-scheduler |
| Auth errors | Check GOOGLE_APPLICATION_CREDENTIALS path in .env |
| BigQuery schema errors | Table pre-created by Terraform; use autodetect=False |
| dbt env var error | Run set -a && source .env && set +a before dbt commands |
| dbt won't run | Ensure Python 3.12-3.13 (not 3.14+) |
See TROUBLESHOOTING.md for detailed solutions.
MIT License
- Data source: GHE Archive
- Course: Data Engineering Zoomcamp 2026
