A production-grade, cloud-native data engineering platform built on AWS, simulating Uber's ride analytics pipeline. Showcases 18+ technologies across batch processing, real-time streaming, data warehousing, and ML feature engineering.
The platform follows a layered architecture with clear separation between ingestion, processing, warehousing, and orchestration:
Data Sources β Ingestion (API Gateway + Lambda)
β
Real-Time: Kinesis β Lambda β Firehose β S3 + DynamoDB
Batch: S3 Bronze β Glue ETL β S3 Silver β S3 Gold
β
Warehouse: Redshift (Star Schema) + Snowflake + Athena
β
Orchestration: Airflow + Step Functions + EventBridge
Monitoring: CloudWatch + SNS Alerts + Great Expectations
| Layer | Purpose | Format | Partitioning |
|---|---|---|---|
| π₯ Bronze | Raw, unprocessed events | JSON/CSV | year/month/day |
| π₯ Silver | Cleaned, validated, enriched | Parquet (Snappy) | event_date/city |
| π₯ Gold | Business-ready aggregations & star schema | Parquet | ride_date |
| Category | Technologies |
|---|---|
| Storage | Amazon S3 (Data Lake β Bronze/Silver/Gold), DynamoDB |
| Compute | AWS Lambda, EC2, EMR (Spark), AWS Glue (PySpark) |
| Streaming | Amazon Kinesis Data Streams, Kinesis Firehose, Apache Kafka (MSK) |
| Data Warehouse | Amazon Redshift, Snowflake |
| Transformation | dbt, AWS Glue ETL, PySpark |
| Orchestration | AWS Step Functions, Apache Airflow (MWAA), Amazon EventBridge |
| API | Amazon API Gateway |
| Analytics | Amazon Athena, Redshift Spectrum |
| Data Quality | Great Expectations |
| Monitoring | Amazon CloudWatch, Amazon SNS |
| Infrastructure | Terraform (IaC) |
| CI/CD | GitHub Actions |
| Containers | Docker, Amazon ECR |
| Data Catalog | AWS Glue Data Catalog |
uber-data-engineering-platform/
β
βββ terraform/ # π§ Infrastructure as Code
β βββ main.tf # Root configuration (S3 backend)
β βββ variables.tf / outputs.tf # Input/output definitions
β βββ modules/
β βββ vpc/ # VPC, subnets, NAT, security groups
β βββ s3/ # Bronze/Silver/Gold + lifecycle policies
β βββ kinesis/ # Data Streams + Firehose delivery
β βββ redshift/ # Cluster + Spectrum + IAM
β βββ glue/ # Crawlers, ETL jobs, catalog DB
β βββ lambda/ # 3 functions + API Gateway
β βββ emr/ # Spark cluster + auto-scaling
β βββ dynamodb/ # Live ride metrics table
β βββ step_functions/ # ETL orchestration state machine
β βββ monitoring/ # CloudWatch dashboard + SNS alarms
β
βββ data_sources/ # π Data Generation
β βββ ride_event_generator.py # Multi-mode: batch / stream / local
β βββ historical_data_generator.py # Partitioned Parquet output
β βββ schemas/ride_event_schema.json # JSON schema definition
β
βββ ingestion/ # π₯ Data Ingestion Layer
β βββ lambda_kinesis_producer/ # REST API β Kinesis producer
β βββ lambda_s3_trigger/ # S3 event β validate β Bronze
β βββ api_gateway/api_spec.yaml # OpenAPI 3.0 specification
β
βββ streaming/ # β‘ Real-Time Pipeline
β βββ lambda_stream_processor/ # Enrich + Firehose + DynamoDB
β βββ kinesis_analytics/ # SQL tumbling/sliding windows
β βββ kafka_producer/ # MSK/Kafka alternative
β
βββ batch_processing/ # βοΈ Batch ETL
β βββ glue_jobs/
β β βββ bronze_to_silver.py # PySpark: clean, dedupe, validate
β β βββ silver_to_gold.py # PySpark: star schema + KPIs
β βββ glue_crawlers/ # Crawler configurations
β βββ emr_jobs/
β βββ heavy_aggregation.py # Geospatial + driver scoring + ML
β βββ bootstrap.sh # EMR dependency installer
β
βββ data_warehouse/ # π’ Data Warehouse
β βββ redshift/
β β βββ ddl/schema.sql # Star schema (6 tables)
β β βββ copy_commands.sql # S3 β Redshift COPY
β β βββ analytical_queries.sql # 6 complex BI queries
β β βββ spectrum_setup.sql # Federated S3 queries
β βββ snowflake/
β β βββ setup.sql # Warehouse + external stages
β β βββ pipes.sql # Snowpipe + CDC streams
β βββ athena/queries.sql # Serverless analytics
β
βββ dbt_models/ # π dbt Transformations
β βββ dbt_project.yml
β βββ models/
β βββ staging/stg_rides.sql # Standardize raw data
β βββ intermediate/int_ride_metrics.sql # Derived metrics
β βββ marts/fct_daily_rides.sql # Daily aggregated facts
β βββ marts/dim_drivers.sql # Driver performance dimension
β βββ schema.yml # Tests + column documentation
β
βββ orchestration/ # π― Pipeline Orchestration
β βββ step_functions/etl_pipeline.json # AWS state machine
β βββ airflow/dags/
β β βββ uber_etl_dag.py # Daily batch ETL pipeline
β β βββ streaming_monitor_dag.py # 15-min health checks
β βββ eventbridge/rules.json # Scheduled triggers
β
βββ data_quality/ # β
Data Quality
β βββ great_expectations/
β βββ expectations/ # 19 validation rules
β βββ checkpoints/ # Daily validation runs
β βββ great_expectations.yml # S3-based stores
β
βββ docker/ # π³ Local Development
β βββ docker-compose.yml # LocalStack + Kafka + Airflow + Spark
β βββ localstack/init.sh # AWS resource bootstrap
β βββ Dockerfile.etl # PySpark container
β
βββ .github/workflows/ # π CI/CD
β βββ ci.yml # Lint + validate + scan
β βββ deploy.yml # Terraform + Lambda + Glue deploy
β
βββ docs/images/ # πΈ Architecture diagrams
βββ requirements.txt
βββ .gitignore
βββ README.md
- AWS Account with appropriate IAM permissions
- Terraform >= 1.5.0
- Python >= 3.12
- Docker & Docker Compose
- AWS CLI configured with credentials
git clone https://github.com/yourusername/uber-data-engineering-platform.git
cd uber-data-engineering-platform# Start all local services (LocalStack, Kafka, Airflow, Spark, PostgreSQL)
cd docker
docker compose up -d
# Verify LocalStack resources
awslocal s3 ls
awslocal kinesis list-streams
awslocal dynamodb list-tables
# Access Airflow UI at http://localhost:8080 (admin/admin)pip install -r requirements.txt
# Preview sample events
cd data_sources
python ride_event_generator.py --mode local --count 5
# Generate 100K batch records
python ride_event_generator.py --mode batch --count 100000 --output ../sample_data/rides.json
# Generate historical Parquet dataset
python historical_data_generator.py --events-per-year 100000 --years 2022 2023 2024# Initialize and deploy infrastructure
cd terraform
terraform init
terraform plan
terraform apply
# Deploy Lambda functions
for func in lambda_kinesis_producer lambda_s3_trigger; do
cd ../ingestion/$func
zip -r /tmp/$func.zip handler.py
aws lambda update-function-code --function-name uber-de-dev-$func --zip-file fileb:///tmp/$func.zip
cd ../../
done
# Upload Glue scripts
aws s3 sync batch_processing/glue_jobs/ s3://uber-de-dev-scripts/glue/# Start streaming events to Kinesis
python data_sources/ride_event_generator.py --mode stream --stream-name uber-de-dev-ride-events --eps 50 --duration 300
# Trigger batch ETL via Step Functions
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:ACCOUNT_ID:stateMachine:uber-de-dev-etl-pipeline
# Run dbt models
cd dbt_models
dbt run --target prod
dbt test ββββββββββββββββ
β dim_time β
ββββββββ¬ββββββββ
β
ββββββββββββββββ βββββββββΌββββββββ ββββββββββββββββββββ
β dim_riders βββββ fact_rides βββββ dim_vehicle_typesβ
ββββββββββββββββ βββββββββ¬ββββββββ ββββββββββββββββββββ
β
ββββββββββββββββ βββββββββΌββββββββ
β dim_drivers βββββ dim_locations β
ββββββββββββββββ βββββββββββββββββ
| Table | Records | Description |
|---|---|---|
fact_rides |
Millions | Individual completed ride records with all metrics |
dim_drivers |
Thousands | Driver profiles with performance tiers (PLATINUMβBRONZE) |
dim_riders |
Thousands | Rider profiles with preferences |
dim_time |
~4,000 | Calendar dimension (2020β2030) |
dim_vehicle_types |
6 | UberX, UberXL, UberBlack, Pool, Comfort, Green |
dim_locations |
Dynamic | Geographic grid zones |
| Analysis | Description | Tool |
|---|---|---|
| Revenue Trends | Monthly revenue by city with MoM growth | Redshift |
| Peak Hour Surge | Hourly demand with P50/P95 fare distributions | Redshift |
| Driver Scoring | Composite performance with 4-tier classification | EMR Spark |
| Weather Impact | Demand & pricing shifts by weather conditions | Redshift |
| Rider Retention | Weekly cohort analysis with retention curves | Redshift |
| Geospatial Heatmaps | Pickup/dropoff hotspot grid analysis | EMR Spark |
| Route Popularity | Top origin-destination pairs by city | EMR Spark |
| Real-Time Metrics | Live rides/min, surge hotspots per city | Kinesis Analytics |
| Stage | Tool | Checks |
|---|---|---|
| Lint | flake8, black, isort | Code quality + formatting |
| Validate | Terraform validate | Infrastructure configs |
| Syntax | py_compile | All Python files |
| Schema | JSON/YAML validators | All config files |
| Security | Trivy | Vulnerability scanning |
| Deploy | Terraform apply | Infrastructure provisioning |
| Release | AWS CLI | Lambda + Glue script deployment |
- CloudWatch Dashboard: Lambda metrics, Kinesis throughput & lag, Glue job status
- Alarms: Consumer lag > 5min, Lambda error rate > 5%, Glue job failures
- SNS Notifications: Email/Slack alerts on any pipeline failure
- Airflow Health DAG: Automated Kinesis + Lambda + DynamoDB checks every 15 minutes
- Great Expectations: 19 data quality rules validated daily on Silver layer
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License.
β οΈ Note: ReplaceACCOUNT_IDplaceholders with your actual AWS account ID before deploying.
Built with β€οΈ for Data Engineering


