Skip to content

manvith2003/uber-data-engineering-platform

Repository files navigation

Uber End-to-End Data Engineering Platform

A production-grade, cloud-native data engineering platform built on AWS, simulating Uber's ride analytics pipeline. Showcases 18+ technologies across batch processing, real-time streaming, data warehousing, and ML feature engineering.


πŸ—οΈ Architecture

System Architecture

The platform follows a layered architecture with clear separation between ingestion, processing, warehousing, and orchestration:

Data Sources β†’ Ingestion (API Gateway + Lambda)
    ↓
Real-Time: Kinesis β†’ Lambda β†’ Firehose β†’ S3 + DynamoDB
Batch: S3 Bronze β†’ Glue ETL β†’ S3 Silver β†’ S3 Gold
    ↓
Warehouse: Redshift (Star Schema) + Snowflake + Athena
    ↓
Orchestration: Airflow + Step Functions + EventBridge
Monitoring: CloudWatch + SNS Alerts + Great Expectations

πŸ”„ Data Pipeline

End-to-End Pipeline

Medallion Architecture

Layer Purpose Format Partitioning
πŸ₯‰ Bronze Raw, unprocessed events JSON/CSV year/month/day
πŸ₯ˆ Silver Cleaned, validated, enriched Parquet (Snappy) event_date/city
πŸ₯‡ Gold Business-ready aggregations & star schema Parquet ride_date

πŸ› οΈ Technology Stack

Technology Stack

Category Technologies
Storage Amazon S3 (Data Lake – Bronze/Silver/Gold), DynamoDB
Compute AWS Lambda, EC2, EMR (Spark), AWS Glue (PySpark)
Streaming Amazon Kinesis Data Streams, Kinesis Firehose, Apache Kafka (MSK)
Data Warehouse Amazon Redshift, Snowflake
Transformation dbt, AWS Glue ETL, PySpark
Orchestration AWS Step Functions, Apache Airflow (MWAA), Amazon EventBridge
API Amazon API Gateway
Analytics Amazon Athena, Redshift Spectrum
Data Quality Great Expectations
Monitoring Amazon CloudWatch, Amazon SNS
Infrastructure Terraform (IaC)
CI/CD GitHub Actions
Containers Docker, Amazon ECR
Data Catalog AWS Glue Data Catalog

πŸ“ Project Structure

uber-data-engineering-platform/
β”‚
β”œβ”€β”€ terraform/                          # πŸ”§ Infrastructure as Code
β”‚   β”œβ”€β”€ main.tf                         # Root configuration (S3 backend)
β”‚   β”œβ”€β”€ variables.tf / outputs.tf       # Input/output definitions
β”‚   └── modules/
β”‚       β”œβ”€β”€ vpc/                        # VPC, subnets, NAT, security groups
β”‚       β”œβ”€β”€ s3/                         # Bronze/Silver/Gold + lifecycle policies
β”‚       β”œβ”€β”€ kinesis/                    # Data Streams + Firehose delivery
β”‚       β”œβ”€β”€ redshift/                   # Cluster + Spectrum + IAM
β”‚       β”œβ”€β”€ glue/                       # Crawlers, ETL jobs, catalog DB
β”‚       β”œβ”€β”€ lambda/                     # 3 functions + API Gateway
β”‚       β”œβ”€β”€ emr/                        # Spark cluster + auto-scaling
β”‚       β”œβ”€β”€ dynamodb/                   # Live ride metrics table
β”‚       β”œβ”€β”€ step_functions/             # ETL orchestration state machine
β”‚       └── monitoring/                 # CloudWatch dashboard + SNS alarms
β”‚
β”œβ”€β”€ data_sources/                       # πŸ“Š Data Generation
β”‚   β”œβ”€β”€ ride_event_generator.py         # Multi-mode: batch / stream / local
β”‚   β”œβ”€β”€ historical_data_generator.py    # Partitioned Parquet output
β”‚   └── schemas/ride_event_schema.json  # JSON schema definition
β”‚
β”œβ”€β”€ ingestion/                          # πŸ“₯ Data Ingestion Layer
β”‚   β”œβ”€β”€ lambda_kinesis_producer/        # REST API β†’ Kinesis producer
β”‚   β”œβ”€β”€ lambda_s3_trigger/              # S3 event β†’ validate β†’ Bronze
β”‚   └── api_gateway/api_spec.yaml       # OpenAPI 3.0 specification
β”‚
β”œβ”€β”€ streaming/                          # ⚑ Real-Time Pipeline
β”‚   β”œβ”€β”€ lambda_stream_processor/        # Enrich + Firehose + DynamoDB
β”‚   β”œβ”€β”€ kinesis_analytics/              # SQL tumbling/sliding windows
β”‚   └── kafka_producer/                 # MSK/Kafka alternative
β”‚
β”œβ”€β”€ batch_processing/                   # βš™οΈ Batch ETL
β”‚   β”œβ”€β”€ glue_jobs/
β”‚   β”‚   β”œβ”€β”€ bronze_to_silver.py         # PySpark: clean, dedupe, validate
β”‚   β”‚   └── silver_to_gold.py           # PySpark: star schema + KPIs
β”‚   β”œβ”€β”€ glue_crawlers/                  # Crawler configurations
β”‚   └── emr_jobs/
β”‚       β”œβ”€β”€ heavy_aggregation.py        # Geospatial + driver scoring + ML
β”‚       └── bootstrap.sh               # EMR dependency installer
β”‚
β”œβ”€β”€ data_warehouse/                     # 🏒 Data Warehouse
β”‚   β”œβ”€β”€ redshift/
β”‚   β”‚   β”œβ”€β”€ ddl/schema.sql              # Star schema (6 tables)
β”‚   β”‚   β”œβ”€β”€ copy_commands.sql           # S3 β†’ Redshift COPY
β”‚   β”‚   β”œβ”€β”€ analytical_queries.sql      # 6 complex BI queries
β”‚   β”‚   └── spectrum_setup.sql          # Federated S3 queries
β”‚   β”œβ”€β”€ snowflake/
β”‚   β”‚   β”œβ”€β”€ setup.sql                   # Warehouse + external stages
β”‚   β”‚   └── pipes.sql                   # Snowpipe + CDC streams
β”‚   └── athena/queries.sql              # Serverless analytics
β”‚
β”œβ”€β”€ dbt_models/                         # πŸ”€ dbt Transformations
β”‚   β”œβ”€β”€ dbt_project.yml
β”‚   └── models/
β”‚       β”œβ”€β”€ staging/stg_rides.sql       # Standardize raw data
β”‚       β”œβ”€β”€ intermediate/int_ride_metrics.sql  # Derived metrics
β”‚       β”œβ”€β”€ marts/fct_daily_rides.sql   # Daily aggregated facts
β”‚       β”œβ”€β”€ marts/dim_drivers.sql       # Driver performance dimension
β”‚       └── schema.yml                  # Tests + column documentation
β”‚
β”œβ”€β”€ orchestration/                      # 🎯 Pipeline Orchestration
β”‚   β”œβ”€β”€ step_functions/etl_pipeline.json  # AWS state machine
β”‚   β”œβ”€β”€ airflow/dags/
β”‚   β”‚   β”œβ”€β”€ uber_etl_dag.py             # Daily batch ETL pipeline
β”‚   β”‚   └── streaming_monitor_dag.py    # 15-min health checks
β”‚   └── eventbridge/rules.json          # Scheduled triggers
β”‚
β”œβ”€β”€ data_quality/                       # βœ… Data Quality
β”‚   └── great_expectations/
β”‚       β”œβ”€β”€ expectations/               # 19 validation rules
β”‚       β”œβ”€β”€ checkpoints/                # Daily validation runs
β”‚       └── great_expectations.yml      # S3-based stores
β”‚
β”œβ”€β”€ docker/                             # 🐳 Local Development
β”‚   β”œβ”€β”€ docker-compose.yml              # LocalStack + Kafka + Airflow + Spark
β”‚   β”œβ”€β”€ localstack/init.sh              # AWS resource bootstrap
β”‚   └── Dockerfile.etl                  # PySpark container
β”‚
β”œβ”€β”€ .github/workflows/                  # πŸš€ CI/CD
β”‚   β”œβ”€β”€ ci.yml                          # Lint + validate + scan
β”‚   └── deploy.yml                      # Terraform + Lambda + Glue deploy
β”‚
β”œβ”€β”€ docs/images/                        # πŸ“Έ Architecture diagrams
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

πŸš€ Getting Started

Prerequisites

  • AWS Account with appropriate IAM permissions
  • Terraform >= 1.5.0
  • Python >= 3.12
  • Docker & Docker Compose
  • AWS CLI configured with credentials

1. Clone the Repository

git clone https://github.com/yourusername/uber-data-engineering-platform.git
cd uber-data-engineering-platform

2. Local Development (Docker)

# Start all local services (LocalStack, Kafka, Airflow, Spark, PostgreSQL)
cd docker
docker compose up -d

# Verify LocalStack resources
awslocal s3 ls
awslocal kinesis list-streams
awslocal dynamodb list-tables

# Access Airflow UI at http://localhost:8080 (admin/admin)

3. Generate Sample Data

pip install -r requirements.txt

# Preview sample events
cd data_sources
python ride_event_generator.py --mode local --count 5

# Generate 100K batch records
python ride_event_generator.py --mode batch --count 100000 --output ../sample_data/rides.json

# Generate historical Parquet dataset
python historical_data_generator.py --events-per-year 100000 --years 2022 2023 2024

4. Deploy to AWS

# Initialize and deploy infrastructure
cd terraform
terraform init
terraform plan
terraform apply

# Deploy Lambda functions
for func in lambda_kinesis_producer lambda_s3_trigger; do
    cd ../ingestion/$func
    zip -r /tmp/$func.zip handler.py
    aws lambda update-function-code --function-name uber-de-dev-$func --zip-file fileb:///tmp/$func.zip
    cd ../../
done

# Upload Glue scripts
aws s3 sync batch_processing/glue_jobs/ s3://uber-de-dev-scripts/glue/

5. Run the Pipeline

# Start streaming events to Kinesis
python data_sources/ride_event_generator.py --mode stream --stream-name uber-de-dev-ride-events --eps 50 --duration 300

# Trigger batch ETL via Step Functions
aws stepfunctions start-execution \
    --state-machine-arn arn:aws:states:us-east-1:ACCOUNT_ID:stateMachine:uber-de-dev-etl-pipeline

# Run dbt models
cd dbt_models
dbt run --target prod
dbt test

πŸ“Š Star Schema

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  dim_time    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  dim_riders  │───│  fact_rides   │───│  dim_vehicle_typesβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ dim_drivers  │───│ dim_locations β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Table Records Description
fact_rides Millions Individual completed ride records with all metrics
dim_drivers Thousands Driver profiles with performance tiers (PLATINUM→BRONZE)
dim_riders Thousands Rider profiles with preferences
dim_time ~4,000 Calendar dimension (2020–2030)
dim_vehicle_types 6 UberX, UberXL, UberBlack, Pool, Comfort, Green
dim_locations Dynamic Geographic grid zones

πŸ“ˆ Key Analytics & Insights

Analysis Description Tool
Revenue Trends Monthly revenue by city with MoM growth Redshift
Peak Hour Surge Hourly demand with P50/P95 fare distributions Redshift
Driver Scoring Composite performance with 4-tier classification EMR Spark
Weather Impact Demand & pricing shifts by weather conditions Redshift
Rider Retention Weekly cohort analysis with retention curves Redshift
Geospatial Heatmaps Pickup/dropoff hotspot grid analysis EMR Spark
Route Popularity Top origin-destination pairs by city EMR Spark
Real-Time Metrics Live rides/min, surge hotspots per city Kinesis Analytics

πŸ”„ CI/CD Pipeline

Stage Tool Checks
Lint flake8, black, isort Code quality + formatting
Validate Terraform validate Infrastructure configs
Syntax py_compile All Python files
Schema JSON/YAML validators All config files
Security Trivy Vulnerability scanning
Deploy Terraform apply Infrastructure provisioning
Release AWS CLI Lambda + Glue script deployment

πŸ“Š Monitoring & Alerting

  • CloudWatch Dashboard: Lambda metrics, Kinesis throughput & lag, Glue job status
  • Alarms: Consumer lag > 5min, Lambda error rate > 5%, Glue job failures
  • SNS Notifications: Email/Slack alerts on any pipeline failure
  • Airflow Health DAG: Automated Kinesis + Lambda + DynamoDB checks every 15 minutes
  • Great Expectations: 19 data quality rules validated daily on Silver layer

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License.


⚠️ Note: Replace ACCOUNT_ID placeholders with your actual AWS account ID before deploying.


Built with ❀️ for Data Engineering

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors