Skip to content

nellaivijay/beam-code-practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Apache Beam Code Practice

Python Apache Beam License Jupyter

πŸ“– Table of Contents

🎯 Educational Mission

A comprehensive, vendor-independent Apache Beam learning environment designed for developers, data engineers, and analysts who want to master modern data pipeline engineering through hands-on practice.

15 progressive labs with 120+ exercises covering Apache Beam fundamentals, pipeline development, streaming processing, and production deployment. Completely free and open source. Built for learners, by learners.

πŸŽ“ Why This Repository?

This educational resource fills the gap between theoretical knowledge and practical skills in Apache Beam and modern data pipeline engineering:

  • Learn by Doing: Progressive hands-on labs build real skills
  • Vendor Independent: Master concepts that apply across all runners
  • Unified Model: Learn the Beam model for batch and streaming
  • Production Patterns: Learn deployment, monitoring, and operations
  • Multi-Language Experience: Work with Python, Java, and SQL
  • Community Driven: Built and improved by the data engineering community

πŸŽ“ Learning Approach

Progressive Complexity

Our labs are designed to build knowledge progressively:

  • Beginner (Labs 0-2): Foundation and basic pipeline concepts
  • Intermediate (Labs 3-6): Advanced transforms and I/O
  • Advanced (Labs 7-10): Streaming, state, and production deployment

Hands-On Learning

Each lab includes:

  • Clear Learning Objectives: Know what you'll achieve
  • Step-by-Step Instructions: Guided exercises
  • Real-World Scenarios: Practical pipeline use cases
  • Solution Notebooks: Reference implementations
  • Conceptual Guides: Deep-dive explanations

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Apache Beam Code Practice                β”‚
β”‚                  Data Pipeline Learning Environment         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Apache Beam Unified Model                 β”‚  β”‚
β”‚  β”‚         - PCollection abstraction               β”‚  β”‚
β”‚  β”‚         - Transform functions                   β”‚  β”‚
β”‚  β”‚         - Pipeline I/O connectors                β”‚  β”‚
β”‚  β”‚         - Windowing and triggers                 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Pipeline Development                     β”‚  β”‚
β”‚  β”‚         - Batch processing patterns              β”‚  β”‚
β”‚  β”‚         - Streaming processing patterns           β”‚  β”‚
β”‚  β”‚         - State management                       β”‚  β”‚
β”‚  β”‚         - Windowing strategies                   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Execution Runners                        β”‚  β”‚
β”‚  β”‚         - DirectRunner (local)                   β”‚  β”‚
β”‚  β”‚         - Dataflow (cloud)                      β”‚  β”‚
β”‚  β”‚         - Spark (cluster)                        β”‚  β”‚
β”‚  β”‚         - Flink (streaming)                      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Data Sources & Sinks                     β”‚  β”‚
β”‚  β”‚         - File systems (GCS, S3, HDFS)           β”‚  β”‚
β”‚  β”‚         - Pub/Sub, Kafka (streaming)              β”‚  β”‚
β”‚  β”‚         - Databases (BigQuery, Spanner)           β”‚  β”‚
β”‚  β”‚         - Custom I/O connectors                  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Production Operations                   β”‚  β”‚
β”‚  β”‚         - Pipeline deployment                   β”‚  β”‚
β”‚  β”‚         - Monitoring and logging                 β”‚  β”‚
β”‚  β”‚         - Testing and debugging                  β”‚  β”‚
β”‚  β”‚         - Performance optimization               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Core Stack

Apache Beam

  • Apache Beam: Unified model for batch and streaming
  • Python SDK: Pipeline development with Python
  • Java SDK: Pipeline development with Java
  • SQL Support: Beam SQL for declarative pipelines

Execution Runners

  • DirectRunner: Local execution for development
  • Dataflow: Managed cloud execution
  • Spark: Cluster execution
  • Flink: Streaming execution

Data Connectors

  • File Systems: GCS, S3, HDFS, local files
  • Streaming: Pub/Sub, Kafka
  • Databases: BigQuery, Spanner, JDBC
  • Custom I/O: Extensible connector framework

πŸŽ“ Lab Structure

Lab Difficulty & Time Estimates

Level Labs Time per Lab What It Tests
Beginner Labs 0-2 30-60 min Basic setup, pipeline concepts, simple transforms
Intermediate Labs 3-6 45-75 min Advanced transforms, I/O, windowing, state
Advanced Labs 7-10 60-120 min Streaming, production, deployment, monitoring

Lab 0: Environment Setup

  • Install Apache Beam and dependencies
  • Test pipeline execution locally
  • Validate runner configurations
  • Explore different SDKs

Lab 1: Introduction to Apache Beam

  • Understand Apache Beam fundamentals
  • Learn the Beam model and concepts
  • Explore PCollection and transforms
  • Build your first pipeline

Lab 2: Pipeline Fundamentals

  • Create and execute basic pipelines
  • Understand pipeline I/O
  • Practice common transforms
  • Work with schema and data types

Lab 3: Core Transforms

  • ParDo, Map, Filter, GroupByKey
  • Combine, Flatten, CoGroupByKey
  • Custom transforms and DoFns
  • Pipeline optimization basics

Lab 4: I/O Connectors

  • File system I/O (GCS, S3, local)
  • Database I/O (BigQuery, JDBC)
  • Streaming I/O (Pub/Sub, Kafka)
  • Custom I/O connectors

Lab 5: Windowing and Triggers

  • Fixed and sliding windows
  • Session windows
  • Trigger strategies
  • Late data handling

Lab 6: State and Timers

  • Stateful processing
  • Timers and user state
  • State persistence
  • Checkpointing

Lab 7: Streaming Pipelines

  • Streaming fundamentals
  • Watermarks and event time
  • Streaming I/O patterns
  • Streaming aggregation

Lab 8: Pipeline Testing

  • Unit testing transforms
  • Integration testing pipelines
  • Testing streaming pipelines
  • Test data and fixtures

Lab 9: Production Deployment

  • Dataflow deployment
  • Spark deployment
  • Flink deployment
  • Pipeline templates

Lab 10: Monitoring and Operations

  • Pipeline monitoring
  • Logging and debugging
  • Performance optimization
  • Error handling

πŸ’Ύ Sample Data

The environment includes comprehensive sample datasets for hands-on learning:

Sample Datasets

  • Sample Sales Data: Transaction records for pipeline processing
  • Sample Streaming Data: Event data for streaming pipelines
  • Sample Log Data: Server logs for ETL patterns
  • Sample User Data: User behavior data for analytics

Loading Sample Data

# Generate and load sample data
python3 scripts/generate_sample_data.py
python3 scripts/load_sample_data.py

πŸš€ Quick Start

πŸŽ“ New to Apache Beam?

Follow our recommended learning path:

  1. Start with Fundamentals: Read Apache Beam Fundamentals wiki page
  2. Set Up Environment: Follow Getting Started Guide
  3. Begin Lab 0: Load sample data with Lab 0
  4. Progress Through Labs: Follow the Learning Path

πŸ“‹ Setup Options

Option 1: Python Environment (Recommended)

cd beam-code-practice
pip install -r requirements.txt
python3 scripts/setup.py

Option 2: Docker Environment

cd beam-code-practice
docker-compose up -d

πŸ“‹ Requirements

  • Python 3.8+ (for Python SDK)
  • Java 11+ (for Java SDK)
  • pip (Python package manager)
  • 4GB RAM minimum (8GB recommended)
  • 2GB disk space minimum

πŸ”§ Configuration

Python Environment Setup

# Install Apache Beam
pip install apache-beam[gcp]

# Install additional runners
pip install apache-beam[gcp]  # For Dataflow
pip install apache-beam[spark]  # For Spark

Beam Configuration

import apache_beam as beam

# Set pipeline options
pipeline_options = {
    'runner': 'DirectRunner',
    'project': 'your-project',
    'region': 'us-central1',
    'temp_location': 'gs://your-bucket/temp'
}

πŸ“š Documentation

πŸ“– Wiki

Comprehensive wiki documentation is available with detailed guides:

πŸŽ“ Educational Resources

Wiki Guides (Comprehensive learning materials):

Core Documentation

Lab Materials

πŸ’‘ Jupyter Notebooks

Interactive Jupyter notebooks for hands-on learning:

πŸ€– Automation Scripts

πŸ”— Related Practice Repositories

Continue your learning journey with these related repositories:

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

πŸ†˜ Vendor Independence

This environment uses only Apache 2.0 licensed tools:

  • Apache Beam (Apache 2.0)
  • Python packages (various open source licenses)
  • Jupyter (BSD)
  • Pandas (BSD)

No proprietary cloud services or consoles required.

🀝 Contributing

This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.

Disclaimer: This is an independent educational resource for learning Apache Beam and modern data pipeline engineering. It is not affiliated with, endorsed by, or sponsored by Apache Beam or any vendor.

πŸ‘₯ Community and Learning

This repository is an open educational resource built for the data engineering community. We believe in learning together and sharing knowledge.

🀝 Learning Together

  • πŸ“– Comprehensive Wiki: Detailed guides and tutorials for all skill levels
  • πŸ’¬ GitHub Discussions: Ask questions and share insights with fellow learners
  • πŸ› Issue Tracking: Report bugs and suggest improvements
  • πŸ”„ Pull Requests: Contribute labs, fixes, and enhancements
  • ⭐ Star the Repo: Show your support and help others discover this resource

πŸŽ“ Contributing to Learning

We welcome contributions that improve the educational value:

  • New Labs: Suggest new lab topics and exercises
  • Better Explanations: Improve clarity of existing content
  • Additional Examples: Add more practical examples
  • Translation: Help translate content for global learners
  • Bug Fixes: Report and fix issues in labs or documentation

See CONTRIBUTING.md for detailed contribution guidelines.

πŸ“š Additional Learning Resources

πŸ“„ License

Apache License 2.0

About

beam-workouts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors