**End-to-End ETL Scenario**

Company: Large e-commerce platform
Goal: Build a daily analytics pipeline for user activity + orders

You ingest raw logs, clean them, join datasets, aggregate metrics, and produce a final fact table.

**üîπ PIPELINE OVERVIEW (High-Level)**

Source Systems

Application event logs (strings)

Orders table (structured)

Users table (dimension)

Target Output
Daily user-level metrics table:

(user_id, user_name, total_events, total_orders, total_revenue)

**Question 1 ‚Äî Log Ingestion + Parsing (ETL Stage: Extract)**

Concepts Tested

File Streaming Pattern

String Parsing

Frequency Map (setup)

Problem Statement

You receive raw application logs as strings.
Each log represents a single user event.

Input Format
logs = [
    "2025-01-01|user=1|event=login",
    "2025-01-01|user=2|event=view",
    "2025-01-01|user=1|event=purchase",
    "2025-01-01|user=1|event=logout",
    "2025-01-01|user=3|event=view"
]

Task

Parse the logs and extract structured records as:

(user_id, event_name)

Expected Output
[
    (1, "login"),
    (2, "view"),
    (1, "purchase"),
    (1, "logout"),
    (3, "view")
]

Constraints

Logs arrive as strings

Do not use regex

Assume logs fit in memory (for now)

Maintain input order

**Question 2 ‚Äî Event Aggregation per User**

ETL Stage: Transform
Difficulty: Easy ‚Üí Medium
Real-world: Daily user activity metrics

Concepts Tested

Group By List

Frequency Map

Aggregation Map

Problem Statement

From the parsed event records produced in Stage 1, compute event-level metrics per user.

Input Format (from previous stage)
events = [
    (1, "login"),
    (2, "view"),
    (1, "purchase"),
    (1, "logout"),
    (3, "view")
]

Task

Build two outputs:

1Ô∏è‚É£ Grouped events per user
{
    1: ["login", "purchase", "logout"],
    2: ["view"],
    3: ["view"]
}

2Ô∏è‚É£ Total event count per user
{
    1: 3,
    2: 1,
    3: 1
}

Constraints

One pass preferred

Preserve order of events per user

Do not use Counter

Use dictionary-based aggregation only

**Question 3 ‚Äî Orders Aggregation (Fact Table Prep)**

ETL Stage: Transform
Difficulty: Easy
Real-world: Daily revenue aggregation

Concepts Tested

Aggregation Map

Group By Key

One-pass ETL logic

Problem Statement

You receive an orders dataset from a transactional system.

Each record represents a completed order.

Input Format
orders = [
    (101, 1, 50.0),
    (102, 1, 30.0),
    (103, 2, 20.0),
    (104, 1, 40.0),
    (105, 4, 100.0)
]


Where each tuple is:

(order_id, user_id, order_amount)

Task

Compute per-user order metrics:

{
    1: {
        "total_orders": 3,
        "total_revenue": 120.0
    },
    2: {
        "total_orders": 1,
        "total_revenue": 20.0
    },
    4: {
        "total_orders": 1,
        "total_revenue": 100.0
    }
}

Constraints

One pass

Do not sort

Use dictionary aggregation

Some users may not appear in events dataset

**Question 4 ‚Äî User Dimension Join (Left + Inner Join Logic)**

ETL Stage: Transform ‚Üí Enrich
Difficulty: Easy ‚Üí Medium
Real-world: Dimension enrichment before fact load

Concepts Tested

Hash Join (Inner)

Left Join

Semi Join (implicitly)

Dictionary lookups

Problem Statement

You are given a users dimension table.

Input Format
Users Dimension
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana"),
    (5, "Eve")
]

From previous stages

Event counts per user

event_counts = {
    1: 3,
    2: 1,
    3: 1
}


Order metrics per user

order_metrics = {
    1: {"total_orders": 3, "total_revenue": 120.0},
    2: {"total_orders": 1, "total_revenue": 20.0},
    4: {"total_orders": 1, "total_revenue": 100.0}
}

Task

Produce a left-joined enriched dataset where:

Every user appears exactly once

Missing metrics are filled with 0

Join is done using hash maps

Output Format
[
    (1, "Alice", 3, 3, 120.0),
    (2, "Bob", 1, 1, 20.0),
    (3, "Charlie", 1, 0, 0.0),
    (4, "Diana", 0, 1, 100.0),
    (5, "Eve", 0, 0, 0.0)
]


Where each record is:

(user_id, user_name, total_events, total_orders, total_revenue)

Constraints

Preserve users table order

No nested loops

Use dictionary lookups only

This is a LEFT JOIN on users

**Question 5 ‚Äî Data Quality Checks + Anti Join (Pre-Load Validation)**

ETL Stage: Transform ‚Üí Validate ‚Üí Load-ready
Difficulty: Easy (conceptual but critical)
Real-world: Preventing bad facts from landing in warehouse

Concepts Tested

Anti Join

Semi Join (existence logic)

Dedupe with Dictionary

Production ETL thinking

Problem Statement

Before loading the final fact table, you must perform data quality checks.

You are given:

Enriched dataset (from previous stage)
enriched_users = [
    (1, "Alice", 3, 3, 120.0),
    (2, "Bob", 1, 1, 20.0),
    (3, "Charlie", 1, 0, 0.0),
    (4, "Diana", 0, 1, 100.0),
    (5, "Eve", 0, 0, 0.0)
]

Valid users list from identity system
active_users = {1, 2, 3, 4}

Tasks
1Ô∏è‚É£ Anti Join ‚Äî Invalid Users

Identify users that should NOT be loaded because they are no longer active.

Output

[
    (5, "Eve", 0, 0, 0.0)
]

2Ô∏è‚É£ Final Load Dataset

Return only valid users for loading.

[
    (1, "Alice", 3, 3, 120.0),
    (2, "Bob", 1, 1, 20.0),
    (3, "Charlie", 1, 0, 0.0),
    (4, "Diana", 0, 1, 100.0)
]

Constraints

Use set-based or dict-based lookups

Preserve original order

No filtering via list scans inside loops

Think WHERE EXISTS / NOT EXISTS

**Question 6 ‚Äî Windowed Deduplication + Late Events (Streaming ETL)**

ETL Stage: Stream ‚Üí Stateful Transform
Difficulty: Medium ‚Üí Hard
Real-world: Kafka / Kinesis / Spark Structured Streaming

Concepts Tested

Dedup Within Window

Sliding Window

Stateful Aggregation

Late Event Handling

Dictionary + deque logic

Scenario

User events now arrive as a stream, not a batch.

Each event contains:

(event_id, user_id, event_type, event_time)


Events may arrive out of order and may be duplicated.

Input Format (arrival order)
events = [
    ("e1", 1, "login",  100),
    ("e2", 1, "view",   102),
    ("e1", 1, "login",  103),   # duplicate (same event_id)
    ("e3", 2, "view",   104),
    ("e4", 1, "purchase", 95),  # late event
    ("e5", 1, "logout", 108)
]

Business Rules
1Ô∏è‚É£ Deduplication Rule

Events are uniquely identified by event_id

Deduplicate within a 10-second window

Older duplicates outside the window can be ignored

2Ô∏è‚É£ Late Event Rule

Accept late events up to 5 seconds

Drop events older than (current_event_time - 5)

3Ô∏è‚É£ Aggregation Rule

For each user, compute:

total_valid_events

Expected Output

Final aggregated result after processing the stream:

{
    1: 4,
    2: 1
}

Constraints

You cannot store all events forever

Must simulate state cleanup

Assume single partition (no distributed sync)

Python only (no Spark code yet)

Use:

dict for state

deque for sliding window cleanup

**Question 7 ‚Äî Streaming Join + Windowed Aggregation + Data Skew**

ETL Stage: Stream ‚Üí Join ‚Üí Window ‚Üí Aggregate
Difficulty: Hard
Real-world: MANG production analytics pipeline

Scenario

You are building a real-time metrics pipeline for a large e-commerce platform.

Two streaming sources
1Ô∏è‚É£ User Events Stream
(event_id, user_id, event_type, event_time)

2Ô∏è‚É£ Orders Stream
(order_id, user_id, order_amount, order_time)

Input Streams (arrival order)
events = [
    ("e1", 1, "view", 100),
    ("e2", 1, "purchase", 105),
    ("e3", 2, "view", 106),
    ("e4", 1, "view", 107),
    ("e5", 999, "view", 108),   # hot key (skewed user)
    ("e6", 999, "purchase", 109),
]

orders = [
    ("o1", 1, 50.0, 106),
    ("o2", 999, 20.0, 110),
    ("o3", 1, 30.0, 111)
]

Business Requirements
1Ô∏è‚É£ Streaming Join (Time-bounded)

Join events ‚Üî orders on user_id

Only join if:

|event_time - order_time| ‚â§ 5 seconds

2Ô∏è‚É£ Windowed Aggregation

Compute per-user metrics in a 10-second tumbling window:

(user_id,
 total_events,
 total_orders,
 total_revenue)

3Ô∏è‚É£ Late Event Handling

Allow lateness: 3 seconds

Drop anything older than watermark

4Ô∏è‚É£ Skew Constraint (VERY IMPORTANT)

user_id = 999 represents a hot key

You must design the solution so:

State does NOT explode

One user does NOT bottleneck the pipeline