Problem: Deduplicate Events by Event ID (Keep First Seen)

You are processing a batch of raw events ingested from an application log.
Each event has a unique event_id, but due to retries and upstream issues, duplicate events may appear.

Your task is to deduplicate events using a dictionary, keeping only the first occurrence of each event_id while preserving input order.

ðŸ”¹ Input Format
events = [
    {"event_id": "e1", "user": "u1", "timestamp": "2025-01-01T10:00"},
    {"event_id": "e2", "user": "u2", "timestamp": "2025-01-01T10:01"},
    {"event_id": "e1", "user": "u1", "timestamp": "2025-01-01T10:02"},
    {"event_id": "e3", "user": "u3", "timestamp": "2025-01-01T10:03"},
    {"event_id": "e2", "user": "u2", "timestamp": "2025-01-01T10:04"},
]


event_id â†’ string (deduplication key)

Input order matters

ðŸ”¹ Output Format

Return a list of events, deduplicated by event_id, keeping only the first occurrence:

[
    {"event_id": "e1", "user": "u1", "timestamp": "2025-01-01T10:00"},
    {"event_id": "e2", "user": "u2", "timestamp": "2025-01-01T10:01"},
    {"event_id": "e3", "user": "u3", "timestamp": "2025-01-01T10:03"},
]

ðŸ”¹ Constraints

1 â‰¤ len(events) â‰¤ 10^5

Assume all dictionaries contain the event_id key

Do not sort the input

Use Python

Use a dictionary-based approach (no set-only shortcuts)

Problem 2: Deduplicate Records by Composite Key (Keep Latest)

You are processing user activity records coming from multiple upstream systems.
Duplicates can exist for the same (user_id, date), and when they do, you must keep the record with the latest timestamp.

This is a very common daily snapshot / upsert pattern in data pipelines.

ðŸ”¹ Input Format
records = [
    ("u1", "2025-01-01", "2025-01-01T09:00", 5),
    ("u2", "2025-01-01", "2025-01-01T10:00", 3),
    ("u1", "2025-01-01", "2025-01-01T11:00", 7),
    ("u1", "2025-01-02", "2025-01-02T08:00", 2),
    ("u2", "2025-01-01", "2025-01-01T09:30", 4),
]


Each tuple represents:

(user_id, date, timestamp, value)

ðŸ”¹ Deduplication Rule

Deduplicate by (user_id, date)

If multiple records exist:

Keep the one with the latest timestamp

Order of output does not matter

ðŸ”¹ Output Format

Return a list of deduplicated records:

[
    ("u1", "2025-01-01", "2025-01-01T11:00", 7),
    ("u2", "2025-01-01", "2025-01-01T10:00", 3),
    ("u1", "2025-01-02", "2025-01-02T08:00", 2),
]

ðŸ”¹ Constraints

1 â‰¤ len(records) â‰¤ 10^5

Timestamps are ISO-8601 strings (lexicographically comparable)

Python only

Use a dictionary keyed by composite key

Problem 3: Deduplicate Metrics Events (Sum on Duplicate Keys)

You are ingesting raw metric events from an application.
Each event represents a metric emitted for a (service, metric_name) pair.

Due to retries, duplicate keys can appear, and instead of dropping them, you must aggregate (sum) their values.

This is a dedupe + aggregation pattern using a dictionary.

ðŸ”¹ Input Format
events = [
    ("auth", "requests", 10),
    ("payments", "requests", 5),
    ("auth", "requests", 7),
    ("auth", "errors", 2),
    ("payments", "requests", 3),
]


Each tuple is:

(service, metric_name, count)

ðŸ”¹ Deduplication Rule

Deduplicate by (service, metric_name)

If duplicates exist:

Sum the count

Output order does not matter

ðŸ”¹ Output Format

Return a dictionary:

{
    ("auth", "requests"): 17,
    ("payments", "requests"): 8,
    ("auth", "errors"): 2
}

ðŸ”¹ Constraints

1 â‰¤ len(events) â‰¤ 10^5

Counts are positive integers

Use Python

Must use a dictionary-based aggregation

Problem 4: Deduplicate Records While Preserving First-Seen Order

You are processing a stream of record IDs coming from a Kafka topic snapshot.
Duplicates may exist, but the first time an ID appears is the one that must be kept, and relative order must be preserved.

This tests whether you really understand dictionary-as-seen-set + order preservation.

ðŸ”¹ Input Format
ids = ["a", "b", "a", "c", "b", "d", "e", "d"]

ðŸ”¹ Deduplication Rule

Keep only the first occurrence

Preserve original order

Use a dictionary (not a set-only solution)

ðŸ”¹ Output Format
["a", "b", "c", "d", "e"]

ðŸ”¹ Constraints

1 â‰¤ len(ids) â‰¤ 10^6

IDs are hashable

Python only

Single pass expected

Medium Problem 5: Deduplicate Events Within a Time Window (Session-Style)

You are processing application events ordered by timestamp.
Each event has a user_id and a timestamp.

Due to retries, duplicate events from the same user may arrive within a short time window.
You must drop duplicates that occur within N seconds, keeping only the first occurrence per window.

This is a stateful deduplication problem â€” very common in streaming + ETL systems.

ðŸ”¹ Input Format
events = [
    ("u1", 100),
    ("u2", 101),
    ("u1", 102),
    ("u1", 108),
    ("u2", 109),
    ("u1", 115),
]


Each tuple is:

(user_id, timestamp)   # timestamp is in seconds, sorted ascending

ðŸ”¹ Deduplication Rule

Deduplicate per user_id

If the same user appears again within window = 5 seconds:

Drop the event

If it appears after the window, keep it

Keep output in arrival order

ðŸ”¹ Output Format
[
    ("u1", 100),
    ("u2", 101),
    ("u1", 108),
    ("u2", 109),
    ("u1", 115),
]


Explanation:

("u1", 102) is dropped â†’ within 5 seconds of 100

("u1", 108) kept â†’ 8 seconds later

ðŸ”¹ Constraints

1 â‰¤ len(events) â‰¤ 10^6

Events are time-ordered

Use dictionary for state

Single pass expected

Python only

Hard Problem: Stateful Deduplication with Watermark & Memory Bounds

You are building a streaming deduplication operator for a high-throughput event pipeline.

Each event has:

event_id (string)

event_time (integer, seconds since epoch)

Events may arrive late (out of order), but lateness is bounded.

Your task is to emit only the first occurrence of each event_id and drop duplicates, while ensuring bounded memory usage.

ðŸ”¹ Input Format
events = [
    ("e1", 100),
    ("e2", 101),
    ("e1", 99),    # late duplicate
    ("e3", 105),
    ("e1", 111),   # duplicate after long time
    ("e4", 112),
]


Each tuple is:

(event_id, event_time)


Events are processed in arrival order, NOT sorted by event_time.

ðŸ”¹ Deduplication Rules

Emit an event only if its event_id has not been seen before

Maintain a watermark:

watermark = max_event_time_seen - allowed_lateness

You must evict state for event_ids whose first-seen time is strictly less than the watermark

If a duplicate arrives after eviction, it is treated as new and emitted again

ðŸ”¹ Parameters
allowed_lateness = 10  # seconds

ðŸ”¹ Output Format

Return a list of emitted events, in processing order:

[
    ("e1", 100),
    ("e2", 101),
    ("e3", 105),
    ("e1", 111),
    ("e4", 112),
]

ðŸ”¹ Constraints

1 â‰¤ len(events) â‰¤ 10^6

You must process in one pass

Use dictionary-based state

Memory must remain bounded

Python only

No external libraries