**Question 1 — Frequency Map (Easy)**

Problem Statement
You are given a list of event names representing user actions in an application.
Your task is to build a frequency map that counts how many times each event occurs.

This simulates counting event types in application logs before aggregation.

Input Format

events = [
    "login",
    "view",
    "login",
    "purchase",
    "view",
    "login"
]


Output Format
Return a dictionary where:

key = event name

value = number of occurrences

{
    "login": 3,
    "view": 2,
    "purchase": 1
}


Constraints

1 <= len(events) <= 10^5

Event names are non-empty strings

Case-sensitive ("Login" ≠ "login")

Do not use collections.Counter for this question

In [0]:
events = [
    "login",
    "view",
    "login",
    "purchase",
    "view",
    "login"
]

freq = {}

for e in events:
    freq[e] = freq.get(e,0) + 1
freq

**Question 2 — Group By List (Easy)**

Problem Statement
You are given a list of user activity records.
Each record contains a user_id and an activity_name.

Your task is to group activities by user, producing a dictionary where each user maps to the list of activities they performed in input order.

This simulates grouping raw clickstream logs before aggregation.

Input Format

activities = [
    (1, "login"),
    (2, "view"),
    (1, "purchase"),
    (1, "logout"),
    (2, "purchase")
]


Output Format
Return a dictionary:

{
    1: ["login", "purchase", "logout"],
    2: ["view", "purchase"]
}


Constraints

1 <= len(activities) <= 10^5

user_id is an integer

Maintain original order of activities per user

Do not sort

Use basic dictionary logic (no pandas)

In [0]:
activities = [
    (1, "login"),
    (2, "view"),
    (1, "purchase"),
    (1, "logout"),
    (2, "purchase")
]

from collections import defaultdict
agg = defaultdict(list)

for i,j in activities:
    agg[i].append(j)
agg

**Question 3 — Aggregation Map (Easy)**

Problem Statement
You are given a list of transaction records.
Each record contains a store_id and an amount.

Your task is to compute the total sales per store using an aggregation map.

This mirrors a basic ETL aggregation step before reporting.

Input Format

transactions = [
    (1, 100.0),
    (2, 50.0),
    (1, 25.0),
    (2, 75.0),
    (3, 40.0)
]


Output Format
Return a dictionary where:

key = store_id

value = total sales amount

{
    1: 125.0,
    2: 125.0,
    3: 40.0
}


Constraints

1 <= len(transactions) <= 10^5

store_id is an integer

amount is a positive float

Do not use pandas

One pass solution expected

In [0]:
transactions = [
    (1, 100.0),
    (2, 50.0),
    (1, 25.0),
    (2, 75.0),
    (3, 40.0)
]

agg = {}

for k,v in transactions:
    agg[k] = agg.get(k,0) + v
agg

**Question 4 — Inner Join (Easy)**

Problem Statement
You are given two datasets:

users: basic user information

orders: purchase records

Your task is to perform an inner join on user_id, returning only users who have at least one order.

This simulates a hash join between dimension and fact tables.

Input Format

users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie")
]

orders = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (3, "2025-01-03")
]


Output Format
Return a list of tuples:

[
    (1, "Alice", "2025-01-01"),
    (1, "Alice", "2025-01-05"),
    (3, "Charlie", "2025-01-03")
]


Constraints

Preserve order of orders

Use a hash map (dictionary) for the join

Time complexity should be O(n + m)

No nested loops over full datasets

In [0]:
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie")
]

orders = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (3, "2025-01-03")
]

user_map = {d:k for d,k in users}

res = []
for i,o in orders:
    res.append(
        (i,user_map.get(i,None),o)
    )
res


**Question 5 — Left Join (Easy)**

Problem Statement
You are given two datasets:

users: master user table

logins: login activity table

Your task is to perform a left join on user_id.

For every user, attach their most recent login date.
If a user has never logged in, the login date should be None.

This mirrors a common dimension enrichment step in ETL pipelines.

Input Format

users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

logins = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (3, "2025-01-03")
]


Output Format
Return a list of tuples:

[
    (1, "Alice", "2025-01-05"),
    (2, "Bob", None),
    (3, "Charlie", "2025-01-03"),
    (4, "Diana", None)
]


Constraints

Preserve order of users

Use a dictionary to pre-aggregate login dates

Dates are ISO strings (YYYY-MM-DD)

No sorting required

Expected time complexity: O(n + m)

In [0]:
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

logins = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (3, "2025-01-03")
]

lookup = {}
for i,d in logins:
    lookup[i] = max(d,lookup.get(i,d))

res = []

for i,n in users:
    res.append(
        (i,n,lookup.get(i,None))
    )
res

**Question 6 — Semi Join (Easy)**

Problem Statement
You are given two datasets:

users: list of all users

orders: list of users who placed at least one order

Your task is to perform a semi join:

Return only users who have at least one order

Do not include order details

This is commonly used for existence filtering in SQL (WHERE EXISTS).

Input Format

users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (1, "o101"),
    (1, "o102"),
    (3, "o201")
]


Output Format

[
    (1, "Alice"),
    (3, "Charlie")
]


Constraints

Preserve order of users

Use a hash-based lookup

Do not duplicate users

Time complexity: O(n + m)

In [0]:
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (1, "o101"),
    (1, "o102"),
    (3, "o201")
]

index = {d[0] for d in orders}
semi = [ (i,n) for i,n in users if i in index]
semi

**Question 7 — Anti Join (Easy)**

Problem Statement
You are given two datasets:

users: list of all users

orders: list of users who placed orders

Your task is to perform an anti join:

Return only users who have never placed an order

This is the logical inverse of a semi join and maps directly to
WHERE NOT EXISTS in SQL.

Input Format

users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (1, "o101"),
    (1, "o102"),
    (3, "o201")
]


Output Format

[
    (2, "Bob"),
    (4, "Diana")
]


Constraints

Preserve order of users

Use hash lookup

No nested loops

Time complexity: O(n + m)

In [0]:
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (1, "o101"),
    (1, "o102"),
    (3, "o201")
]

index = [d[0] for d in orders]
anti = [(i,n) for i,n in users if i not in index]
anti

**Question 1 — User Activity Metrics with Joins & Aggregations (Medium)**

You are building a daily analytics job for a product team.

You are given three datasets:

Input
users = [
    {"user_id": 1, "country": "US"},
    {"user_id": 2, "country": "IN"},
    {"user_id": 3, "country": "US"},
    {"user_id": 4, "country": "CA"},
]

events = [
    {"user_id": 1, "event": "login"},
    {"user_id": 1, "event": "purchase"},
    {"user_id": 2, "event": "login"},
    {"user_id": 2, "event": "login"},
    {"user_id": 5, "event": "login"},   # user not in users table
]

banned_users = {2}

Requirements

Inner Join users ↔ events on user_id

Anti Join: exclude any user present in banned_users

Group By List: group remaining records by country

Aggregation Map:

total_events per country

per-event frequency per country (frequency map)

Output Format

Return a dictionary in the following structure:

{
  "US": {
      "total_events": 2,
      "event_counts": {
          "login": 1,
          "purchase": 1
      }
  },
  "CA": {
      "total_events": 0,
      "event_counts": {}
  }
}


Countries with no valid events after joins must still appear with zero counts (Left Join behavior at country level).

Constraints

Assume inputs fit in memory

Do not use pandas, SQL, or external libraries

Use only core Python data structures

Time complexity should be better than O(n²)

**Question 2 — Order Analytics with Join Semantics (Medium+)**

You are implementing an analytics step in an ETL pipeline for an e-commerce system.

Input
customers = [
    {"customer_id": 1, "tier": "gold"},
    {"customer_id": 2, "tier": "silver"},
    {"customer_id": 3, "tier": "gold"},
    {"customer_id": 4, "tier": "bronze"},
]

orders = [
    {"order_id": 101, "customer_id": 1, "amount": 120},
    {"order_id": 102, "customer_id": 1, "amount": 80},
    {"order_id": 103, "customer_id": 2, "amount": 50},
    {"order_id": 104, "customer_id": 5, "amount": 200},  # no matching customer
]

fraud_customers = {2}

Requirements

Inner Join customers ↔ orders on customer_id

Anti Join: exclude customers present in fraud_customers

Group By List: group records by tier

Aggregation Map per tier:

total_orders

total_revenue

order_count_by_amount_bucket, where buckets are:

"low" : amount < 100

"high" : amount >= 100

Left Join behavior at tier level:

All tiers from customers must appear in output, even if they have zero valid orders

Output Format
{
  "gold": {
      "total_orders": 2,
      "total_revenue": 200,
      "order_count_by_amount_bucket": {
          "low": 1,
          "high": 1
      }
  },
  "silver": {
      "total_orders": 0,
      "total_revenue": 0,
      "order_count_by_amount_bucket": {}
  },
  "bronze": {
      "total_orders": 0,
      "total_revenue": 0,
      "order_count_by_amount_bucket": {}
  }
}

Constraints

Python only (no pandas / SQL)

Prefer dictionaries / defaultdict

Must be linear or near-linear time

Clean, production-quality logic

**Question 3 — Multi-Join User Engagement Rollup (Medium+)**

You are building a daily engagement rollup for analytics.

Input
users = [
    {"user_id": 1, "region": "NA"},
    {"user_id": 2, "region": "EU"},
    {"user_id": 3, "region": "NA"},
    {"user_id": 4, "region": "APAC"},
]

sessions = [
    {"session_id": "s1", "user_id": 1},
    {"session_id": "s2", "user_id": 1},
    {"session_id": "s3", "user_id": 2},
    {"session_id": "s4", "user_id": 5},   # unknown user
]

events = [
    {"session_id": "s1", "event": "view"},
    {"session_id": "s1", "event": "click"},
    {"session_id": "s2", "event": "view"},
    {"session_id": "s3", "event": "view"},
    {"session_id": "s3", "event": "purchase"},
]

blocked_users = {2}

Requirements

Inner Join users ↔ sessions on user_id

Inner Join result ↔ events on session_id

Anti Join: exclude users in blocked_users

Group By List: group final records by region

Aggregation Map per region:

total_events

unique_sessions

event_frequency (frequency map of events)

Left Join behavior at region level:

All regions from users must appear, even if they have zero valid events

Output Format
{
  "NA": {
      "total_events": 3,
      "unique_sessions": 2,
      "event_frequency": {
          "view": 2,
          "click": 1
      }
  },
  "EU": {
      "total_events": 0,
      "unique_sessions": 0,
      "event_frequency": {}
  },
  "APAC": {
      "total_events": 0,
      "unique_sessions": 0,
      "event_frequency": {}
  }
}

Constraints

Core Python only

No nested O(n²) joins

Prefer hash maps

Assume data fits in memory

**Question 4 — Marketplace Seller Performance Rollup (Medium-Hard)**

You are building a daily seller performance aggregation for a marketplace analytics pipeline.

Input
sellers = [
    {"seller_id": 1, "category": "electronics"},
    {"seller_id": 2, "category": "fashion"},
    {"seller_id": 3, "category": "electronics"},
    {"seller_id": 4, "category": "home"},
]

products = [
    {"product_id": "p1", "seller_id": 1},
    {"product_id": "p2", "seller_id": 1},
    {"product_id": "p3", "seller_id": 2},
    {"product_id": "p4", "seller_id": 5},   # seller missing
]

orders = [
    {"order_id": 101, "product_id": "p1", "amount": 300},
    {"order_id": 102, "product_id": "p1", "amount": 150},
    {"order_id": 103, "product_id": "p3", "amount": 80},
    {"order_id": 104, "product_id": "p9", "amount": 500},  # product missing
]

suspended_sellers = {2}

Requirements

Inner Join sellers ↔ products on seller_id

Inner Join result ↔ orders on product_id

Anti Join: exclude sellers in suspended_sellers

Group By List: group final records by category

Aggregation Map per category:

total_orders

total_revenue

seller_order_count
(frequency map: seller_id → number of orders)

Left Join behavior at category level:

Every category from sellers must appear in output, even if all sellers were filtered out or had no valid orders

Output Format
{
  "electronics": {
      "total_orders": 2,
      "total_revenue": 450,
      "seller_order_count": {
          1: 2
      }
  },
  "fashion": {
      "total_orders": 0,
      "total_revenue": 0,
      "seller_order_count": {}
  },
  "home": {
      "total_orders": 0,
      "total_revenue": 0,
      "seller_order_count": {}
  }
}

Constraints

Python only (no pandas / SQL)

Hash-based joins only

Avoid nested loops over large datasets

Clean, readable, production-grade logic

**Question 5 — Subscription Billing Rollup with Mixed Join Semantics (Medium-Hard)**

You are building a monthly billing aggregation job for a SaaS platform.

Input
accounts = [
    {"account_id": 1, "plan": "pro"},
    {"account_id": 2, "plan": "basic"},
    {"account_id": 3, "plan": "pro"},
    {"account_id": 4, "plan": "enterprise"},
]

subscriptions = [
    {"sub_id": "s1", "account_id": 1},
    {"sub_id": "s2", "account_id": 1},
    {"sub_id": "s3", "account_id": 2},
    {"sub_id": "s4", "account_id": 5},   # account missing
]

charges = [
    {"sub_id": "s1", "amount": 100},
    {"sub_id": "s1", "amount": 50},
    {"sub_id": "s2", "amount": 200},
    {"sub_id": "s3", "amount": 30},
    {"sub_id": "s9", "amount": 999},     # subscription missing
]

cancelled_accounts = {2}

Requirements

Inner Join accounts ↔ subscriptions on account_id

Inner Join result ↔ charges on sub_id

Anti Join: exclude accounts in cancelled_accounts

Group By List: group final records by plan

Aggregation Map per plan:

total_charges

total_revenue

account_charge_count
(frequency map: account_id → number of charges)

Left Join behavior at plan level:

All plans from accounts must appear, even if they have zero valid charges

Output Format
{
  "pro": {
      "total_charges": 3,
      "total_revenue": 350,
      "account_charge_count": {
          1: 3
      }
  },
  "basic": {
      "total_charges": 0,
      "total_revenue": 0,
      "account_charge_count": {}
  },
  "enterprise": {
      "total_charges": 0,
      "total_revenue": 0,
      "account_charge_count": {}
  }
}

Constraints

Core Python only

Hash joins only (no nested scans)

Must be O(n) with respect to input sizes

Clean data-engineering-grade code

**Question 6 — Event Revenue Rollup with Dedup + Mixed Joins (Hard)**

You are implementing a daily revenue rollup from raw event logs.

Input
users = [
    {"user_id": 1, "segment": "A"},
    {"user_id": 2, "segment": "B"},
    {"user_id": 3, "segment": "A"},
]

events = [
    {"event_id": "e1", "user_id": 1, "type": "purchase", "amount": 100},
    {"event_id": "e2", "user_id": 1, "type": "purchase", "amount": 50},
    {"event_id": "e2", "user_id": 1, "type": "purchase", "amount": 50},  # duplicate
    {"event_id": "e3", "user_id": 2, "type": "click", "amount": 0},
    {"event_id": "e4", "user_id": 4, "type": "purchase", "amount": 200}, # user missing
]

blacklisted_users = {2}

Requirements

Deduplicate events by event_id (keep first occurrence)

Inner Join events ↔ users on user_id

Anti Join: exclude users in blacklisted_users

Group By List: group remaining records by segment

Aggregation Map per segment:

total_events

total_revenue (sum of amount for all events)

event_type_frequency (frequency map of event types)

Left Join behavior at segment level:

All segments from users must appear in output, even if zero events remain

Output Format
{
  "A": {
      "total_events": 2,
      "total_revenue": 150,
      "event_type_frequency": {
          "purchase": 2
      }
  },
  "B": {
      "total_events": 0,
      "total_revenue": 0,
      "event_type_frequency": {}
  }
}

Constraints

Python only

No nested joins

Dedup must be O(n)

Use dictionaries / sets

Production-quality logic