**1. Problem Statement**

You are given two datasets represented as Python lists:

users: a list of (user_id, user_name)

logins: a list of (user_id, login_date)

Your task is to compute three separate results using dictionary-based joins:

Left Join
For every user, attach their most recent login date if it exists, otherwise None.

Semi Join
Return only users who have logged in at least once.

Anti Join
Return only users who have never logged in.

Input Format
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

logins = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (2, "2025-01-03")
]

Output Format

1️⃣ Left Join

[
    (1, "Alice", "2025-01-05"),
    (2, "Bob", "2025-01-03"),
    (3, "Charlie", None),
    (4, "Diana", None)
]


2️⃣ Semi Join

[
    (1, "Alice"),
    (2, "Bob")
]


3️⃣ Anti Join

[
    (3, "Charlie"),
    (4, "Diana")
]

Constraints

Assume user_id is unique in users

logins may contain multiple records per user_id

Use Python dictionaries only (no pandas, no SQL)

Time complexity target: O(n + m)

In [0]:
users = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

logins = [
    (1, "2025-01-01"),
    (1, "2025-01-05"),
    (2, "2025-01-03")
]

In [0]:
# Left Join

lookup = {}
for user , date in logins:
    lookup[user] = max(date,lookup.get(user,date))

result = []
for user,name in users:
    result.append((user,name,lookup.get(user)))
result

In [0]:
# Semi Join

right_key = {r[0] for r in logins}
semi = [ (u,n) for u,n in users if u in right_key]
semi
anti = [ (u,n) for u,n in users if u not in right_key]
anti

**2. Problem Statement**

You are given two datasets:

orders: (order_id, customer_id, amount)

customers: (customer_id, customer_name)

Your task is to compute the following using dictionary-based joins:

1️⃣ Left Join
Return all customers, along with the total order amount they have spent.
If a customer has no orders, total amount should be 0.

2️⃣ Semi Join
Return only customers who have placed at least one order.

3️⃣ Anti Join
Return only customers who have never placed any order.

Input Format
customers = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (101, 1, 120.0),
    (102, 1, 80.0),
    (103, 2, 50.0)
]

Output Format

Left Join

[
    (1, "Alice", 200.0),
    (2, "Bob", 50.0),
    (3, "Charlie", 0),
    (4, "Diana", 0)
]


Semi Join

[
    (1, "Alice"),
    (2, "Bob")
]


Anti Join

[
    (3, "Charlie"),
    (4, "Diana")
]

Constraints

customer_id is unique in customers

orders may have multiple rows per customer_id

Use only:

dict

defaultdict

No sorting required

Target time complexity: O(n + m)

In [0]:
customers = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

orders = [
    (101, 1, 120.0),
    (102, 1, 80.0),
    (103, 2, 50.0)
]

In [0]:
# Left Join

order_amt = {}

for _ , id, amt in orders:
    order_amt[id] = order_amt.get(id,0) + amt

result = []
for customer_id, customer_name in customers:
    result.append(
        (customer_id,customer_name,order_amt.get(customer_id, 0))
    )
result

In [0]:
# Semi/Anti Join

right_key = {r[1] for r in orders}
semi = [(i,n) for i,n in customers if i in right_key]
anti = [(i,n) for i,n in customers if i not in right_key]

**3. Problem Statement**

You are given two datasets:

employees: (emp_id, emp_name)

access_logs: (emp_id, system_name)

Each log row means the employee accessed any internal system at least once.

Your task is to compute:

1️⃣ Left Join
Return all employees, along with a boolean has_access indicating whether they appear in access_logs.

2️⃣ Semi Join
Return only employees who have accessed at least one system.

3️⃣ Anti Join
Return only employees who have never accessed any system.

Input Format
employees = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

access_logs = [
    (1, "jira"),
    (1, "github"),
    (2, "confluence")
]

Output Format

Left Join

[
    (1, "Alice", True),
    (2, "Bob", True),
    (3, "Charlie", False),
    (4, "Diana", False)
]


Semi Join

[
    (1, "Alice"),
    (2, "Bob")
]


Anti Join

[
    (3, "Charlie"),
    (4, "Diana")
]

Constraints

emp_id is unique in employees

access_logs may contain multiple rows per emp_id

Use set / dict only

Target complexity: O(n + m)

No sorting required

In [0]:
employees = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana")
]

access_logs = [
    (1, "jira"),
    (1, "github"),
    (2, "confluence")
]

In [0]:
# Left Join
access = {}

for i,_ in access_logs :
    access[i] = True

result = []
for id, name in employees:
    result.append(
        (id,name,access.get(id,False))
    )
result

In [0]:
right_key = {a[0] for a in access_logs}
semi = [ (id,name) for id,name in employees if id in right_key]
anti = [ (id,name) for id,name in employees if id not in right_key]
anti

**4. Problem Statement**

You are given two datasets representing a subscription platform:

users: (user_id, country)

subscriptions: (user_id, plan, start_date, is_active)

Each user may have multiple subscriptions over time, but at most one active subscription.

Your task is to compute the following:

1️⃣ Left Join

Return all users, along with their active plan.
If a user has no active subscription, return "FREE".

2️⃣ Semi Join

Return only users who currently have an active subscription.

3️⃣ Anti Join

Return only users who have never had any subscription at all
(not even inactive ones).

Input Format
users = [
    (1, "US"),
    (2, "IN"),
    (3, "UK"),
    (4, "US"),
    (5, "DE")
]

subscriptions = [
    (1, "PRO", "2024-01-01", False),
    (1, "PREMIUM", "2025-01-01", True),
    (2, "BASIC", "2025-02-01", True),
    (3, "BASIC", "2024-06-01", False)
]

Output Format

Left Join

[
    (1, "US", "PREMIUM"),
    (2, "IN", "BASIC"),
    (3, "UK", "FREE"),
    (4, "US", "FREE"),
    (5, "DE", "FREE")
]


Semi Join

[
    (1, "US"),
    (2, "IN")
]


Anti Join

[
    (4, "US"),
    (5, "DE")
]

Constraints

user_id is unique in users

subscriptions may have multiple rows per user_id

At most one active subscription per user

Use dict / set / defaultdict only

No sorting

Time complexity target: O(n + m)

In [0]:
users = [
    (1, "US"),
    (2, "IN"),
    (3, "UK"),
    (4, "US"),
    (5, "DE")
]

subscriptions = [
    (1, "PRO", "2024-01-01", False),
    (1, "PREMIUM", "2025-01-01", True),
    (2, "BASIC", "2025-02-01", True),
    (3, "BASIC", "2024-06-01", False)
]

In [0]:
# Left
lookup = {}
for k,p,d,a in subscriptions:
    if a:
        lookup[k] = p
result = []
for id,cnt in users:
    result.append(
        (id,cnt,lookup.get(id,'FREE'))
    )
result

In [0]:
# Semi/Anti

right_key = {d[0] for d in subscriptions if d[3]}
anti_key = {d[0] for d in subscriptions}
semi = [ (id, country) for id,country in users if id in right_key]
anti = [ (id, country) for id,country in users if id not in anti_key]
anti

**5. Problem Statement**

You are given two datasets from a data warehouse ingestion pipeline:

customers: (customer_id, region)

events: (event_id, customer_id, event_type, event_date)

Each customer may generate multiple events across time.

Your task is to compute the following:

1️⃣ Left Join

Return all customers, along with:

event_count → total number of events generated by that customer
If a customer has no events, event_count = 0.

2️⃣ Semi Join

Return only customers who have generated at least one event of type "purchase".

3️⃣ Anti Join

Return only customers who have generated NO events at all.

Input Format
customers = [
    (1, "US"),
    (2, "EU"),
    (3, "US"),
    (4, "IN"),
    (5, "EU")
]

events = [
    (101, 1, "click", "2025-01-01"),
    (102, 1, "purchase", "2025-01-02"),
    (103, 2, "click", "2025-01-03"),
    (104, 2, "click", "2025-01-04"),
    (105, 3, "purchase", "2025-01-05")
]

Output Format

Left Join

[
    (1, "US", 2),
    (2, "EU", 2),
    (3, "US", 1),
    (4, "IN", 0),
    (5, "EU", 0)
]


Semi Join

[
    (1, "US"),
    (3, "US")
]


Anti Join

[
    (4, "IN"),
    (5, "EU")
]

Constraints

customer_id is unique in customers

events may contain many rows per customer

Use dictionary-based aggregation + join

No sorting

Target time complexity: O(n + m)

In [0]:
customers = [
    (1, "US"),
    (2, "EU"),
    (3, "US"),
    (4, "IN"),
    (5, "EU")
]

events = [
    (101, 1, "click", "2025-01-01"),
    (102, 1, "purchase", "2025-01-02"),
    (103, 2, "click", "2025-01-03"),
    (104, 2, "click", "2025-01-04"),
    (105, 3, "purchase", "2025-01-05")
]

In [0]:
# Left Join

event_freq = {}
for _ , id, _,_ in events:
    event_freq[id] = event_freq.get(id,0) + 1

res = []
for id,c in customers:
    res.append(
        (id,c,event_freq.get(id,0))
    )
res

In [0]:
# Semi Join
semi_key = {d[1] for d in events if d[2] == 'purchase'}
semi = [ (id,cn) for id,cn in users if id in semi_key]
semi

In [0]:
# Anti Join
anti_key = {d[1] for d in events}
anti = [(id,cn) for id,cn in users if id not in anti_key]
anti

**6. Problem Statement**

You are given two datasets from a user activity tracking system:

users: (user_id, signup_date)

activity: (user_id, activity_date, activity_type)

Each user may have multiple activities over time.

Your task is to compute the following:

1️⃣ Left Join

Return all users, along with:

last_activity_date → the most recent activity date
If a user has no activity, return None.

2️⃣ Semi Join

Return only users who have been active in the last 30 days
(assume reference date = "2025-02-01").

3️⃣ Anti Join

Return only users who have never had any activity.

Input Format
users = [
    (1, "2024-01-01"),
    (2, "2024-06-01"),
    (3, "2024-09-01"),
    (4, "2025-01-15")
]

activity = [
    (1, "2025-01-01", "login"),
    (1, "2025-01-20", "purchase"),
    (2, "2024-12-15", "login"),
    (3, "2024-10-01", "login")
]

Output Format

Left Join

[
    (1, "2024-01-01", "2025-01-20"),
    (2, "2024-06-01", "2024-12-15"),
    (3, "2024-09-01", "2024-10-01"),
    (4, "2025-01-15", None)
]


Semi Join

[
    (1, "2024-01-01")
]


Anti Join

[
    (4, "2025-01-15")
]

Constraints

Dates are ISO strings (YYYY-MM-DD)

No sorting allowed

Use dictionaries / sets only

One pass over activity

Target complexity: O(n + m)

**7. Problem Statement**

You are given two datasets from a payments + risk system:

accounts: (account_id, country)

transactions: (txn_id, account_id, amount, status, txn_date)

Rules:

An account may have many transactions

A transaction is successful if status == "SUCCESS"

Reference date = "2025-02-01"

Your task is to compute:

1️⃣ Left Join

Return all accounts, along with:

successful_txn_count_last_30_days
If an account has no qualifying transactions, return 0.

2️⃣ Semi Join

Return only accounts that have at least ONE successful transaction
in the last 30 days.

3️⃣ Anti Join

Return only accounts that have transactions, but NONE successful
in the last 30 days.

⚠️ Important:

Accounts with no transactions at all should NOT appear in the Anti Join.

Input Format
accounts = [
    (1, "US"),
    (2, "EU"),
    (3, "IN"),
    (4, "US"),
    (5, "EU")
]

transactions = [
    (101, 1, 100.0, "SUCCESS", "2025-01-15"),
    (102, 1, 50.0, "FAILED",  "2025-01-20"),
    (103, 2, 75.0, "FAILED",  "2025-01-10"),
    (104, 2, 60.0, "FAILED",  "2025-01-25"),
    (105, 3, 40.0, "SUCCESS", "2024-12-15"),
    (106, 3, 90.0, "FAILED",  "2025-01-18")
]

Output Format

Left Join

[
    (1, "US", 1),
    (2, "EU", 0),
    (3, "IN", 0),
    (4, "US", 0),
    (5, "EU", 0)
]


Semi Join

[
    (1, "US")
]


Anti Join

[
    (2, "EU"),
    (3, "IN")
]

Constraints

Dates are ISO strings

One pass over transactions

Use dictionaries / sets only

No sorting

Time complexity target: O(n + m)

Must carefully distinguish:

no transactions

transactions but no qualifying success