# Setup (path, imports, helpers)

In [1]:
from __future__ import annotations

import os
import sys
import gc
import time
from dataclasses import dataclass
from typing import Callable, Any

from memory_profiler import memory_usage

NOTEBOOK_DIR = os.getcwd()
REPO_ROOT = os.path.abspath(os.path.join(NOTEBOOK_DIR, ".."))

if REPO_ROOT not in sys.path:
    sys.path.insert(0, REPO_ROOT)

DATASET_PATH = os.path.join(REPO_ROOT, "data", "farmers-protest-tweets-2021-2-4.json")
assert os.path.exists(DATASET_PATH), f"Dataset not found at: {DATASET_PATH}"

# Import challenge functions
from src.q1_time import q1_time
from src.q1_memory import q1_memory

from src.q2_time import q2_time
from src.q2_memory import q2_memory

from src.q3_time import q3_time
from src.q3_memory import q3_memory


@dataclass
class BenchResult:
    name: str
    seconds: float
    peak_mib: float
    result_preview: Any


def bench(fn: Callable[[str], Any], path: str, name: str, repeats: int = 1) -> BenchResult:
    """
    Measures wall-clock time and peak RSS delta (MiB) using memory_profiler.
    - repeats: repeats function call; we keep the best time (min) and peak memory across the best run.
    """
    best_time = None
    best_peak = None
    best_preview = None

    for _ in range(repeats):
        gc.collect()
        t0 = time.perf_counter()

        mem_trace, out = memory_usage((fn, (path,)), retval=True, interval=0.05, timeout=None)
        t1 = time.perf_counter()

        elapsed = t1 - t0
        peak = max(mem_trace) - min(mem_trace)  # delta in MiB during this call

        if best_time is None or elapsed < best_time:
            best_time = elapsed
            best_peak = peak
            # keep preview small and stable
            best_preview = out[:3] if isinstance(out, list) else out

    return BenchResult(name=name, seconds=float(best_time), peak_mib=float(best_peak), result_preview=best_preview)


# Data Engineer Challenge ‚Äî Benchmark Notebook

## Dataset
- Format: NDJSON (one JSON object per line).
- Path used in this notebook: `data/farmers-protest-tweets-2021-2-4.json`.

## Field assumptions (based on the dataset structure)
- **Tweet text** for Q2: we use the canonical `content` field.
- **Mentions** for Q3: we treat `mentionedUsers` as the canonical structured signal for mentions (when present).

These assumptions reduce ambiguity and make results reproducible across environments.

## Benchmark methodology
We report:
- **Wall-clock time** (seconds) using `time.perf_counter()`.
- **Peak memory delta (MiB)** during the function call using `memory_profiler.memory_usage()`:
  - We take `max(trace) - min(trace)` as an approximation of incremental RSS during execution.

To reduce noise:
- We run each function **3 times** (`repeats=3`) and keep the **best time**.
- We call `gc.collect()` before each run.

Limitations:
- RSS-based memory measurements vary by OS and Python allocator behavior. We use them for **relative comparisons** within the same environment.

## Approach overview (per question)

### Q1 ‚Äî Top dates and most active user per date
**Inputs used:**
- `record["date"]` (ISO datetime)
- `record["user"]["username"]`

**Outputs:**
- Top 10 dates by tweet count.
- For each selected date, the most active user (with deterministic tie-breaking).

**q1_time**
- Single pass; builds a full `date -> {user -> count}` map for all dates.
- Pros: minimal post-processing, straightforward selection.
- Cons: higher memory (stores per-user counts for every date).

**q1_memory**
- Two passes:
  1) Count tweets per date only.
  2) For the top-10 dates, count users and select most active.
- Pros: lower memory peak by restricting per-user counting to the top-10 dates.
- Cons: reads the dataset twice (more I/O).

### Q2 ‚Äî Top emojis
**Input used:**
- `record["content"]`

**Outputs:**
- Top 10 emojis by total usage across all tweets, ordered deterministically (count desc, emoji asc).

Both implementations are streaming and use a dictionary counter.
Differences are mostly constant factors (temporary allocations, helper overhead), not asymptotic complexity.

### Q3 ‚Äî Top mentioned users
**Inputs used:**
- Primary: `record["mentionedUsers"]` (structured list of mentioned usernames)
- Secondary (time only): parse `record["content"]` with a mention regex if `mentionedUsers` is missing/empty.

**Outputs:**
- Top 10 mentioned usernames across the dataset (count desc, username asc).

**q3_time**
- Uses `mentionedUsers` as primary (fast, structured).
- Falls back to regex parsing of `content` only when structured mentions are missing/empty (higher recall).

**q3_memory**
- Uses only `mentionedUsers` as the canonical signal.
- Rationale: minimizes temporary allocations and avoids ambiguous false positives from raw-text parsing (emails/URLs/text artifacts).


# Execute functions (sanity outputs)

In [2]:
print("Q1 time (preview):", q1_time(DATASET_PATH)[:3])
print("Q1 memory (preview):", q1_memory(DATASET_PATH)[:3])

print("Q2 time (preview):", q2_time(DATASET_PATH)[:3])
print("Q2 memory (preview):", q2_memory(DATASET_PATH)[:3])

print("Q3 time (preview):", q3_time(DATASET_PATH)[:3])
print("Q3 memory (preview):", q3_memory(DATASET_PATH)[:3])

Q1 time (preview): [(datetime.date(2021, 2, 12), 'RanbirS00614606'), (datetime.date(2021, 2, 13), 'MaanDee08215437'), (datetime.date(2021, 2, 17), 'RaaJVinderkaur')]
Q1 memory (preview): [(datetime.date(2021, 2, 12), 'RanbirS00614606'), (datetime.date(2021, 2, 13), 'MaanDee08215437'), (datetime.date(2021, 2, 17), 'RaaJVinderkaur')]
Q2 time (preview): [('üôè', 5049), ('üòÇ', 3072), ('üöú', 2972)]
Q2 memory (preview): [('üôè', 5049), ('üòÇ', 3072), ('üöú', 2972)]
Q3 time (preview): [('narendramodi', 2265), ('Kisanektamorcha', 1840), ('RakeshTikaitBKU', 1644)]
Q3 memory (preview): [('narendramodi', 2265), ('Kisanektamorcha', 1840), ('RakeshTikaitBKU', 1644)]


# Benchmark (time and memory)

In [3]:
benchmarks = [
    (q1_time,   "Q1 time"),
    (q1_memory, "Q1 memory"),
    (q2_time,   "Q2 time"),
    (q2_memory, "Q2 memory"),
    (q3_time,   "Q3 time"),
    (q3_memory, "Q3 memory"),
]

results = []
for fn, name in benchmarks:
    r = bench(fn, DATASET_PATH, name=name, repeats=3) 
    results.append(r)

results


[BenchResult(name='Q1 time', seconds=3.2976390999974683, peak_mib=1.95703125, result_preview=[(datetime.date(2021, 2, 12), 'RanbirS00614606'), (datetime.date(2021, 2, 13), 'MaanDee08215437'), (datetime.date(2021, 2, 17), 'RaaJVinderkaur')]),
 BenchResult(name='Q1 memory', seconds=5.888545800000429, peak_mib=1.9375, result_preview=[(datetime.date(2021, 2, 12), 'RanbirS00614606'), (datetime.date(2021, 2, 13), 'MaanDee08215437'), (datetime.date(2021, 2, 17), 'RaaJVinderkaur')]),
 BenchResult(name='Q2 time', seconds=9.809530400001677, peak_mib=1.828125, result_preview=[('üôè', 5049), ('üòÇ', 3072), ('üöú', 2972)]),
 BenchResult(name='Q2 memory', seconds=10.024615400005132, peak_mib=0.01953125, result_preview=[('üôè', 5049), ('üòÇ', 3072), ('üöú', 2972)]),
 BenchResult(name='Q3 time', seconds=3.186534300009953, peak_mib=0.265625, result_preview=[('narendramodi', 2265), ('Kisanektamorcha', 1840), ('RakeshTikaitBKU', 1644)]),
 BenchResult(name='Q3 memory', seconds=3.126013100001728, pea

# Results Table

In [4]:
import pandas as pd

df = pd.DataFrame([{
    "task": r.name,
    "time_seconds": round(r.seconds, 4),
    "peak_mem_delta_mib": round(r.peak_mib, 3),
    "result_preview": r.result_preview
} for r in results]).sort_values("task")

df


Unnamed: 0,task,time_seconds,peak_mem_delta_mib,result_preview
1,Q1 memory,5.8885,1.938,"[(2021-02-12, RanbirS00614606), (2021-02-13, M..."
0,Q1 time,3.2976,1.957,"[(2021-02-12, RanbirS00614606), (2021-02-13, M..."
3,Q2 memory,10.0246,0.02,"[(üôè, 5049), (üòÇ, 3072), (üöú, 2972)]"
2,Q2 time,9.8095,1.828,"[(üôè, 5049), (üòÇ, 3072), (üöú, 2972)]"
5,Q3 memory,3.126,1.109,"[(narendramodi, 2265), (Kisanektamorcha, 1840)..."
4,Q3 time,3.1865,0.266,"[(narendramodi, 2265), (Kisanektamorcha, 1840)..."


## Results and Discussion

### Q1
The time-optimized implementation (`q1_time`) is significantly faster than the memory-oriented version, while peak memory usage is similar for this dataset.

This outcome is expected given the dataset characteristics:
- The number of distinct dates and per-date user distributions do not cause the full `date ‚Üí user ‚Üí count` structure to grow excessively.
- As a result, the two-pass strategy in `q1_memory` does not yield a memory advantage here, while it incurs additional I/O and processing cost.

This highlights an important point: **memory-oriented designs provide worst-case guarantees, but their benefits depend on data distribution**.

### Q2
Both implementations exhibit nearly identical runtime, which is dominated by emoji parsing.
However, the memory-oriented version shows a dramatically lower peak memory delta.

This confirms that:
- Both approaches share the same asymptotic complexity.
- The memory-focused implementation effectively reduces temporary allocations and runtime memory churn, resulting in a much lower peak RSS delta.

### Q3
Runtime differences between the two versions are negligible.
The time-oriented implementation includes a fallback to parse mentions from raw text, but in practice this path is rarely triggered because the dataset provides structured `mentionedUsers` metadata.

The memory-oriented version avoids text parsing entirely and relies only on the canonical structured signal, resulting in the lowest observed memory overhead.

### Overall
Across all questions:
- Differences between *time* and *memory* implementations are driven by **engineering tradeoffs and constant factors**, not asymptotic complexity.
- Dataset characteristics strongly influence whether a memory-oriented strategy yields visible gains.
- Explicitly separating these approaches makes the tradeoffs transparent and reproducible.
