Performance: backend model-build is O(n²); related per-cycle hot-path costs

`build_model()` runs every 0.5s on the single controller thread while a download is active. Several avoidable costs compound on large seedbox trees, stalling command processing, scan ingestion, and SSE diff generation for all clients.

### 1. BFS uses list.pop(0) → O(n²) model build

**Severity:** medium · **Effort:** quick · **Dimension:** performance · **Verdict:** confirmed

**Location:** `src/python/controller/model_builder.py:242-243, 429-440`

> **Corrected/extra locations:** src/python/controller/model_builder.py:243 and 432-440 (get_children copy.copy at src/python/model/file.py:262)

**Impact**
Both per-file tree traversals pop from the front of a Python list, which is O(n) each time (shifts the whole list). For a directory with N descendant files/dirs the BFS becomes O(N^2). _check_root_downloaded also calls get_children() (which does copy.copy of the child list) on every node. With a large seedbox directory (tens of thousands of nested files), build_model() — run every 0.5s on the single controller thread while a download is active (active-files invalidate the cache every cycle, line 54) — can spike CPU and stall command processing, scan ingestion, and SSE diff generation for all clients.

**Recommended fix**
Use collections.deque and popleft() for both frontiers (O(1) pop), or convert the BFS to an explicit index/stack-based traversal. Avoid get_children()'s defensive copy.copy in the hot check loop (iterate the underlying list, or expose an iterator). This is a small, contained change with a large asymptotic win on big trees.

---

### 2. SSE model serialization recomputes full subtree JSON per event per client

**Severity:** medium · **Effort:** moderate · **Dimension:** performance · **Verdict:** confirmed

**Location:** `src/python/web/serialize/serialize_model.py:69-118`

> **Corrected/extra locations:** src/python/web/serialize/serialize_model.py:69-118; src/python/web/handler/stream_model.py:43-64; serialization driven per-client from src/python/web/web_app.py:160-168

**Impact**
On each controller cycle, every UPDATED diff (a top-level dir whose size/eta/speed changed during download) is re-serialized into a JSON tree of the *entire* directory subtree, separately for each connected client, on each client's web thread. A single actively-downloading multi-thousand-file directory produces a large JSON blob every 0.5s per client. With several open browser tabs the redundant serialization and payload size scale linearly with clients and tree size.

**Recommended fix**
Serialize each diff's JSON once per cycle and reuse the string across all client queues (e.g. enqueue the pre-serialized SSE frame rather than the ModelFile), or only emit changed fields rather than the full subtree for UPDATED events. Caching the serialized frame per (event, cycle) removes the per-client recomputation.

---

### 3. sync_persist re-filters all persist sets per changed file

**Severity:** low · **Effort:** moderate · **Dimension:** performance · **Verdict:** unverified

**Location:** `src/python/controller/model_updater.py:159-194, 298-340, 438-481`

**Impact**
During bursts (e.g. a directory of many small files all completing in the same cycle, each going through _handle_downloaded_file -> sync_persist_to_all_builders), the full persist set is re-filtered and re-pushed to every pair builder once per file. This is O(files_changed * pairs * total_persist_keys) within a single cycle, and each set_*_files on the builder then deep-compares to decide cache invalidation (compounding finding #2).

**Recommended fix**
Batch the sync: mark persist as dirty during diff processing and call sync_persist_to_all_builders once at the end of update() instead of after each individual mutation. Filtering once per cycle instead of once per changed file removes the quadratic burst behavior.

---

### 4. Per-diff/completion pair lookups are linear scans

**Severity:** low · **Effort:** quick · **Dimension:** performance · **Verdict:** unverified

**Location:** `src/python/controller/command_pipeline.py:473-490`

**Impact**
For each model diff in a cycle, the owning pair is found by a linear scan over all configured pairs. This is O(diffs * pairs). Pair count is typically tiny (single digits) so blast radius is small, but it is on the per-cycle hot path and trivially avoidable.

**Recommended fix**
Build a dict {pair_id: PairContext} once (and on pair config changes) and look up by key. Keep the None/default-pair fallback. Quick, removes the repeated scans.

---

### 5. SSE poll loop fixed 250ms busy-wait per connection

**Severity:** low · **Effort:** moderate · **Dimension:** performance · **Verdict:** unverified

**Location:** `src/python/web/web_app.py:64, 156-176`

**Impact**
Each open SSE connection is a daemon thread that wakes 4x/second regardless of activity, draining empty StreamQueues (status/log/model) via non-blocking get. It is a busy-poll rather than an event-driven wait. CPU cost is small per connection but scales with concurrent clients, and it caps event delivery latency at ~250ms. Queues are already thread-safe queue.Queue objects that support blocking get with timeout.

**Recommended fix**
Replace the fixed sleep with a blocking wait on the underlying queue(s) (queue.get(timeout=...) on a merged/condition-signaled source), so threads sleep until data arrives and wake immediately on new events. At minimum, make StreamQueue expose a blocking get_next_event(timeout) and have the loop wait on it instead of poll+sleep.

---

### Acceptance criteria

- [ ] Both BFS frontiers use `deque.popleft()` (O(1)); the `get_children()` defensive copy is avoided in the hot check loop
- [ ] Each diff is serialized once per cycle and the frame reused across client queues (or only changed fields emitted)
- [ ] Persist sync is batched once per `update()` cycle rather than per changed file
- [ ] Pair lookup uses a `{pair_id: PairContext}` dict
- [ ] SSE stream loop waits on the queue (blocking get with timeout) instead of a fixed 250ms poll

---
_From the codebase audit — see `REVIEW.md`. Generated by the `codebase-audit` workflow; findings adversarially verified against the code._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: backend model-build is O(n²); related per-cycle hot-path costs #520

1. BFS uses list.pop(0) → O(n²) model build

2. SSE model serialization recomputes full subtree JSON per event per client

3. sync_persist re-filters all persist sets per changed file

4. Per-diff/completion pair lookups are linear scans

5. SSE poll loop fixed 250ms busy-wait per connection

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance: backend model-build is O(n²); related per-cycle hot-path costs #520

Description

1. BFS uses list.pop(0) → O(n²) model build

2. SSE model serialization recomputes full subtree JSON per event per client

3. sync_persist re-filters all persist sets per changed file

4. Per-diff/completion pair lookups are linear scans

5. SSE poll loop fixed 250ms busy-wait per connection

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions