Size / Priority
- Size: M — diagnostic + fix per offender.
- Category: C.4 Test-Infrastructure.
- Risk: medium — flaky tests are notoriously hard.
Affected files
tests/multi-node/tcp-*, tests/multi-node/udp-*, tests/multi-node/websocket-*, tests/unit/persistence/filesystem-* — tests that fail under parallel execution load.
Background
Some tests pass reliably in isolation but fail under parallel execution. Root causes:
- Port collisions — TCP/UDP tests claim ephemeral ports; under load, two tests grab the same port before either binds.
- Filesystem races — concurrent tests writing to same temp dir.
- WebSocket connect/disconnect races — handshake state.
- Resource exhaustion — too many open sockets under stress.
These tests typically have a flake rate of 1-5%. CI sometimes retries; flake rates rise over time.
Target
Per-offender investigation + fix:
- TCP/UDP: use unique port ranges per test (e.g.,
bind to port 0 for ephemeral allocation, then read assigned port).
- Filesystem: each test gets its own UUID-named temp dir.
- WebSocket: serialise these via test-runner serialisation hints (
it.serial if available, or move to a dedicated workspace).
- Resource limits: per-test cleanup + reduced concurrency for resource-heavy tests.
Alternative: a testkit-isolation mode that ensures problematic tests run sequentially while everything else runs parallel.
Integration / risk
- Each fix is small but the diagnosis is hard.
- May require test-runner config changes.
Test plan
- Catalog offenders: enumerate failing tests with their failure modes.
- Per-offender fix.
- Stress test: run suite N times under parallel load; verify <0.1% flake.
Acceptance criteria
Size / Priority
Affected files
tests/multi-node/tcp-*,tests/multi-node/udp-*,tests/multi-node/websocket-*,tests/unit/persistence/filesystem-*— tests that fail under parallel execution load.Background
Some tests pass reliably in isolation but fail under parallel execution. Root causes:
These tests typically have a flake rate of 1-5%. CI sometimes retries; flake rates rise over time.
Target
Per-offender investigation + fix:
bind to port 0for ephemeral allocation, then read assigned port).it.serialif available, or move to a dedicated workspace).Alternative: a testkit-isolation mode that ensures problematic tests run sequentially while everything else runs parallel.
Integration / risk
Test plan
Acceptance criteria