Skip to content

Detector: Test Coherence

Jacob Centner edited this page Apr 10, 2026 · 1 revision

Detector: Test Coherence

Compares test functions against their implementation counterparts using an LLM to detect stale or meaningless tests.

Property Value
Name test-coherence
Tier LLM_ASSISTED
Languages Python
External tool None
LLM required Yes — minimum basic capability tier
Confidence 0.60 (basic), 0.75 (enhanced)

What it detects

Tests that no longer meaningfully validate the code they claim to test:

  • Test assertions that don't match the current function behavior
  • Tests that test an old API signature
  • Tests that are trivially always-passing
  • Tests that duplicate other tests without adding coverage

How it works

  1. Finds test files matching test_*.py or *_test.py
  2. Pairs test files with implementation files via:
    • Naming convention: test_config.pyconfig.py
    • Import analysis: follows imports in the test file
  3. Extracts matched (test function, implementation function) pairs via AST
  4. Sends each pair to the LLM for coherence assessment
  5. LLM judges whether the test meaningfully validates the current implementation

Limits

Parameter Value
Max pairs per file 5 (prevents runaway LLM costs)
Min function lines 3 (skips trivial functions)
Truncation (basic) 1500 chars
Truncation (enhanced) 3000 chars

Capability tiers

Tier Behavior
basic Binary coherent/stale judgment, smaller context
standard+ Nuanced assessment with severity levels, larger context

Example finding

[TEST-COHERENCE] tests/test_config.py:test_load_config → src/sentinel/config.py:load_config
  Test checks for 'model' field default of 'llama2' but the implementation
  default was changed to 'qwen3.5:4b'. Test is stale.
  Severity: MEDIUM, Confidence: 0.60

Configuration

[sentinel]
model_capability = "basic"     # minimum for this detector
skip_llm = false              # must be false

Known limitations

  • Python-only
  • Requires a healthy LLM provider
  • Quality depends on model capability
  • Only pairs tests via naming convention and imports — misses indirect test coverage
  • Not yet validated at scale against real-world repos
  • Max 5 pairs per file may miss issues in files with many functions

Clone this wiki locally