Skip to content

Validate language auto-detection accuracy before defaulting to it #61

@ar7casper

Description

@ar7casper

Context

detect_language() exists in core/parser_adapter.py and supports a dominance heuristic (count source files by extension, return the language with the most files, after filtering common non-source dirs like node_modules, __pycache__, vendor, etc.).

It was historically callable through the Python API (parse_repository(language="auto")), but the Go CLI required --language explicitly, gating the heuristic from the default user path. #40 proposes removing that gate and making --language optional in openant init.

The algorithm itself is unchanged from current master — #40 moves the config to a shared config/languages.json (eliminating Go↔Python drift), adds tests for the algorithm and init flow, and drops the .git requirement, but the dominance heuristic is byte-for-byte identical.

The concern

A wrong auto-detect at openant init cascades through every subsequent command. The detected language is written to ~/.openant/projects/<name>/project.json and read by core/parser_adapter.py to dispatch to parsers/<lang>/test_pipeline.py. The user might never notice the wrong parser ran until output looks weird.

Reliability of the dominance heuristic on real-world repos isn't quantified today. Several edge cases the algorithm doesn't handle well by construction:

  • Polyglot repos with auxiliary languages: a Python service with a TypeScript frontend (web/), a Go backend with Python build/migration scripts (scripts/), a Rust project with vendored C bindings.
  • File count ≠ code volume: 100 small .ts declaration files vs. 50 large .py files; the algorithm picks .ts.
  • Near-ties: 50/50 splits resolve based on rglob walk order, which is non-deterministic across platforms.
  • No user signal: detection runs silently, no count display, no near-tie warning.

What's tested today (in #40)

  • 11 unit tests on detect_language() with synthetic file trees (Python, JS, TS, Go, mixed root, skip_dirs, empty, non-git directories).
  • 8 integration tests on openant init verifying project.json["language"] matches expected for synthetic fixtures.

What's not tested

  • End-to-end correctness: init (auto-detected) → parse → verify the per-language parser actually ran and produced expected dataset shape. The cascade between "right language string" and "right parser output" is unverified.
  • Polyglot real-world fixtures: tests cover 6-TS-vs-4-Py and 7-Py-vs-3-JS at the same root, but real polyglot shapes (frontend/backend split, build-tooling, vendored deps) aren't represented.
  • Calibration corpus: no quantified accuracy on known OSS repos.

Proposed validation work (3 pieces)

Piece 1 — End-to-end fixture test

For each of the 7 supported languages, build a small but realistic fixture and run the full initparse flow with auto-detect, asserting the per-language parser ran and produced expected dataset shape (correct unit_type values, language-specific call graph fields, expected output paths).

This is the test that connects "right language string in project.json" to "right parser output." Catches dispatch regressions and parser-side incompatibilities.

Piece 2 — Polyglot regression fixtures

4-6 fixtures matching real shapes:

  • Python service with TypeScript frontend in web/
  • Go monorepo with Python tooling in scripts/
  • TypeScript NestJS backend with migrations/*.py
  • Ruby on Rails app with embedded JS (app/javascript/)
  • C project with Python bindings
  • (etc.)

Pinned expected outputs. Catches the dominance-heuristic edge cases where the file-count-majority language isn't what a human would scan.

Piece 3 — OSS calibration corpus

10-15 well-known OSS repos (Django, Express, Kubernetes, Rails, Flask, Next.js, Buf, etc.). Run auto-detect against each, compare against the language a human would obviously pick. Gives a quantified accuracy number.

If accuracy is high enough → comfortable defaulting to auto. If not → keep explicit -l required (or expose -l auto as opt-in but warn loudly).

UX improvements worth considering regardless

  • Print the counts ("Detected: python (47 files), javascript (12 files)") so the user can verify before proceeding.
  • Warn on near-ties (e.g., when the runner-up is within 20% of the leader) and require -l explicitly in those cases.
  • Sanity check with --dry-run that shows what would be detected without writing project.json.

Decision needed

Before flipping --language from required to optional (#40), we want some confidence the dominance heuristic actually picks the right thing on real repos. Options:

  1. Block feat: auto-detect language in init #40's CLI default change until Pieces 1 + 2 + 3 are done.
  2. Merge feat: auto-detect language in init #40's other improvements (shared config, non-git path, tests) but keep -l required; expose -l auto as opt-in.
  3. Merge feat: auto-detect language in init #40 as-is, accept the risk, address validation in this issue as follow-up.

Why this is a separate issue

#40 is a UX improvement that touches a real reliability question. Splitting the discussion lets #40's parser/config/non-git work merge on its own merits while we figure out the right validation bar for default-on auto-detection.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions