Validate language auto-detection accuracy before defaulting to it

## Context

`detect_language()` exists in `core/parser_adapter.py` and supports a dominance heuristic (count source files by extension, return the language with the most files, after filtering common non-source dirs like `node_modules`, `__pycache__`, `vendor`, etc.).

It was historically callable through the Python API (`parse_repository(language="auto")`), but the Go CLI required `--language` explicitly, gating the heuristic from the default user path. #40 proposes removing that gate and making `--language` optional in `openant init`.

The algorithm itself is unchanged from current master — #40 moves the config to a shared `config/languages.json` (eliminating Go↔Python drift), adds tests for the algorithm and `init` flow, and drops the `.git` requirement, but the dominance heuristic is byte-for-byte identical.

## The concern

A wrong auto-detect at `openant init` cascades through every subsequent command. The detected language is written to `~/.openant/projects/<name>/project.json` and read by `core/parser_adapter.py` to dispatch to `parsers/<lang>/test_pipeline.py`. The user might never notice the wrong parser ran until output looks weird.

Reliability of the dominance heuristic on real-world repos isn't quantified today. Several edge cases the algorithm doesn't handle well by construction:

- **Polyglot repos with auxiliary languages**: a Python service with a TypeScript frontend (`web/`), a Go backend with Python build/migration scripts (`scripts/`), a Rust project with vendored C bindings.
- **File count ≠ code volume**: 100 small `.ts` declaration files vs. 50 large `.py` files; the algorithm picks `.ts`.
- **Near-ties**: 50/50 splits resolve based on `rglob` walk order, which is non-deterministic across platforms.
- **No user signal**: detection runs silently, no count display, no near-tie warning.

## What's tested today (in #40)

- 11 unit tests on `detect_language()` with synthetic file trees (Python, JS, TS, Go, mixed root, skip_dirs, empty, non-git directories).
- 8 integration tests on `openant init` verifying `project.json["language"]` matches expected for synthetic fixtures.

## What's not tested

- **End-to-end correctness**: `init` (auto-detected) → `parse` → verify the per-language parser actually ran and produced expected dataset shape. The cascade between "right language string" and "right parser output" is unverified.
- **Polyglot real-world fixtures**: tests cover 6-TS-vs-4-Py and 7-Py-vs-3-JS at the same root, but real polyglot shapes (frontend/backend split, build-tooling, vendored deps) aren't represented.
- **Calibration corpus**: no quantified accuracy on known OSS repos.

## Proposed validation work (3 pieces)

### Piece 1 — End-to-end fixture test

For each of the 7 supported languages, build a small but realistic fixture and run the full `init` → `parse` flow with auto-detect, asserting the per-language parser ran and produced expected dataset shape (correct `unit_type` values, language-specific call graph fields, expected output paths).

This is the test that connects "right language string in `project.json`" to "right parser output." Catches dispatch regressions and parser-side incompatibilities.

### Piece 2 — Polyglot regression fixtures

4-6 fixtures matching real shapes:

- Python service with TypeScript frontend in `web/`
- Go monorepo with Python tooling in `scripts/`
- TypeScript NestJS backend with `migrations/*.py`
- Ruby on Rails app with embedded JS (`app/javascript/`)
- C project with Python bindings
- _(etc.)_

Pinned expected outputs. Catches the dominance-heuristic edge cases where the file-count-majority language isn't what a human would scan.

### Piece 3 — OSS calibration corpus

10-15 well-known OSS repos (Django, Express, Kubernetes, Rails, Flask, Next.js, Buf, etc.). Run auto-detect against each, compare against the language a human would obviously pick. Gives a quantified accuracy number.

If accuracy is high enough → comfortable defaulting to auto. If not → keep explicit `-l` required (or expose `-l auto` as opt-in but warn loudly).

## UX improvements worth considering regardless

- **Print the counts** ("Detected: python (47 files), javascript (12 files)") so the user can verify before proceeding.
- **Warn on near-ties** (e.g., when the runner-up is within 20% of the leader) and require `-l` explicitly in those cases.
- **Sanity check with `--dry-run`** that shows what would be detected without writing `project.json`.

## Decision needed

Before flipping `--language` from required to optional (#40), we want some confidence the dominance heuristic actually picks the right thing on real repos. Options:

1. Block #40's CLI default change until Pieces 1 + 2 + 3 are done.
2. Merge #40's other improvements (shared config, non-git path, tests) but keep `-l` required; expose `-l auto` as opt-in.
3. Merge #40 as-is, accept the risk, address validation in this issue as follow-up.

## Why this is a separate issue

#40 is a UX improvement that touches a real reliability question. Splitting the discussion lets #40's parser/config/non-git work merge on its own merits while we figure out the right validation bar for default-on auto-detection.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate language auto-detection accuracy before defaulting to it #61

Context

The concern

What's tested today (in #40)

What's not tested

Proposed validation work (3 pieces)

Piece 1 — End-to-end fixture test

Piece 2 — Polyglot regression fixtures

Piece 3 — OSS calibration corpus

UX improvements worth considering regardless

Decision needed

Why this is a separate issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validate language auto-detection accuracy before defaulting to it #61

Description

Context

The concern

What's tested today (in #40)

What's not tested

Proposed validation work (3 pieces)

Piece 1 — End-to-end fixture test

Piece 2 — Polyglot regression fixtures

Piece 3 — OSS calibration corpus

UX improvements worth considering regardless

Decision needed

Why this is a separate issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions