You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
detect_language() exists in core/parser_adapter.py and supports a dominance heuristic (count source files by extension, return the language with the most files, after filtering common non-source dirs like node_modules, __pycache__, vendor, etc.).
It was historically callable through the Python API (parse_repository(language="auto")), but the Go CLI required --language explicitly, gating the heuristic from the default user path. #40 proposes removing that gate and making --language optional in openant init.
The algorithm itself is unchanged from current master — #40 moves the config to a shared config/languages.json (eliminating Go↔Python drift), adds tests for the algorithm and init flow, and drops the .git requirement, but the dominance heuristic is byte-for-byte identical.
The concern
A wrong auto-detect at openant init cascades through every subsequent command. The detected language is written to ~/.openant/projects/<name>/project.json and read by core/parser_adapter.py to dispatch to parsers/<lang>/test_pipeline.py. The user might never notice the wrong parser ran until output looks weird.
Reliability of the dominance heuristic on real-world repos isn't quantified today. Several edge cases the algorithm doesn't handle well by construction:
Polyglot repos with auxiliary languages: a Python service with a TypeScript frontend (web/), a Go backend with Python build/migration scripts (scripts/), a Rust project with vendored C bindings.
File count ≠ code volume: 100 small .ts declaration files vs. 50 large .py files; the algorithm picks .ts.
Near-ties: 50/50 splits resolve based on rglob walk order, which is non-deterministic across platforms.
No user signal: detection runs silently, no count display, no near-tie warning.
11 unit tests on detect_language() with synthetic file trees (Python, JS, TS, Go, mixed root, skip_dirs, empty, non-git directories).
8 integration tests on openant init verifying project.json["language"] matches expected for synthetic fixtures.
What's not tested
End-to-end correctness: init (auto-detected) → parse → verify the per-language parser actually ran and produced expected dataset shape. The cascade between "right language string" and "right parser output" is unverified.
Polyglot real-world fixtures: tests cover 6-TS-vs-4-Py and 7-Py-vs-3-JS at the same root, but real polyglot shapes (frontend/backend split, build-tooling, vendored deps) aren't represented.
Calibration corpus: no quantified accuracy on known OSS repos.
Proposed validation work (3 pieces)
Piece 1 — End-to-end fixture test
For each of the 7 supported languages, build a small but realistic fixture and run the full init → parse flow with auto-detect, asserting the per-language parser ran and produced expected dataset shape (correct unit_type values, language-specific call graph fields, expected output paths).
This is the test that connects "right language string in project.json" to "right parser output." Catches dispatch regressions and parser-side incompatibilities.
Piece 2 — Polyglot regression fixtures
4-6 fixtures matching real shapes:
Python service with TypeScript frontend in web/
Go monorepo with Python tooling in scripts/
TypeScript NestJS backend with migrations/*.py
Ruby on Rails app with embedded JS (app/javascript/)
C project with Python bindings
(etc.)
Pinned expected outputs. Catches the dominance-heuristic edge cases where the file-count-majority language isn't what a human would scan.
Piece 3 — OSS calibration corpus
10-15 well-known OSS repos (Django, Express, Kubernetes, Rails, Flask, Next.js, Buf, etc.). Run auto-detect against each, compare against the language a human would obviously pick. Gives a quantified accuracy number.
If accuracy is high enough → comfortable defaulting to auto. If not → keep explicit -l required (or expose -l auto as opt-in but warn loudly).
UX improvements worth considering regardless
Print the counts ("Detected: python (47 files), javascript (12 files)") so the user can verify before proceeding.
Warn on near-ties (e.g., when the runner-up is within 20% of the leader) and require -l explicitly in those cases.
Sanity check with --dry-run that shows what would be detected without writing project.json.
Decision needed
Before flipping --language from required to optional (#40), we want some confidence the dominance heuristic actually picks the right thing on real repos. Options:
#40 is a UX improvement that touches a real reliability question. Splitting the discussion lets #40's parser/config/non-git work merge on its own merits while we figure out the right validation bar for default-on auto-detection.
Context
detect_language()exists incore/parser_adapter.pyand supports a dominance heuristic (count source files by extension, return the language with the most files, after filtering common non-source dirs likenode_modules,__pycache__,vendor, etc.).It was historically callable through the Python API (
parse_repository(language="auto")), but the Go CLI required--languageexplicitly, gating the heuristic from the default user path. #40 proposes removing that gate and making--languageoptional inopenant init.The algorithm itself is unchanged from current master — #40 moves the config to a shared
config/languages.json(eliminating Go↔Python drift), adds tests for the algorithm andinitflow, and drops the.gitrequirement, but the dominance heuristic is byte-for-byte identical.The concern
A wrong auto-detect at
openant initcascades through every subsequent command. The detected language is written to~/.openant/projects/<name>/project.jsonand read bycore/parser_adapter.pyto dispatch toparsers/<lang>/test_pipeline.py. The user might never notice the wrong parser ran until output looks weird.Reliability of the dominance heuristic on real-world repos isn't quantified today. Several edge cases the algorithm doesn't handle well by construction:
web/), a Go backend with Python build/migration scripts (scripts/), a Rust project with vendored C bindings..tsdeclaration files vs. 50 large.pyfiles; the algorithm picks.ts.rglobwalk order, which is non-deterministic across platforms.What's tested today (in #40)
detect_language()with synthetic file trees (Python, JS, TS, Go, mixed root, skip_dirs, empty, non-git directories).openant initverifyingproject.json["language"]matches expected for synthetic fixtures.What's not tested
init(auto-detected) →parse→ verify the per-language parser actually ran and produced expected dataset shape. The cascade between "right language string" and "right parser output" is unverified.Proposed validation work (3 pieces)
Piece 1 — End-to-end fixture test
For each of the 7 supported languages, build a small but realistic fixture and run the full
init→parseflow with auto-detect, asserting the per-language parser ran and produced expected dataset shape (correctunit_typevalues, language-specific call graph fields, expected output paths).This is the test that connects "right language string in
project.json" to "right parser output." Catches dispatch regressions and parser-side incompatibilities.Piece 2 — Polyglot regression fixtures
4-6 fixtures matching real shapes:
web/scripts/migrations/*.pyapp/javascript/)Pinned expected outputs. Catches the dominance-heuristic edge cases where the file-count-majority language isn't what a human would scan.
Piece 3 — OSS calibration corpus
10-15 well-known OSS repos (Django, Express, Kubernetes, Rails, Flask, Next.js, Buf, etc.). Run auto-detect against each, compare against the language a human would obviously pick. Gives a quantified accuracy number.
If accuracy is high enough → comfortable defaulting to auto. If not → keep explicit
-lrequired (or expose-l autoas opt-in but warn loudly).UX improvements worth considering regardless
-lexplicitly in those cases.--dry-runthat shows what would be detected without writingproject.json.Decision needed
Before flipping
--languagefrom required to optional (#40), we want some confidence the dominance heuristic actually picks the right thing on real repos. Options:-lrequired; expose-l autoas opt-in.Why this is a separate issue
#40 is a UX improvement that touches a real reliability question. Splitting the discussion lets #40's parser/config/non-git work merge on its own merits while we figure out the right validation bar for default-on auto-detection.