Skip to content

fix: Windows-compatible path handling in JS/Go parsers#46

Open
joshbouncesecurity wants to merge 5 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-08-windows-paths
Open

fix: Windows-compatible path handling in JS/Go parsers#46
joshbouncesecurity wants to merge 5 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-08-windows-paths

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

Summary

Three Windows-only failures preventing OpenAnt from running correctly on Windows:

  1. ts-morph backslash interpretationpath.relative() produces backslashes on Windows; ts-morph treats them as escape characters when matching paths it has already added, so the JS parser silently emits 0 files. Fix: convert separators to forward slashes via a toPosixPath() helper before handing paths to ts-morph, and use POSIX-form relativePath when building functionId values.
  2. \r\n-tainted file lists — splitting a newline-delimited list on Windows leaves a trailing \r on every path, which addSourceFileAtPath then fails to resolve. Fix: split the --files-from content on /\r?\n/ and .trim() each line.
  3. cp1252 console crashes — the pipeline status output used ✓ ✗ →, which crash on cp1252-encoded stdouts (the Windows default). Fix: detect at import time whether sys.stdout.encoding can encode those glyphs and fall back to ASCII (OK / FAIL / ->) only when it cannot. UTF-8 terminals keep the prettier Unicode output — this is more nuanced than an unconditional ASCII downgrade.

Stale Windows xfail/skip markers in tests/test_js_parser.py and tests/test_go_cli.py are removed since the underlying bugs are fixed.

These are the parser path/encoding pieces of issue #16 item 8. The companion venv-binary path fix (Scripts\python.exe vs bin/python in apps/openant-cli/internal/python/runtime.go) is intentionally not included here — it's a separate concern and worth its own PR.

Addresses item 8 from #16 (does not close the issue).

Test plan

New libs/openant-core/tests/test_windows_path_handling.py covers all three fixes, runnable on any platform:

  • JS analyzer accepts a backslash-only file list and produces non-empty functionId values that contain only forward slashes.
  • JS analyzer accepts a CRLF-terminated file list (with a trailing blank line) and finds all functions.
  • Both pipeline modules pick the ASCII fallback under a cp1252 stdout, and the Unicode form under a UTF-8 stdout.
  • Existing parser tests pass: pytest tests/99 passed, 12 skipped (the 12 skips are test_go_cli.py cases that need the compiled Go binary; pre-existing).
  • Manual verification: run openant parse <ts-repo> on Windows and confirm a non-zero file count is reported.

Three Windows-only failures preventing OpenAnt from running correctly on
Windows. Ports the path/encoding fixes from #3 with a
more nuanced Unicode strategy.

1. ts-morph backslash interpretation. path.relative() and path.resolve()
   produce backslash-separated paths on Windows; ts-morph treats those
   backslashes as escape characters when matching paths it has already
   added, so the JS analyzer silently emitted 0 functions. Normalise to
   forward slashes via a toPosixPath() helper before handing paths to
   ts-morph and before storing them in functionId values.

2. CRLF-tainted file lists. The --files-from consumer split on \n only,
   leaving a trailing \r on every path when the list was written on
   Windows. Switch to a /\r?\n/ split with per-line trim().

3. cp1252 console crashes. The "Pipeline Test" status output used the
   Unicode glyphs U+2713 / U+2717 / U+2192, which crash on cp1252-encoded
   stdouts (the Windows default). Detect at import time whether stdout
   can encode those symbols and fall back to ASCII (OK / FAIL / ->) only
   when it cannot, so UTF-8 terminals keep the prettier output.

Also remove the now-stale Windows xfail/skip markers from tests/
test_js_parser.py and tests/test_go_cli.py — the underlying bugs are
fixed, so those guards no longer apply.

Adds tests/test_windows_path_handling.py covering all three fixes:
  - ts-morph accepts a backslash-only file list and produces non-empty
    POSIX-form functionId values
  - the analyzer accepts a CRLF-terminated file list with a trailing
    blank line
  - both pipeline modules pick the ASCII fallback under a cp1252 stdout
    and the Unicode form under a UTF-8 stdout

Refs knostic#16 (item 8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

joshbouncesecurity commented May 4, 2026

Superseded by newer review

Previous review (click to expand)

🤖 Automated Claude Code Review

Findings

F1 -- Critical bug: 14 print calls in both test_pipeline.py files emit literal {SYM_FAIL} / {SYM_OK} / {SYM_ARROW} strings instead of the variable values

This is the most significant defect in the PR. A number of print(...) calls were converted from print(f"... {SYM_FAIL}...") to print("{SYM_FAIL} ...") -- that is, plain string literals, not f-strings. The {SYM_FAIL} text is emitted verbatim at runtime rather than expanding to the variable.

Affected sites in libs/openant-core/parsers/go/test_pipeline.py (7 locations):

  • Line 510: print("{SYM_FAIL} SARIF output not found")
  • Line 583: print("{SYM_FAIL} CodeQL not found. Please install CodeQL CLI.")
  • Line 594: print("{SYM_FAIL} CodeQL analysis timed out")
  • Line 897: marker = "{SYM_ARROW}" if cls == "exploitable" else " "
  • Line 987: print("{SYM_OK} All stages completed successfully")
  • Line 989: print("{SYM_FAIL} Some stages failed")
  • Line 993: status = "{SYM_OK}" if stage_result.get('success') else "{SYM_FAIL}"

The same 7 patterns appear in libs/openant-core/parsers/javascript/test_pipeline.py (lines 718, 793, 804, 1028, 1126, 1128, 1132).

All 14 must be f-strings: print(f"{SYM_FAIL} SARIF output not found"), marker = SYM_ARROW if ..., status = SYM_OK if ....

The status variable (last entry in each file) is used on the very next line as an f-string argument print(f" {status} {stage_name}: ...") -- but because status holds the literal string "{SYM_OK}" the output reads " {SYM_OK} stage_name: ..." on every platform. This means the final summary table is always garbled, defeating the intent of the entire change.

These regressions are not caught by the new tests because the tests only inspect pipeline_module.SYM_OK as an attribute and never exercise the code paths that contain the broken print statements.


F2 -- extractSingleFunction in typescript_analyzer.js bypasses toPosixPath

The PR adds toPosixPath normalization to analyzeFiles (batch mode) and to extractFunctionsFromFile / buildCallGraphForFile. However, the standalone extractSingleFunction function (line 529) still calls project.addSourceFileAtPath(filePath) with the raw, potentially backslash-containing filePath. On Windows, when the single-function CLI mode is used with a native Windows path, ts-morph will silently fail to load the file -- the same bug this PR claims to fix for batch mode.

A secondary issue: lines ~740 also call requiredProject.addSourceFileAtPath(resolvedPath) inside the require-resolution logic, where resolvedPath is built with path.resolve on a backslash dirname. That path is also unnormalised.

Fix: add const normalised = toPosixPath(filePath); at the top of extractSingleFunction and pass normalised to addSourceFileAtPath; also apply toPosixPath to resolvedPath before passing it to the nested addSourceFileAtPath call.


F3 -- this.repoPath is not normalised, creating risk of mixed-separator paths in path.relative()

this.repoPath is stored raw from args[0] (lines 905-906, 945) without normalisation. On Windows, if the caller passes repoPath with backslashes, path.relative(backslashRepoPath, posixFilePath) may behave unexpectedly because ts-morph returns paths from sourceFile.getFilePath() in POSIX form internally (per the PR description). On some Node versions this causes path.relative to fall back to an absolute path rather than a relative one, defeating toPosixPath(path.relative(...)).

Additionally, path.join(this.repoPath, filePath) on line 152 can produce a mixed-separator path (e.g. C:\project/src/module.js) when repoPath has backslashes and filePath has forward slashes. toPosixPath is applied to the combined result so the final string passed to ts-morph is correct, but Node APIs called on the intermediate value (e.g. fs.existsSync) would see the mixed form.

Recommended fix: normalise at construction time:

constructor(repoPath) {
  this.repoPath = toPosixPath(path.resolve(repoPath));
  ...
}

F4 -- test_typescript_analyzer_accepts_backslash_paths will fail (not pass) on Linux/macOS

The test constructs a backslash path by replacing every / in an absolute POSIX path:

abs_path = str(src / "module.js")           # /tmp/pytest-xxx/repo/src/module.js
backslash_path = abs_path.replace("/", "\\") # \tmp\pytest-xxx\repo\src\module.js

On Linux/macOS, Node's path.isAbsolute('\tmp\...') returns false (a leading backslash is not an absolute indicator on POSIX). analyzeFiles then takes the else branch and calls path.join(repoPath, '\tmp\pytest-xxx\repo\src\module.js'). After toPosixPath that becomes <repoPath>//tmp/pytest-xxx/repo/src/module.js -- an invalid doubly-rooted path. addSourceFileAtPath will fail to find the file; the analyzer exits with code 1 ("No files to analyze"), causing the assert result.returncode == 0 to fail.

If the Node/npm skip guard fires (no node_modules on the Linux CI agent), this failure is hidden, giving a false sense that the test passes everywhere.

The PR description explicitly claims these tests are "runnable on any platform." That claim is incorrect for this test as written.

Options:

  1. Gate the test with @pytest.mark.skipif(sys.platform != "win32", reason="backslash path simulation only meaningful on Windows").
  2. Instead of replacing all slashes, write only the relative portion of the path (e.g. "src\\module.js") with backslashes and let the analyzer join it to the repo root, which would be portable.

F5 -- pipeline_module fixture leaks loaded modules into sys.modules after teardown

After yield mod, the fixture's finally block only calls sys.path.remove(...). It does not pop the freshly-loaded module (openant_test_js_pipeline_cp1252, openant_test_go_pipeline_cp1252) from sys.modules. These modules hold a reference to the fake cp1252 stdout via their module-level _UNICODE_OK / SYM_* globals. If any later test imports from those modules it will get the stale cp1252-patched version.

The same issue affects test_pipeline_uses_unicode_when_stdout_supports_it: openant_test_js_pipeline_utf8 is added to sys.modules but not removed in the finally block.

Fix: add sys.modules.pop(mod_name, None) to both finally blocks.


F6 -- Unicode detection fails silently when stdout is piped or redirected on UTF-8 systems

When Python stdout is redirected (piped to a file, log aggregator, or subprocess), sys.stdout.encoding is commonly None or "ascii" on POSIX even when the system locale is UTF-8. In that case _stdout_supports_unicode() returns False and all symbols degrade to "OK"/"FAIL"/"->" in the captured output, making logs harder to read.

Additionally, the or "" fallback for a None encoding means "".encode("") raises LookupError (which is caught), but the intent is obscured. A more explicit guard before the probe would be clearer:

if not encoding:
    return False

If degrading to ASCII for piped/redirected output is the intended behaviour, the docstring should say so explicitly.


F7 -- test_pipeline_uses_unicode_when_stdout_supports_it tests only the JS pipeline; Go pipeline Unicode path is uncovered

The UTF-8 branch test (lines 862-884) loads only parsers/javascript/test_pipeline.py under a UTF-8 fake stdout. There is no corresponding test that reloads the Go test_pipeline.py under a UTF-8 stdout and asserts _UNICODE_OK is True and SYM_OK == "u2713". Given that F1 documents broken print calls in the Go pipeline too, the Go pipeline's Unicode path is completely uncovered by the new test suite.


F8 -- test_go_cli.py xfail removal silently loses UnicodeEncodeError diagnostic on non-Windows

The removed code included:

if "UnicodeEncodeError" in result.stderr:
    if sys.platform == "win32":
        pytest.skip(...)
    else:
        pytest.fail("UnicodeEncodeError from JS parser on non-Windows (unexpected regression)")

After the PR, if a UnicodeEncodeError appears on Linux/macOS in a future regression, the test falls through to assert result.returncode == 0 and produces a generic assertion failure with no indication of the root cause.

The Windows-specific pytest.skip was correct to remove (the bug is fixed). But the non-Windows pytest.fail was a valuable regression guard that should be retained:

if result.returncode != 0 and "UnicodeEncodeError" in result.stderr:
    pytest.fail("UnicodeEncodeError from JS parser (unexpected regression)")

F9 -- toPosixPath docstring does not document UNC path behavior

When a Windows UNC path \\server\share\file.ts is passed through toPosixPath, the result is //server/share/file.ts (the double leading backslashes become double leading forward slashes). This is the correct POSIX-style UNC representation that TypeScript understands on Windows, so UNC paths work correctly at runtime. However, the JSDoc comment says only "Convert a filesystem path to forward slashes" and does not mention UNC paths. A reader auditing this function could reasonably question whether //server/share/... is a valid result. A one-line note in the docstring (e.g. "UNC paths (\\server\share\...) are correctly converted to //server/share/...") would prevent ambiguity.


Positive notes

  • The toPosixPath helper is well-placed and well-commented. The three batch-mode call sites it is applied to (addSourceFileAtPath in analyzeFiles, extractFunctionsFromFile, buildCallGraphForFile) cover the main end-to-end data flow correctly.
  • The CRLF fix -- .split(/\r?\n/).map(line => line.trim()).filter(line => line.length > 0) -- is correct, portable, and handles pure CRLF, pure LF, mixed files, and trailing blank lines in one pass.
  • The _stdout_supports_unicode design of probing by actually encoding the target glyphs is more robust than enumerating code pages. Catching both UnicodeEncodeError and LookupError covers the two known failure paths.
  • test_typescript_analyzer_strips_crlf_from_file_list is cleanly portable: it writes raw CRLF bytes with write_bytes and is genuinely testable on any OS. The trailing-blank-line tolerance check is a good edge-case addition.
  • The removal of the sys import from test_js_parser.py (no longer needed after removing the xfail decorators) is correct housekeeping.
  • The PR body is well-scoped: it explicitly defers the venv binary path fix to a separate PR and clearly identifies which item of issue Fork contributions: bug fixes, Windows compatibility, pipeline resilience, and new features #16 is addressed without overclaiming closure.

joshbouncesecurity and others added 2 commits May 4, 2026 22:40
…h gaps, unicode clarity

F1: fix 7 broken non-f-string print calls in go/test_pipeline.py (SYM_FAIL,
SYM_OK, SYM_ARROW literals emitted verbatim) and 7 matching regressions
in javascript/test_pipeline.py.  status/marker variables now use the
SYM_* variables directly instead of wrapping them in string literals.

F2: apply toPosixPath to extractSingleFunction's filePath and to the nested
requiredProject.addSourceFileAtPath resolvedPath so single-function CLI
mode normalises Windows paths the same way batch mode does.

F3: normalise this.repoPath in the TypeScriptAnalyzer constructor via
toPosixPath(path.resolve(repoPath)) so all later path.relative /
path.join calls operate on a consistent forward-slash base.

F6: replace the `or ""` encoding fallback in _stdout_supports_unicode()
with an explicit `if not encoding: return False` guard in both pipeline
files, and add a docstring noting the piped/redirected-stdout behaviour.

F9: add UNC path behaviour note to toPosixPath JSDoc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
F4: gate test_typescript_analyzer_accepts_backslash_paths to Windows only.
Replacing all forward slashes in a POSIX absolute path produces a
non-absolute path on Linux/macOS, making the test fail on those platforms.

F5: pop freshly-loaded cp1252 and utf-8 pipeline modules from sys.modules
in finally blocks (pipeline_module fixture and unicode test) to prevent
stale module-level state leaking into later tests.

F7: parametrize test_pipeline_uses_unicode_when_stdout_supports_it to
cover both the JS and Go pipeline modules, closing the coverage gap for
the Go Unicode code path.

F8: restore UnicodeEncodeError regression guard in test_go_cli.py
test_parse_js_repo.  The Windows-specific pytest.skip was correct to
remove (bug is fixed), but the non-Windows pytest.fail provides a
diagnostic on future regressions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

🤖 Automatically written by Claude

Thanks for the thorough review. All 9 findings addressed across two commits:

  • F1: fixed (commit 923b16b) — Fixed 14 broken non-f-string print calls across both go/test_pipeline.py and javascript/test_pipeline.py. Plain string literals "{SYM_FAIL}" etc. were converted to f-strings or direct variable references. The status/marker variables now reference SYM_OK/SYM_FAIL/SYM_ARROW directly.
  • F2: fixed (commit 923b16b) — extractSingleFunction now normalises filePath via toPosixPath(path.resolve(filePath)) before the project.addSourceFileAtPath() call. The nested requiredProject.addSourceFileAtPath(resolvedPath) also uses a toPosixPath-normalised resolvedPath, and path.dirname is called on the already-normalised path.
  • F3: fixed (commit 923b16b) — TypeScriptAnalyzer constructor now stores this.repoPath = toPosixPath(path.resolve(repoPath)) so all downstream path.relative / path.join calls operate on a consistent forward-slash base.
  • F4: fixed (commit e3d9abd) — test_typescript_analyzer_accepts_backslash_paths is now gated with @pytest.mark.skipif(sys.platform != "win32", ...). The docstring explains why this test is unsound on POSIX (replacing all / in an absolute path produces a non-absolute result).
  • F5: fixed (commit e3d9abd) — Added sys.modules.pop(mod_name, None) in the finally block of the pipeline_module fixture, and sys.modules.pop("openant_test_js_pipeline_utf8", None) (now sys.modules.pop(mod_name, None) in the parametrized version) in the unicode test's finally block.
  • F6: fixed (commit 923b16b) — Replaced or "" fallback with if not encoding: return False in both pipeline files. Added docstring noting the piped/redirected-stdout ASCII degradation behaviour.
  • F7: fixed (commit e3d9abd) — test_pipeline_uses_unicode_when_stdout_supports_it is now parametrized over both javascript/test_pipeline.py and go/test_pipeline.py, covering the Go Unicode path.
  • F8: fixed (commit e3d9abd) — Restored if result.returncode != 0 and "UnicodeEncodeError" in result.stderr: pytest.fail(...) in test_go_cli.py::test_parse_js_repo. The Windows-specific pytest.skip was already removed by the original PR; the non-Windows pytest.fail regression guard is back.
  • F9: fixed (commit 923b16b) — Added UNC paths (\\server\share\...) are correctly converted to //server/share/... note to the toPosixPath JSDoc.

All 16 tests pass (12 skipped — Go binary not built, as expected in this environment).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

joshbouncesecurity commented May 4, 2026

Superseded by newer review

Previous review (click to expand)

🤖 Automated Claude Code Review

Findings

This is a follow-up review. F1 through F9 from the previous review are addressed; new findings continue from F10.

Status of previous findings:

  • F1 (f-string regressions): Fixed. All 14 broken print calls in both go/test_pipeline.py and javascript/test_pipeline.py are now correct f-strings or bare variable references.
  • F2 (extractSingleFunction bypasses toPosixPath): Fixed. normalisedFilePath = toPosixPath(path.resolve(filePath)) is applied before addSourceFileAtPath.
  • F3 (this.repoPath not normalised): Fixed. Constructor now does toPosixPath(path.resolve(repoPath)).
  • F4 (test_typescript_analyzer_accepts_backslash_paths fails on Linux/macOS): Fixed. A @pytest.mark.skipif(sys.platform != "win32", ...) guard is added.
  • F5 (pipeline_module fixture leaks modules into sys.modules): Partially addressed — see F10 below.
  • F6 (Unicode detection with or ""): Fixed. The or "" fallback is removed and replaced with an explicit if not encoding: return False.
  • F7 (Go pipeline Unicode path uncovered): Fixed. test_pipeline_uses_unicode_when_stdout_supports_it is now parametrised over both JS and Go.
  • F8 (UnicodeEncodeError regression guard removed): Fixed. The non-Windows pytest.fail guard is restored.
  • F9 (toPosixPath UNC docstring): Fixed. UNC note added to JSDoc.

F10 — sys.modules.pop(mod_name, None) in the fixture is a no-op; the stated F5 fix is ineffective

_load_pipeline_module uses importlib.util.spec_from_file_location + module_from_spec + exec_module without inserting the module into sys.modules. As a result:

  • The sys.modules.pop(mod_name, None) before the _load_pipeline_module call (line 195) is a no-op on the first run — there is nothing to evict.
  • The sys.modules.pop(mod_name, None) in the finally block (line 202) is also a no-op — the module was never placed there, so no stale state is actually removed.

The comment "Remove the freshly loaded module so its stale cp1252-patched module-level state doesn't leak into later tests" is therefore misleading; the cleanup it describes does not happen.

To make the fix effective, the module must first be registered: add sys.modules[name] = mod inside _load_pipeline_module after exec_module, and correspondingly the finally pop will then correctly remove it. Alternatively, if the approach of not registering is kept intentionally (to avoid polluting sys.modules entirely), then remove the misleading pop calls and update the comment to clarify that the module is intentionally not cached.

Note: the practical consequence is small because the module-level _UNICODE_OK and SYM_* variables are bound at import time, and any later test that calls _load_pipeline_module with the same name will re-execute the module body (since it bypasses the sys.modules cache entirely). But the stated intent of F5's fix — preventing stale state via sys.modules eviction — is not actually achieved, and the comment actively misleads future maintainers about why the pop is there.


F11 — Double @pytest.mark.skipif stacking on test_typescript_analyzer_accepts_backslash_paths is fragile; the platform guard fires first and masks the node/npm skip

The test now has two independent @pytest.mark.skipif decorators (lines 32-40):

@pytest.mark.skipif(
    sys.platform != "win32",
    reason="...",
)
@pytest.mark.skipif(
    not shutil.which("node") or not JS_NODE_MODULES.exists(),
    reason="Node.js or JS parser npm dependencies not available",
)
def test_typescript_analyzer_accepts_backslash_paths(tmp_path):

When running on Linux/macOS, the first decorator skips the test with the platform reason. That is the intended behaviour. However, pytest evaluates decorators from bottom to top (inner to outer at runtime), and multiple skipif decorators are independent — the first True condition encountered wins. The ordering is fine at runtime: the platform guard fires and the test is correctly skipped on non-Windows.

The concern is discoverability in CI output: on a non-Windows CI agent that also lacks Node, the skip message will say "backslash path simulation only meaningful on Windows" rather than "Node.js or JS parser npm dependencies not available", hiding the missing-dependency condition. More importantly, on Windows without Node installed, the platform guard is False so the second guard must catch it — and it does. That path is fine.

This is a minor readability/diagnostic issue rather than a correctness bug, but it is actionable: combining the two conditions into one skipif with a descriptive reason, or reversing the decorator order so that the more-informative node/npm guard appears as the outermost skip (and thus fires first on non-Windows too), would give cleaner skip messages:

@pytest.mark.skipif(
    sys.platform != "win32"
    or not shutil.which("node")
    or not JS_NODE_MODULES.exists(),
    reason="Windows only (backslash path test) and requires Node.js + JS parser npm deps",
)

F12 — fs.existsSync(filePath) in extractSingleFunction uses the raw un-normalised path for the error path, while normalisedFilePath is used for everything else — minor inconsistency in error message

Lines 523-527:

if (!fs.existsSync(filePath)) {
    console.error(`File not found: ${filePath}`);
    process.exit(1);
}
const normalisedFilePath = toPosixPath(path.resolve(filePath));

fs.existsSync and fs.statSync accept backslash paths correctly on Windows (they go through the OS layer). So this is not a functional bug — if the file exists at a backslash path, existsSync returns true and execution proceeds to normalisation.

However, there is a subtle inconsistency: the error message reports filePath (raw, potentially backslash-form), while all subsequent log messages and the process use normalisedFilePath. Moving the normalisation and path.resolve to before the existence check would make the guard consistent with the rest of the function, and would also make the "File not found" error message display the same resolved path that ts-morph will use:

const normalisedFilePath = toPosixPath(path.resolve(filePath));
if (!fs.existsSync(normalisedFilePath)) {
    console.error(`File not found: ${normalisedFilePath}`);
    process.exit(1);
}

This also catches a (unlikely) edge case: if path.resolve expands a relative path differently than the caller expects, the normalised path would be shown in the error, making debugging easier.


F13 — test_pipeline_uses_unicode_when_stdout_supports_it does not isolate the sys.path insertion from the fixture's insertion — potential double-insertion

The parametrised test_pipeline_uses_unicode_when_stdout_supports_it function (line 245) does sys.path.insert(0, str(parsers_root.parent)) directly. The pipeline_module fixture (lines 184-205) also does sys.path.insert(0, str(parsers_root.parent)). These two functions operate on separate code paths and pytest may run them in the same process without intervening cleanup of sys.path.

Since sys.path.remove only removes the first occurrence, a double insert followed by a single remove leaves one extra copy in sys.path. Over a test run with both the cp1252 fixture (parametrised 2x) and the utf8 parametrised test (2x), sys.path could accumulate stale entries pointing to libs/openant-core. In practice this is benign — having libs/openant-core on the path multiple times causes no imports to fail — but it is sloppy state management and could interact badly with other test modules that manipulate sys.path.

A clean fix: check before inserting (if str(parsers_root.parent) not in sys.path: sys.path.insert(0, ...)) or use a module-level fixture that inserts once and removes once after the entire test session.


F14 — test_pipeline_uses_unicode_when_stdout_supports_it asserts Unicode glyphs by literal value; a test running under a non-UTF-8 source encoding would give a confusing error

Lines 250-253:

assert mod.SYM_OK == "✓"
assert mod.SYM_FAIL == "✗"
assert mod.SYM_ARROW == "→"

The test file does not declare a # -*- coding: utf-8 -*- header. Python 3 defaults to UTF-8 for source files, so this is safe on CPython 3.x. However, the literals "✓", "✗", "→" embedded in the assertion will produce confusing error messages if the terminal or log output is in a non-UTF-8 encoding (e.g., cp1252), because the AssertionError display of the expected value will crash. This is a narrow concern, but it is the same class of problem this PR was designed to fix in the pipelines.

An alternative assertion form — assert mod.SYM_OK == "✓" — is immune to the display encoding of the failing test's output, which is more appropriate for a test that is explicitly about encoding robustness. This is particularly important because this test is designed to run on Windows (unlike the backslash test), and the console encoding on the CI Windows agent may well be cp1252.


Positive notes

  • F1 through F8 are all addressed cleanly and correctly. The f-string fixes are systematic (all 14 sites), the constructor normalisation is the right place to apply toPosixPath, and the extractSingleFunction normalisation correctly handles both the primary addSourceFileAtPath call and the nested requiredProject.addSourceFileAtPath path.
  • The _stdout_supports_unicode fix (F6) is clean: the if not encoding: return False guard is explicit and readable, and the updated docstring correctly explains the piped/redirected behaviour.
  • Parametrising test_pipeline_uses_unicode_when_stdout_supports_it over both JS and Go (F7) is exactly the right approach — it closes the coverage gap with minimal duplication.
  • The UnicodeEncodeError regression guard restoration in test_go_cli.py (F8) is appropriately scoped: only fires on non-zero exit code with the specific error, so it won't mask unrelated failures.
  • The F4 fix (Windows-only skip) is correct and the updated docstring accurately explains why the test cannot work on POSIX.

…istency, sys.path dedup, assertion safety

F10: register module in sys.modules inside _load_pipeline_module so that
pre/post sys.modules.pop() calls in the fixture and unicode test are
meaningful and prevent stale module-level state from leaking.

F11: combine double @pytest.mark.skipif decorators on
test_typescript_analyzer_accepts_backslash_paths into a single condition
so the skip reason is always accurate and neither guard can mask the other.

F12: move toPosixPath/path.resolve above the fs.existsSync check in
extractSingleFunction so the existence check and error message both use
the normalised path, consistent with the rest of the function.

F13: guard sys.path.insert with already_on_path check in both the
pipeline_module fixture and test_pipeline_uses_unicode_when_stdout_supports_it
to prevent duplicate path entries across parametrized test runs; only
remove from sys.path if we added it.

F14: replace literal Unicode glyph strings in assertions (✓ ✗ →) with
\u-escape sequences so pytest assertion error messages remain safe on
cp1252 consoles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

🤖 Automatically written by Claude

All 5 new findings from iteration 2 addressed in commit 22e6b56:

  • F10: fixed (commit 22e6b56) — Added sys.modules[name] = mod in _load_pipeline_module before calling exec_module. The pre-pop before load and post-pop in finally are now both meaningful; stale module-level state can no longer leak.
  • F11: fixed (commit 22e6b56) — Combined the two @pytest.mark.skipif decorators on test_typescript_analyzer_accepts_backslash_paths into a single condition: sys.platform != "win32" or not shutil.which("node") or not JS_NODE_MODULES.exists(). Single reason string covers both guards.
  • F12: fixed (commit 22e6b56) — Moved the normalisedFilePath = toPosixPath(path.resolve(filePath)) assignment above the fs.existsSync check in extractSingleFunction. The existence check and error message now both reference the normalised path, matching the rest of the function.
  • F13: fixed (commit 22e6b56) — Added already_on_path = parsers_parent in sys.path guard in both the pipeline_module fixture and test_pipeline_uses_unicode_when_stdout_supports_it. sys.path.insert and the corresponding sys.path.remove in finally are now conditional, preventing duplicate entries across parametrized runs.
  • F14: fixed (commit 22e6b56) — Replaced literal UTF-8 glyph bytes (//) in assertions with // escape sequences in the Python source. Assertion error messages on a cp1252 terminal will no longer crash when printing the comparison operands.

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

🤖 Automated Claude Code Review

Findings

This is a follow-up review. F1 through F14 from the previous two reviews are addressed; new findings continue from F15.

Status of previous findings (F10–F14):

  • F10 (sys.modules.pop was a no-op): Fixed. sys.modules[name] = mod is now registered before exec_module, making the pre/post pop calls meaningful and effective.
  • F11 (double @pytest.mark.skipif could mask guards): Fixed. The two decorators are collapsed into one combined condition with a single reason string.
  • F12 (fs.existsSync inconsistency in extractSingleFunction): Fixed. toPosixPath(path.resolve(...)) is now performed before the existence check, making both the guard and error message consistent with the rest of the function.
  • F13 (sys.path double-insertion across parametrized tests): Partially addressed — see F15 below.
  • F14 (literal Unicode glyphs in assertions unsafe on cp1252): Fixed. , , escape sequences replace the literal glyphs in the assertion RHS, preventing source-file encoding issues and keeping the .py file pure ASCII.

F15 — sys.path still accumulates one extra entry per test call due to the pipeline module's own sys.path.insert

The F13 fix guards against the test harness inserting a duplicate parsers_parent entry, but does not account for the fact that each pipeline module (test_pipeline.py) also inserts the identical path at module level when exec_module runs:

# Line 71 of both parsers/javascript/test_pipeline.py and parsers/go/test_pipeline.py:
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))

os.path.dirname(dirname(dirname(abspath(__file__)))) for a file in libs/openant-core/parsers/javascript/ resolves to libs/openant-core/ — the same directory as str(parsers_root.parent) in the test harness.

Trace for the first parametrized test call (e.g., cp1252 + javascript):

  1. already_on_path = False → test inserts parsers_parent (count = 1)
  2. _load_pipeline_module runs exec_module → pipeline body inserts parsers_parent again (count = 2)
  3. finally: if not already_on_path: sys.path.remove(parsers_parent) → removes one (count = 1)

On the second call (cp1252 + go):

  1. already_on_path = True (count = 1 from previous run's pipeline insertion) → test skips insert
  2. exec_module → pipeline inserts again (count = 2)
  3. finally: already_on_path was True → test does NOT remove (count = 2)

Each subsequent test call adds one more copy. After 4 parametrized tests, parsers_parent appears 4 times in sys.path. The F13 fix reduces accumulation compared to the pre-fix state (where the test always inserted and often failed to remove) but does not eliminate it.

A correct fix requires either (a) snapshotting sys.path before exec_module and restoring it after, or (b) ensuring the pipeline module does not insert a path that is already present (guarding its own sys.path.insert the same way the test now does). Option (a) is simpler and entirely within the test harness:

path_snapshot = list(sys.path)
try:
    mod = _load_pipeline_module(mod_name, path)
    ...
finally:
    sys.path[:] = path_snapshot
    sys.modules.pop(mod_name, None)

F16 — _load_pipeline_module does not roll back sys.modules on exec_module failure, creating a partially-initialized module hazard

The F10 fix registers the module in sys.modules before calling exec_module, following CPython convention for supporting circular imports. However, CPython's own importlib machinery wraps this with a rollback on failure:

sys.modules[name] = module
try:
    spec.loader.exec_module(module)
except BaseException:
    try:
        del sys.modules[name]
    except KeyError:
        pass
    raise

The current _load_pipeline_module omits the rollback:

sys.modules[name] = mod
spec.loader.exec_module(mod)   # if this raises, sys.modules[name] is a partial module
return mod

If exec_module raises (e.g., because the pipeline module's from utilities.context_enhancer import ContextEnhancer fails when sys.path is not yet set up, or for any other reason), sys.modules[name] is left pointing to a partially initialized module object. Any subsequent code that does import <name> (or sys.modules.get(name)) would receive that partial module.

The callers' finally blocks do sys.modules.pop(mod_name, None) and would clean this up — but only if the exception propagates back to them. If any intervening code catches the exception and ignores it, the stale partial module persists. This is a fragile implicit contract.

The fix is a three-line change inside _load_pipeline_module:

sys.modules[name] = mod
try:
    spec.loader.exec_module(mod)
except BaseException:
    sys.modules.pop(name, None)
    raise
return mod

F17 — Trailing-whitespace misalignment in \u-escape assertion comments (cosmetic but actionable)

The three new assertion lines have inconsistent alignment before the inline comments:

assert mod.SYM_OK == "✓"    # CHECK MARK (U+2713)   ← 4 spaces
assert mod.SYM_FAIL == "✗"  # BALLOT X (U+2717)     ← 2 spaces
assert mod.SYM_ARROW == "→"  # RIGHTWARDS ARROW (U+2192)  ← 2 spaces

SYM_OK is shorter than SYM_FAIL/SYM_ARROW, so the original code likely aligned the # column. With escape sequences the strings are all the same length ("✓", "✗", "→"), so the SYM_OK line now has 2 extra spaces before the #. This is harmless but inconsistent; normalizing to 2 spaces (or aligning all to the same column) improves readability.


Positive notes

  • F10 fix is architecturally correct. Registering the module in sys.modules before exec_module is the CPython-prescribed pattern (mirroring importlib._bootstrap._load_unlocked). This makes the pre/post pop calls semantically meaningful for the first time in this codebase.
  • F11 combined skipif is cleaner and less surprising. A single decorator with a single reason eliminates the decorator-order ambiguity and ensures the skip output is always self-consistent.
  • F12 normalisation-before-existsSync is a solid improvement. The function is now fully consistent: all path operations use the same normalized value, and the existence-check error message matches what ts-morph will see.
  • F14 escape-sequence assertions are the right approach. Using \u escapes keeps the test source pure-ASCII and explicitly documents the Unicode code points with inline comments — this is the correct idiom for encoding-sensitive test code.
  • The commit message is exemplary. The per-finding callouts (F10–F14) with concise technical descriptions make the history easy to follow during future bisect/blame sessions.

…rollback, comment alignment

F15: snapshot and restore sys.path around exec_module in
_load_pipeline_module so that module-level sys.path.insert calls
in the pipeline files do not accumulate across parametrized test runs.

F16: roll back sys.modules[name] on exec_module failure (CPython importlib
convention) so a failed load does not leave a partially-initialized module
poisoning the cache.

F17: fix inconsistent inline comment spacing on the three \u-escape
assertion lines — normalise to 2 spaces before # for all three.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

🤖 Automatically written by Claude

All 3 final findings from iteration 3 addressed in commit fb31f2a:

  • F15: fixed (commit fb31f2a) — Added path_snapshot = sys.path[:] before exec_module and sys.path[:] = path_snapshot in a finally block inside _load_pipeline_module. Module-level sys.path.insert calls within the pipeline files no longer accumulate across parametrized test runs.
  • F16: fixed (commit fb31f2a) — Wrapped exec_module in try/except that calls sys.modules.pop(name, None) on any exception before re-raising. Follows CPython convention: a failed exec_module should not leave a partially-initialized module in sys.modules.
  • F17: fixed (commit fb31f2a) — Normalised all three \u-escape assertion comment lines to 2 spaces before the # (was 4 on the SYM_OK line, 2 on SYM_FAIL and SYM_ARROW).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification (Windows-specific)

  • openant parse <ts-repo> on Windows: reports a non-zero file count (was 0 before this PR due to ts-morph backslash bug).
  • Mixed paths: parser handles a repo whose path contains both forward and backslashes without crashing.
  • CRLF file lists: file-list inputs with \r\n line endings are parsed cleanly — no trailing \r in unit IDs or filenames.
  • cp1252 console: openant scan runs to completion without UnicodeEncodeError — status output uses ASCII fallback (OK / FAIL / ->) instead of ✓ ✗ →.
  • UTF-8 console (Windows Terminal, modern PowerShell): keeps the prettier Unicode glyphs (✓ ✗ →).
  • Cross-platform sanity: same parser invocation on Linux/macOS still yields a non-zero file count and identical output.

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Ran the JS pipeline on Windows against libs/openant-core/tests/fixtures/sample_js_repo (3 .js files).

Commands run:

cd libs/openant-core/parsers/javascript
python test_pipeline.py <sample_js_repo> --output <out> --processing-level all

Outcome:

  • repository_scanner Files: 3 (non-zero) ✅
  • typescript_analyzer Functions: 5 (would have been 0 before this fix on Windows due to ts-morph backslash interpretation) ✅
  • All functionIds in the resulting dataset.json use forward slashes — checked all 5 unit ids, 0 contained a backslash ✅
  • Pipeline status output ran cleanly on cp1252 stdout — fell back to OK/-> glyphs as designed (no UnicodeEncodeError) ✅

Sample unit ids (note POSIX separators):

src/db.js:getConnection
src/db.js:getUser
src/db.js:createUser

@joshbouncesecurity joshbouncesecurity marked this pull request as ready for review May 7, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant