Skip to content

Fix JSON node types cache: commit missing file and repair union deserialization#235

Merged
bashandbone merged 5 commits intofix-insecure-deserialization-node-types-cache-15389003928708879868from
copilot/sub-pr-233
Mar 16, 2026
Merged

Fix JSON node types cache: commit missing file and repair union deserialization#235
bashandbone merged 5 commits intofix-insecure-deserialization-node-types-cache-15389003928708879868from
copilot/sub-pr-233

Conversation

Copy link
Contributor

Copilot AI commented Mar 16, 2026

The JSON cache was never actually loading — the missing node_types_cache.json file caused fallback to full parse on every startup, and a silent KeyError bug in the union deserialization meant the cache would have failed even if the file existed.

Root cause: DirectConnection | PositionalConnections union always failed

TypeAdapter tries DirectConnection first on all connection dicts. Connection.__init__ calls super().__init__(**data) which triggers pydantic's internal field extractor — a compiled lambda that raises KeyError: 'role' (not ValidationError) when a PositionalConnections dict is processed. The original code silently swallowed this via except KeyError, making the cache a permanent no-op.

# Before: union TypeAdapter always KeyErrors on PositionalConnections dicts
connections: list[DirectConnection | PositionalConnections]  # ← never works

# After: validate as raw dicts, reconstruct per-item using 'role' as discriminator
connections: list[dict]  # in TypedDict
# then:
DirectConnection.model_validate(c) if isinstance(c, dict) and "role" in c
else PositionalConnections.model_validate(c)

Changes

  • node_types_cache.json — generated and committed (13 MB compact JSON); added REUSE license sidecar
  • node_type_parser.py — fix union deserialization; move ValidationError/TypeAdapter imports to module scope (fixes fragile import-inside-try); extract _ensure_cache_adapter(), _reconstruct_cache(), _clear_stale_cached_properties() helpers; cache TypeAdapter as ClassVar[TypeAdapter[Any] | None] built once per class lifetime; remove redundant json.JSONDecodeError (covered by ValidationError)
  • preprocess-node-types.py — emit compact JSON (no indentation) to stay under the 13 MB VCS limit
  • hk.pkl — raise check-added-large-files limit from 10 000 → 15 000 KB to accommodate the cache artifact

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Copilot AI and others added 2 commits March 16, 2026 16:41
…tract helper methods, add cache file

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
…c, simplify adapter guard

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 16, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • openaipublic.blob.core.windows.net
    • Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python python -m pytest tests/unit/ -v --timeout=60 --no-cov -q (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title [WIP] Fix insecure deserialization vulnerability in node types parser Fix JSON node types cache: commit missing file and repair union deserialization Mar 16, 2026
Copilot AI requested a review from bashandbone March 16, 2026 16:45
@bashandbone bashandbone marked this pull request as ready for review March 16, 2026 17:37
Copilot AI review requested due to automatic review settings March 16, 2026 17:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the semantic node-types JSON cache so it can be loaded reliably at startup (instead of silently falling back to full JSON parsing), and commits the generated cache artifact needed for runtime loading.

Changes:

  • Repair cache deserialization by validating cached connections as raw dicts and reconstructing DirectConnection vs PositionalConnections using "role" as a discriminator.
  • Add and license the generated node_types_cache.json artifact; update preprocessing to emit compact JSON.
  • Bump the check-added-large-files hook limit and pin hk in mise.toml to accommodate the cache artifact.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/codeweaver/semantic/node_type_parser.py Fixes cache load/deserialization; adds cached TypeAdapter and reconstruction helpers.
src/codeweaver/semantic/data/node_types_cache.json Adds the missing runtime cache artifact (large JSON).
src/codeweaver/semantic/data/node_types_cache.json.license Adds REUSE/Spdx licensing sidecar for the cache artifact.
scripts/build/preprocess-node-types.py Writes compact JSON to reduce cache size.
mise.toml Pins hk tool version.
hk.pkl Raises large-file hook threshold to allow committing the cache file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bashandbone and others added 2 commits March 16, 2026 13:52
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
@bashandbone bashandbone merged commit a6438c7 into fix-insecure-deserialization-node-types-cache-15389003928708879868 Mar 16, 2026
3 checks passed
@bashandbone bashandbone deleted the copilot/sub-pr-233 branch March 16, 2026 17:52
bashandbone added a commit that referenced this pull request Mar 16, 2026
* 🔒 Replace insecure pickle with JSON for node types cache

Mitigate insecure deserialization vulnerability by switching the
tree-sitter node types cache from pickle to JSON.

- Updated scripts/build/preprocess-node-types.py to serialize cache to JSON.
- Updated src/codeweaver/semantic/node_type_parser.py to load and validate
  the JSON cache using pydantic.TypeAdapter.
- Corrected build and CI artifact paths in mise.dev.toml.
- Updated documentation in src/codeweaver/semantic/data/__init__.py.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* 🔒 Switch node types cache to JSON and fix CI issues

- Switch tree-sitter node types cache from pickle to JSON to mitigate
  insecure deserialization (S301).
- Use Pydantic's TypeAdapter for safe validation and loading.
- Fix CI failures on Python 3.14 by using the project's internal uuid7
  utility instead of uuid_extensions.
- Clean up classification_result during cache loading to ensure fresh
  recomputation.
- Correct artifact paths in mise.dev.toml and documentation.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding 'Unused import'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

* Fix JSON node types cache: commit missing file and repair union deserialization (#235)

* Initial plan

* Fix JSON cache loading: resolve connection union, fix type errors, extract helper methods, add cache file

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Address code review: clarify noqa comment, improve circular-import doc, simplify adapter guard

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

---------

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI added a commit that referenced this pull request Mar 17, 2026
* 🔒 Replace insecure pickle with JSON for node types cache

Mitigate insecure deserialization vulnerability by switching the
tree-sitter node types cache from pickle to JSON.

- Updated scripts/build/preprocess-node-types.py to serialize cache to JSON.
- Updated src/codeweaver/semantic/node_type_parser.py to load and validate
  the JSON cache using pydantic.TypeAdapter.
- Corrected build and CI artifact paths in mise.dev.toml.
- Updated documentation in src/codeweaver/semantic/data/__init__.py.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* 🔒 Switch node types cache to JSON and fix CI issues

- Switch tree-sitter node types cache from pickle to JSON to mitigate
  insecure deserialization (S301).
- Use Pydantic's TypeAdapter for safe validation and loading.
- Fix CI failures on Python 3.14 by using the project's internal uuid7
  utility instead of uuid_extensions.
- Clean up classification_result during cache loading to ensure fresh
  recomputation.
- Correct artifact paths in mise.dev.toml and documentation.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding 'Unused import'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

* Fix JSON node types cache: commit missing file and repair union deserialization (#235)

* Initial plan

* Fix JSON cache loading: resolve connection union, fix type errors, extract helper methods, add cache file

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Address code review: clarify noqa comment, improve circular-import doc, simplify adapter guard

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

---------

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants