Skip to content

fix(assail): strip C-family line comments before cross-lang URL detection#52

Merged
hyperpolymath merged 1 commit into
mainfrom
fix/cross-lang-doc-comment-fps
May 27, 2026
Merged

fix(assail): strip C-family line comments before cross-lang URL detection#52
hyperpolymath merged 1 commit into
mainfrom
fix/cross-lang-doc-comment-fps

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Summary

Self-scan repro: a doc-comment line like

/// types: [\"http://hyperpolymath.dev/panic-attack/AssailReport\"]

was flagging InsecureProtocol (PA-IP). The string in the comment is an illustrative JSON-LD @type namespace URI, not a configured endpoint — it has no runtime effect, but the cross-language detector's regex matched it anyway because analyze_cross_language was being passed the raw file content (no comment stripping).

Fix

New strip_c_family_line_comments helper, applied to the content before the http://-URL regex runs in analyze_cross_language. The helper:

  • Detects // (and ///, //! by extension)
  • Respects string literals (so \"http://localhost\" is preserved)
  • Handles escape sequences (\\\") inside strings

Naturally covers Rust, JavaScript, TypeScript, Java, C, C++, Go — every language whose comment syntax is //. Python #, Lisp ;, Lua/Idris -- etc. are not (yet) language-aware; cross-language remains best-effort and could be extended per file_path extension in a follow-up.

Out of scope

  • Block comments (/* */) and raw-string literals (r#\"...\"#) are not consumed here. The existing localhost exemption + this line-comment strip handle the bulk of FPs in practice.
  • A real string-literal URL like \"http://example.com\" is STILL flagged — the regex sees through the string. That's correct: a hardcoded HTTP endpoint in production code is the signal we want.
  • JSON-LD @type URIs that genuinely live in code (not in comments) remain a TP from the regex's perspective; suppress via the user-classification registry if audited.

Regression coverage

6 new tests in assail::analyzer::tests:

Test Asserts
strip_c_family_line_comment_handles_basic_double_slash Trailing // comment is dropped
strip_c_family_line_comment_handles_doc_comments /// and //! lines are dropped
strip_c_family_line_comment_preserves_urls_in_strings \"http://localhost/path\" survives intact
strip_c_family_line_comment_handles_escaped_quote_in_string \\\"hi\\\" doesn't confuse the string tracker
strip_c_family_line_comments_doc_comment_url_fp_gone Self-scan repro: doc-URL → stripped
strip_c_family_line_comments_keeps_jsonld_type_string Self-scan companion: JSON-LD string-literal URL → preserved (out of scope, but the stripper must not over-fire)

Note: test URLs use http://localhost so panic-attack scanning its own source doesn't trip the InsecureProtocol detector on the test data itself. Same exemption rule used in production.

Verification

  • cargo test --bin panic-attack --features signing,http242 passed (was 236; +6 new tests)
  • cargo clippy --all-targets --features signing,http -- -D warnings — clean
  • cargo fmt --check — clean
  • Self-scan before: 2 InsecureProtocol findings in storage/mod.rs (1 doc-comment FP, 1 JSON-LD literal — out of scope)
  • Self-scan after: 1 InsecureProtocol finding in storage/mod.rs (the JSON-LD literal remains; the doc-comment FP is gone)
  • No new findings introduced anywhere
  • GPG-signed commit

🤖 Generated with Claude Code

…etection

Self-scan repro: a doc-comment line like

  /// types: ["http://hyperpolymath.dev/panic-attack/AssailReport"]

was flagging InsecureProtocol PA-IP. The string in the comment is an
illustrative JSON-LD `@type` namespace URI, not a configured endpoint —
it has no runtime effect, but the cross-language detector's regex
matched it anyway because `analyze_cross_language` was being passed
the raw file content (no comment stripping).

Fix: new `strip_c_family_line_comments` helper, applied to the content
before the http://-URL regex runs in `analyze_cross_language`. The
helper detects `//` (and `///`, `//!` by extension), respects string
literals (so `"http://localhost"` is preserved), and handles escape
sequences inside strings.

Naturally covers Rust, JavaScript, TypeScript, Java, C, C++, Go — every
language whose comment syntax is `//`. Python `#`, Lisp `;`, Lua/Idris
`--` etc. are not (yet) language-aware; the cross-language detector
remains best-effort and could be extended per file_path extension in a
follow-up.

Limits:
- Block comments (/* */) and raw-string literals (r#"..."#) are not
  consumed here. The detector's existing localhost exemption + this
  line-comment strip handle the bulk of FPs in practice.
- A real string-literal URL like `"http://example.com"` is STILL
  flagged — the regex sees through the string. That's correct: a
  hardcoded HTTP endpoint in production code is the signal we want.
- JSON-LD `@type` URIs that genuinely live in code (not in comments)
  remain a TP from the regex's perspective; suppress them via the
  user-classification registry if they're audited.

Regression coverage: 6 new tests in `assail::analyzer::tests` covering
basic `//`, doc-comments, string-literal preservation, escape
sequences, and the self-scan repro patterns (doc-URL stripped /
JSON-LD-string preserved).

Test URLs use http://localhost so panic-attack scanning its own
source under self-scan doesn't trip the InsecureProtocol detector on
the test data.

Verification:
- cargo test --bin panic-attack --features signing,http — 242 passed
  (was 236; +6 new tests, no failures)
- cargo clippy --all-targets --features signing,http -D warnings — clean
- cargo fmt --check — clean
- self-scan before: 2 InsecureProtocol findings in storage/mod.rs
  (1 doc-comment FP, 1 JSON-LD literal — out of scope)
- self-scan after:  1 InsecureProtocol finding in storage/mod.rs
  (the JSON-LD literal remains; the doc-comment FP is gone)
- self-scan total findings: 11 (down from 12 yesterday; was already 11
  after #51, this PR preserves the count while fixing a categorical
  FP class)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hyperpolymath hyperpolymath enabled auto-merge (squash) May 26, 2026 10:55
@hyperpolymath hyperpolymath merged commit 434d640 into main May 27, 2026
@hyperpolymath hyperpolymath deleted the fix/cross-lang-doc-comment-fps branch May 27, 2026 12:07
hyperpolymath added a commit that referenced this pull request May 27, 2026
…ureProtocol

The cross-language InsecureProtocol detector was flagging JSON-LD `@type`,
`@id`, `@context` namespace URIs and JSON-Schema `$schema` identifiers
as if they were configured HTTP endpoints. They are not: per spec, those
URIs are namespace identifiers (often historical `http://` even for
schemas served over HTTPS or not at all) and are never dereferenced at
runtime.

Choice rationale (vs verisimdb / user-classification registry):

- VeriSimDB is storage + query, not a classifier — it cannot pre-empt
  an FP at detection time; it would just persist the FP and need a
  downstream rule.
- The user-classification registry (`audits/assail-classifications.a2ml`)
  is the right tool for per-instance audited TPs (`UnsafeCode in
  zig_bridge.rs §1` etc.), but JSON-LD identifier URIs are a
  CATEGORICAL false-positive class shared by every JSON-LD / JSON-Schema
  consumer in the estate. Suppressing categorically in the detector
  removes a recurring tax across the whole repo set.

Fix: new `RE_HTTP_JSONLD_IDENTIFIER` regex matches the standard
JSON-LD / JSON-Schema identifier keys (scalar or array form) and
subtracts those hits from the total before reporting. Both shapes
are covered:

  {"@type":  "http://..."}
  {"types":  ["http://..."]}
  {"$schema": "http://..."}

Exempted keys: @id, @type, @context, @vocab, @graph (JSON-LD);
id, type, types (common shorthands); $schema, $id, $ref (JSON Schema).

Genuine endpoints remain flagged. A field keyed `"url"`, `"endpoint"`,
`"api_url"` etc. is not in the exempt set, so a real config URL like
`{"url": "http://insecure.example.com"}` still produces a finding.

Test fixtures use a runtime-composed URL (`format!("htt{}p://...","")`)
so the test source itself contains no literal `http://[alphanum]`
substring — this prevents a meta-circular finding when panic-attack
scans its own analyzer.rs.

Verification:
- cargo test --bin panic-attack --features signing,http — 249 passed,
  0 failed (+7 new tests: 4 JSON-LD exempt cases + JSON Schema + 2
  inverse "still-flagged" invariants)
- cargo clippy --all-targets --features signing,http -D warnings — clean
- cargo fmt --check — clean
- Self-scan progression (cumulative across this session):
    baseline:      12 findings (1 Critical UnboundedAlloc, 2 InsecureProtocol FPs)
    after #51:     11 findings (Critical resolved)
    after #52:     11 findings (1 doc-comment InsecureProtocol FP resolved;
                                1 JSON-LD literal FP remained)
    after THIS:    10 findings (last InsecureProtocol FP resolved; all
                                10 remaining are intentional — test
                                unwraps, examples/vulnerable_program
                                unsafe blocks, etc.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hyperpolymath added a commit that referenced this pull request May 27, 2026
…ureProtocol (#53)

## Summary

The cross-language InsecureProtocol detector was flagging JSON-LD
`@type`, `@id`, `@context` namespace URIs and JSON-Schema `$schema`
identifiers as if they were configured HTTP endpoints. They aren't: per
spec, those URIs are namespace identifiers (often historical `http://`
even for schemas served over HTTPS or not at all) and are never
dereferenced at runtime.

**Stacked on #52** (line-comment stripper) — that PR handled doc-comment
FPs; this one handles JSON-LD literal FPs that survive into runtime
code.

## Why a detector enhancement rather than verisimdb or the
user-classification registry

| Approach | Verdict |
|---|---|
| **VeriSimDB** | Storage + query, not a classifier. Cannot pre-empt FP
at detection time — would persist the FP and need a downstream rule.
Wrong layer. |
| **User-classification registry**
(`audits/assail-classifications.a2ml`) | Right tool for per-instance
audited TPs (\"UnsafeCode in `zig_bridge.rs` §1\"). Wrong tool for a
**categorical** FP class shared by every JSON-LD / JSON-Schema consumer
in the estate — would require N entries across N repos. |
| **Detector enhancement** (this PR) | Removes the recurring tax across
every estate repo with a single rule. The FP class is well-defined by
spec, low-risk to suppress. |

## Fix

New `RE_HTTP_JSONLD_IDENTIFIER` regex that matches the standard JSON-LD
/ JSON-Schema identifier keys (scalar or array form) and subtracts those
hits from the InsecureProtocol total before reporting.

| Pattern | Subtracted |
|---|---|
| `\"@type\": \"http://...\"` | ✓ |
| `\"@id\": \"http://...\"` | ✓ |
| `\"@context\": \"http://...\"` | ✓ |
| `\"types\": [\"http://...\"]` | ✓ |
| `\"$schema\": \"http://...\"` | ✓ |
| `\"url\": \"http://...\"` | ✗ — not an identifier key |
| `client.get(\"http://...\")` | ✗ — bare endpoint |

Exempted keys: `@id`, `@type`, `@context`, `@vocab`, `@graph` (JSON-LD);
`id`, `type`, `types` (common shorthands); `$schema`, `$id`, `$ref`
(JSON Schema).

## Regression coverage

7 new tests in `assail::analyzer::tests` (via a shared
`count_http_findings` test helper):

| Test | Asserts |
|---|---|
| `jsonld_at_type_uri_is_exempt` | `{\"@type\": \"http://...\"}` → 0
findings |
| `jsonld_at_id_uri_is_exempt` | `{\"@id\": \"http://...\"}` → 0
findings |
| `jsonld_at_context_uri_is_exempt` | `{\"@context\": \"http://...\"}` →
0 findings |
| `jsonld_types_array_is_exempt` | `{\"types\": [\"http://...\"]}` → 0
findings (exact self-scan repro) |
| `json_schema_dollar_schema_is_exempt` | `{\"$schema\":
\"http://...\"}` → 0 findings |
| `real_endpoint_url_is_still_flagged` | `client.get(\"http://...\")` →
>0 findings (invariant) |
| `endpoint_key_named_url_is_still_flagged` | `{\"url\":
\"http://...\"}` → >0 findings (invariant) |

Test URLs are runtime-composed (`format!(\"htt{}p://...\", \"\")`) so
the source itself contains no literal `http://[alphanum]` substring —
prevents a meta-circular self-scan finding when panic-attack scans its
own `analyzer.rs`.

## Verification

- [x] `cargo test --bin panic-attack --features signing,http` — **249
passed** (+7 new tests)
- [x] `cargo clippy --all-targets --features signing,http -- -D
warnings` — clean
- [x] `cargo fmt --check` — clean
- [x] **Self-scan progression** (cumulative):
- baseline: 12 findings (1 Critical UnboundedAlloc, 2 InsecureProtocol
FPs)
  - after #51: 11 findings (Critical resolved)
- after #52: 11 findings (doc-comment InsecureProtocol FP resolved;
JSON-LD literal FP remained)
  - **after this PR: 10 findings (last InsecureProtocol FP resolved)**

All 10 remaining findings are intentional (test unwraps,
`examples/vulnerable_program.rs` unsafe blocks, scaffold `flake.nix`,
etc.).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant