Skip to content

fix(codegen-js): emit \uXXXX/\u{X} for non-ASCII string literals (closes #460)#463

Merged
hyperpolymath merged 1 commit into
mainfrom
feat/codegen-js-unicode-string-lit-460
May 30, 2026
Merged

fix(codegen-js): emit \uXXXX/\u{X} for non-ASCII string literals (closes #460)#463
hyperpolymath merged 1 commit into
mainfrom
feat/codegen-js-unicode-string-lit-460

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Summary

Closes #460 — non-ASCII string literals in AffineScript source no longer break strict-mode ESM in the Deno/Node JS backends.

Root cause

OCaml's `String.escaped` emits non-ASCII bytes as `\NNN` decimal sequences. JavaScript parses `\NNN` as octal escapes which strict-mode ESM rejects:

```
SyntaxError: Octal escape sequences are not allowed in strict mode.
```

(And even outside strict mode the bytes would decode to the wrong characters — `\226` octal = 0x96, not the 0xE2 lead-byte of ❌.)

Fix

New helper `Js_codegen.js_string_lit` walks the UTF-8 byte sequence, decodes code points, and emits:

Character class Output
Printable ASCII (0x20-0x7E except `\` `"`) as-is
`\` `"` `\n` `\r` `\t` conventional escape
Other ASCII (control bytes) `\xHH`
Non-ASCII BMP (U+0080..U+FFFF) `\uXXXX`
Non-BMP (U+10000+) `\u{XXXXX}`

Wired into both `js_codegen.ml` (Node target) and `codegen_deno.ml` (Deno-ESM target) at the `LitString`/`LitChar` emit sites.

Test plan

New `tests/codegen-deno/non_ascii.affine` fixture + harness:

```affine
pub fn emoji_cross() -> String { return "❌"; } // BMP U+274C
pub fn non_bmp_sob() -> String { return "😭"; } // non-BMP U+1F62D
pub fn cjk_hello() -> String { return "你好"; }
pub fn latin_accent() -> String { return "café résumé"; }
pub fn mixed() -> String { return "[OK] café 你好 ❌"; }
pub fn ascii_only() -> String { return "plain ASCII"; }
pub fn quotes_and_backslash() -> String { return "\"escaped\" and \\back"; }
```

The `import` itself is the strictest test: if the emitted `.deno.js` contains octal escapes, the module fails to parse and the harness import throws SyntaxError before any assertion runs.

  • Local `./tools/run_codegen_deno_tests.sh`: 13/13 harnesses green (including the new fixture)
  • Local `dune test`: 352/352 unit tests green
  • Compiler output spot-check: `emoji_cross` emits `return "\u274C";`, `non_bmp_sob` emits `return "\u{1F62D}";`, ASCII passes through unchanged
  • Manual: emitted `.deno.js` parses + runs under Node 20 ESM (which uses strict mode by default)

Out of scope

  • `rescript_codegen.ml` also uses `String.escaped` but emits ReScript source (which the rescript compiler then transforms to JS). Whether ReScript inherits the same bug is a separate question; not addressed here.
  • Other non-JS codegens (lua, c, rust, julia, gleam, nickel, why3) keep `String.escaped` — they target languages with their own escape conventions.

Refs

🤖 Generated with Claude Code

OCaml's `String.escaped` emits non-ASCII bytes as `\NNN` *decimal*
sequences. JavaScript parses `\NNN` as *octal* escapes which strict-mode
ESM rejects outright (`SyntaxError: Octal escape sequences are not
allowed in strict mode`), and which would decode to wrong characters
even outside strict mode.

Adds `Js_codegen.js_string_lit` that walks the UTF-8 byte sequence,
decodes code points, and emits `\uXXXX` (BMP) or `\u{XXXXX}` (non-BMP)
Unicode escapes. ASCII printable bytes pass through unchanged; `\\` `\"`
`\n` `\r` `\t` use conventional escapes; ASCII control bytes use
`\xHH`. Wired into both `js_codegen.ml` (Node target) and
`codegen_deno.ml` (Deno-ESM target) LitString/LitChar emit sites.

Regression fixture `tests/codegen-deno/non_ascii.affine` + harness
exercise BMP emoji (❌ ✓), CJK (你好), Latin accents (café résumé),
non-BMP code points (😭 = U+1F62D), mixed strings, and the
existing-escape regression path (\\ and \"). Pre-fix: harness
`import` itself fails with SyntaxError. Post-fix: 8/8 assertions pass.

Verified: full `tools/run_codegen_deno_tests.sh` (13/13 harnesses
green); full `dune test` suite (352/352 green).

Closes #460
Refs hyperpolymath/standards#284 (the seam-analyst PR that surfaced
this)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hyperpolymath hyperpolymath enabled auto-merge (squash) May 30, 2026 14:08
@github-actions
Copy link
Copy Markdown

🔍 Hypatia Security Scan

Findings: 83 issues detected

Severity Count
🔴 Critical 4
🟠 High 11
🟡 Medium 68

⚠️ Action Required: Critical security issues found!

View findings
[
  {
    "reason": "Action perpolymath/standards/.github/workflows/governance-reusable.yml@main\n needs attention",
    "type": "unpinned_action",
    "file": "governance.yml",
    "action": "pin_sha",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Action ons/checkout@v6\n    needs attention",
    "type": "unpinned_action",
    "file": "publish-jsr.yml",
    "action": "pin_sha",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Action land/setup-deno@v2\n    needs attention",
    "type": "unpinned_action",
    "file": "publish-jsr.yml",
    "action": "pin_sha",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in affine-vscode-publish.yml",
    "type": "unknown",
    "file": "affine-vscode-publish.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in casket-pages.yml",
    "type": "unknown",
    "file": "casket-pages.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in casket-pages.yml",
    "type": "unknown",
    "file": "casket-pages.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in ci.yml",
    "type": "unknown",
    "file": "ci.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in ci.yml",
    "type": "unknown",
    "file": "ci.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in ci.yml",
    "type": "unknown",
    "file": "ci.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Issue in ci.yml",
    "type": "unknown",
    "file": "ci.yml",
    "action": "flag",
    "rule_module": "workflow_audit",
    "severity": "medium"
  }
]

Powered by Hypatia Neurosymbolic CI/CD Intelligence

@hyperpolymath hyperpolymath merged commit d5437f8 into main May 30, 2026
26 of 27 checks passed
@hyperpolymath hyperpolymath deleted the feat/codegen-js-unicode-string-lit-460 branch May 30, 2026 14:11
hyperpolymath added a commit that referenced this pull request May 30, 2026
…464)

## Summary

Closes #458 — \`String < String\` (and \`>\` / \`<=\` / \`>=\`) now
type-check, lowering to JS's native lexicographic string comparison.
Pre-fix: \`TypeMismatch (String, Int)\`.

## Implementation

Single addition to the existing comparison dispatch in
\`Typecheck.synth_expr\` for \`ExprBinary\`:

\`\`\`ocaml
match repr lhs_ty with
| TCon "Float" -> ...
| TCon "String" ->
    let* () = check ctx rhs ty_string in
    Ok ty_bool
| _ -> ...   (* legacy Int monomorphism *)
\`\`\`

Pattern mirrors the existing Float dispatch a few lines up. No codegen
changes needed — JavaScript's \`<\` / \`>\` / \`<=\` / \`>=\` on strings
is lex compare natively, and the JS-family backends already emit those
operators verbatim.

## Test plan

New regression fixture \`tests/codegen-deno/string_lex_cmp.affine\` +
harness with **22 assertions**:

- All four ops via functional form (\`lt(a, b)\`, etc.) — covers each
operator's positive/negative direction
- All four ops via literal form (\`first_lt()\`, etc.)
- Equal-string corner cases — \`x <= x\` true, \`x >= x\` true, \`x <
x\` false
- Empty strings — \`\"\" < \"a\"\`, \`\"\" <= \"\"\`
- Prefix relations — \`\"abc\" < \"abcd\"\`

- [x] Local \`./tools/run_codegen_deno_tests.sh\`: **14/14** harnesses
green
- [x] Local \`dune test\`: **352/352** green
- [x] Smoke compile: \`return a < b;\` emits as \`return (a < b);\` (JS
native)

## Out of scope

- **Non-ASCII string comparison** in the fixture: this branch forked
from \`main\` before #463 (the companion Unicode-escape codegen fix for
#460) lands, so non-ASCII source literals would still emit OCaml-style
\`\\NNN\` octal escapes that strict-mode ESM rejects. The relational
typecheck change is orthogonal to literal encoding — non-ASCII lex
compare works naturally once both PRs merge. A non-ASCII assertion can
be added in a follow-up commit after #463 merges, or auto-rebased here
if they land in either order.
- **Other backends** (rescript, wasm, lua, c, rust): out of scope; #458
specifically called out the JS-family ergonomic gap. If \`String <\`
lowering for other backends becomes load-bearing, file separately.

## Refs

- Closes #458
- Refs hyperpolymath/standards#284 (the seam-analyst PR with the
\`str_lt\` workaround)
- Companion: #463 (#460 Unicode-escape codegen, lands together to
unblock non-ASCII relational comparisons)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codegen --deno-esm: non-ASCII string literals lower to octal escapes, blocked by strict-mode ESM

1 participant