Skip to content

fix(landing): render real Unicode in language tags instead of literal \u escapes#74

Merged
heznpc merged 1 commit intomainfrom
fix/landing-page-unicode-labels
Apr 10, 2026
Merged

fix(landing): render real Unicode in language tags instead of literal \u escapes#74
heznpc merged 1 commit intomainfrom
fix/landing-page-unicode-labels

Conversation

@heznpc
Copy link
Copy Markdown
Owner

@heznpc heznpc commented Apr 10, 2026

Summary

The landing page at https://heznpc.github.io/skillBridge/ has been displaying literal \uXXXX escape sequences for non-Latin language names since the lang-tag list was first auto-generated. Verified live before this fix.

Language Was showing Should show
Korean \ud55c\uad6d\uc5b4 한국어
Japanese \u65e5\u672c\u8a9e 日本語
Chinese (Simplified) \u4e2d\u6587(\u7b80\u4f53) 中文(简体)
Chinese (Traditional) \u4e2d\u6587(\u7e41\u9ad4) 中文(繁體)
Russian \u0420\u0443\u0441\u0441\u043a\u0438\u0439 Русский
(also Vietnamese, Ukrainian, Czech, Turkish, Arabic, Hindi, Thai, Bengali, Hebrew, Romanian, Greek in + N more) \u... actual chars

Languages whose names are pure Latin (Español, Français, Deutsch, Português (BR), etc.) rendered fine because they were already stored with literal characters.

Root cause

src/lib/constants.js stored each language label as a string literal containing \uXXXX escape sequences:

const PREMIUM_LANGUAGES = [
  { code: 'ko', label: '\ud55c\uad6d\uc5b4' },   // ← escape literal
  { code: 'ja', label: '\u65e5\u672c\u8a9e' },
  // ...
  { code: 'de', label: 'Deutsch' },              // ← already plain UTF-8
];

At runtime this is fine. The JS engine decodes \uXXXX when parsing the source, so popup.js, header-controls.js, translator.js, and the in-extension language picker all see the correct characters and display them properly.

But scripts/generate-docs.js reads constants.js as a TEXT FILE and uses a regex to extract the label between single quotes:

const entryRe = /\{\s*code:\s*'([^']+)',\s*label:\s*'([^']+)'\s*\}/g;

The regex captures the raw bytes \, u, d, 5, 5, c, ... which then get written verbatim into docs/index.html. Browsers don't decode \uXXXX in HTML text content, so users see the escape sequences literally on the live page.

Fix

Convert every \uXXXX escape in PREMIUM_LANGUAGES and AVAILABLE_LANGUAGES to its literal UTF-8 character. constants.js is already a UTF-8 source file, every other label was already non-escaped (Deutsch, Italiano, Polski...), and prettier/eslint accept either form — there was no reason to use escapes.

Once the source has real characters, the script's regex captures them and the HTML output is correct UTF-8.

Why this is safe at runtime

  • '\ud55c\uad6d\uc5b4' and '한국어' produce identical strings when the JS engine parses them
  • All runtime consumers (popup.js:112-128, header-controls.js:78-178, translator.js:33, content.js:455) use lang.label as text and don't string-compare against escape literals
  • tests/constants.test.js only checks length, code presence, and label truthiness — no string content assertions
  • Tests pass unchanged (309/309)

Verification

Check Result
npm test 309/309 passing (same as baseline)
npm run lint clean
npm run format:check clean (Prettier accepts UTF-8)
npm run docs idempotent — second run produces no further diff
Raw byte inspection of docs/index.html confirmed valid UTF-8 with real Unicode characters

Out of scope (separate fix candidate)

README.md:7 links to https://heznpc.github.io/skillbridge/ (lowercase b) but the actual GitHub Pages URL is https://heznpc.github.io/skillBridge/ (capital B). Lowercase returns 404. Same project area but a separate concern — will track separately if you'd like.

Test plan

🤖 Generated with Claude Code

… \u escapes

The landing page at heznpc.github.io/skillBridge/ has been displaying
literal escape sequences for non-Latin language names since the lang-tag
list was first auto-generated:

  Korean   : \ud55c\uad6d\uc5b4   (should be 한국어)
  Japanese : \u65e5\u672c\u8a9e   (should be 日本語)
  Chinese  : \u4e2d\u6587(\u7b80\u4f53)   (should be 中文(简体))
  Russian  : \u0420\u0443\u0441\u0441\u043a\u0438\u0439   (should be Русский)
  ... and 5 more

Verified live at https://heznpc.github.io/skillBridge/ before this fix.
Languages whose names are pure Latin (Spanish, French, German, Vietnamese
post-decoding via the JS engine in popup) rendered fine.

Root cause: src/lib/constants.js stored each language label as a string
literal containing \uXXXX escape sequences. At runtime this is fine — the
JS engine decodes them when parsing the source — so popup.js, header-
controls.js, and the in-extension language picker all show the right
characters. But scripts/generate-docs.js reads constants.js as a TEXT
FILE and uses a regex to extract the label between single quotes. The
regex captures the raw bytes \, u, d, 5, 5, c, ... which then get
written verbatim into docs/index.html. Browsers don't decode \uXXXX in
HTML text content, so the live page shows the escape sequences literally.

Fix: convert every \uXXXX escape in PREMIUM_LANGUAGES and AVAILABLE_LANGUAGES
to its literal UTF-8 character. constants.js is already a UTF-8 source
file, every other label was already non-escaped (Deutsch, Italiano, etc.),
and prettier/eslint accept either form — there was no reason to use
escapes in the first place. Once the source has real characters, the
script's regex captures them and the HTML gets correct UTF-8 output.

Why this is safe at runtime:
- The JS engine produces an identical string from '\ud55c\uad6d\uc5b4' and '한국어'.
- All runtime consumers (popup.js:112-128, header-controls.js:78-178,
  translator.js:33, content.js:455) use lang.label as text and don't
  string-compare against escape literals.
- tests/constants.test.js only checks length, code presence, and label
  truthiness — no string content assertions. Tests pass unchanged.

Verification:
- npm test           — 309/309 passing (same as baseline)
- npm run lint       — clean
- npm run format:check — clean (Prettier accepts UTF-8 in source)
- npm run docs       — idempotent; second run produces no diff
- docs/index.html    — confirmed via raw byte read that the new content
                       is valid UTF-8 with real Unicode characters

Out of scope (separate fix candidate):
- README.md L7 links to https://heznpc.github.io/skillbridge/ (lowercase b)
  but the actual GitHub Pages URL is https://heznpc.github.io/skillBridge/
  (capital B). Lowercase returns 404. Same project area but a separate concern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@heznpc heznpc merged commit 2d1314b into main Apr 10, 2026
2 checks passed
@heznpc heznpc deleted the fix/landing-page-unicode-labels branch April 10, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant