Skip to content

Enforce UTF-8 validity for RFC 8259 string parsing#22

Merged
ionux merged 1 commit intomainfrom
codex/analyze-for-rfc-8259-compliance
Mar 11, 2026
Merged

Enforce UTF-8 validity for RFC 8259 string parsing#22
ionux merged 1 commit intomainfrom
codex/analyze-for-rfc-8259-compliance

Conversation

@ionux
Copy link
Copy Markdown
Owner

@ionux ionux commented Mar 11, 2026

Motivation

  • Ensure JSON string handling conforms to RFC 8259 §8.1 by rejecting ill-formed UTF-8 bytes in string contents.

Description

  • Added okj_is_utf8_continuation() and okj_validate_utf8_sequence() to validate single UTF-8 scalar sequences (1–4 bytes) and detect invalid forms such as overlong encodings, surrogate-range encodings, invalid continuation bytes, truncated sequences, and out-of-range 4-byte forms.
  • Integrated UTF-8 validation into the string-parsing loop in src/ok_json.c so non-escaped non-ASCII bytes are checked and the parser advances by the validated byte length instead of assuming single-byte characters.
  • On UTF-8 validation failure the parser now returns OKJ_ERROR_BAD_STRING for the offending string token.
  • Added unit tests in test/ok_json_tests.c: test_utf8_valid_multibyte, test_utf8_invalid_overlong, test_utf8_invalid_surrogate, and test_utf8_invalid_truncated to cover valid multibyte text and representative invalid UTF-8 cases.

Testing

  • Built and ran the full test suite with make; the test runner executed and all tests passed including the new UTF-8 tests.
  • Test files exercised: test/ok_json_tests.c (new UTF-8 cases) and src/ok_json.c (parser changes) and reported All OK_JSON tests passed! on the test runner output.

Codex Task

@ionux ionux merged commit b019cef into main Mar 11, 2026
2 checks passed
@ionux ionux deleted the codex/analyze-for-rfc-8259-compliance branch March 11, 2026 02:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7745e260b1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ok_json.c
return 0U;
}

b2 = (uint8_t)src[pos + 2U];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent UTF-8 validation from reading past buffer end

okj_validate_utf8_sequence() reads src[pos + 2U] (and later src[pos + 3U]) without first verifying those bytes exist, so a malformed input that ends mid-sequence (for example, an unterminated string whose last byte is 0xE0) can trigger an out-of-bounds read before the parser reports an error. In this case the parser enters the string loop because src[pos] != '\0', then dereferences beyond the terminator, which is undefined behavior and can crash under sanitizers.

Useful? React with 👍 / 👎.

Comment thread src/ok_json.c
break;
}

parser->position += utf8_advance;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce max string length after UTF-8 byte advance

The max-length check is performed before consuming the next character, but parser->position now advances by the full UTF-8 sequence length, so a final multibyte character can push the token past OKJ_MAX_STRING_LEN and still be accepted if the next byte is the closing quote. For example, a 63-byte ASCII prefix plus a 2-byte UTF-8 character parses successfully with length 65, which bypasses the configured hard limit.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant