Enforce UTF-8 validity for RFC 8259 string parsing#22
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7745e260b1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return 0U; | ||
| } | ||
|
|
||
| b2 = (uint8_t)src[pos + 2U]; |
There was a problem hiding this comment.
Prevent UTF-8 validation from reading past buffer end
okj_validate_utf8_sequence() reads src[pos + 2U] (and later src[pos + 3U]) without first verifying those bytes exist, so a malformed input that ends mid-sequence (for example, an unterminated string whose last byte is 0xE0) can trigger an out-of-bounds read before the parser reports an error. In this case the parser enters the string loop because src[pos] != '\0', then dereferences beyond the terminator, which is undefined behavior and can crash under sanitizers.
Useful? React with 👍 / 👎.
| break; | ||
| } | ||
|
|
||
| parser->position += utf8_advance; |
There was a problem hiding this comment.
Enforce max string length after UTF-8 byte advance
The max-length check is performed before consuming the next character, but parser->position now advances by the full UTF-8 sequence length, so a final multibyte character can push the token past OKJ_MAX_STRING_LEN and still be accepted if the next byte is the closing quote. For example, a 63-byte ASCII prefix plus a 2-byte UTF-8 character parses successfully with length 65, which bypasses the configured hard limit.
Useful? React with 👍 / 👎.
Motivation
Description
okj_is_utf8_continuation()andokj_validate_utf8_sequence()to validate single UTF-8 scalar sequences (1–4 bytes) and detect invalid forms such as overlong encodings, surrogate-range encodings, invalid continuation bytes, truncated sequences, and out-of-range 4-byte forms.src/ok_json.cso non-escaped non-ASCII bytes are checked and the parser advances by the validated byte length instead of assuming single-byte characters.OKJ_ERROR_BAD_STRINGfor the offending string token.test/ok_json_tests.c:test_utf8_valid_multibyte,test_utf8_invalid_overlong,test_utf8_invalid_surrogate, andtest_utf8_invalid_truncatedto cover valid multibyte text and representative invalid UTF-8 cases.Testing
make; the test runner executed and all tests passed including the new UTF-8 tests.test/ok_json_tests.c(new UTF-8 cases) andsrc/ok_json.c(parser changes) and reportedAll OK_JSON tests passed!on the test runner output.Codex Task