various: Don't allow creation of invalid UTF8 strings or identifiers#17862
various: Don't allow creation of invalid UTF8 strings or identifiers#17862jepler wants to merge 5 commits intomicropython:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17862 +/- ##
=======================================
Coverage 98.38% 98.39%
=======================================
Files 171 171
Lines 22296 22300 +4
=======================================
+ Hits 21937 21941 +4
Misses 359 359 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Code size report: |
|
|
||
| #if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK | ||
| // Throws an exception if string content is not UTF-8 | ||
| void utf8_require(const byte *p, size_t len); |
There was a problem hiding this comment.
I suggest to give this function the mp_ namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.
There was a problem hiding this comment.
good suggestion. done.
| @@ -0,0 +1,5 @@ | |||
| UnicodeError | |||
There was a problem hiding this comment.
Can we make the output match CPython so this test doesn't need a .exp file?
I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.
There was a problem hiding this comment.
Would it be better to make MicroPython raise SyntaxError, to match CPython?
I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.
OTOH, if you had some code like this:
try:
exec(some_code)
except SyntaxError:
handle_syntax_error()you might be surprised that the UnicodeError is raised and escapes the error handling.
| mp_parse_node_t pn; | ||
| mp_lexer_t *lex = parser->lexer; | ||
| if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) { | ||
| mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len); |
There was a problem hiding this comment.
This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.
There was a problem hiding this comment.
Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).
(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)
|
It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash. Linux |
|
I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high. |
|
That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase. |
.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>
All sites immediately threw a UnicodeError, so roll that into the new function utf8_require. unicode.c was designed not to require runtime.h, so move the checking function into objstr.c. Reduce the number of #if sites by making a do-nothing variant that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK. Signed-off-by: Jeff Epler <jepler@gmail.com>
.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>
This catches for instance the cases I found in micropython#13084. It does not bring the behavior in line with standard Python, but it does throw errors in the case where a string object would be created with invalid UTF-8 content. Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Jeff Epler <jepler@gmail.com>
|
Because it naturally built on the code re-org done here, I added a UTF8 validity check in a path that catches the problem cases of #13084 as well as a cpydiff example for the docs. It doesn't implement CPython compatibility for formatting unicode code points via %c or{:c} but it does prevent invalid strings to be formed in that way and escape. |
| static mp_obj_t mp_obj_new_str_type_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) { | ||
| // if not a bytes object, look if a qstr with this data already exists | ||
| if (type == &mp_type_str) { | ||
| mp_utf8_require((byte *)vstr->buf, vstr->len); |
There was a problem hiding this comment.
Does it make a measurable difference in performance if we move this after the if statement?
There was a problem hiding this comment.
I'm not familiar with how to measure performance but it does seem like it could have a beneficial effect.
| mp_raise_msg(&mp_type_UnicodeError, NULL); | ||
| } | ||
| #endif // MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK | ||
| mp_utf8_require((byte *)vstr->buf, vstr->len); |
There was a problem hiding this comment.
This is now a duplicate check.
| #if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK | ||
| mp_obj_t mp_obj_new_str_from_utf8_vstr(vstr_t *vstr) { | ||
| // bypasses utf8_check. | ||
| // bypasses utf8_require. |
There was a problem hiding this comment.
This comment is no longer true.
There was a problem hiding this comment.
hm in fact with this change it's no longer possible to bypass the check, so the entire existence of this function becomes moot.
|
maybe I should back the string formatting stuff out again, it's not as clean a change as I thought. What do you think @dlech ? |
|
We could add a |
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes if the first byte of a string is 0xff (compression marker). This caused user-allocated strings on the heap that happened to start with 0xff (utf-8 continuation byte) to be incorrectly treated as compressed ROM string. Modified decompress_error_text_maybe() to add heap pointer validation before attempting decompression. The fix checks if the pointer is in the GC heap - if it is, it cannot be a ROM compressed string and should not be decompressed. The validation uses the same logic as the VERIFY_PTR macro from gc.c Alternative to : micropython#17862 Fixes: micropython#17855 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes if the first byte of a string is 0xff (compression marker). This caused user-allocated strings on the heap that happened to start with 0xff (utf-8 continuation byte) to be incorrectly treated as compressed ROM string. Modified decompress_error_text_maybe() to add heap pointer validation before attempting decompression. The fix checks if the pointer is in the GC heap - if it is, it cannot be a ROM compressed string and should not be decompressed. The validation uses the same logic as the VERIFY_PTR macro from gc.c Alternative to : micropython#17862 Fixes: micropython#17855 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes if the first byte of a string is 0xff (compression marker). This caused user-allocated strings on the heap that happened to start with 0xff (utf-8 continuation byte) to be incorrectly treated as compressed ROM string. Modified decompress_error_text_maybe() to add heap pointer validation before attempting decompression. The fix checks if the pointer is in the GC heap - if it is, it cannot be a ROM compressed string and should not be decompressed. The validation uses the same logic as the VERIFY_PTR macro from gc.c Alternative to : micropython#17862 Fixes: micropython#17855 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Summary
Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.
Besides leading to the problems that the introduction of
utf8_checkwas intended to fix (#9044), the fuzzer found an actual crash when the first byte was\xffand the string was used as an exception argument (#17855).I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.
I re-organized the code to pay for the size cost of the new check in the lexer.
Testing
I added a new test, using
eval()andexec()of byte strings, to ensure that these cases are caught by the lexer.Trade-offs and Alternatives
Could check that the whole code buffer is UTF-8 instead.