Skip to content

various: Don't allow creation of invalid UTF8 strings or identifiers#17862

Closed
jepler wants to merge 5 commits intomicropython:masterfrom
jepler:issue17855
Closed

various: Don't allow creation of invalid UTF8 strings or identifiers#17862
jepler wants to merge 5 commits intomicropython:masterfrom
jepler:issue17855

Conversation

@jepler
Copy link
Copy Markdown
Contributor

@jepler jepler commented Aug 7, 2025

Summary

Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.

Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).

I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.

I re-organized the code to pay for the size cost of the new check in the lexer.

Testing

I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.

Trade-offs and Alternatives

Could check that the whole code buffer is UTF-8 instead.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.39%. Comparing base (9939565) to head (cee3bba).
⚠️ Report is 428 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #17862   +/-   ##
=======================================
  Coverage   98.38%   98.39%           
=======================================
  Files         171      171           
  Lines       22296    22300    +4     
=======================================
+ Hits        21937    21941    +4     
  Misses        359      359           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 7, 2025

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +104 +0.012% standard
      stm32:   +32 +0.008% PYBV10
     mimxrt:   +40 +0.011% TEENSY40
        rp2:   +24 +0.003% RPI_PICO_W
       samd:   +32 +0.012% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +36 +0.008% VIRT_RV32

@jepler jepler changed the title parse: Don't allow creation of invalid UTF8 strings parse: Don't allow creation of invalid UTF8 strings or identifiers Aug 7, 2025
Comment thread py/objstr.h Outdated

#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
// Throws an exception if string content is not UTF-8
void utf8_require(const byte *p, size_t len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to give this function the mp_ namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. done.

@dpgeorge dpgeorge added py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support. labels Aug 8, 2025
Comment thread tests/unicode/unicode_parser.py.exp Outdated
@@ -0,0 +1,5 @@
UnicodeError
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the output match CPython so this test doesn't need a .exp file?

I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make MicroPython raise SyntaxError, to match CPython?

I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.

OTOH, if you had some code like this:

try:
    exec(some_code)
except SyntaxError:
    handle_syntax_error()

you might be surprised that the UnicodeError is raised and escapes the error handling.

Comment thread py/parse.c Outdated
mp_parse_node_t pn;
mp_lexer_t *lex = parser->lexer;
if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) {
mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).

(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)

@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Aug 15, 2025

It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash.

Benchmark 1: ./mpy-cross-pr17862 huge.py
  Time (mean ± σ):      66.1 ms ±   0.3 ms    [User: 58.6 ms, System: 7.4 ms]
  Range (min … max):    65.3 ms …  67.0 ms    100 runs
 
Benchmark 2: ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py
  Time (mean ± σ):      66.3 ms ±   0.3 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.8 ms …  67.7 ms    100 runs
 
Summary
  ./mpy-cross-pr17862 huge.py ran
    1.00 ± 0.01 times faster than ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py

Linux perf stat counted 4 million more instructions (+0.59%) but another way to look at is 15 added instructions per byte of source code. (baseline is around 2554 instructions per byte of source code)

 Performance counter stats for './mpy-cross-pr17862 huge.py' (30 runs):

             66.82 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.09% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               329      page-faults                      #    4.924 K/sec                       ( +-  0.09% )
       282,271,960      cycles                           #    4.224 GHz                         ( +-  0.05% )  (57.31%)
        41,553,651      stalled-cycles-frontend          #   14.72% frontend cycles idle        ( +-  0.33% )  (63.35%)
       690,916,317      instructions                     #    2.45  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.09% )  (63.77%)
       193,477,661      branches                         #    2.896 G/sec                       ( +-  0.08% )  (60.80%)
         1,477,191      branch-misses                    #    0.76% of all branches             ( +-  1.10% )  (54.76%)

         0.0674732 +- 0.0000710 seconds time elapsed  ( +-  0.11% )


 Performance counter stats for './mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py' (30 runs):

             65.99 msec task-clock                       #    0.991 CPUs utilized               ( +-  0.05% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               328      page-faults                      #    4.971 K/sec                       ( +-  0.07% )
       283,490,040      cycles                           #    4.296 GHz                         ( +-  0.05% )  (55.98%)
        42,627,980      stalled-cycles-frontend          #   15.04% frontend cycles idle        ( +-  0.37% )  (62.06%)
       686,832,657      instructions                     #    2.42  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.13% )  (63.52%)
       194,634,958      branches                         #    2.950 G/sec                       ( +-  0.17% )  (62.25%)
         1,374,874      branch-misses                    #    0.71% of all branches             ( +-  1.75% )  (56.18%)

         0.0665979 +- 0.0000371 seconds time elapsed  ( +-  0.06% )

@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Aug 16, 2025

I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high.

@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Aug 16, 2025

That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase.

.. from non UTF-8 inputs. In this case, MicroPython raises
UnicodeError while CPython uses SyntaxError. By catching either
exception, the test does not require an .exp file.

Signed-off-by: Jeff Epler <jepler@gmail.com>
All sites immediately threw a UnicodeError, so roll that into
the new function utf8_require.

unicode.c was designed not to require runtime.h, so move the
checking function into objstr.c.

Reduce the number of #if sites by making a do-nothing variant
that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK.

Signed-off-by: Jeff Epler <jepler@gmail.com>
.. even when compiling non UTF-8 files or byte strings.

Closes: micropython#17855
Signed-off-by: Jeff Epler <jepler@gmail.com>
This catches for instance the cases I found in micropython#13084.
It does not bring the behavior in line with standard Python,
but it does throw errors in the case where a string object
would be created with invalid UTF-8 content.

Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Jeff Epler <jepler@gmail.com>
@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Sep 4, 2025

Because it naturally built on the code re-org done here, I added a UTF8 validity check in a path that catches the problem cases of #13084 as well as a cpydiff example for the docs. It doesn't implement CPython compatibility for formatting unicode code points via %c or{:c} but it does prevent invalid strings to be formed in that way and escape.

Comment thread py/objstr.c
static mp_obj_t mp_obj_new_str_type_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) {
// if not a bytes object, look if a qstr with this data already exists
if (type == &mp_type_str) {
mp_utf8_require((byte *)vstr->buf, vstr->len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make a measurable difference in performance if we move this after the if statement?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how to measure performance but it does seem like it could have a beneficial effect.

Comment thread py/objstr.c
mp_raise_msg(&mp_type_UnicodeError, NULL);
}
#endif // MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
mp_utf8_require((byte *)vstr->buf, vstr->len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a duplicate check.

Comment thread py/objstr.c
#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
mp_obj_t mp_obj_new_str_from_utf8_vstr(vstr_t *vstr) {
// bypasses utf8_check.
// bypasses utf8_require.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm in fact with this change it's no longer possible to bypass the check, so the entire existence of this function becomes moot.

@jepler jepler changed the title parse: Don't allow creation of invalid UTF8 strings or identifiers various: Don't allow creation of invalid UTF8 strings or identifiers Sep 4, 2025
@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Sep 4, 2025

maybe I should back the string formatting stuff out again, it's not as clean a change as I thought. What do you think @dlech ?

@dlech
Copy link
Copy Markdown
Contributor

dlech commented Sep 4, 2025

We could add a bool parameter to mp_obj_new_str_type_from_vstr() to optionally enable the check, but of course that comes as some small cost in code size and a few extra instructions executed.

@jepler
Copy link
Copy Markdown
Contributor Author

jepler commented Dec 26, 2025

Closing this up, as it stands this is not a cleanup, it's just making things uglier. #17855 and (new) #18609 track the two issues of creating non UTF-8 string literals & identifiers.

@jepler jepler closed this Dec 26, 2025
Josverl added a commit to Josverl/micropython that referenced this pull request Mar 20, 2026
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes
if the first byte of a string is 0xff (compression marker).
This caused user-allocated strings on the heap that happened
to start with 0xff (utf-8 continuation byte) to be incorrectly
treated as compressed ROM string.

Modified decompress_error_text_maybe() to add heap pointer validation
before attempting decompression. The fix checks if the pointer is in
the GC heap - if it is, it cannot be a ROM compressed string and
should not be decompressed.
The validation uses the same logic as the VERIFY_PTR macro from gc.c

Alternative to : micropython#17862

Fixes: micropython#17855
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Josverl added a commit to Josverl/micropython that referenced this pull request Apr 2, 2026
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes
if the first byte of a string is 0xff (compression marker).
This caused user-allocated strings on the heap that happened
to start with 0xff (utf-8 continuation byte) to be incorrectly
treated as compressed ROM string.

Modified decompress_error_text_maybe() to add heap pointer validation
before attempting decompression. The fix checks if the pointer is in
the GC heap - if it is, it cannot be a ROM compressed string and
should not be decompressed.
The validation uses the same logic as the VERIFY_PTR macro from gc.c

Alternative to : micropython#17862

Fixes: micropython#17855
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Josverl added a commit to Josverl/micropython that referenced this pull request Apr 2, 2026
The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes
if the first byte of a string is 0xff (compression marker).
This caused user-allocated strings on the heap that happened
to start with 0xff (utf-8 continuation byte) to be incorrectly
treated as compressed ROM string.

Modified decompress_error_text_maybe() to add heap pointer validation
before attempting decompression. The fix checks if the pointer is in
the GC heap - if it is, it cannot be a ROM compressed string and
should not be decompressed.
The validation uses the same logic as the VERIFY_PTR macro from gc.c

Alternative to : micropython#17862

Fixes: micropython#17855
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants