various: Don't allow creation of invalid UTF8 strings or identifiers by jepler · Pull Request #17862 · micropython/micropython

jepler · 2025-08-07T14:29:42Z

Summary

Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.

Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).

I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.

I re-organized the code to pay for the size cost of the new check in the lexer.

Testing

I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.

Trade-offs and Alternatives

Could check that the whole code buffer is UTF-8 instead.

codecov · 2025-08-07T14:39:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.39%. Comparing base (9939565) to head (cee3bba).
⚠️ Report is 428 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #17862   +/-   ##
=======================================
  Coverage   98.38%   98.39%           
=======================================
  Files         171      171           
  Lines       22296    22300    +4     
=======================================
+ Hits        21937    21941    +4     
  Misses        359      359

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-08-07T14:39:41Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +104 +0.012% standard
      stm32:   +32 +0.008% PYBV10
     mimxrt:   +40 +0.011% TEENSY40
        rp2:   +24 +0.003% RPI_PICO_W
       samd:   +32 +0.012% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +36 +0.008% VIRT_RV32

dlech · 2025-08-07T16:18:50Z


+#if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
+// Throws an exception if string content is not UTF-8
+void utf8_require(const byte *p, size_t len);


I suggest to give this function the mp_ namespace prefix. We haven't always been consistent about this, but I think any new functions should have it.

good suggestion. done.

dpgeorge · 2025-08-15T02:07:28Z

@@ -0,0 +1,5 @@
+UnicodeError


Can we make the output match CPython so this test doesn't need a .exp file?

I see that CPython raises SyntaxError though, which is different to MicroPython here... not sure what the best way forward is.

Would it be better to make MicroPython raise SyntaxError, to match CPython?

I don't know if that really matters though, this is a pretty rare case, and saving code size (reusing the same UnicodeError) is probably more important.

OTOH, if you had some code like this:

try: exec(some_code) except SyntaxError: handle_syntax_error()

you might be surprised that the UnicodeError is raised and escapes the error handling.

dpgeorge · 2025-08-15T14:55:22Z

    mp_parse_node_t pn;
    mp_lexer_t *lex = parser->lexer;
+    if (lex->tok_kind == MP_TOKEN_NAME || lex->tok_kind == MP_TOKEN_STRING) {
+        mp_utf8_require((byte *)lex->vstr.buf, lex->vstr.len);


This is going to impact performance for compilation of all code. I guess there's not really any way around that if we want to validate the utf8 properly.

Agreed on both counts. If the overhead is intolerable for some use case, it can be compile-time disabled just like the existing uses of mp_utf8_require (gated by MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, on by default when unicode is on).

(It's been rattling around in my head to have a build flag for "remove all checks a good static checker would diagnose", e.g., code that passes mypy --strict, is checked for valid utf-8, etc. but I think there would still end up being a lot of judgement calls and also you can't quite type check micropython/circuitpython code because of lack of exactly matching pyi stubs…)

jepler · 2025-08-15T17:14:07Z

It's not necessarily a great comparison but I made a file called "huge.py" from 16 copies of tests/perf_bench/bm_hexiom.py (total size 268928 bytes) then benchmarked compiling it with the old and new mpy-cross on my x86_64 linux desktop system. Statistically, it was a wash.

Benchmark 1: ./mpy-cross-pr17862 huge.py
  Time (mean ± σ):      66.1 ms ±   0.3 ms    [User: 58.6 ms, System: 7.4 ms]
  Range (min … max):    65.3 ms …  67.0 ms    100 runs
 
Benchmark 2: ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py
  Time (mean ± σ):      66.3 ms ±   0.3 ms    [User: 59.6 ms, System: 6.6 ms]
  Range (min … max):    65.8 ms …  67.7 ms    100 runs
 
Summary
  ./mpy-cross-pr17862 huge.py ran
    1.00 ± 0.01 times faster than ./mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py

Linux perf stat counted 4 million more instructions (+0.59%) but another way to look at is 15 added instructions per byte of source code. (baseline is around 2554 instructions per byte of source code)

 Performance counter stats for './mpy-cross-pr17862 huge.py' (30 runs):

             66.82 msec task-clock                       #    0.990 CPUs utilized               ( +-  0.09% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               329      page-faults                      #    4.924 K/sec                       ( +-  0.09% )
       282,271,960      cycles                           #    4.224 GHz                         ( +-  0.05% )  (57.31%)
        41,553,651      stalled-cycles-frontend          #   14.72% frontend cycles idle        ( +-  0.33% )  (63.35%)
       690,916,317      instructions                     #    2.45  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.09% )  (63.77%)
       193,477,661      branches                         #    2.896 G/sec                       ( +-  0.08% )  (60.80%)
         1,477,191      branch-misses                    #    0.76% of all branches             ( +-  1.10% )  (54.76%)

         0.0674732 +- 0.0000710 seconds time elapsed  ( +-  0.11% )


 Performance counter stats for './mpy-cross-v1.27.0-preview-42-gb7cfafc1ee huge.py' (30 runs):

             65.99 msec task-clock                       #    0.991 CPUs utilized               ( +-  0.05% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               328      page-faults                      #    4.971 K/sec                       ( +-  0.07% )
       283,490,040      cycles                           #    4.296 GHz                         ( +-  0.05% )  (55.98%)
        42,627,980      stalled-cycles-frontend          #   15.04% frontend cycles idle        ( +-  0.37% )  (62.06%)
       686,832,657      instructions                     #    2.42  insn per cycle            
                                                  #    0.06  stalled cycles per insn     ( +-  0.13% )  (63.52%)
       194,634,958      branches                         #    2.950 G/sec                       ( +-  0.17% )  (62.25%)
         1,374,874      branch-misses                    #    0.71% of all branches             ( +-  1.75% )  (56.18%)

         0.0665979 +- 0.0000371 seconds time elapsed  ( +-  0.06% )

jepler · 2025-08-16T19:18:13Z

I changed things a bit so SyntaxError can be raised in this case. I agree it seems preferable if the cost isn't high.

jepler · 2025-08-16T20:28:34Z

That takes e.g., RP2040 from -16 bytes to +16 bytes (net 32 bytes difference to throw the correct exception). Other ports are in the same range of increase.

.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>

All sites immediately threw a UnicodeError, so roll that into the new function utf8_require. unicode.c was designed not to require runtime.h, so move the checking function into objstr.c. Reduce the number of #if sites by making a do-nothing variant that is used instead when !STR_UNICODE or !STR_UNICODE_CHECK. Signed-off-by: Jeff Epler <jepler@gmail.com>

.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>

This catches for instance the cases I found in micropython#13084. It does not bring the behavior in line with standard Python, but it does throw errors in the case where a string object would be created with invalid UTF-8 content. Signed-off-by: Jeff Epler <jepler@gmail.com>

Signed-off-by: Jeff Epler <jepler@gmail.com>

jepler · 2025-09-04T13:42:27Z

Because it naturally built on the code re-org done here, I added a UTF8 validity check in a path that catches the problem cases of #13084 as well as a cpydiff example for the docs. It doesn't implement CPython compatibility for formatting unicode code points via %c or{:c} but it does prevent invalid strings to be formed in that way and escape.

dlech · 2025-09-04T13:51:06Z

 static mp_obj_t mp_obj_new_str_type_from_vstr(const mp_obj_type_t *type, vstr_t *vstr) {
    // if not a bytes object, look if a qstr with this data already exists
    if (type == &mp_type_str) {
+        mp_utf8_require((byte *)vstr->buf, vstr->len);


Does it make a measurable difference in performance if we move this after the if statement?

I'm not familiar with how to measure performance but it does seem like it could have a beneficial effect.

dlech · 2025-09-04T13:51:42Z

-        mp_raise_msg(&mp_type_UnicodeError, NULL);
-    }
-    #endif // MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
+    mp_utf8_require((byte *)vstr->buf, vstr->len);


This is now a duplicate check.

dlech · 2025-09-04T13:51:52Z

 #if MICROPY_PY_BUILTINS_STR_UNICODE && MICROPY_PY_BUILTINS_STR_UNICODE_CHECK
 mp_obj_t mp_obj_new_str_from_utf8_vstr(vstr_t *vstr) {
-    // bypasses utf8_check.
+    // bypasses utf8_require.


This comment is no longer true.

hm in fact with this change it's no longer possible to bypass the check, so the entire existence of this function becomes moot.

jepler · 2025-09-04T15:13:14Z

maybe I should back the string formatting stuff out again, it's not as clean a change as I thought. What do you think @dlech ?

dlech · 2025-09-04T15:24:24Z

We could add a bool parameter to mp_obj_new_str_type_from_vstr() to optionally enable the check, but of course that comes as some small cost in code size and a few extra instructions executed.

jepler · 2025-12-26T15:32:46Z

Closing this up, as it stands this is not a cleanup, it's just making things uglier. #17855 and (new) #18609 track the two issues of creating non UTF-8 string literals & identifiers.

The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes if the first byte of a string is 0xff (compression marker). This caused user-allocated strings on the heap that happened to start with 0xff (utf-8 continuation byte) to be incorrectly treated as compressed ROM string. Modified decompress_error_text_maybe() to add heap pointer validation before attempting decompression. The fix checks if the pointer is in the GC heap - if it is, it cannot be a ROM compressed string and should not be decompressed. The validation uses the same logic as the VERIFY_PTR macro from gc.c Alternative to : micropython#17862 Fixes: micropython#17855 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

jepler changed the title ~~parse: Don't allow creation of invalid UTF8 strings~~ parse: Don't allow creation of invalid UTF8 strings or identifiers Aug 7, 2025

jepler force-pushed the issue17855 branch from fd0ca44 to 392d55c Compare August 7, 2025 14:43

dlech reviewed Aug 7, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from 392d55c to 09533f8 Compare August 7, 2025 16:45

dpgeorge added py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support. labels Aug 8, 2025

dpgeorge reviewed Aug 15, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from 09533f8 to bce1332 Compare August 15, 2025 14:41

dpgeorge reviewed Aug 15, 2025

View reviewed changes

jepler force-pushed the issue17855 branch from bce1332 to 1e841d4 Compare August 16, 2025 19:17

jepler force-pushed the issue17855 branch from 1e841d4 to 90e366e Compare August 17, 2025 13:59

jepler requested a review from dpgeorge September 4, 2025 02:08

jepler added 3 commits September 3, 2025 21:11

tests: Add test of invalid unicode strings.

8b79c71

.. from non UTF-8 inputs. In this case, MicroPython raises UnicodeError while CPython uses SyntaxError. By catching either exception, the test does not require an .exp file. Signed-off-by: Jeff Epler <jepler@gmail.com>

parse: Don't allow creation of invalid UTF8 strings or identifiers.

777ee14

.. even when compiling non UTF-8 files or byte strings. Closes: micropython#17855 Signed-off-by: Jeff Epler <jepler@gmail.com>

jepler force-pushed the issue17855 branch from 90e366e to 777ee14 Compare September 4, 2025 02:13

jepler added 2 commits September 4, 2025 08:30

cpydiff: Document difference in {:c}, %c formatting.

cee3bba

Signed-off-by: Jeff Epler <jepler@gmail.com>

dlech reviewed Sep 4, 2025

View reviewed changes

jepler changed the title ~~parse: Don't allow creation of invalid UTF8 strings or identifiers~~ various: Don't allow creation of invalid UTF8 strings or identifiers Sep 4, 2025

jepler mentioned this pull request Dec 26, 2025

Micropython allows creation of non UTF-8 identifiers #18609

Open

jepler closed this Dec 26, 2025

Uh oh!

Conversation

jepler commented Aug 7, 2025

Summary

Testing

Trade-offs and Alternatives

Uh oh!

codecov bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jepler commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

jepler commented Aug 16, 2025

Uh oh!

jepler commented Sep 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jepler commented Sep 4, 2025

Uh oh!

dlech commented Sep 4, 2025

Uh oh!

jepler commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 7, 2025 •

edited

Loading

github-actions bot commented Aug 7, 2025 •

edited

Loading

jepler commented Aug 15, 2025 •

edited

Loading