Skip to content

str() will convert invalid utf8 from bytes object #4310

@ddiminnie

Description

@ddiminnie

I've reproduced this behavior on the windows, pyboard (via javascript emulator), and CircuitPython 'atmel-samd' ports. The output of the 'windows' port is shown below.
If a bytes object contains values that represent invalid utf8 (more specifically, invalid continuation characters), CPython will throw an appropriate exception;

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte

However, MicroPython (at least, the MicroPython ports mentioned above) will happily perform the conversion, with 'interesting' results:

MicroPython v1.9.4 on 2018-11-19; win32 version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
>>> len(s)
4
>>> s[0]
'\x00\x00\r\x08'
>>> s[1]
'\x00\r\x08'
>>> s[2]
'\r\x08\x00'
>>> s[3]
'\x08\x00\x16'

What's somewhat disturbing is that the value stored to 's[2]' and 's[3]' in the example above appears to contain the contents of memory outside of the original bytes object (vague flashbacks to the now infamous 'Heartbleeds' bug come to mind). At any rate, this (and similar) example(s) almost certainly ought to trigger an appropriate exception...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions