Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 Strings when compiling with MRB_UTF8_STRING compute length incorrectly #5269

lopopolo opened this issue Jan 8, 2021 · 2 comments


Copy link

lopopolo commented Jan 8, 2021

When compiling mruby with UTF-8 Strings (e.g. setting CFLAGS="-DMRB_UTF8_STRING"), mruby incorrectly computes the length of strings with invalid UTF-8 byte sequences.


$ git rev-parse HEAD
$ git show
commit 69482dbc8e590ed66f0944e9b48c4f9c2f83c873 (HEAD -> master, origin/master, origin/HEAD)
Merge: 6587269a f7ff4810
Author: Yukihiro "Matz" Matsumoto <>
Date:   Fri Jan 8 23:10:49 2021 +0900

    Merge pull request #5265 from shuujii/reapply-116e128b-because-it-is-back-at-456878ba

    Reapply 116e128b because it is back at 456878ba

Reproduction steps

rake clean

Executing in mirb:

$ ./bin/mirb
mirb - Embeddable Interactive Ruby Shell

> xs = [192, 128].pack("C*")
 => "��"
> xs.bytes
 => [192, 128]
> xs.length
 => 1

Reference MRI execution

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs.bytes
=> [192, 128]
[2.6.6] > xs.length
=> 2

With forced UTF-8 encoding:

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs = xs.force_encoding(Encoding::UTF_8)
=> "\xC0\x80"
[2.6.6] > xs.encoding
=> #<Encoding:UTF-8>
[2.6.6] > xs.length
=> 2
lopopolo added a commit to artichoke/artichoke that referenced this issue Jan 8, 2021
- Fix quickcheck test setup for owned and borrowed bytes tests. An
  upstream mruby bug (mruby/mruby#5269) prevents the length of invalid
  UTF-8 strings from being correctly calculated. The quickcheck harness
  generates a wider range of inputs which requires converting the tests
  to use `bytes`, `byteslice`, and `bytesize`.
- Add an additional `convert_with_trailing_nul` test to the `bytes`
  converter module.
- Fix quickcheck test setup for float tests. The quickcheck harness
  generates a wider range of inputs which revealed issues with the test
  when comparing NaN, infinities, and, some pairs which caused a
  subtraction overflow.
- Fix a test in `spinoso-securerandom` to no longer use the `chars()`
  iterator because `alphanumeric` returns a `Vec<u8>` now.

This commit also renames the cargo features in `spinoso-random` from
`rand_core` to `rand-traits` and from `rand` to `random-rand`. These
changes more closely align `spinoso-random` with its related `rand_mt`
crate. These changes remove the package name overrides for `rand` and
`rand_core` in `Cargo.toml`. Doc comments and doc tests have been
updated for these changes.
Copy link

matz commented Jan 9, 2021

The length of broken UTF-8 strings is undefined, I think. You may get whatever length.
CRuby does validate if the string is UTF-8 compliant. mruby just assumes the string is valid UTF-8.
I will investigate if we can validate without hindering the performance.

@matz matz closed this as completed in f81591c Jan 9, 2021
Copy link

matz commented Jan 9, 2021

For this particular case (the first byte is in 80..C1) is checked now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants