Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 Strings when compiling with MRB_UTF8_STRING compute length incorrectly #5269

Closed
lopopolo opened this issue Jan 8, 2021 · 2 comments

Comments

@lopopolo
Copy link
Contributor

lopopolo commented Jan 8, 2021

When compiling mruby with UTF-8 Strings (e.g. setting CFLAGS="-DMRB_UTF8_STRING"), mruby incorrectly computes the length of strings with invalid UTF-8 byte sequences.

mruby

$ git rev-parse HEAD
69482dbc8e590ed66f0944e9b48c4f9c2f83c873
$ git show
commit 69482dbc8e590ed66f0944e9b48c4f9c2f83c873 (HEAD -> master, origin/master, origin/HEAD)
Merge: 6587269a f7ff4810
Author: Yukihiro "Matz" Matsumoto <matz@ruby.or.jp>
Date:   Fri Jan 8 23:10:49 2021 +0900

    Merge pull request #5265 from shuujii/reapply-116e128b-because-it-is-back-at-456878ba

    Reapply 116e128b because it is back at 456878ba

Reproduction steps

rake clean
CFLAGS="-DMRB_UTF8_STRING" rake

Executing in mirb:

$ ./bin/mirb
mirb - Embeddable Interactive Ruby Shell

> xs = [192, 128].pack("C*")
 => "��"
> xs.bytes
 => [192, 128]
> xs.length
 => 1

Reference MRI execution

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs.bytes
=> [192, 128]
[2.6.6] > xs.length
=> 2

With forced UTF-8 encoding:

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs = xs.force_encoding(Encoding::UTF_8)
=> "\xC0\x80"
[2.6.6] > xs.encoding
=> #<Encoding:UTF-8>
[2.6.6] > xs.length
=> 2
lopopolo added a commit to artichoke/artichoke that referenced this issue Jan 8, 2021
- Fix quickcheck test setup for owned and borrowed bytes tests. An
  upstream mruby bug (mruby/mruby#5269) prevents the length of invalid
  UTF-8 strings from being correctly calculated. The quickcheck harness
  generates a wider range of inputs which requires converting the tests
  to use `bytes`, `byteslice`, and `bytesize`.
- Add an additional `convert_with_trailing_nul` test to the `bytes`
  converter module.
- Fix quickcheck test setup for float tests. The quickcheck harness
  generates a wider range of inputs which revealed issues with the test
  when comparing NaN, infinities, and, some pairs which caused a
  subtraction overflow.
- Fix a test in `spinoso-securerandom` to no longer use the `chars()`
  iterator because `alphanumeric` returns a `Vec<u8>` now.

This commit also renames the cargo features in `spinoso-random` from
`rand_core` to `rand-traits` and from `rand` to `random-rand`. These
changes more closely align `spinoso-random` with its related `rand_mt`
crate. These changes remove the package name overrides for `rand` and
`rand_core` in `Cargo.toml`. Doc comments and doc tests have been
updated for these changes.
@matz
Copy link
Member

matz commented Jan 9, 2021

The length of broken UTF-8 strings is undefined, I think. You may get whatever length.
CRuby does validate if the string is UTF-8 compliant. mruby just assumes the string is valid UTF-8.
I will investigate if we can validate without hindering the performance.

@matz matz closed this as completed in f81591c Jan 9, 2021
@matz
Copy link
Member

matz commented Jan 9, 2021

For this particular case (the first byte is in 80..C1) is checked now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants