Skip to content

Invalid UTF-8 Strings when compiling with MRB_UTF8_STRING compute length incorrectly #5269

Closed
@lopopolo

Description

@lopopolo

When compiling mruby with UTF-8 Strings (e.g. setting CFLAGS="-DMRB_UTF8_STRING"), mruby incorrectly computes the length of strings with invalid UTF-8 byte sequences.

mruby

$ git rev-parse HEAD
69482dbc8e590ed66f0944e9b48c4f9c2f83c873
$ git show
commit 69482dbc8e590ed66f0944e9b48c4f9c2f83c873 (HEAD -> master, origin/master, origin/HEAD)
Merge: 6587269a f7ff4810
Author: Yukihiro "Matz" Matsumoto <matz@ruby.or.jp>
Date:   Fri Jan 8 23:10:49 2021 +0900

    Merge pull request #5265 from shuujii/reapply-116e128b-because-it-is-back-at-456878ba

    Reapply 116e128b because it is back at 456878ba

Reproduction steps

rake clean
CFLAGS="-DMRB_UTF8_STRING" rake

Executing in mirb:

$ ./bin/mirb
mirb - Embeddable Interactive Ruby Shell

> xs = [192, 128].pack("C*")
 => "��"
> xs.bytes
 => [192, 128]
> xs.length
 => 1

Reference MRI execution

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs.bytes
=> [192, 128]
[2.6.6] > xs.length
=> 2

With forced UTF-8 encoding:

$ irb
[2.6.6] > xs = [192, 128].pack("C*")
=> "\xC0\x80"
[2.6.6] > xs = xs.force_encoding(Encoding::UTF_8)
=> "\xC0\x80"
[2.6.6] > xs.encoding
=> #<Encoding:UTF-8>
[2.6.6] > xs.length
=> 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions