Implement approximate length and other length routines for proper broken character processing #26

lopex · 2019-04-25T16:54:12Z

MRI has several character length routines that have different semantics and are used quite inconsistently, wiki: https://github.com/jruby/jruby/wiki/Encodings-in-JRuby.

For now we only have two semantics:

return -1 on broken or (-1 - n) for missing n bytes in a stream (in jcodings itself).
StringSupport.preciseLength in JRuby core.

There are several issues:
#25
jruby/joni#38
jruby/joni#17
jruby/joni#46

All of those are related to semantics where length returns 1 for invalid character, so scans can advance while consuming arrays (were we have -1 and fall into infinite loops or AIOOBE)

Presto mitigated some of that by using our NonStrictUtf8Encoding here:
prestodb/presto#8711

Ultimately, we need to decide whether to scatter our code with more costly validating length routines (which would be wasteful for already validated Strings), or try a less wasteful approach by expanding on https://github.com/jruby/jcodings/tree/unsafe-encoding

lopex mentioned this issue Apr 25, 2019

Unable to find org.jcodings.specific.BaseUTF8Encoding.mbcCaseFold #25

Closed

lopex mentioned this issue Feb 11, 2020

how to interrupt hanging thread? jruby/joni#46

Open

sebbASF mentioned this issue Feb 22, 2023

JRuby 9.3+ fails with Java::JavaLang::ArrayIndexOutOfBoundsException mikel/mail#1569

Open

ahorek mentioned this issue May 2, 2023

Running specific regex with Regexp::IGNORECASE flag on text starting with specific pipe character results in java.lang.ArrayIndexOutOfBoundsException jruby/jruby#7730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement approximate length and other length routines for proper broken character processing #26

Implement approximate length and other length routines for proper broken character processing #26

lopex commented Apr 25, 2019 •

edited

Implement approximate length and other length routines for proper broken character processing #26

Implement approximate length and other length routines for proper broken character processing #26

Comments

lopex commented Apr 25, 2019 • edited

lopex commented Apr 25, 2019 •

edited