So I've been having problems with encodings after upgrading from JRuby 1.7.6 (I think the next version I tried after that was 1.7.8, but surely 1.7.11).
Reproduce in 1.7.18:
require 'nokogiri' require 'open-uri' Nokogiri::HTML(open("https://www.youtube.com/results?search_query=bears+in+exile+guppy+gmma+indicator+part+2")).to_html
It works fine with 1.7.6 (also works on MRI 2.1.2). I'm not sure if 1.7.11 had the exact same issue, I just know that I had issues with encodings.
The text was updated successfully, but these errors were encountered:
This is an ugly one to fix in 1.7.
The logic for StringIO#write in MRI calls rb_str_conv_env, which calls rb_str_conv_enc_opts, which calls rb_econv_convert. Given the result from rb_econv_convert, it either increases the buffer size for the target buffer (if the result is "destination buffer full"), or it returns the newly-transcoded buffer (if the result is "finished"), or it returns the original string and ignores the result altogether (all other results, including errors). This results in behavior that seems incorrect on MRI, but which raises an error on JRuby:
sio = StringIO.new('123'.force_encoding('US-ASCII')) p sio.string.encoding sio.write('øøø') p sio.string p sio.string.encoding
Output on MRI:
Clearly this string is no longer US-ASCII because of the multibyte UTF-8 characters, but this is how MRI handles the situation.
The basic problem is that our transcoder is set up to always raise an error if there's a result from transcoding. Our StringIO#write calls roughly equivalent logic in Transcoder.strConvEnc/Opts, but the logic differs in two important ways:
The challenge of fixing this is that most of the transcoding consumers in our codebase depend on the transcoder raising errors on its own. I tried to just remove the
The safest way to fix this would be to make a separate path for our strConvEncOpts that handles the error appropriately outside CharsetTranscoder, and make sure all paths using strConvEncOpts are using it properly.
When there's an error from encoding, MRI's version of this function just returns the incoming bytes as its result. Ours used a path that always raises an error when transcoding fails. The new version uses a path more in line with MRI that requires an out buffer and returns a result, so we can handle it appropriately. Fixes #2419.