Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Encoding issue after v1.7.6 #2419
So I've been having problems with encodings after upgrading from JRuby 1.7.6 (I think the next version I tried after that was 1.7.8, but surely 1.7.11).
Reproduce in 1.7.18:
require 'nokogiri' require 'open-uri' Nokogiri::HTML(open("https://www.youtube.com/results?search_query=bears+in+exile+guppy+gmma+indicator+part+2")).to_html
It works fine with 1.7.6 (also works on MRI 2.1.2). I'm not sure if 1.7.11 had the exact same issue, I just know that I had issues with encodings.
This is an ugly one to fix in 1.7.
The logic for StringIO#write in MRI calls rb_str_conv_env, which calls rb_str_conv_enc_opts, which calls rb_econv_convert. Given the result from rb_econv_convert, it either increases the buffer size for the target buffer (if the result is "destination buffer full"), or it returns the newly-transcoded buffer (if the result is "finished"), or it returns the original string and ignores the result altogether (all other results, including errors). This results in behavior that seems incorrect on MRI, but which raises an error on JRuby:
sio = StringIO.new('123'.force_encoding('US-ASCII')) p sio.string.encoding sio.write('øøø') p sio.string p sio.string.encoding
Output on MRI:
Clearly this string is no longer US-ASCII because of the multibyte UTF-8 characters, but this is how MRI handles the situation.
The basic problem is that our transcoder is set up to always raise an error if there's a result from transcoding. Our StringIO#write calls roughly equivalent logic in Transcoder.strConvEnc/Opts, but the logic differs in two important ways:
The challenge of fixing this is that most of the transcoding consumers in our codebase depend on the transcoder raising errors on its own. I tried to just remove the
The safest way to fix this would be to make a separate path for our strConvEncOpts that handles the error appropriately outside CharsetTranscoder, and make sure all paths using strConvEncOpts are using it properly.