Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Some data can cause String#encode to hang #2856
I have code that attempts to remove invalid characters by converting an input (supposedly) UTF-8 string to UTF-16 and back to UTF-8.
If given a string of random binary data instead of mostly valid UTF-8, the first call to encode (UTF-8 -> UTF-16) can hang and appears to never return. I wrote a test script that demonstrates a case where this happens consistently. It does not happen with all random data, but it's pretty easy to find a case that does this by just randomly generating bytes.
Here's me running it on MRI Ruby. It took less than 1 second:
Here's me running it on the latest stable JRuby. I gave up after 4 minutes:
Here's what that thread was doing before I killed it:
I tried reproducing this, but both JRuby 1.7.19 and master produced results almost immediately. There is a bit of a disparity in the resulting data length, however. MRI 2.2 and JRuby master report 1121, whereas JRuby 1.7.19 reports 2242 (mirroring your results for MRI 2.0.0p353).
I've been able to consistently reproduce on OSX and LInux. I also had someone else reproduce it on their OSX machine.
I'm on OSX 10.10.3 and Java 7. JRuby is installed with rbenv.
What else can I do to help find out what's causing this?
In this gist there are 12 base64 encoded strings that have this problem. And it also includes a script I created to randomly create strings and run encode on them until it finds one that hangs.
I could not get master to fail in any case, but this isn't surprising...the transcoder is now identical to MRI's.
JRuby 1.7 worked ok on Java 8u40, but on Java 7u67 I was able to reproduce your results with both your original script and on the longer-running random search.
So it seems there may be a Java bug here. I will investigate a bit to see if I can improve my transcoder to avoid this problem.
Java 7 appears to raise different errors for some cases, and these cases were not handled in the encoder loop. As a result, they could trigger an infinite loop on bad input. This appeared to be in the form of cleaved UTF-16 surrogate pairs leading to underflow where in Java 8 those pairs do not get cleaved. Fixes #2856.