I have a program ... it's a large program at the moment, but it's stuck on this stack trace:
"main" prio=6 tid=0x0000000002090800 nid=0xe70 runnable [0x0000000000536000]
Judging from the stack trace this might only occur on Windows, if CRLFStreamWrapper is the problem.
A potentially suspicious spot though is MultiByteEncoding.strCodeAt. It's looping an offset after adding a length but not checking whether the length is 0. It should probably be checking that length >= 1 and throwing an exception if it would have infinite-looped.
Maybe this is just a severe performance issue?
I wrote a test program to replicate what I see JRuby doing when it seems to hang:
// simulate what we found in the debugger when the real script was running
ByteList byteList = new ByteList(FileUtils.readFileToByteArray(new File("Z:\\Data\\bytes.dat")));
// simulate what we saw JRuby doing when writing the string out
int length = byteList.length();
for (int i = 0; i < length; i++)
I'm waiting to see if it breaks on some value in the file, but getEnc seems to take longer and longer for each subsequent value of i. At the moment it's taking about 1 second per iteration (up to around iteration 500,000 now) but at the start, it was quite a bit faster. It seems like getEnc() is O(n) with the length of the string, and then iterating the string to print it out which is also O(n), you get O(n^2) for the overall operation. The string we're putting through is about 8MB in size, so it will probably be quite slow once it gets near the end...
The workaround of breaking the string into multiple smaller chunks and writing each chunk seems to work fine. It now raises a different issue.
Somehow the contents of the string being printed are not UTF-8 (which is the source of the issue) - even though str.encoding == UTF-8.
I tried to force_encoding back to windows-1252 (which it seems like it is) and then re-encode to UTF-8 - this just gives some other error, which I'm sure is a different bug.
Everything is just so broken at the moment... I wonder if anyone other than us has ever fed international characters through JRuby before.
I am changing how 1.9 does this (hopefully in time for 1.7.3) but we must have a better mechanism for this and this was a simple oversight....having to re-walk string for each char is crazy :)
GH #504: Infinite loop (or performance issue) writing using MultiByte…
…Encoding (O(n^2) -> O(n))
@trejkaz can you give me a snippet for the encoding error you have with the split up strings? We do have another open issue open on encoding on Windows but any extra snippets might help us figure that out quicker. The new performance of CRLF should be sane now: https://gist.github.com/4624040
Yep. hopefully today I will have time to reproduce that in isolation. I managed to get it to happen with our full application running over a single email with a non-ASCII subject but obviously I have to cut the "full application" bit out of the picture still.
Also can you see if your problem goes away if you start JRuby with -Eutf-8? I was just working on the other Windows encoding bug and MRI behaved exactly like we did. Once I changed both to -Eutf-8 everything seemed ok.
I don't seem to get the error with my cut-down example, but the wrong text ends up in the file anyway.
Minimal code here: https://gist.github.com/4631108
Summary is that -Eutf-8 makes no difference from using -Dfile.encoding=UTF-8.
Ah thanks for the minimal test. I will play with this and see what I can find out.
I left a comment on that gist for you to look at. fwiw, I believe this is fixed on master and would appreciate it if you could confirm that. The -Dfile.encoding=UTF-8 as a sane default is something we will discuss for 1.7.3 since I think very very few people want Java default charset to be Windows-1252.
Well this is interesting. My reduced example certainly works correctly now, but our tests are still spitting out the same error as before. I'll have to investigate further.
Turns out it is fixed after all. We had another workaround of setting the encoding on the stream - removing that workaround makes it all work again. (Phew.)