Infinite loop (or performance issue) writing using MultiByteEncoding #504

Closed
trejkaz opened this Issue Jan 21, 2013 · 11 comments

Projects

None yet

2 participants

@trejkaz
trejkaz commented Jan 21, 2013

I have a program ... it's a large program at the moment, but it's stuck on this stack trace:

"main" prio=6 tid=0x0000000002090800 nid=0xe70 runnable [0x0000000000536000]
   java.lang.Thread.State: RUNNABLE
        at org.jcodings.MultiByteEncoding.strCodeAt(MultiByteEncoding.java:221)
        at org.jruby.util.ByteList.getEnc(ByteList.java:604)
        at org.jruby.util.io.CRLFStreamWrapper.convertLFToCRLF(CRLFStreamWrapper.java:303)
        at org.jruby.util.io.CRLFStreamWrapper.fwrite(CRLFStreamWrapper.java:120)
        at org.jruby.RubyIO.fwrite(RubyIO.java:1521)
        at org.jruby.RubyIO.write(RubyIO.java:1399)
        ...

Judging from the stack trace this might only occur on Windows, if CRLFStreamWrapper is the problem.

A potentially suspicious spot though is MultiByteEncoding.strCodeAt. It's looping an offset after adding a length but not checking whether the length is 0. It should probably be checking that length >= 1 and throwing an exception if it would have infinite-looped.

@trejkaz
trejkaz commented Jan 22, 2013

Maybe this is just a severe performance issue?

I wrote a test program to replicate what I see JRuby doing when it seems to hang:

    // simulate what we found in the debugger when the real script was running
    ByteList byteList = new ByteList(FileUtils.readFileToByteArray(new File("Z:\\Data\\bytes.dat")));
    byteList.setEncoding(Encoding.load("UTF8"));

    // simulate what we saw JRuby doing when writing the string out
    int length = byteList.length();
    for (int i = 0; i < length; i++)
    {
        byteList.getEnc(i);
    }

I'm waiting to see if it breaks on some value in the file, but getEnc seems to take longer and longer for each subsequent value of i. At the moment it's taking about 1 second per iteration (up to around iteration 500,000 now) but at the start, it was quite a bit faster. It seems like getEnc() is O(n) with the length of the string, and then iterating the string to print it out which is also O(n), you get O(n^2) for the overall operation. The string we're putting through is about 8MB in size, so it will probably be quite slow once it gets near the end...

@trejkaz
trejkaz commented Jan 22, 2013

The workaround of breaking the string into multiple smaller chunks and writing each chunk seems to work fine. It now raises a different issue.

Somehow the contents of the string being printed are not UTF-8 (which is the source of the issue) - even though str.encoding == UTF-8.

I tried to force_encoding back to windows-1252 (which it seems like it is) and then re-encode to UTF-8 - this just gives some other error, which I'm sure is a different bug.

Everything is just so broken at the moment... I wonder if anyone other than us has ever fed international characters through JRuby before.

@enebo
Member
enebo commented Jan 23, 2013

I am changing how 1.9 does this (hopefully in time for 1.7.3) but we must have a better mechanism for this and this was a simple oversight....having to re-walk string for each char is crazy :)

@enebo enebo added a commit that referenced this issue Jan 24, 2013
@enebo enebo GH #504: Infinite loop (or performance issue) writing using MultiByte…
…Encoding (O(n^2) -> O(n))
ec13fc8
@enebo
Member
enebo commented Jan 24, 2013

@trejkaz can you give me a snippet for the encoding error you have with the split up strings? We do have another open issue open on encoding on Windows but any extra snippets might help us figure that out quicker. The new performance of CRLF should be sane now: https://gist.github.com/4624040

@enebo enebo closed this Jan 24, 2013
@trejkaz
trejkaz commented Jan 24, 2013

Yep. hopefully today I will have time to reproduce that in isolation. I managed to get it to happen with our full application running over a single email with a non-ASCII subject but obviously I have to cut the "full application" bit out of the picture still.

@enebo
Member
enebo commented Jan 24, 2013

Also can you see if your problem goes away if you start JRuby with -Eutf-8? I was just working on the other Windows encoding bug and MRI behaved exactly like we did. Once I changed both to -Eutf-8 everything seemed ok.

@trejkaz
trejkaz commented Jan 25, 2013

I don't seem to get the error with my cut-down example, but the wrong text ends up in the file anyway.

Minimal code here: https://gist.github.com/4631108

Summary is that -Eutf-8 makes no difference from using -Dfile.encoding=UTF-8.

@enebo
Member
enebo commented Jan 25, 2013

Ah thanks for the minimal test. I will play with this and see what I can find out.

@enebo
Member
enebo commented Jan 25, 2013

I left a comment on that gist for you to look at. fwiw, I believe this is fixed on master and would appreciate it if you could confirm that. The -Dfile.encoding=UTF-8 as a sane default is something we will discuss for 1.7.3 since I think very very few people want Java default charset to be Windows-1252.

@trejkaz
trejkaz commented Feb 22, 2013

Well this is interesting. My reduced example certainly works correctly now, but our tests are still spitting out the same error as before. I'll have to investigate further.

@trejkaz
trejkaz commented Feb 22, 2013

Turns out it is fixed after all. We had another workaround of setting the encoding on the stream - removing that workaround makes it all work again. (Phew.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment