Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

String#encode('utf-8') throws IllegalArgumentException on long strings (1505 MByte) #845

Closed
petervandenabeele opened this Issue · 2 comments

3 participants

@petervandenabeele

As requested by Charles Oliver Nutter, I am filing the bug here:

/Users/peter_v/dbd $ cat bin/test_3.rb
# encoding=us-ascii

#row = "A" * 300 # NEVER fails with this value of `row`
row = "A" * 301 # ALWAYS fails with this value of `row`
count = 5_000_000

csv_string = row * count
encoded_string = csv_string.encode("utf-8")
/Users/peter_v/dbd $ time jruby -J-Xmx18000m bin/test_3.rb
CharBuffer.java:311:in `allocate': java.lang.IllegalArgumentException
    from CharsetDecoder.java:775:in `decode'
    from CharsetTranscoder.java:81:in `transcode'
    from CharsetTranscoder.java:64:in `transcode'
    from CharsetTranscoder.java:110:in `transcode'
    from RubyString.java:7649:in `transcode'
    from RubyString.java:7590:in `encode'
    from RubyString$INVOKER$i$encode.gen:-1:in `call'
    from CachingCallSite.java:326:in `cacheAndCall'
    from CachingCallSite.java:170:in `call'
    from bin/test_3.rb:8:in `__file__'
    from bin/test_3.rb:-1:in `load'
    from Ruby.java:807:in `runScript'
    from Ruby.java:800:in `runScript'
    from Ruby.java:669:in `runNormally'
    from Ruby.java:518:in `runFromMain'
    from Main.java:390:in `doRunFromMain'
    from Main.java:279:in `internalRun'
    from Main.java:221:in `run'
    from Main.java:201:in `main'

real    0m5.942s
user    0m5.839s
sys    0m1.705s
/Users/peter_v/dbd $

Reproduced on jruby-1.7.4 on Mac OS X and Ubuntu 12.04 64 bit.

Charles' analysis (by e-mail on July 1, 2013):

This appears to be a JDK bug. The following code in CharsetDecoder
attempts to grow the CharBuffer it's decoding into as it goes, but as
you get close to the signed 32-bit max for the incoming ByteBuffer,
this will overflow to negative and cause IllegalArgumentException in
CharBuffer.allocate.

           if (cr.isOverflow()) {
                n = 2*n + 1;    // Ensure progress; n might be 0!
                CharBuffer o = CharBuffer.allocate(n);

The only workaround I can offer is to not transcode such a large string.

JDK should probably be fixed to not overflow integer max here and be
more conservative growing the CharBuffer when approaching 2GB.

We can fix this by expanding the use of our own decode loop, which
tries to avoid over-allocating buffers. We could also fix it by
getting jcodings transcoding logic working, probably. But working with
String data close to signed 32-bit max is likely to run into other
issues since the JVM can only index arrays (e.g. byte[] in a String)
up to 32-bit size.

- Charlie

My work-around (in CSV.generate) has been to set

    def csv_defaults
      {force_quotes: true,
       encoding: 'utf-8'}
    end

(see petervandenabeele/dbd@55382d4 )

which utf-8 encodes the individual cells before they go into the CSV string (so I do not have to utf-8 encode the complete CSV string afterwards).

The proper solution though, was to start using

CSV.open(filename, 'w', csv_defaults) do |csv|
        push_facts(csv)
end

which streams the long CSV stream directly to a file on disk and never creates a GB long string all together.

@BanzaiMan
Owner

http://markmail.org/message/6cmy2giu3dc626gc

This appears to be a JDK bug. The following code in CharsetDecoder
attempts to grow the CharBuffer it's decoding into as it goes, but as
you get close to the signed 32-bit max for the incoming ByteBuffer,
this will overflow to negative and cause IllegalArgumentException in
CharBuffer.allocate.

       if (cr.isOverflow()) {
            n = 2*n + 1;    // Ensure progress; n might be 0!
            CharBuffer o = CharBuffer.allocate(n);

The only workaround I can offer is to not transcode such a large string.

JDK should probably be fixed to not overflow integer max here and be
more conservative growing the CharBuffer when approaching 2GB.

We can fix this by expanding the use of our own decode loop, which
tries to avoid over-allocating buffers. We could also fix it by
getting jcodings transcoding logic working, probably. But working with
String data close to signed 32-bit max is likely to run into other
issues since the JVM can only index arrays (e.g. byte[] in a String)
up to 32-bit size.

  • Charlie
@headius
Owner

Recent work to improve transcoding has helped this logic be more robust, but it is still subject to the 2GB limit all Java arrays impose. The new version of CharsetTranscoder will attempt to increase the size of the array all the way up to nearly the maximum size for an array, but it can't ever exceed it. If your string content is near that limit, it may go over just because of header sizes etc.

I'm going to have to close this as Won't Fix since it's a limitation of the JVM.

@headius headius closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.