Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception NegativeArraySizeException during JSON.dump of large Hash #6265

Open
jbaiza opened this issue Jun 4, 2020 · 6 comments
Open

Exception NegativeArraySizeException during JSON.dump of large Hash #6265

jbaiza opened this issue Jun 4, 2020 · 6 comments

Comments

@jbaiza
Copy link

jbaiza commented Jun 4, 2020

Environment Information

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)
    Originally detected on 9.2.9.0:
    jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.5+10 on 11.0.5+10 [darwin-x86_64]
    but can be reproduced also on the latest 9.2.11.1:
    jruby 9.2.11.1 (2.5.7) 2020-03-25 b1f55b1 OpenJDK 64-Bit Server VM 11.0.5+10 on 11.0.5+10 [darwin-x86_64]
  • Operating system and platform (e.g. uname -a)
    Linux staging-app1 4.4.0-170-generic Allow the RubyClass to be determined when extending BigDecimal #199-Ubuntu SMP Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    also on Dev box:
    Darwin JBA-MacBook-Pro.local 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64

Other relevant info you may wish to add:

  • Installed or activated gems - N/A
  • Application/framework version (e.g. Rails, Sinatra) - N/A
  • Environment variables - increase Java memory with JAVA_OPTS=-Xmx4g

Expected Behavior

  • Describe your expectation of how JRuby should behave, perhaps by showing how CRuby/MRI behaves.
    JSON.dump completes without an error for a large hash. On MRI 2.5.1 provided sample code completes without an error.
  • Provide an executable Ruby script or a link to an example repository.
require 'JSON'
arr = [{"0" => "0", "1" => "1", "2" => "2", "3" => "3", "4" => "4", "5" => "5"}]
begin
  25.times do
    arr.concat arr
    puts "ARR size: #{arr.size}"
  end
  puts "JSON.dump size: #{JSON.dump(arr).size}"
  arr.concat arr;true
  arr.size
  puts "ARR size: #{arr.size}"
  puts "JSON.dump size: #{JSON.dump(arr).size}"
end

Actual Behavior

An error is thrown:

Traceback (most recent call last):
       16: from json.ext.Generator$Handler.generateNew(Generator.java:194)
       15: from json.ext.Generator$4.generate(Generator.java:272)
       14: from json.ext.Generator$4.generate(Generator.java:315)
       13: from json.ext.Generator$5.generate(Generator.java:329)
       12: from json.ext.Generator$5.generate(Generator.java:356)
       11: from org.jruby.dist/org.jruby.RubyHash.visitAll(RubyHash.java:2746)
       10: from org.jruby.dist/org.jruby.RubyHash.visitLimited(RubyHash.java:699)
        9: from org.jruby.dist/org.jruby.RubyHash$Visitor.visit(RubyHash.java:677)
        8: from json.ext.Generator$5$1.visit(Generator.java:377)
        7: from json.ext.Generator$6.generate(Generator.java:391)
        6: from json.ext.Generator$6.generate(Generator.java:412)
        5: from json.ext.StringEncoder.encode(StringEncoder.java:51)
        4: from json.ext.ByteListTranscoder.quoteStop(ByteListTranscoder.java:147)
        3: from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:530)
        2: from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:546)
        1: from org.jruby.dist/org.jruby.util.ByteList.grow(ByteList.java:1107)
Java::JavaLang::NegativeArraySizeException (-1746927586)

It seems that 2Gb is the limit when the error starts to occur.

@headius
Copy link
Member

headius commented Nov 20, 2020

Not too surprising... this is dumping the json output into one of our ByteList instances, which are based on Java byte array, and on Java the limit for such buffers is 2GB.

The only workaround I can think of at the moment would be to dump to an IO stream, rather than dumping to a >2GB in-memory buffer.

Unfortunately fixing this is a much larger challenge, since the result of JSON.dump is a Ruby String, and String in JRuby is backed by a single ByteList. There have been many other such bug reports that we have had to close as "won't fix" mostly because this is a JVM limitation.

@headius
Copy link
Member

headius commented Nov 20, 2020

@enebo @lopex I don't know that we are any closer to a solution on this now than we were ten years ago. One option would be modifying bytelist to either use multiple arrays or a long[] but clearly that will impact a huge amount of code that expects to be able to get a byte[] out. On the other hand, the vast majority of cases will still be under 2GB, so perhaps we could incrementally add support for ranges outside int32 and error for cases that expect to have a real byte[].

@jbaiza
Copy link
Author

jbaiza commented Mar 31, 2023

Hello,
We have upgraded our user JRuby version to 9.3.7.0, and the same sample code now produces OutOfMemoryError:

Traceback (most recent call last):
       16: from json.ext.Generator$4.generate(Generator.java:253)
       15: from json.ext.Generator$4.generate(Generator.java:296)
       14: from json.ext.Generator$5.generate(Generator.java:310)
       13: from json.ext.Generator$5.generate(Generator.java:339)
       12: from org.jruby.dist/org.jruby.RubyHash.visitAll(RubyHash.java:2882)
       11: from org.jruby.dist/org.jruby.RubyHash.visitLimited(RubyHash.java:727)
       10: from org.jruby.dist/org.jruby.RubyHash$Visitor.visit(RubyHash.java:705)
        9: from json.ext.Generator$5$1.visit(Generator.java:358)
        8: from json.ext.Generator$6.generate(Generator.java:372)
        7: from json.ext.Generator$6.generate(Generator.java:393)
        6: from json.ext.StringEncoder.encode(StringEncoder.java:52)
        5: from json.ext.ByteListTranscoder.quoteStop(ByteListTranscoder.java:147)
        4: from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:547)
        3: from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:563)
        2: from org.jruby.dist/org.jruby.util.ByteList.grow(ByteList.java:1125)
        1: from org.jruby.dist/org.jruby.runtime.Helpers.calculateBufferLength(Helpers.java:493)
Java::JavaLang::OutOfMemoryError (Requested array size exceeds VM limit)

and on the latest version 9.4.2.0 the same error with a bit different stack trace formatting:

org.jruby.dist/org.jruby.runtime.Helpers.calculateBufferLength(Helpers.java:492): Requested array size exceeds VM limit (Java::JavaLang::OutOfMemoryError)
	from org.jruby.dist/org.jruby.util.ByteList.grow(ByteList.java:1125)
	from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:563)
	from org.jruby.dist/org.jruby.util.ByteList.append(ByteList.java:547)
	from json.ext.ByteListTranscoder.quoteStop(ByteListTranscoder.java:147)
	from json.ext.StringEncoder.encode(StringEncoder.java:52)
	from json.ext.Generator$6.generate(Generator.java:393)
	from json.ext.Generator$6.generate(Generator.java:372)
	from json.ext.Generator$5$1.visit(Generator.java:358)
	from org.jruby.dist/org.jruby.RubyHash$Visitor.visit(RubyHash.java:715)
	from org.jruby.dist/org.jruby.RubyHash.visitLimited(RubyHash.java:759)
	from org.jruby.dist/org.jruby.RubyHash.visitAll(RubyHash.java:2982)
	from json.ext.Generator$5.generate(Generator.java:339)
	from json.ext.Generator$5.generate(Generator.java:310)
	from json.ext.Generator$4.generate(Generator.java:296)
	from json.ext.Generator$4.generate(Generator.java:253)
	from json.ext.Generator$Handler.generateNew(Generator.java:175)
	... 180 levels...

Java has given memory larger than 2 Gb - export JAVA_OPTS="-Xms256m -Xmx10048m"

@headius
Copy link
Member

headius commented Mar 31, 2023

This remains a Java limitation. The buffer into which your large hash is being dumped eventually grows to be larger than 2GB, which is the limit of a Java array. Since we only have one implementation of Ruby's String, and that implementation uses a Java byte[], we cannot grow a string any larger than the 2GB limit.

The solutions to this in the Java world are to use multiple arrays or to use a native block of memory via a native ByteBuffer. I can see two paths forward for fixing this with ByteBuffer:

  • We can implement either a special RubyString or ByteList that uses a ByteBuffer rather than a byte[] as its storage. We have discussed this possibility in the past as a way to improve I/O performance and memory usage since an entire file could be memory-mapped rather than copied into the JVM heap. But implementing this is no small task.
  • We may be able to coax the json library to dump to a custom data type written in Ruby that maintains a native ByteBuffer and implements key methods of String. If so, we would have an option for cases that require more than 2GB of String data, and a possible path forward to enhancing or replacing our existing byte[] String with a more flexible implementation.

I'm going to poke around the json library and see if the latter option might work in the short term.

@headius
Copy link
Member

headius commented Mar 31, 2023

Unfortunately the json library is not compatible with the approach I outlined. In all three implementations of the generator–the pure-Ruby version, the C version, and the Java version–all json is written first to a String, and then that String is returned or written to an IO. There's no streaming of json data into an abstract "write" or "append" interface, so there's no way to trick it into using a different type of buffer.

I was also mistaken when I said ByteBuffer could be used to work around this. ByteBuffers can only be constructed with a size specified in a Java int, effectively limiting it to 2GB. So that leaves us to use a native buffer in some other way, such as through Ruby FFI, Java FFI libraries like jnr-ffi, or the new OpenJDK project Panama's support for native memory buffers.

So the project is pretty big but also could be pretty valuable.

For IO-heavy json use cases, writing to a String buffer is obviously not going to be the most efficient option; there would be value in enhancing json to write not to a String but to any String-like or IO-like object provided by the caller. That would in turn allow us to pass it a native memory wrapper and avoid the 2GB byte[] limit.

The wrapper itself could be implemented today with FFI, but for better efficiency the JIT enhancements that come with project Panama make it a more attractive target. And Panama is only available in a preview form as of JDK 19 having incubated in JDK 17 and 18. I will be presenting on this topic (in part) next week so I'm looking into the possibilities right now.

headius added a commit to headius/json that referenced this issue Aug 15, 2023
Json.dump allows you to pass an IO to which the dump output will
be sent, but it still buffers the entire output in memory before
sending it to the given IO. This leads to issues on JRuby like
jruby/jruby#6265 when it tries to create a byte[] that exceeds the
maximum size of a signed int (JVM's array size limit).

This commit plumbs the IO all the way through the generation logic
so that it can be written to directly without filling a temporary
memory buffer first. This allow JRuby to dump object graphs that
would normally produce more content than the JVM can hold in a
single array, providing a workaround for jruby/jruby#6265.

It is unfortunately a bit slow to dump directly to IO due to the
many small writes that all acquire locks and participate in the
IO encoding subsystem. A more direct path that can skip some of
these pieces could be more competitive with the in-memory version,
but functionally it expands the size of graphs that cana be dumped
when using JRuby.

See flori#54
headius added a commit to headius/json that referenced this issue Aug 15, 2023
Json.dump allows you to pass an IO to which the dump output will
be sent, but it still buffers the entire output in memory before
sending it to the given IO. This leads to issues on JRuby like
jruby/jruby#6265 when it tries to create a byte[] that exceeds the
maximum size of a signed int (JVM's array size limit).

This commit plumbs the IO all the way through the generation logic
so that it can be written to directly without filling a temporary
memory buffer first. This allow JRuby to dump object graphs that
would normally produce more content than the JVM can hold in a
single array, providing a workaround for jruby/jruby#6265.

It is unfortunately a bit slow to dump directly to IO due to the
many small writes that all acquire locks and participate in the
IO encoding subsystem. A more direct path that can skip some of
these pieces could be more competitive with the in-memory version,
but functionally it expands the size of graphs that cana be dumped
when using JRuby.

See flori#524
@headius
Copy link
Member

headius commented Aug 15, 2023

See flori/json#524 for a proof-of-concept streaming dump implementation. This is likely the closest we can get in the near term to defeating the JVM array-size limit, but I could use some help cleaning it up and getting it shipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants