-
-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV parsing performance is poor #3348
Comments
A quick profile indicated that most time is being spent generating stack traces, and digging deeper I found that Also, the converter procs don't JIT in 9k because we don't JIT blocks (yet). We'll need to fix that to get full perf out of this. So yeah, it's basically a bug...a really bad algorithm for conversion in CSV. |
I made two hacks to get a better picture of performance:
With those changes, JRuby is now back to being the fastest. The difference in perf between JRuby:
MRI:
Rubinius:
All three ran with the same modified copy of csv.rb, but MRI and Rubinius were still generating their backtraces (albeit at a MUCH lower cost than JRuby). |
Temporary patch for csv to use methods instead of procs: diff --git a/lib/ruby/stdlib/csv.rb b/lib/ruby/stdlib/csv.rb
index 54b820d..d36550a 100644
--- a/lib/ruby/stdlib/csv.rb
+++ b/lib/ruby/stdlib/csv.rb
@@ -944,29 +944,35 @@ class CSV
# To add a combo field, the value should be an Array of names. Combo fields
# can be nested with other combo fields.
#
- Converters = { integer: lambda { |f|
- Integer(f.encode(ConverterEncoding)) rescue f
- },
- float: lambda { |f|
- Float(f.encode(ConverterEncoding)) rescue f
- },
+ module ConverterMethods
+ def self.integer(f)
+ Integer(f.encode(ConverterEncoding)) rescue f
+ end
+ def self.float(f)
+ Float(f.encode(ConverterEncoding)) rescue f
+ end
+ def self.date(f)
+ begin
+ e = f.encode(ConverterEncoding)
+ e =~ DateMatcher ? Date.parse(e) : f
+ rescue # encoding conversion or date parse errors
+ f
+ end
+ end
+ def self.date_time(f)
+ begin
+ e = f.encode(ConverterEncoding)
+ e =~ DateTimeMatcher ? DateTime.parse(e) : f
+ rescue # encoding conversion or date parse errors
+ f
+ end
+ end
+ end
+ Converters = { integer: ConverterMethods.method(:integer),
+ float: ConverterMethods.method(:float),
numeric: [:integer, :float],
- date: lambda { |f|
- begin
- e = f.encode(ConverterEncoding)
- e =~ DateMatcher ? Date.parse(e) : f
- rescue # encoding conversion or date parse errors
- f
- end
- },
- date_time: lambda { |f|
- begin
- e = f.encode(ConverterEncoding)
- e =~ DateTimeMatcher ? DateTime.parse(e) : f
- rescue # encoding conversion or date parse errors
- f
- end
- },
+ date: ConverterMethods.method(:date),
+ date_time: ConverterMethods.method(:date_time),
all: [:date_time, :numeric] }
# |
Converter logic that is causing all the exceptions is at Lines 947 to 970 in f192460
|
And it turns out this has come up before, but we never got a fix from csv. See #1816. Thanks @tenderlove for finding that! |
Steps I followed to investigate this:
With this ugly patch we are slightly faster than MRI, but not as fast as with all stack traces disabled. It seems there's still some exception flow control happening that does not log. |
This is a temporary fix to improve perf of converters since JRuby does not currently JIT blocks. See #3348.
This is a temporary fix to improve perf of converters since JRuby does not currently JIT blocks. See #3348.
This is a temporary fix to improve perf of converters since JRuby does not currently JIT blocks. See jruby/jruby#3348.
This is a temporary fix to improve perf of converters since JRuby does not currently JIT blocks. See jruby/jruby#3348.
This is now fixed on master. @enebo improved our compiler + exception logic to not raise exceptions when the nearest upstream "rescue" only returns a simple expression rather than capturing the exception and using its contents. See fb4dcb4, c6ce091, 5e0eece, 20acc1b, and 42278a5. I finally got rootless blocks to JIT on their own, which allows the converters in csv.rb to have full speed performance. For this case, it doesn't change perf a great deal (the exception backtrace fix was the big money). See 318c853 and 771fe3f. We reverted csv.rb to stock in b228d42. See also e4727c3 and 06f1c2b in which I modify some internally-thrown exceptions to not generate JVM stack traces. There may be more such cases where we use exceptions solely to unroll stack back to a point where we re-raise them as a Ruby exception. With all above work in place, JRuby runs the given CSV benchmarks faster than either MRI or Rubinius, by a fairly wide margin. |
JRuby 9.0.1.0 is a few orders of magnitude slower than MRI 2.2.3 and RBX 2.5.8 in a simple CSV parsing experiment. Code to follow:
On my system (OS X 10.10.5, 24GB RAM, 1TB SSD), here are the results of the benchmarks for each runtime:
The JRuby runtime never went above 250MB of RAM usage, so it doesn't appear to be memory pressure.
An earlier attempt at deducing the cause had Kernel.Integer and Kernel.Float show up as heavy hitters on the profile.
The test file is available at this link:
https://www.dropbox.com/s/l5lze28kpd7dx8u/testfile.zip?dl=0
The text was updated successfully, but these errors were encountered: