Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV parsing performance is poor #3348

Closed
chuckremes opened this issue Sep 25, 2015 · 7 comments
Closed

CSV parsing performance is poor #3348

chuckremes opened this issue Sep 25, 2015 · 7 comments

Comments

@chuckremes
Copy link

JRuby 9.0.1.0 is a few orders of magnitude slower than MRI 2.2.3 and RBX 2.5.8 in a simple CSV parsing experiment. Code to follow:

require 'benchmark'
require 'csv'

FILE = "./testfile.csv"

Benchmark.bmbm(22) do |x|
  x.report('parse CSV') do
    io = File.open(FILE)
    csv = CSV.new(io, :converters => :all)

    while line = csv.gets
    end

    io.close
  end
end

On my system (OS X 10.10.5, 24GB RAM, 1TB SSD), here are the results of the benchmarks for each runtime:

GuestOSX:options_database cremes$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV                7.910000   0.010000   7.920000 (  7.924954)
------------------------------------------------- total: 7.920000sec

                             user     system      total        real
parse CSV                8.040000   0.020000   8.060000 (  8.053098)

GuestOSX:options_database cremes$ chruby jruby
GuestOSX:options_database cremes$ ruby -v
jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.60-b23 on 1.8.0_60-b27 +jit [darwin-x86_64]
GuestOSX:options_database cremes$ ruby -J-Xmx2g benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV              238.740000   1.480000 240.220000 (233.007993)
----------------------------------------------- total: 240.220000sec

                             user     system      total        real
parse CSV              229.210000   1.080000 230.290000 (227.212325)
GuestOSX:options_database cremes$ chruby rbx
GuestOSX:options_database cremes$ ruby -v
rubinius 2.5.8 (2.1.0 bef51ae3 2015-09-24 3.5.1 JI) [x86_64-darwin14.5.0]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV               16.264571   0.161624  16.426195 ( 10.562584)
------------------------------------------------ total: 16.426195sec

                             user     system      total        real
parse CSV                9.084859   0.033108   9.117967 (  9.010402)

The JRuby runtime never went above 250MB of RAM usage, so it doesn't appear to be memory pressure.

An earlier attempt at deducing the cause had Kernel.Integer and Kernel.Float show up as heavy hitters on the profile.

The test file is available at this link:
https://www.dropbox.com/s/l5lze28kpd7dx8u/testfile.zip?dl=0

@headius
Copy link
Member

headius commented Sep 25, 2015

A quick profile indicated that most time is being spent generating stack traces, and digging deeper I found that :converters => :all is the culprit. The CSV lib chains a series of procs together to do value conversion, using exceptions to trigger the next proc in the chain. Kernel#Integer appears to be the main culprit. At worst, getting through the converters will trigger up to four exception raises before the value remains unconverted.

Also, the converter procs don't JIT in 9k because we don't JIT blocks (yet). We'll need to fix that to get full perf out of this.

So yeah, it's basically a bug...a really bad algorithm for conversion in CSV.

@headius
Copy link
Member

headius commented Sep 25, 2015

I made two hacks to get a better picture of performance:

  1. Disable stack traces for the JVM (temporary fix for exception-based flow happening here).
  2. Turn the converters into normal methods (closures have overhead in all impls).

With those changes, JRuby is now back to being the fastest. The difference in perf between :converters => :all and no converters is still enormous though; I'd recommend you find some way to do the conversions yourself rather than letting csv.rb do it.

JRuby:

Rehearsal ----------------------------------------------------------
parse CSV               17.040000   0.370000  17.410000 ( 11.373649)
------------------------------------------------ total: 17.410000sec

                             user     system      total        real
parse CSV               10.780000   0.100000  10.880000 ( 10.013685)
Rehearsal ----------------------------------------------------------
parse CSV               10.410000   0.090000  10.500000 ( 10.075584)
------------------------------------------------ total: 10.500000sec

                             user     system      total        real
parse CSV               10.070000   0.100000  10.170000 (  9.992680)
Rehearsal ----------------------------------------------------------
parse CSV                9.970000   0.080000  10.050000 (  9.612474)
------------------------------------------------ total: 10.050000sec

                             user     system      total        real
parse CSV                9.860000   0.070000   9.930000 (  9.817421)

MRI:

Rehearsal ----------------------------------------------------------
parse CSV               15.060000   0.120000  15.180000 ( 15.198348)
------------------------------------------------ total: 15.180000sec

                             user     system      total        real
parse CSV               15.030000   0.110000  15.140000 ( 15.171767)
Rehearsal ----------------------------------------------------------
parse CSV               15.420000   0.140000  15.560000 ( 15.680886)
------------------------------------------------ total: 15.560000sec

                             user     system      total        real
parse CSV               15.030000   0.090000  15.120000 ( 15.154154)
Rehearsal ----------------------------------------------------------
parse CSV               15.150000   0.110000  15.260000 ( 15.282513)
------------------------------------------------ total: 15.260000sec

                             user     system      total        real
parse CSV               15.380000   0.140000  15.520000 ( 15.563353)

Rubinius:

Rehearsal ----------------------------------------------------------
parse CSV               23.919160   0.151872  24.071032 ( 15.913175)
------------------------------------------------ total: 24.071032sec

                             user     system      total        real
parse CSV               14.293140   0.074804  14.367944 ( 14.319268)
Rehearsal ----------------------------------------------------------
parse CSV               14.721981   0.072734  14.794715 ( 14.502657)
------------------------------------------------ total: 14.794715sec

                             user     system      total        real
parse CSV               13.755248   0.060286  13.815534 ( 13.843445)

All three ran with the same modified copy of csv.rb, but MRI and Rubinius were still generating their backtraces (albeit at a MUCH lower cost than JRuby).

@headius
Copy link
Member

headius commented Sep 25, 2015

Temporary patch for csv to use methods instead of procs:

diff --git a/lib/ruby/stdlib/csv.rb b/lib/ruby/stdlib/csv.rb
index 54b820d..d36550a 100644
--- a/lib/ruby/stdlib/csv.rb
+++ b/lib/ruby/stdlib/csv.rb
@@ -944,29 +944,35 @@ class CSV
   # To add a combo field, the value should be an Array of names.  Combo fields
   # can be nested with other combo fields.
   #
-  Converters  = { integer:   lambda { |f|
-                    Integer(f.encode(ConverterEncoding)) rescue f
-                  },
-                  float:     lambda { |f|
-                    Float(f.encode(ConverterEncoding)) rescue f
-                  },
+  module ConverterMethods
+    def self.integer(f)
+      Integer(f.encode(ConverterEncoding)) rescue f
+    end
+    def self.float(f)
+      Float(f.encode(ConverterEncoding)) rescue f
+    end
+    def self.date(f)
+      begin
+        e = f.encode(ConverterEncoding)
+        e =~ DateMatcher ? Date.parse(e) : f
+      rescue  # encoding conversion or date parse errors
+        f
+      end
+    end
+    def self.date_time(f)
+      begin
+        e = f.encode(ConverterEncoding)
+        e =~ DateTimeMatcher ? DateTime.parse(e) : f
+      rescue  # encoding conversion or date parse errors
+        f
+      end
+    end
+  end
+  Converters  = { integer:   ConverterMethods.method(:integer),
+                  float:     ConverterMethods.method(:float),
                   numeric:   [:integer, :float],
-                  date:      lambda { |f|
-                    begin
-                      e = f.encode(ConverterEncoding)
-                      e =~ DateMatcher ? Date.parse(e) : f
-                    rescue  # encoding conversion or date parse errors
-                      f
-                    end
-                  },
-                  date_time: lambda { |f|
-                    begin
-                      e = f.encode(ConverterEncoding)
-                      e =~ DateTimeMatcher ? DateTime.parse(e) : f
-                    rescue  # encoding conversion or date parse errors
-                      f
-                    end
-                  },
+                  date:      ConverterMethods.method(:date),
+                  date_time: ConverterMethods.method(:date_time),
                   all:       [:date_time, :numeric] }

   #

@headius
Copy link
Member

headius commented Sep 25, 2015

Converter logic that is causing all the exceptions is at

Converters = { integer: lambda { |f|
Integer(f.encode(ConverterEncoding)) rescue f
},
float: lambda { |f|
Float(f.encode(ConverterEncoding)) rescue f
},
numeric: [:integer, :float],
date: lambda { |f|
begin
e = f.encode(ConverterEncoding)
e =~ DateMatcher ? Date.parse(e) : f
rescue # encoding conversion or date parse errors
f
end
},
date_time: lambda { |f|
begin
e = f.encode(ConverterEncoding)
e =~ DateTimeMatcher ? DateTime.parse(e) : f
rescue # encoding conversion or date parse errors
f
end
},
all: [:date_time, :numeric] }

@headius
Copy link
Member

headius commented Sep 25, 2015

And it turns out this has come up before, but we never got a fix from csv. See #1816. Thanks @tenderlove for finding that!

@headius
Copy link
Member

headius commented Sep 25, 2015

Steps I followed to investigate this:

  1. Ran original bench with no flags to confirm problem.
  2. Ran bench with --sample (sampled profiler) and discovered stack trace generation was the biggest problem.
  3. Ran with -Xlog.backtraces=true to get traces for the exceptions and found that they were from csv.rb and Kernel conversion methods.
  4. Ran with -J-XX:MaxJavaStackTraceLevel=0 to disable stack traces in JVM. This fixed 90% of the performance issue.

With this ugly patch we are slightly faster than MRI, but not as fast as with all stack traces disabled. It seems there's still some exception flow control happening that does not log.

headius added a commit that referenced this issue Sep 25, 2015
This is a temporary fix to improve perf of converters since JRuby
does not currently JIT blocks.

See #3348.
headius added a commit that referenced this issue Sep 25, 2015
This is a temporary fix to improve perf of converters since JRuby
does not currently JIT blocks.

See #3348.
headius added a commit to jruby/ruby that referenced this issue Sep 25, 2015
This is a temporary fix to improve perf of converters since JRuby
does not currently JIT blocks.

See jruby/jruby#3348.
headius added a commit to jruby/ruby that referenced this issue Sep 25, 2015
This is a temporary fix to improve perf of converters since JRuby
does not currently JIT blocks.

See jruby/jruby#3348.
@headius headius added this to the JRuby 9.0.2.0 milestone Oct 2, 2015
@headius
Copy link
Member

headius commented Oct 2, 2015

This is now fixed on master.

@enebo improved our compiler + exception logic to not raise exceptions when the nearest upstream "rescue" only returns a simple expression rather than capturing the exception and using its contents. See fb4dcb4, c6ce091, 5e0eece, 20acc1b, and 42278a5.

I finally got rootless blocks to JIT on their own, which allows the converters in csv.rb to have full speed performance. For this case, it doesn't change perf a great deal (the exception backtrace fix was the big money). See 318c853 and 771fe3f.

We reverted csv.rb to stock in b228d42.

See also e4727c3 and 06f1c2b in which I modify some internally-thrown exceptions to not generate JVM stack traces. There may be more such cases where we use exceptions solely to unroll stack back to a point where we re-raise them as a Ruby exception.

With all above work in place, JRuby runs the given CSV benchmarks faster than either MRI or Rubinius, by a fairly wide margin.

@headius headius closed this as completed Oct 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants