Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
csv.rb uses postfix exceptions when converters are specified #1816
Comments
headius
added this to the
JRuby 9000 milestone
Jul 15, 2014
headius
added
stdlib
labels
Jul 15, 2014
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
Link to the converters, where the postfix rescues live: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L947
Link to the convert_fields method: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L2165
Note that if working on this, JRuby master is currently a little unstable...better to use JRuby 1.7.x
Link to the converters, where the postfix rescues live: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L947 Link to the convert_fields method: https://github.com/jruby/jruby/blob/master/lib/ruby/2.1/csv.rb#L2165 Note that if working on this, JRuby master is currently a little unstable...better to use JRuby 1.7.x |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
enebo
Jul 15, 2014
Owner
Admittedly, a JRuby peformance bug, but we also do not invoke lambdas the same speed as ordinary methods; So if anyone also wants to change converters to simple classes with a convert method it might be interesting to see the perf difference from that activity too (but postfix exception is the bigger target obviously).
Admittedly, a JRuby peformance bug, but we also do not invoke lambdas the same speed as ordinary methods; So if anyone also wants to change converters to simple classes with a convert method it might be interesting to see the perf difference from that activity too (but postfix exception is the bigger target obviously). |
Copying @JEG2. Modifying the Integer converter to use to_i and a nil check (as opposed to the postfix rescue) is around 25% faster in MRI too. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
JEG2
Jul 15, 2014
But Integer()
does more than a nil
check, right?
>> Integer("bug")
ArgumentError: invalid value for Integer(): "bug"
from (irb):1:in `Integer'
from (irb):1
from /Users/jeg2/.rubies/ruby-2.1.2/bin/irb:11:in `<main>'
JEG2
commented
Jul 15, 2014
But >> Integer("bug")
ArgumentError: invalid value for Integer(): "bug"
from (irb):1:in `Integer'
from (irb):1
from /Users/jeg2/.rubies/ruby-2.1.2/bin/irb:11:in `<main>' |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
@JEG2 I believe @tenderlove and I came up with a reasonable no-exception "maybe convert" logic for Psych, but I don't recall exactly where it was. I believe it involved some simple regexp matches to filter out strings we know won't parse.
f.to_i || f
is almost reasonable, but it will allow through strings that only start with a digit.
@JEG2 I believe @tenderlove and I came up with a reasonable no-exception "maybe convert" logic for Psych, but I don't recall exactly where it was. I believe it involved some simple regexp matches to filter out strings we know won't parse.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
JEG2
commented
Jul 15, 2014
So do we need to change this in CSV to make JRuby perform better? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
tenderlove
Jul 15, 2014
Contributor
Looks like changing CSV makes MRI faster:
diff --git a/lib/csv.rb b/lib/csv.rb
index 60b22e7..af9c918 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -945,7 +945,13 @@ class CSV
# can be nested with other combo fields.
#
Converters = { integer: lambda { |f|
- Integer(f.encode(ConverterEncoding)) rescue f
+ begin
+ num = f.encode(ConverterEncoding)
+ return f unless num =~ /^[-+]?\d+$/
+ Integer(num)
+ rescue
+ f
+ end
},
float: lambda { |f|
Float(f.encode(ConverterEncoding)) rescue f
Before:
[aaron@higgins ruby (trunk)]$ ruby -Ilib -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
1.830000 0.020000 1.850000 ( 1.845980)
1.840000 0.010000 1.850000 ( 1.860188)
1.830000 0.000000 1.830000 ( 1.849151)
1.820000 0.010000 1.830000 ( 1.831595)
After:
[aaron@higgins ruby (trunk)]$ ruby -Ilib -rbenchmark -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10_000; loop { puts Benchmark.measure { CSV.parse(data, converters: :integer) } }'
1.460000 0.010000 1.470000 ( 1.473279)
1.440000 0.010000 1.450000 ( 1.466242)
1.460000 0.010000 1.470000 ( 1.470897)
1.500000 0.000000 1.500000 ( 1.506282)
1.460000 0.010000 1.470000 ( 1.476430)
Should make all rubies faster AFAICT.
Looks like changing CSV makes MRI faster: diff --git a/lib/csv.rb b/lib/csv.rb
index 60b22e7..af9c918 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -945,7 +945,13 @@ class CSV
# can be nested with other combo fields.
#
Converters = { integer: lambda { |f|
- Integer(f.encode(ConverterEncoding)) rescue f
+ begin
+ num = f.encode(ConverterEncoding)
+ return f unless num =~ /^[-+]?\d+$/
+ Integer(num)
+ rescue
+ f
+ end
},
float: lambda { |f|
Float(f.encode(ConverterEncoding)) rescue f Before:
After:
Should make all rubies faster AFAICT. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
@JEG2 @tenderlove Yes, as mentioned in #1816 (comment) fixing it makes MRI faster too. About 25% for MRI, and about 100x faster for JRuby. Seems like a win.
@JEG2 @tenderlove Yes, as mentioned in #1816 (comment) fixing it makes MRI faster too. About 25% for MRI, and about 100x faster for JRuby. Seems like a win. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
Ideally the other converters would not be so prone to throw-away exceptions too.
Ideally the other converters would not be so prone to throw-away exceptions too. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
JEG2
Jul 15, 2014
I'll be honest and say that I'm less inclined to view this as a win.
Integer()
has the desired behavior here and trying to reimplement it sounds error prone to me. For example, @tenderlove's Regexp
misses this input: "tricky\n42"
. I think we can fix that by using /\A[+-]\d+\z/
instead, but I'm worried about how deep this rabbit hole goes.
However, if we need this, I guess we need it. I'm happy to apply a patch if someone wants to try and work out the equivalent code.
JEG2
commented
Jul 15, 2014
I'll be honest and say that I'm less inclined to view this as a win.
However, if we need this, I guess we need it. I'm happy to apply a patch if someone wants to try and work out the equivalent code. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
JEG2
Jul 15, 2014
Just in case I wasn't very clear there, I feel we're trading simpler, easier to maintain code for faster code with this change. I've never felt the pain of slow CSV conversions, so it's hard for me to justify doing that. I will though, if needed.
JEG2
commented
Jul 15, 2014
Just in case I wasn't very clear there, I feel we're trading simpler, easier to maintain code for faster code with this change. I've never felt the pain of slow CSV conversions, so it's hard for me to justify doing that. I will though, if needed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
tenderlove
Jul 15, 2014
Contributor
For me, it's two wins: speed and easier to use debugging output. Check out the output with ruby -d
:
Before the patch:
[aaron@higgins ruby (trunk)]$ ruby -Ilib -d -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10; CSV.parse(data, converters: :integer)'
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1194 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1203 - cannot load such file -- rubygems/defaults/ruby
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "3\"one"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "two"
Exception `ArgumentError' at /Users/aaron/git/ruby/lib/csv.rb:948 - invalid value for Integer(): "three"
After:
[aaron@higgins ruby (trunk)]$ ruby -Ilib -d -rcsv -e 'data = "\"one\",\"two\",\"three\"\n\"1\",\"2\",\"3\"".encode("UTF-16BE") * 10; CSV.parse(data, converters: :integer)'
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1194 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/aaron/git/ruby/lib/rubygems.rb:1203 - cannot load such file -- rubygems/defaults/ruby
[aaron@higgins ruby (trunk)]$
For me, it's two wins: speed and easier to use debugging output. Check out the output with Before the patch:
After:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
@tenderlove You win for best example of why rescue nil here is a problem. I always end up fixating on perf, but that output is completely unusable.
@tenderlove You win for best example of why rescue nil here is a problem. I always end up fixating on perf, but that output is completely unusable. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
headius
Jul 15, 2014
Owner
@JEG2 The problem here is that Integer does something else we don't want: it raises an exception if it can't handle the input. Your use of rescue f
is basically hacking around Integer behavior we don't want, rather than implementing the behavior we do want. And rescue f
is well-known to be a really bad antipattern for Ruby...so bad matz wishes he hadn't added it.
If Integer returned false or nil when it can't parse, your logic would be simplest of all: Integer(f) || f
. Unfortunately that's not the case, and so we have to hack around the decision to raise a (heavy on all impls...some more than others) throw-away exception for every field that fails to convert.
Perhaps there's a feature request possible here, like a form of Integer that takes an option, or a more robust String#to_i. That's not in scope for me :-)
@JEG2 The problem here is that Integer does something else we don't want: it raises an exception if it can't handle the input. Your use of If Integer returned false or nil when it can't parse, your logic would be simplest of all: Perhaps there's a feature request possible here, like a form of Integer that takes an option, or a more robust String#to_i. That's not in scope for me :-) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
JEG2
commented
Jul 15, 2014
As I said, I'll apply a patch. |
@JEG2 Great, thank you very much! I'll leave this open until we can import the modified version from MRI. |
enebo
modified the milestone:
JRuby 9.0.0.0
Jul 14, 2015
headius
added this to the JRuby 1.7.23 milestone
Sep 25, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
Marking to fix for 1.7.23 (and 9.0.2.0). |
enebo
modified the milestones:
JRuby 1.7.23,
JRuby 1.7.24
Nov 24, 2015
enebo
modified the milestones:
JRuby 1.7.24,
JRuby 1.7.25
Jan 20, 2016
enebo
modified the milestones:
Invalid or Duplicate,
JRuby 1.7.25
Apr 11, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
enebo
Apr 11, 2016
Owner
I am closing this. We did indirectly address performance issue itself on 9k already without applying the better error reporting logic which was mentioned above. I think the train for improving general performance on 1.7 should be spent on 9k at this point. It's dead...
I am closing this. We did indirectly address performance issue itself on 9k already without applying the better error reporting logic which was mentioned above. I think the train for improving general performance on 1.7 should be spent on 9k at this point. It's dead... |
headius commentedJul 15, 2014
See code in csv.rb, in CSV#convert_fields and the lambda-based converters around line 947.
Simple benchmark to show the devastating effect:
Without converters, we perform about like MRI.