Investigate univocity performances #11

nrinaudo · 2015-12-29T20:03:26Z

Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.

This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.

While this is not strictly a tabulate issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.

If the performances are really that bad, we should probably drop univocity from our benchmarks.

The text was updated successfully, but these errors were encountered:

jbax · 2016-01-02T07:20:27Z

Hi, I'm the author of the library and found this issue on google. Take care when building a benchmark using small files. By default univocity-parsers executes with the concurrent input reading enabled. If you run the benchmark with a small input, multiple times, all you are testing is the time the parser waits for the input thread to get ready.

If this is the case, disable the extra thread by calling setReadInputOnSeparateThread(false); on the parser settings object.

By default, it also allocates a 1mb buffer at startup which might be overkill if you are testing inputs with just a few dozens of records. You can tweak this by setting the buffer size using the setInputBufferSize(); method.

Lastly, keep in mind the parser supports a lot of different configurations and depending on the configuration it will perform some initialization processes such as automatic detection of line endings. If you run a benchmark with small inputs, multiple times, you will be effectively testing the performance of the initialization process and not the parsing itself.

nrinaudo · 2016-01-02T21:08:40Z

Thanks for taking the time to post this. You're right that univocity's default setting are bad for what my benchmarking is testing: after disabling the extra thread and setting the buffer size to something a bit more conservative, performances have improved drastically - still not the best of the lot (that'd be jackson), but at least part of the competition now.

I don't intend to disable automatic line ending detection - that wouldn't be fair to the other parsers I'm using, as they also have that feature enabled. I'll look into the other settings though - if there's something like automated quote character or column separator detection on by default, that's certainly unfair to univocity when all other parsers know to expect " and ,.

jbax · 2016-01-03T02:56:08Z

A few other things that univocity does by default and other parsers dont:

leading/trailing whitespace removal on each parsed value
comment skipping
blank line skipping
unescaped quote handling
the line ending detection works by analysing the input. Not by using the
operating system default line separator.

Also the parser is designed with the use of the RowProcessor in mind. This
is a callback interface used to delegate the processing of each parsed row
to all sorts of custom requirements. It is generally slower to use the
parser.readnext method as it triggers a few calls to the RowProcessor under
the hood.
On 3 Jan 2016 7:38 am, "Nicolas Rinaudo" notifications@github.com wrote:

Thanks for taking the time to post this. You're right that univocity's
default setting are bad for what my benchmarking is testing: after
disabling the extra thread and setting the buffer size to something a bit
more conservative, performances have improved drastically - still not the
best of the lot (that'd be jackson), but at least part of the competition
now.

I don't intend to disable automatic line ending detection - that wouldn't
be fair to the other parsers I'm using, as they also have that feature
enabled. I'll look into the other settings though - if there's something
like automated quote character or column separator detection on by default,
that's certainly unfair to univocity when all other parsers know to
expect " and ,.

—
Reply to this email directly or view it on GitHub
#11 (comment).

nrinaudo · 2016-01-03T11:33:44Z

The other libraries I'm currently testing are:

my own, tabulate.
opencsv
jackson-csv
apache commons-csv
product-collections, which I believe uses opencsv as the underlying parser.

leading/trailing whitespace removal on each parsed value

Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)
Is it something that I can disable?

comment skipping

You're right that I'd want to disable that, as it's not something that I'm testing against. Come to think of it, this is also something I should disable for the other libraries - Jackson also does it by default, I think, as well as commons-csv. Not sure about opencsv, I'll need to check.

blank line skipping

I think all libraries I'm testing do that. I'll need to make sure, but I think I even have an explicit test against it.

unescaped quote handling

This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

the line ending detection works by analysing the input. Not by using the operating system default line separator.

Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

Also the parser is designed with the use of the RowProcessor

Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

jbax · 2016-01-04T09:01:31Z

leading/trailing whitespace removal on each parsed value
Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)

Note the the RFC is just a proposal and not a standard. It's rarely followed and there are many sorts of non-conforming CSV input out there.

Is it something that I can disable?

settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);

unescaped quote handling
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

No parser except univocity handles this. Try this:

something,"a quoted value "with unescaped quotes" can be parsed", something

univocity will parse 3 values instead of blowing up:

something
a quoted value "with unescaped quotes" can be parsed
something

To disable this behavior and get an exception instead, use: settings.setParseUnescapedQuotes(false);

the line ending detection works by analysing the input. Not by using the operating system default line separator.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

It just analyses the first loaded input buffer and tries to identify which line ending (CRLF, LF or CR) is present in the input. To make sure it doesn't run use settings.setLineSeparatorDetectionEnabled(false)

Also the parser is designed with the use of the RowProcessor
Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

Try and see for yourself. It is ~15% slower.

nrinaudo · 2016-01-04T21:34:41Z

Mmm, you're right. I thought that by unescaped quote handling, you meant something like something,a non quoted value with "quotes",something. Tabulate can handle your specific example, but it does break down if the character following the unescaped quote is a line break or a column separator.

As for the RowProcessor thing, you're the author, I'm sure you're right: it must be faster when using the callback-based API rather than the iterator-like one. I'm specifically benchmarking iterator-like access though - and I must say, if I'm using univocity for a use-case for which it wasn't specifically optimised, the results are quite impressive.

nrinaudo · 2016-04-09T19:57:15Z

All the best parsers and serializers, including univocity, are now in the same very small performance bracket. All the gross misconfigurations are now fixed.

jbax · 2016-04-11T05:33:43Z

Thanks!

As a FYI, version 2.1.0 will be considerably faster than 2.0.2 and you can test univocity-parsers-2.1.0-SNAPSHOT for parsing already.

nrinaudo · 2016-04-11T07:19:55Z

I have set things up to be notified of new versions of the parsers I'm benchmarking and will be sure to update results as soon as 2.1.0 is released - although I'm not sure it can get much faster than it already is, there's not much room for improvement left :)

jbax · 2016-04-11T07:24:22Z

Well, it got at least 30% faster. I ran a preliminary test against some other parsers using the worldcitiespop.txt file and it parsed everything in 880ms on my machine while Jackson took ~1.1 seconds and OpenCsv ~1.9, so it might be good to test again using your test scenario.

nrinaudo · 2016-04-11T07:28:18Z

I'm not surprised about opencsv - my results show that 2.0.2 is already almost twice as fast - but faster than jackson is quite an achievement.

jbax · 2016-05-02T06:25:59Z

Version 2.1.0 released to include parsing and writing performance improvements. It should be way faster now.

nrinaudo · 2016-05-02T08:06:11Z

Quite, second only to jackson in my benchmarks now.

jbax · 2016-05-02T08:08:28Z

Thank you!

On 2 May 2016 at 17:36, Nicolas Rinaudo notifications@github.com wrote:

Quite http://nrinaudo.github.io/kantan.csv/tut/benchmarks.html, second
only to jackson in my benchmarks now.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#11 (comment)

nrinaudo added the question label Dec 29, 2015

nrinaudo closed this as completed Apr 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate univocity performances #11

Investigate univocity performances #11

nrinaudo commented Dec 29, 2015

jbax commented Jan 2, 2016

nrinaudo commented Jan 2, 2016

jbax commented Jan 3, 2016

nrinaudo commented Jan 3, 2016

jbax commented Jan 4, 2016

nrinaudo commented Jan 4, 2016

nrinaudo commented Apr 9, 2016

jbax commented Apr 11, 2016

nrinaudo commented Apr 11, 2016

jbax commented Apr 11, 2016

nrinaudo commented Apr 11, 2016

jbax commented May 2, 2016

nrinaudo commented May 2, 2016

jbax commented May 2, 2016

Investigate univocity performances #11

Investigate univocity performances #11

Comments

nrinaudo commented Dec 29, 2015

jbax commented Jan 2, 2016

nrinaudo commented Jan 2, 2016

jbax commented Jan 3, 2016

nrinaudo commented Jan 3, 2016

jbax commented Jan 4, 2016

nrinaudo commented Jan 4, 2016

nrinaudo commented Apr 9, 2016

jbax commented Apr 11, 2016

nrinaudo commented Apr 11, 2016

jbax commented Apr 11, 2016

nrinaudo commented Apr 11, 2016

jbax commented May 2, 2016

nrinaudo commented May 2, 2016

jbax commented May 2, 2016