Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate univocity performances #11

Closed
nrinaudo opened this issue Dec 29, 2015 · 14 comments
Closed

Investigate univocity performances #11

nrinaudo opened this issue Dec 29, 2015 · 14 comments
Labels

Comments

@nrinaudo
Copy link
Owner

Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.

This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.

While this is not strictly a tabulate issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.

If the performances are really that bad, we should probably drop univocity from our benchmarks.

@jbax
Copy link

jbax commented Jan 2, 2016

Hi, I'm the author of the library and found this issue on google. Take care when building a benchmark using small files. By default univocity-parsers executes with the concurrent input reading enabled. If you run the benchmark with a small input, multiple times, all you are testing is the time the parser waits for the input thread to get ready.

If this is the case, disable the extra thread by calling setReadInputOnSeparateThread(false); on the parser settings object.

By default, it also allocates a 1mb buffer at startup which might be overkill if you are testing inputs with just a few dozens of records. You can tweak this by setting the buffer size using the setInputBufferSize(); method.

Lastly, keep in mind the parser supports a lot of different configurations and depending on the configuration it will perform some initialization processes such as automatic detection of line endings. If you run a benchmark with small inputs, multiple times, you will be effectively testing the performance of the initialization process and not the parsing itself.

@nrinaudo
Copy link
Owner Author

nrinaudo commented Jan 2, 2016

Thanks for taking the time to post this. You're right that univocity's default setting are bad for what my benchmarking is testing: after disabling the extra thread and setting the buffer size to something a bit more conservative, performances have improved drastically - still not the best of the lot (that'd be jackson), but at least part of the competition now.

I don't intend to disable automatic line ending detection - that wouldn't be fair to the other parsers I'm using, as they also have that feature enabled. I'll look into the other settings though - if there's something like automated quote character or column separator detection on by default, that's certainly unfair to univocity when all other parsers know to expect " and ,.

@jbax
Copy link

jbax commented Jan 3, 2016

A few other things that univocity does by default and other parsers dont:

  • leading/trailing whitespace removal on each parsed value
  • comment skipping
  • blank line skipping
  • unescaped quote handling
  • the line ending detection works by analysing the input. Not by using the
    operating system default line separator.

Also the parser is designed with the use of the RowProcessor in mind. This
is a callback interface used to delegate the processing of each parsed row
to all sorts of custom requirements. It is generally slower to use the
parser.readnext method as it triggers a few calls to the RowProcessor under
the hood.
On 3 Jan 2016 7:38 am, "Nicolas Rinaudo" notifications@github.com wrote:

Thanks for taking the time to post this. You're right that univocity's
default setting are bad for what my benchmarking is testing: after
disabling the extra thread and setting the buffer size to something a bit
more conservative, performances have improved drastically - still not the
best of the lot (that'd be jackson), but at least part of the competition
now.

I don't intend to disable automatic line ending detection - that wouldn't
be fair to the other parsers I'm using, as they also have that feature
enabled. I'll look into the other settings though - if there's something
like automated quote character or column separator detection on by default,
that's certainly unfair to univocity when all other parsers know to
expect " and ,.


Reply to this email directly or view it on GitHub
#11 (comment).

@nrinaudo
Copy link
Owner Author

nrinaudo commented Jan 3, 2016

The other libraries I'm currently testing are:

  • my own, tabulate.
  • opencsv
  • jackson-csv
  • apache commons-csv
  • product-collections, which I believe uses opencsv as the underlying parser.

leading/trailing whitespace removal on each parsed value

Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)
Is it something that I can disable?

comment skipping

You're right that I'd want to disable that, as it's not something that I'm testing against. Come to think of it, this is also something I should disable for the other libraries - Jackson also does it by default, I think, as well as commons-csv. Not sure about opencsv, I'll need to check.

blank line skipping

I think all libraries I'm testing do that. I'll need to make sure, but I think I even have an explicit test against it.

unescaped quote handling

This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

the line ending detection works by analysing the input. Not by using the operating system default line separator.

Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

Also the parser is designed with the use of the RowProcessor

Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

@jbax
Copy link

jbax commented Jan 4, 2016

leading/trailing whitespace removal on each parsed value
Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)

Note the the RFC is just a proposal and not a standard. It's rarely followed and there are many sorts of non-conforming CSV input out there.

Is it something that I can disable?

settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);

unescaped quote handling
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

No parser except univocity handles this. Try this:

something,"a quoted value "with unescaped quotes" can be parsed", something

univocity will parse 3 values instead of blowing up:

  1. something
  2. a quoted value "with unescaped quotes" can be parsed
  3. something

To disable this behavior and get an exception instead, use: settings.setParseUnescapedQuotes(false);

the line ending detection works by analysing the input. Not by using the operating system default line separator.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

It just analyses the first loaded input buffer and tries to identify which line ending (CRLF, LF or CR) is present in the input. To make sure it doesn't run use settings.setLineSeparatorDetectionEnabled(false)

Also the parser is designed with the use of the RowProcessor
Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

Try and see for yourself. It is ~15% slower.

@nrinaudo
Copy link
Owner Author

nrinaudo commented Jan 4, 2016

Mmm, you're right. I thought that by unescaped quote handling, you meant something like something,a non quoted value with "quotes",something. Tabulate can handle your specific example, but it does break down if the character following the unescaped quote is a line break or a column separator.

As for the RowProcessor thing, you're the author, I'm sure you're right: it must be faster when using the callback-based API rather than the iterator-like one. I'm specifically benchmarking iterator-like access though - and I must say, if I'm using univocity for a use-case for which it wasn't specifically optimised, the results are quite impressive.

@nrinaudo
Copy link
Owner Author

nrinaudo commented Apr 9, 2016

All the best parsers and serializers, including univocity, are now in the same very small performance bracket. All the gross misconfigurations are now fixed.

@nrinaudo nrinaudo closed this as completed Apr 9, 2016
@jbax
Copy link

jbax commented Apr 11, 2016

Thanks!

As a FYI, version 2.1.0 will be considerably faster than 2.0.2 and you can test univocity-parsers-2.1.0-SNAPSHOT for parsing already.

@nrinaudo
Copy link
Owner Author

I have set things up to be notified of new versions of the parsers I'm benchmarking and will be sure to update results as soon as 2.1.0 is released - although I'm not sure it can get much faster than it already is, there's not much room for improvement left :)

@jbax
Copy link

jbax commented Apr 11, 2016

Well, it got at least 30% faster. I ran a preliminary test against some other parsers using the worldcitiespop.txt file and it parsed everything in 880ms on my machine while Jackson took ~1.1 seconds and OpenCsv ~1.9, so it might be good to test again using your test scenario.

@nrinaudo
Copy link
Owner Author

I'm not surprised about opencsv - my results show that 2.0.2 is already almost twice as fast - but faster than jackson is quite an achievement.

@jbax
Copy link

jbax commented May 2, 2016

Version 2.1.0 released to include parsing and writing performance improvements. It should be way faster now.

@nrinaudo
Copy link
Owner Author

nrinaudo commented May 2, 2016

Quite, second only to jackson in my benchmarks now.

@jbax
Copy link

jbax commented May 2, 2016

Thank you!

On 2 May 2016 at 17:36, Nicolas Rinaudo notifications@github.com wrote:

Quite http://nrinaudo.github.io/kantan.csv/tut/benchmarks.html, second
only to jackson in my benchmarks now.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#11 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants