-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate univocity performances #11
Comments
Hi, I'm the author of the library and found this issue on google. Take care when building a benchmark using small files. By default univocity-parsers executes with the concurrent input reading enabled. If you run the benchmark with a small input, multiple times, all you are testing is the time the parser waits for the input thread to get ready. If this is the case, disable the extra thread by calling By default, it also allocates a 1mb buffer at startup which might be overkill if you are testing inputs with just a few dozens of records. You can tweak this by setting the buffer size using the Lastly, keep in mind the parser supports a lot of different configurations and depending on the configuration it will perform some initialization processes such as automatic detection of line endings. If you run a benchmark with small inputs, multiple times, you will be effectively testing the performance of the initialization process and not the parsing itself. |
Thanks for taking the time to post this. You're right that I don't intend to disable automatic line ending detection - that wouldn't be fair to the other parsers I'm using, as they also have that feature enabled. I'll look into the other settings though - if there's something like automated quote character or column separator detection on by default, that's certainly unfair to |
A few other things that univocity does by default and other parsers dont:
Also the parser is designed with the use of the RowProcessor in mind. This
|
The other libraries I'm currently testing are:
Technically, that's not RFC compliant (
You're right that I'd want to disable that, as it's not something that I'm testing against. Come to think of it, this is also something I should disable for the other libraries - Jackson also does it by default, I think, as well as commons-csv. Not sure about opencsv, I'll need to check.
I think all libraries I'm testing do that. I'll need to make sure, but I think I even have an explicit test against it.
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.
Interesting. I am indeed calling |
Note the the RFC is just a proposal and not a standard. It's rarely followed and there are many sorts of non-conforming CSV input out there.
No parser except univocity handles this. Try this:
univocity will parse 3 values instead of blowing up:
To disable this behavior and get an exception instead, use:
It just analyses the first loaded input buffer and tries to identify which line ending (CRLF, LF or CR) is present in the input. To make sure it doesn't run use
Try and see for yourself. It is ~15% slower. |
Mmm, you're right. I thought that by unescaped quote handling, you meant something like As for the |
All the best parsers and serializers, including univocity, are now in the same very small performance bracket. All the gross misconfigurations are now fixed. |
Thanks! As a FYI, version 2.1.0 will be considerably faster than 2.0.2 and you can test univocity-parsers-2.1.0-SNAPSHOT for parsing already. |
I have set things up to be notified of new versions of the parsers I'm benchmarking and will be sure to update results as soon as 2.1.0 is released - although I'm not sure it can get much faster than it already is, there's not much room for improvement left :) |
Well, it got at least 30% faster. I ran a preliminary test against some other parsers using the worldcitiespop.txt file and it parsed everything in 880ms on my machine while Jackson took ~1.1 seconds and OpenCsv ~1.9, so it might be good to test again using your test scenario. |
I'm not surprised about opencsv - my results show that 2.0.2 is already almost twice as fast - but faster than jackson is quite an achievement. |
Version 2.1.0 released to include parsing and writing performance improvements. It should be way faster now. |
Quite, second only to jackson in my benchmarks now. |
Thank you! On 2 May 2016 at 17:36, Nicolas Rinaudo notifications@github.com wrote:
|
Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.
This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.
While this is not strictly a
tabulate
issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.If the performances are really that bad, we should probably drop univocity from our benchmarks.
The text was updated successfully, but these errors were encountered: