-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaw in the benchmark? #2
Comments
i started by using Benchmark.js as well but its results were always significantly slower than what i measured in a raw loop. so i wrote something that seemed to have lower overhead. i tried https://github.com/tinylibs/tinybench just to make sure i wasn't crazy, and it gave me the same results as what i expected, and faster than Benchmark.js. maybe it's the runner, maybe we can dig a bit further to see if it's just that or something else. |
A benchmark runner with more overhead would just make all (absolute) results equally lower, right? But I don't think it would impact how the results compare to each other (relatively). |
i will try to dig in a bit more and would appreciate help here. this benchmark loop is super straightforward, so 🤷. i think a lot depends on how much breathing room the bench runner gives to the GC between each cycle. in combo with how much each lib stresses the GC. can you try tinybench and see what you get? |
i think your Papa Parse numbers are worse because you're using the difference is huge: https://github.com/leeoniya/uDSV/tree/main/bench#output-formats it's a legit setting for the nested parsing comparison but as i wrote in the csv42 repo, Papa Parse + flat does not produce valid output, so my benchmarks do not compare Papa for the structured case. |
The If you mix these things in a single benchmark, you're comparing apples with pears. A useful benchmark should compare different ways to get to the same output. And it should create different categories of benchmarks for different options resulting in other types of output (unless it explicitly wants to show the impact of different types of outputs like here of course). Else the benchmark is meaningless and misleading for users that want to make a comparison between either library X or Y for their use-case Z.
Can you share what you tried? I just tested papa+flat again to be sure, and it does actually work as intended: parsing nested objects and producing the same output as the other CSV libraries that I compare it with. |
the problem with comparing the same outputs is that a bunch of libraries dont support one or the other output, which means you have to do additional translation in userland, which is extremely expensive. in my view, the core of parsing is simply finding the correct column and row offsets in the full csv string. this is what allows you to access the values. imo if you have objects or tuples, there is little practical difference which format you structure your app around, and swapping formats is trivial. i did not want to penalize 20% of the libs in either direction. i also have no desire to show/maintain/rerun twice the amount of benchmarks than i already do :D typed parsing is indeed very different, since you cannot avoid type conversion if you actually need it. so i do have that broken out.
im not trying to show what libraries you can swap into existing apps with zero effort. with a bunch of them you cannot since they have different apis, like async, streaming, etc. my goal here is to show the max possible parsing perf of each lib. you can always fiddle with options and datasets to then shoot yourself in the foot, perf-wise.
i'll dig it up today... |
i put a breakpoint in here. uDSV/bench/non-streaming/typed/PapaParse-deep.cjs Lines 30 to 42 in b912796
|
Looking at the code:
you run unflatten twice, shouldn't it just be:
|
did not help :( |
I think the reason is that |
Ahh, I will have a look at that and correct it. |
True. There are only tree main outputs though: array, object, or nested object. It's not that complex, and these different formats have a large impact on how usable the data is. It's very relevant for the user, which probably has to do this "expensive extra translation" himself it the output format doesn't align with what the user needs. It is very helpful if the benchmark is transparent about this.
I understand your point, thanks for elaborating. I don't think this is a good approach though. In my opinion, a good benchmark should change only a single property at a time and compare that. It should not change two different properties at the same time, that's just mixing apples and pears. So, one benchmark can keep the output the same and compares different CSV libraries, and a second benchmark can compare different output formats (array, object, nested object) so you can get a feel for the impact of that.
Ehhhh... I hope that's not a real argument to decide which benchmarks are useful to give the best insight 😉
No and no. The same argument holds for objects: maybe the table component that you need requires objects. Both type parsing and object parsing are optional steps, and depending on the use-case you cannot avoid either of them. And these optional processing steps are not trivial, they can come with a large performance penalty, so, in a performance benchmark it is definitely relevant to recon with them. |
running these benchmarks is currently a pretty manual process, since not all libs can run all datasets, etc. in addition i have to manually update the tables in the writeup. the whole thing takes 90mins to complete...it is really tedious. it would be dishonest if i said there is zero marketing aspect to these benchmarks. i dont have an infinite amount of time on my hands to demonstrate every permutation of libs, options, datasets, manual userland conversion, etc, beyond what is necessary to demonstrate uDSV's performance. i think that's fair, given how much effort is involved here. i have shown quite well that Papa degrades massively once you switch it to object mode and enable typing (this is a big reason i added the Sample column). i expect users to apply these demonstrated handicaps/caveats mentally. if you have an interest in showing that csv42 is much faster than Papa+flat at the typed/nested case, then you have already done so in your own writeup, and i have no concerns that your benchmarks are inaccurate, you clearly also did your homework. i will add Papa + flat to my table once i get its output to be the same, which may require additional litmus gen and i'll need to update uDSV to support the dot-index notation. |
Working out benchmarks is definitely very time-intensive 😅. Ideally we should create one fully automated benchmark repo for all of this. Not sure where to find the time for that though. Maybe what frustrates me is that your benchmark may be fair to uDSV and gives the right perspective there (and uDSV is actually very fast, credits there!), but at the same time the benchmark gives the impression that csv42 is much slower than the popular papaparse whereas csv42 actually does do very well when properly comparing apples to apples. |
i get it. i certainly have no nefarious plan to make other parsers look worse than their best self, quite the opposite, in fact! not everyone parses to or needs to parse to objects, nested objects are niche, and typing may in fact be unneeded if all you're doing is sticking the data into JSX to render a table. it makes no sense to force libs into a slow path without knowing the final use case, and i dont intend to dictate one here. if you add a faster, untyped tuple mode to csv42, then i will gladly use it! |
👍 |
i added support for dot notation in 91cea7b, which makes
|
Nice! Yeah the funny thing is that |
I've extended |
the
|
I've done a small test with adding uDSV to the benchmarks I did in csv42 to see if they give comparable results. The results you post above you claim that How can I reproduce your 4.5x difference with the nested json data?
|
okay, i figured out what it is. my bench loop has an function geoMean(arr) {
let logSum = arr.reduce((acc, val) => acc + Math.log(val), 0);
return Math.exp(logSum / arr.length);
}
const sleep = () => new Promise(resolve => setImmediate(resolve));
let cycles = 200;
let durs = [];
let rss = [];
(async () => {
let i = 0;
while (i++ < cycles) {
let st = performance.now();
let schema = inferSchema(csvStr);
let parser = initParser(schema);
await Promise.resolve(parser.typedDeep(csvStr));
await sleep(); // let GC collect garbage between each cycle
durs.push(performance.now() - st);
rss.push(process.memoryUsage.rss());
}
console.log((1e3 / geoMean(durs)).toFixed(1) + ' ops/s', (Math.max(...rss) / 1024 / 1024).toFixed(1) + ' peak RSS (MiB)');
})(); if you comment out that with sleep():
without sleep():
you can see that if we just allow garbage to accumulate, the high RSS has an outsized impact on performance in cases where the GC can theoretically be efficient if given the opportunity. i think this is the correct way to run benchmarks. not because it makes uDSV win more, but because it represents real load patterns more accurately, except for the case (perhaps) of batch-processing a directory of CSV files in a loop. typically if you do this in an HTTP request handler, or invoke a script to do this one file at a time, the GC will have plenty of time to run between each workload. here is the full code if you'd like to test it: const fs = require('fs');
const csvStr = fs.readFileSync('./bench/data/csv42_nested_10k_dot.csv', 'utf8');
const Papa = require('papaparse');
const flat = require('flat');
const { inferSchema, initParser } = require('../dist/uDSV.cjs.js');
const { csv2json } = require('csv42');
function geoMean(arr) {
let logSum = arr.reduce((acc, val) => acc + Math.log(val), 0);
return Math.exp(logSum / arr.length);
}
const sleep = () => new Promise(resolve => setImmediate(resolve));
let cycles = 200;
let durs = [];
let rss = [];
(async () => {
let i = 0;
while (i++ < cycles) {
let st = performance.now();
let schema = inferSchema(csvStr);
let parser = initParser(schema);
await Promise.resolve(parser.typedDeep(csvStr));
await sleep();
durs.push(performance.now() - st);
rss.push(process.memoryUsage.rss());
}
console.log((1e3 / geoMean(durs)).toFixed(1) + ' ops/s', (Math.max(...rss) / 1024 / 1024).toFixed(1) + ' peak RSS (MiB)');
})();
// (async () => {
// let i = 0;
// while (i++ < cycles) {
// let st = performance.now();
// await Promise.resolve(csv2json(csvStr, { nested: true, header: true }));
// await sleep();
// durs.push(performance.now() - st);
// rss.push(process.memoryUsage.rss());
// }
// console.log((1e3 / geoMean(durs)).toFixed(1) + ' ops/s', (Math.max(...rss) / 1024 / 1024).toFixed(1) + ' peak RSS (MiB)');
// })();
// (async () => {
// function transform(value) {
// const number = Number(value)
// if (!isNaN(number) && !isNaN(parseFloat(value))) {
// return number;
// }
// if (value === 'true') {
// return true;
// }
// if (value === 'false') {
// return false;
// }
// if (value === 'null' || value === '') {
// return null;
// }
// if (value[0] === '{' || value[0] === '[') {
// return JSON.parse(value);
// }
// return value;
// }
// let i = 0;
// while (i++ < cycles) {
// let st = performance.now();
// await Promise.resolve(Papa.parse(csvStr, { header: true, transform }).data.map(flat.unflatten));
// await sleep();
// durs.push(performance.now() - st);
// rss.push(process.memoryUsage.rss());
// }
// console.log((1e3 / geoMean(durs)).toFixed(1) + ' ops/s', (Math.max(...rss) / 1024 / 1024).toFixed(1) + ' peak RSS (MiB)');
// })(); |
yes, I encountered the same. It is broken when using nodejs 20, works on 18. I have to fix that.
O wow, that is really interesting. I totally agree that a good benchmark should test a real world situation where you run the operation once. I was just testing with
Do you understand why running the operation just a single time, measuring with a console.time, gives quite different (but consistent) results? That's a true cold-start, but I'm not sure how close that is to "reality".
|
The I'll give your code a try. |
you need enough cycles to warm up the JIT. first run will always be slower by 2x-10x. |
That makes sense indeed. Running your benchmark code on my machine gives more or less the same results as with other benchmark libraries:
I added the code here: https://github.com/josdejong/csv-benchmark/blob/main/benchmark-custom.js Maybe this has to do with OS/Node.js version etc? I'm running Windows 11, and tested with Node.js 18 and 20. Edit: running on Ubuntu (Windows WSL) gives comparable results too:
|
totally probable. could be CPU differences, branch prediction, etc. it's the wild west! |
Running on Github actions gives the same kind of result (not a dedicated machine I guess so we need to take that with a grain of salt, but the values have similar ratio and are reproducible). So what OS/machine are you using? It's quite odd that this 1.5 times speed increase for |
i agree it's weird. my env is here: https://github.com/leeoniya/uDSV/tree/main/bench#environment |
Can you reproduce the 1.5x speed increase on an other machine? A good benchmark should be reproducible (if not reproducible it's not very trustworthy). |
can you first run all benchmarks in this repo on your existing machine, instead of just the csv42 dataset benchmarks? after you do this and we see significant differences across multiple benchmarks i'll do more work. i have an Intel i7 / Windows 10 machine i can try without WSL2 . i can also try this in a Debian Unstable Linode. 50% fluctuation between hardware and architectures and environments is not unexpected. it is likely impossible to reproduce the same results across everything, because everything is different. i would not consider these benchmarks invalid simply because one library goes from No 10 to No 6 on different hardware. if uDSV ends up being No 5 instead of No 1, then that would be problematic; i don't want to tell people it's the fastest when it isn't. by how much is a more subtle, and variable answer. |
Yes I was indeed trying to run uDSV/bench, but that requires getting all data files in place. I'll give that a try tomorrow, I hope I'll get it running. The performance will indeed differ on different machines, though I would expect a more or less similar ratio. To me, how much faster a library is is definitely relevant. If some library is only faster by "a bit", I don't have a big reason to switch (switching costs time and effort etc). But if it is like a factor 5 or 10 it is definitely a worth considering. |
except that csv42 is nowhere near a "a bit" slower. i don't think you're going to find any hardware on earth that will materially change this. we can re-visit this discussion when the deficit is 33%, not either of 66% or 75%. looking forward to your runs. if i have time today i'll try the other devices. it may very well be "only" 3x. |
btw, you dont need to waste time benchmarking the obviously-slow libs; sometimes there's just no hope. for typed parsing, i think we can make some educated guesses from just these:
|
here is my Win 10 / i7 Desktop (showing 3.5x)
|
OK I did let my computer do some crunching on most of the benchmarks (skipped the streaming ones). Most of them are roughly comparable to your outcomes, that gives me confidence. As for the nested objects: So your linux machine gets a factor 4.5 difference, your windows machine 3.5, my windows machine 4, and my benchmark with benchmark.js a factor 2.5. I'm still surprised that the relative differences can vary that much. Sorry if I overreacted yesterday, I was just triggered by the benchmarks giving such wildly different results than I had seen before with my own benchmarks and tried out myself (putting the library that I built with love in a bad spotlight). Large part of it is because of mixing object/array formats in a single benchmark, and part of it is apparently a combination of the hardware and the benchmarking tool used. I still don't understand why you find it important to differentiate between typed/untyped but not between object/array format (both are optional processing steps, and the latter has much more performance impact). I hope you'll split them someday (I like the Computer: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz (max 4.20GHz), 16 GB RAM, 64-bit, Windows 11 Home, Node.js 18.
|
no worries, thanks for running them :)
because typed output is necessarily expensive. all libs must take a hit here (uDSV eats 30%, and that's the best case). additionally, typed output is not something you can make optional -- if your app need types, there's simply no way around it. object vs array output only impacts libs that do it poorly. if someone insists on using a particular parser, it's possible to change their app to accept tuples rather than objects...yes it's uglier in some cases, but possible if they're willing to trade ergonomics for perf. hope that clarifies why they're in different categories. i'm gonna start working on another unplanned thing now...because Github finally completely fucked up the home page feed 🤮 https://github.com/orgs/community/discussions/categories/feed |
ha ha, we're starting to run in circles here. The same holds for objects: if your app needs objects, there is simply no way around it then to convert into objects. Taking X months to write your own array based Table Component Pro® because the GUI Table Component used company wide only supports objects may not be practical. I guess we have a different definition of a "necessary" operation here. Anyways, I won't bother you anymore. Good luck with the Github feed issues 😓 |
you're right. if you have to integrate with software that's out of your control, and you need objects+types+nested, then csv42 is 3x faster than Papa+typing+flat. nothing i've measured invalidates your article or benchmarks...and uDSV is another 3x-4x faster than that :) |
yes true, it's really impressive! |
I was a bit surprised to see Papaparse being twice as fast as csv42 in many of the benchmarks. I did similar benchmarks before and there csv42 was always at least a bit faster than Papaparse. Now, it can be that I'm overlooking something here and comparing apples with pears, but I would love to understand what's going on, these benchmarks should give similar kind of results.
I created a minimal benchmark to try figure out what's going on: https://github.com/josdejong/csv-benchmark
The results show csv42 being twice as fast as Papaparse with for example the
HPI_master.csv
file and also another file:Can it be that papaparse isn't configured correctly? When omitting the
header: true
setting, it will create arrays instead of objects which is about 3 times faster, giving results similar to the results shown here. Not sure, but here papaparse isn't configured withheader: true
: https://github.com/leeoniya/uDSV/blob/main/bench/non-streaming/untyped/PapaParse.cjs.Did you verify that the output is indeed what you expect for all the libraries in all the benchmarks? (i.e. an object with parsed numbers and booleans).
The text was updated successfully, but these errors were encountered: