Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding upon and improving existing benchmarks #2390

Open
MrPowers opened this issue Jan 15, 2022 · 4 comments
Open

Expanding upon and improving existing benchmarks #2390

MrPowers opened this issue Jan 15, 2022 · 4 comments

Comments

@MrPowers
Copy link

Polars has done an amazing job participating in the h2o benchmarks and showing amazing results. These performance benchmarks are a compelling reason to use Polars.

The h2o benchmarks are somewhat limited because they only show performance for groupby and join queries. We've already seen how additional benchmarks, like reading in 1,097 Parquet files and running a filter operation, can encourage optimizations that allow for 10x performance gains.

Here are some queries I'm especially interested in benchmarking:

  • filtering
  • I/O (both with one huge file and thousands of smaller files)
  • multiple queries at once (e.g. read in thousands of Parquet files, filter, groupby, write out single CSV). These should better mimic "real world" workflows.

I plan to expose all the Polars benchmarks via Jupyter notebooks as well, so they're easily readable. Here are the Polars groupby queries for example. These should help the community learn Polars syntax.

I'm planning on adding this code in the mrpowers-benchmarking repo.

I'm also planning on adding Vaex and arrow-datafusion to the benchmarking analysis.

Here are some next steps / discussion points:

  • The h2o benchmarks run on datasets that are created with R scripts, here's an example. These scripts are really slow and kind of painful to work with. Anyone want to collaborate on a Rust / Polars program to generate these CSV files in a more performant manner? The current R script outputs a 50GB uncompressed CSV file. It'd be better to output multiple files in parallel. We'll eventually want to be able to run benchmarks on terabytes of data.
  • Any suggestions on additional types of queries we should benchmark?
@ghuls
Copy link
Collaborator

ghuls commented Jan 16, 2022

I quickly wrote an AWK script to generate the CSV files. This are the speeds that I am getting:

$ generate_db_bench 1e8 1e3 0 0 | tail
id620,id383,id0000079265,351,138,4355,5,6,79.258110
id305,id100,id0000036585,550,122,80876,2,4,37.077903
id470,id062,id0000000374,509,451,42425,3,1,3.860967
id545,id123,id0000088747,565,224,71621,2,11,1.102061
id111,id825,id0000073112,962,60,5389,2,8,3.934497
id416,id803,id0000096000,687,384,45704,3,8,47.554632
id408,id720,id0000053989,327,234,46885,5,4,1.627670
id202,id870,id0000081570,16,359,82880,4,7,35.411490
id829,id090,id0000080259,485,676,79456,1,7,41.790627
id099,id870,id0000036577,493,71,28130,2,6,27.070123

took 1m48s


$ generate_db_bench 1e7 1e2 0 0 | tail
id027,id096,id0000056446,77,63,3762,4,10,10.649226
id071,id077,id0000010959,51,80,43160,5,3,10.232036
id080,id078,id0000038057,87,68,41867,4,14,41.382498
id023,id042,id0000082354,3,9,9801,1,7,14.909657
id092,id066,id0000019551,26,99,40258,5,12,68.537331
id049,id060,id0000053163,89,77,10105,5,2,30.914155
id025,id006,id0000032002,39,44,78478,5,3,48.230367
id013,id014,id0000040569,34,97,35626,4,3,17.827776
id060,id100,id0000044250,16,19,76827,1,15,19.643650
id009,id004,id0000011484,48,10,4997,2,1,2.519661

took 11s

What do you get with R for those settings?

I still need to implement adding %x NAs to my script.

@MrPowers
Copy link
Author

@ritchie46 - Databricks performed a benchmarking analysis on 157GB of NYC taxi data as described in this blog post.

I was thinking we should reproduce this benchmarking analysis with Polars. It's a larger dataset and it deals with some messy, real-world data, so it should be realistic. Thoughts?

@ritchie46
Copy link
Member

Sounds interesting. I could help on writing the most performant queries. You would however require quite a large VM for this.

@MrPowers
Copy link
Author

MrPowers commented Apr 3, 2022

We've been making some good progress here. I created a new script to generate the h2o groupby data. The current h2o data generation script is limited because it generates a single CSV file and often errors out when generating 1e9 rows of data (50 GB). My script outputs multiple files, so it is scalable. I will eventually want to run all benchmarks on 1e10 rows of data (500 GB).

Filter benchmarks

I added 5 filter benchmarking queries and here are the benchmarking results for Polars on my local machine for the 1e8 row dataset (Macbook Air with 8GB of RAM):

task  polars-parquet  polars-csv  polars-single-csv
q1          4.147290   16.049853          14.464048
q2          7.025856   23.496531          62.517372
q3         16.354966   76.546058         161.577843
q4          8.573848   41.575043          31.605694
q5          6.021159   12.663761          14.181147

Here's the script in case you'd like to take a look at the queries or have any suggestions on how to better structure the code.

Unlike the h2o benchmarks, I don't want to persist any data in memory. All queries should read from disk and then execute.

I will be presenting benchmarking results for a single CSV file, multiple CSV files, and multiple Parquet files. The 1e10 results probably won't have a single CSV file cause don't think it'll be practical to create a 500 GB CSV file.

Multiple operation benchmarks

I am going to add these soon and will keep you posted. These will be queries like filtering & then grouping or grouping and then filtering, etc.

Longer term, I also plan on adding large ETL benchmarks (e.g. reading 500 GB of CSV data, running transformations, writing results as 2,000 Parquet files).

Groupby benchmarks

I am able to run the h2o groupby queries with Polars using multiple CSV files or a single CSV file. I'm having trouble running the query on multiple Parquet files. I tried with Parquet files generated by both Dask & PySpark.

Here's the script. When I run python benchmarks/polars_h2o_groupby.py 1e7 I get this error: "RuntimeError: Any(SchemaMisMatch("cannot vstack: because column names in the two DataFrames do not match for left.name='v1' != right.name='id1'"))".

Can you help me figure out how to get past this error?

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants