Skip to content
This repository has been archived by the owner on Nov 10, 2023. It is now read-only.

Benchmarks #6

Open
ritchie46 opened this issue Nov 28, 2021 · 7 comments
Open

Benchmarks #6

ritchie46 opened this issue Nov 28, 2021 · 7 comments

Comments

@ritchie46
Copy link
Member

As we compare different tools here. It would be cool to run benchmarks from this repo.

Maybe in CI, and later maybe even a dedicated runner.

These can could then be shown on the website. I am already assuming here that polars does great. 馃槃

@koaning
Copy link
Collaborator

koaning commented Nov 28, 2021

Benchmarks are most certainly the plan! Any preference on how though? Part of me likes the idea of running it on Github Actions, but I'm wondering if they provide consistent hardware. There's also "what datasets shall we use" and "where to host those". I may also imagine that we may want to consider versions of functions. After all, there may be multiple ways to implement "sessionize".

@koaning
Copy link
Collaborator

koaning commented Nov 28, 2021

Come to think of it, do we really want to download large datasets and run potentially long-running benchmarks in Github Actions?

@ritchie46
Copy link
Member Author

Could create CI that only runs on manual triggers. In the polars repo, we create the datasets instead of downloading.

The VM's are shared, but I do think that within a pipeline we have the same compute (not really sure though), this would make relative comparisons still sensbible within one run.

@koaning
Copy link
Collaborator

koaning commented Nov 30, 2021

Fair enough. Let's try and start with GithubCI just to keep things simple.

Where would you want to store the data from the benchmark results? Do we want to store the results of the runs in git?

@ritchie46
Copy link
Member Author

Hmm.. that's maybe a good idea yes. We could store it in a separate clean branch. The whole benchmarking is a large todo still.

I also want to run TCPH benchmarks in the polars repo, which would need dedicated compute. I can imagine eventually setting up a database etc.

@koaning
Copy link
Collaborator

koaning commented Nov 30, 2021

I just got a base thing goin' on my local multiple dispatch branch.

image

What does TCPH stand for?

Also, simulating some of these datasets is tricky. How might we properly simulate a session dataset?

@koaning
Copy link
Collaborator

koaning commented Nov 30, 2021

I was thinking about building a memo script, but I'm open to other ideas too. It kind of depends on how accurate you'd like these numbers. There's also stuff like measure parquet vs. csv and/or number of CPUs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants