Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whole script performance #177

Open
ritchie46 opened this issue Jan 5, 2021 · 2 comments
Open

Whole script performance #177

ritchie46 opened this issue Jan 5, 2021 · 2 comments
Labels

Comments

@ritchie46
Copy link
Contributor

Should the whole query be measured for a tool (loading data, casting types, answering question)?

I ask, because I am running the db-benchmark and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.

@jangorecki
Copy link
Contributor

jangorecki commented Jan 5, 2021

Thank you for suggestion.
I recall discussion about having a task to measure whole script execution.
The motivation was to have a test that will balance benefits of R's global string cache with cost of reading-in strings into R session (which is single threaded due to R's global cache). Now that we use categorical/factor this specific case is no longer valid, but I agree that having this kind of test would be useful. The most challenging is actually design it well.

Should the whole query be measured for a tool (loading data, casting types, answering question)?

Let's call it "benchmark script" rather than "query". Term "query" is being used for atomic queries against data (side note: we run 2 queries per question).

pandas solution takes most of its time converting strings to categorical dtype.

This can be (and going to be) outsourced to python datatable, same as we already outsource pandas read_csv to datatable fread.

Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

Strictly speaking it is fair for the "groupby" (or "join") task. Cost of importing (or casting) data into environment seems to best fit into "read" task: #131
Having processes well separated we can coherently present them on the report. I don't see any good way to put such extra timings into current benchmark plots.

@jangorecki jangorecki changed the title Whole query performance Whole script performance Jan 5, 2021
@ritchie46
Copy link
Contributor Author

Right, I agree that casting could be seen as read or preparation and is not part of the groupby/ join operation.

Anyway.. Great work! And I hope that a whole script performance task becomes part of the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants