Whole script performance #177

ritchie46 · 2021-01-05T18:58:39Z

Should the whole query be measured for a tool (loading data, casting types, answering question)?

I ask, because I am running the db-benchmark and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.

The text was updated successfully, but these errors were encountered:

jangorecki · 2021-01-05T20:17:02Z

Thank you for suggestion.
I recall discussion about having a task to measure whole script execution.
The motivation was to have a test that will balance benefits of R's global string cache with cost of reading-in strings into R session (which is single threaded due to R's global cache). Now that we use categorical/factor this specific case is no longer valid, but I agree that having this kind of test would be useful. The most challenging is actually design it well.

Should the whole query be measured for a tool (loading data, casting types, answering question)?

Let's call it "benchmark script" rather than "query". Term "query" is being used for atomic queries against data (side note: we run 2 queries per question).

pandas solution takes most of its time converting strings to categorical dtype.

This can be (and going to be) outsourced to python datatable, same as we already outsource pandas read_csv to datatable fread.

Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

Strictly speaking it is fair for the "groupby" (or "join") task. Cost of importing (or casting) data into environment seems to best fit into "read" task: #131
Having processes well separated we can coherently present them on the report. I don't see any good way to put such extra timings into current benchmark plots.

ritchie46 · 2021-01-06T15:37:10Z

Right, I agree that casting could be seen as read or preparation and is not part of the groupby/ join operation.

Anyway.. Great work! And I hope that a whole script performance task becomes part of the benchmark.

jangorecki added the new task label Jan 5, 2021

jangorecki changed the title ~~Whole query performance~~ Whole script performance Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whole script performance #177

Whole script performance #177

ritchie46 commented Jan 5, 2021

jangorecki commented Jan 5, 2021 •

edited

Loading

ritchie46 commented Jan 6, 2021

Whole script performance #177

Whole script performance #177

Comments

ritchie46 commented Jan 5, 2021

jangorecki commented Jan 5, 2021 • edited Loading

ritchie46 commented Jan 6, 2021

jangorecki commented Jan 5, 2021 •

edited

Loading