You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Should the whole query be measured for a tool (loading data, casting types, answering question)?
I ask, because I am running the db-benchmark and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.
It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.
The text was updated successfully, but these errors were encountered:
Thank you for suggestion.
I recall discussion about having a task to measure whole script execution.
The motivation was to have a test that will balance benefits of R's global string cache with cost of reading-in strings into R session (which is single threaded due to R's global cache). Now that we use categorical/factor this specific case is no longer valid, but I agree that having this kind of test would be useful. The most challenging is actually design it well.
Should the whole query be measured for a tool (loading data, casting types, answering question)?
Let's call it "benchmark script" rather than "query". Term "query" is being used for atomic queries against data (side note: we run 2 queries per question).
pandas solution takes most of its time converting strings to categorical dtype.
This can be (and going to be) outsourced to python datatable, same as we already outsource pandas read_csv to datatable fread.
Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.
Strictly speaking it is fair for the "groupby" (or "join") task. Cost of importing (or casting) data into environment seems to best fit into "read" task: #131
Having processes well separated we can coherently present them on the report. I don't see any good way to put such extra timings into current benchmark plots.
Should the whole query be measured for a tool (loading data, casting types, answering question)?
I ask, because I am running the
db-benchmark
and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.
The text was updated successfully, but these errors were encountered: