-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whole script performance #177
Comments
Thank you for suggestion.
Let's call it "benchmark script" rather than "query". Term "query" is being used for atomic queries against data (side note: we run 2 queries per question).
This can be (and going to be) outsourced to python datatable, same as we already outsource pandas
Strictly speaking it is fair for the "groupby" (or "join") task. Cost of importing (or casting) data into environment seems to best fit into "read" task: #131 |
Right, I agree that casting could be seen as read or preparation and is not part of the groupby/ join operation. Anyway.. Great work! And I hope that a whole script performance task becomes part of the benchmark. |
Should the whole query be measured for a tool (loading data, casting types, answering question)?
I ask, because I am running the
db-benchmark
and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.
The text was updated successfully, but these errors were encountered: