New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use clustered table and standard SQL for lower query costs #107
Conversation
By using the clustered data table, performance if improved because BigQuery can skip of data that doesn't match the desired project.
Codecov Report
@@ Coverage Diff @@
## master #107 +/- ##
==========================================
+ Coverage 94.05% 94.15% +0.09%
==========================================
Files 6 6
Lines 353 359 +6
Branches 36 36
==========================================
+ Hits 332 338 +6
Misses 12 12
Partials 9 9
Continue to review full report at Codecov.
|
I'm not sure why, but the download count is higher with the new table 🤔 Can confirm that estimated query costs are much lower with this change, though.
|
~40x quota savings, great! @fhoffa Do you have any idea why the download count is (~50%) higher with the new tables? |
@di was recently working on this dataset. Perhaps he has ideas? |
Yes, likely because the service that writes to the old dataset has been having issues: https://twitter.com/sethmlarson/status/1347236470688542721?s=19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the new counts are in fact accurate, then LGTM! @hugovk feel free to merge if you want
Thanks! |
https://pypi.org/project/pypinfo/18.0.0/ Thanks again @tswast!!! |
By using the clustered data table, performance if improved because
BigQuery can skip of data that doesn't match the desired project.
Tested locally:
Tested locally with all supported fields:
Closes #64