perf: use clustered table and standard SQL for lower query costs #107

tswast · 2021-01-12T21:51:07Z

By using the clustered data table, performance if improved because
BigQuery can skip of data that doesn't match the desired project.

Tested locally:

$ pypinfo google-cloud-bigquery        Served from cache: False
Data processed: 740.43 MiB
Data billed: 741.00 MiB
Estimated cost: $0.01

| download_count |
| -------------- |
|     10,149,146 |

Tested locally with all supported fields:

$ pypinfo google-cloud-bigquery project version file pyversion percent3 percent2 impl impl-version openssl date month year country installer installer-version setuptools-version system system-release distro distro-version cpu libc libc-version
Served from cache: False
Data processed: 4.62 GiB
Data billed: 4.62 GiB
Estimated cost: $0.03

| project               | version | file                                              | python_version | percent_3 | percent_2 | implementation | impl_version | openssl_version | download_date | download_month | download_year | country | installer_name | installer_version | setuptools_version | system_name | system_release    | distro_name      | distro_version | cpu    | libc_name | libc_version | download_count |
| --------------------- | ------- | ------------------------------------------------- | -------------- | --------- | --------- | -------------- | ------------ | --------------- | ------------- | -------------- | ------------- | ------- | -------------- | ----------------- | ------------------ | ----------- | ----------------- | ---------------- | -------------- | ------ | --------- | ------------ | -------------- |
| google-cloud-bigquery | 1.24.0  | google_cloud_bigquery-1.24.0-py2.py3-none-any.whl | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.0l          | 2020-12-15    | 2020-12        |         2,020 | US      | pip            | 20.0.2            | 45.1.0             | Linux       | 5.4.49+           | Debian GNU/Linux |              9 | x86_64 | glibc     | 2.24         |        171,636 |
| google-cloud-bigquery | 1.24.0  | google_cloud_bigquery-1.24.0-py2.py3-none-any.whl | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.0l          | 2020-12-13    | 2020-12        |         2,020 | US      | pip            | 20.0.2            | 45.1.0             | Linux       | 5.4.49+           | Debian GNU/Linux |              9 | x86_64 | glibc     | 2.24         |        160,152 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-07    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1092-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        136,529 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-06    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1092-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        133,927 |
| google-cloud-bigquery | 1.24.0  | google_cloud_bigquery-1.24.0-py2.py3-none-any.whl | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.0l          | 2020-12-14    | 2020-12        |         2,020 | US      | pip            | 20.0.2            | 45.1.0             | Linux       | 5.4.49+           | Debian GNU/Linux |              9 | x86_64 | glibc     | 2.24         |        130,400 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-05    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1092-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        129,844 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-09    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1050-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        126,544 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-04    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1092-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        124,986 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2020-12-30    | 2020-12        |         2,020 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1092-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        121,202 |
| google-cloud-bigquery | 2.6.1   | google_cloud_bigquery-2.6.1-py2.py3-none-any.whl  | 3.7            | 100.0     | 0.0       | CPython        | 3.7          | 1.1.1           | 2021-01-10    | 2021-01        |         2,021 | IN      | pip            | 20.0.2            | 45.2.0             | Linux       | 4.15.0-1050-azure | Ubuntu           |          18.04 | x86_64 | glibc     | 2.27         |        121,104 |
| Total                 |         |                                                   |                |           |           |                |              |                 |               |                |               |         |                |                   |                    |             |                   |                  |                |        |           |              |      1,356,324 |

Closes #64

By using the clustered data table, performance if improved because BigQuery can skip of data that doesn't match the desired project.

codecov · 2021-01-12T22:02:08Z

Codecov Report

Merging #107 (2718bb5) into master (a5ce368) will increase coverage by 0.09%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage   94.05%   94.15%   +0.09%     
==========================================
  Files           6        6              
  Lines         353      359       +6     
  Branches       36       36              
==========================================
+ Hits          332      338       +6     
  Misses         12       12              
  Partials        9        9

Impacted Files	Coverage Δ
pypinfo/core.py	`89.06% <92.30%> (+0.35%)`	⬆️
pypinfo/fields.py	`100.00% <100.00%> (ø)`
tests/test_core.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5ce368...2718bb5. Read the comment docs.

tswast · 2021-01-12T22:14:26Z

I'm not sure why, but the download count is higher with the new table 🤔 Can confirm that estimated query costs are much lower with this change, though.

$ git checkout master


$ pypinfo pytest               
Served from cache: False
Data processed: 78.79 GiB
Data billed: 78.79 GiB
Estimated cost: $0.39

| download_count |
| -------------- |
|     11,695,271 |


$ git checkout issue64-query-optimize 
Switched to branch 'issue64-query-optimize'


$ pypinfo pytest                     
Served from cache: False
Data processed: 731.78 MiB
Data billed: 732.00 MiB
Estimated cost: $0.01

| download_count |
| -------------- |
|     16,845,649 |

hugovk · 2021-01-16T11:51:48Z

~40x quota savings, great!

@fhoffa Do you have any idea why the download count is (~50%) higher with the new tables?

tswast · 2021-01-19T16:34:00Z

@di was recently working on this dataset. Perhaps he has ideas?

di · 2021-01-19T16:55:41Z

Yes, likely because the service that writes to the old dataset has been having issues: https://twitter.com/sethmlarson/status/1347236470688542721?s=19

ofek

If the new counts are in fact accurate, then LGTM! @hugovk feel free to merge if you want

hugovk · 2021-01-19T21:08:00Z

Thanks!

ofek · 2021-01-25T03:00:09Z

https://pypi.org/project/pypinfo/18.0.0/

Thanks again @tswast!!!

tswast added 2 commits January 12, 2021 15:48

perf: use clustered table and standard SQL for lower query costs

bca7b92

By using the clustered data table, performance if improved because BigQuery can skip of data that doesn't match the desired project.

fix unit tests

2718bb5

ofek approved these changes Jan 19, 2021

View reviewed changes

hugovk merged commit 9f95cfa into ofek:master Jan 19, 2021

tswast mentioned this pull request Feb 5, 2021

Disclose BigTable Costs in Readme #105

Closed

tswast deleted the issue64-query-optimize branch February 5, 2021 22:57

tswast mentioned this pull request Feb 18, 2021

Use standard SQL #28

Closed

REestwick pushed a commit to REestwick/pypinfo that referenced this pull request Apr 1, 2024

Merge pull request ofek#107 from tswast/issue64-query-optimize

9d22a7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use clustered table and standard SQL for lower query costs #107

perf: use clustered table and standard SQL for lower query costs #107

tswast commented Jan 12, 2021 •

edited

codecov bot commented Jan 12, 2021 •

edited

tswast commented Jan 12, 2021

hugovk commented Jan 16, 2021

tswast commented Jan 19, 2021

di commented Jan 19, 2021

ofek left a comment

hugovk commented Jan 19, 2021

ofek commented Jan 25, 2021

perf: use clustered table and standard SQL for lower query costs #107

perf: use clustered table and standard SQL for lower query costs #107

Conversation

tswast commented Jan 12, 2021 • edited

codecov bot commented Jan 12, 2021 • edited

Codecov Report

tswast commented Jan 12, 2021

hugovk commented Jan 16, 2021

tswast commented Jan 19, 2021

di commented Jan 19, 2021

ofek left a comment

Choose a reason for hiding this comment

hugovk commented Jan 19, 2021

ofek commented Jan 25, 2021

tswast commented Jan 12, 2021 •

edited

codecov bot commented Jan 12, 2021 •

edited