-
Notifications
You must be signed in to change notification settings - Fork 59
Update BQ row limit to 100MB and make it more efficient. #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 1453
💛 - Coveralls |
allieychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks awesome! LGTM! Can you also update https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/bigquery_schema.md as well?
| # on sampling rather than exact byte size. | ||
| _MAX_BIGQUERY_ROW_SIZE_BYTES = 90 * 1024 * 1024 | ||
| # Maximum number of calls to sample for BigQuery row size estimate. | ||
| _MAX_NUM_CALL_SAMPLES = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is actually the min number of calls (we always sample 5 calls or 5+1 calls). Maybe NUM_CALL_SAMPLES is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done!
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated the doc as well! Although we need to remove that page completely once the cloud page is updated.
| # on sampling rather than exact byte size. | ||
| _MAX_BIGQUERY_ROW_SIZE_BYTES = 90 * 1024 * 1024 | ||
| # Maximum number of calls to sample for BigQuery row size estimate. | ||
| _MAX_NUM_CALL_SAMPLES = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done!
BQ row limit has changed from 10MB to 100MB. I also made this method more efficient as it turns out jsonifying every single call/variant is a very expensive operation! This PTransform is now 50% faster! The sampling logic can be improved (e.g. randomizing the samples rather than picking particular locations, or dynamically choosing sample size based on input), but I realistically don't think anyone is going to hit the 100MB limit anytime soon and this method can be good enough even then.
Tested:
huge_testslater.