Skip to content

Conversation

@arostamianfar
Copy link
Contributor

BQ row limit has changed from 10MB to 100MB. I also made this method more efficient as it turns out jsonifying every single call/variant is a very expensive operation! This PTransform is now 50% faster! The sampling logic can be improved (e.g. randomizing the samples rather than picking particular locations, or dynamically choosing sample size based on input), but I realistically don't think anyone is going to hit the 100MB limit anytime soon and this method can be good enough even then.

Tested:

  • unit + integration tests
  • Also ran on a synthetic dataset with 500k+ files to force row split. Can consider creating this dataset as part of huge_tests later.

@coveralls
Copy link

coveralls commented Nov 23, 2018

Pull Request Test Coverage Report for Build 1453

  • 32 of 33 (96.97%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 87.639%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/libs/bigquery_vcf_data_converter.py 28 29 96.55%
Totals Coverage Status
Change from base Build 1449: 0.03%
Covered Lines: 6445
Relevant Lines: 7354

💛 - Coveralls

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# on sampling rather than exact byte size.
_MAX_BIGQUERY_ROW_SIZE_BYTES = 90 * 1024 * 1024
# Maximum number of calls to sample for BigQuery row size estimate.
_MAX_NUM_CALL_SAMPLES = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is actually the min number of calls (we always sample 5 calls or 5+1 calls). Maybe NUM_CALL_SAMPLES is good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done!

Copy link
Contributor Author

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated the doc as well! Although we need to remove that page completely once the cloud page is updated.

# on sampling rather than exact byte size.
_MAX_BIGQUERY_ROW_SIZE_BYTES = 90 * 1024 * 1024
# Maximum number of calls to sample for BigQuery row size estimate.
_MAX_NUM_CALL_SAMPLES = 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done!

@arostamianfar arostamianfar merged commit 33c6c69 into googlegenomics:master Nov 26, 2018
@arostamianfar arostamianfar deleted the bqlimit branch November 27, 2018 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants