Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query/combiner] Add options to better partition imported GVCFs #9224

Merged
merged 2 commits into from
Aug 6, 2020

Conversation

chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented Aug 6, 2020

[query/combiner] Add options to better partition imported GVCFs

Teach the VCF combiner three new options:

  • --genomes/-G to indicate that it is operating on genomes
  • --exomes/-E to indicate that it is operating on exomes
  • --import-interval-size to more finely control the partitioning of
    the reference genome on GVCF import.

These are specified as a required mutually exclusive group, exactly one
of them must be present.

This goes along with removing default_exome_intervals and adding
calculate_even_genome_partitioning. The new function takes a reference
genome and a target interval size and computes a partitioning of the
reference into intervals of length at most the target size. It does so
by looping over the 'regular' contigs of the reference (as such it only
supports GRCh37 and GRCh38 for now) and dividing each contig's length
by the target size and rounding up, obtaining the number of partitions
the contig will be subdivided into. Then dividing the length of the
contig by the number of partitions, getting the true partition size.
Then using the true partition size, generating the intervals. This
hopefully leads to few degenerate cases where some partitions at the end
of contigs are very small.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question to address, otherwise looks good

Comment on lines 22 to 23
required=False)
parser.add_argument('--genomes', dest='is_genomes', action='store_true',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we require the user to add this flag? That seems like a good idea to avoid people unintentionally using exome intervals for genome data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question to me is what we would like the defaults to be. I thought about having genomes be the default, but decided against it while making this change. I have no particular reason or justification for doing so.

The reason I added both --genomes and --exomes was to better allow users to indicate their intentions in scripts.

Thoughts on the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, wasn't clear -- I'm proposing that we don't actually have a default, but instead require that one of the flags is passed explicitly by the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, here's what changed between the force pushes:

diff --git a/hail/python/hail/experimental/vcf_combiner/__main__.py b/hail/python/hail/experimental/vcf_combiner/__main__.py
index 91092add2..9a5847372 100644
--- a/hail/python/hail/experimental/vcf_combiner/__main__.py
+++ b/hail/python/hail/experimental/vcf_combiner/__main__.py
@@ -14,25 +14,24 @@ def main():
     parser.add_argument('out_file', help='Path to final combiner output.')
     parser.add_argument('tmp_path', help='Path to folder for intermediate output '
                                          '(should be an object store path, if running on the cloud).')
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument('--genomes', '-G', dest='is_genomes', action='store_true',
+                       help='Indicates that the combiner is operating on genomes. '
+                            'Affects how the genome is partitioned on input.')
+    group.add_argument('--exomes', '-E', dest='is_genomes', action='store_false',
+                       help='Indicates that the combiner is operating on exomes. '
+                            'Affects how the genome is partitioned on input.')
+    group.add_argument('--import-interval-size', type=int,
+                       help='Interval size for partitioning the reference genome for GVCF import.')
     parser.add_argument('--log', help='Hail log path.')
     parser.add_argument('--header',
                         help='External header, must be readable by all executors. '
                              'WARNING: if this option is used, the sample names in the '
                              'GVCFs will be overridden by the names in sample map.',
                         required=False)
-    parser.add_argument('--genomes', dest='is_genomes', action='store_true',
-                        help='Indicates that the combiner is operating on genomes. '
-                             'Affects how the genome is partitioned on input. '
-                             'Ignored if --import-interval-size is used')
-    parser.add_argument('--exomes', dest='is_genomes', action='store_false',
-                        help='Indicates that the combiner is operating on exomes '
-                             '(This is the default). '
-                             'Affects how the genome is partitioned on input. '
-                             'Ignored if --import-interval-size is used')
     parser.add_argument('--branch-factor', type=int, default=CombinerConfig.default_branch_factor, help='Branch factor.')
     parser.add_argument('--batch-size', type=int, default=CombinerConfig.default_batch_size, help='Batch size.')
     parser.add_argument('--target-records', type=int, default=CombinerConfig.default_target_records, help='Target records per partition.')
-    parser.add_argument('--import-interval-size', type=int, help='Interval size for partitioning the reference genome for GVCF import.')
     parser.add_argument('--overwrite', help='overwrite the output path', action='store_true')
     parser.add_argument('--key-by-locus-and-alleles', help='Key by both locus and alleles in the final output.', action='store_true')
     parser.add_argument('--reference-genome', default='GRCh38', help='Reference genome.')

Teach the VCF combiner three new options:
  * --genomes/-G to indicate that it is operating on genomes
  * --exomes/-E to indicate that it is operating on exomes
  * --import-interval-size to more finely control the partitioning of
    the reference genome on GVCF import.

These are specified as a required mutually exclusive group, exactly one
of them must be present.

This goes along with removing `default_exome_intervals` and adding
`calculate_even_genome_partitioning`. The new function takes a reference
genome and a target interval size and computes a partitioning of the
reference into intervals of length at most the target size. It does so
by looping over the 'regular' contigs of the reference (as such it only
supports GRCh37 and GRCh38 for now) and dividing each contig's length
by the target size and rounding up, obtaining the number of partitions
the contig will be subdivided into. Then dividing the length of the
contig by the number of partitions, getting the true partition size.
Then using the true partition size, generating the intervals. This
hopefully leads to few degenerate cases where some partitions at the end
of contigs are very small.
@chrisvittal chrisvittal force-pushed the vcf-combiner/updated-partitions branch from 4a2bcfe to 89669e3 Compare August 6, 2020 18:08
Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, didn't know about that argparse feature.

@danking danking merged commit 6e0f46c into hail-is:main Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants