Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pending scenarios and rename fill-gaps operation. #877

Open
6 of 13 tasks
j-coll opened this issue Jul 23, 2018 · 0 comments
Open
6 of 13 tasks

Handle pending scenarios and rename fill-gaps operation. #877

j-coll opened this issue Jul 23, 2018 · 0 comments

Comments

@j-coll
Copy link
Member

j-coll commented Jul 23, 2018

There are some scenarios where the fill-gaps (and fill-missing) operation is skipping, and should be completed.

Pending scenarios

  1. Gap found in archive.
    It may happen that the original file had a region non covered by the file. This situation is normal in VCFs, but really uncommon in gVCFs. Currently, the FillGapsTask is not writing anything. This is then understood as HOM_REF (0/0). In this scenario we should write ?/? to represent that the information at this locus is unknown.

When executing the fill-missing operation (aga aggregate) we may find a lot of gaps, because we are not reading the reference blocks with HOM_REF genotypes. This should be taken into account.

  1. Insertion not overlapping with any variant.
    This scenario is quite similar to the previous one, where can't be found any overlapping variant from archive. This may happen because we are trying to complete an insertion variant that is between two variants. In this scenario we can do two things:

    1. Try to overlap with the variant in the previous position (if any)
    2. Write a 0/0 indicating that the insertion does not happen for this samples
  2. Multiple overlaps
    This scenario consists of having multiple overlapping positions in one variant. This may happen because of many reasons:

    • Deletion from sample A overlapping with N smaller variants from sample B
    • Inconsistent input VCF with overlapping variants

    In this scenario, we should mark that there is something in this position, but we can not determine what. For this, we should use the special allele <*> from the VCF spec v4.3 (known as <NON_REF> at GATK)

    • Deletion from sample A overlapping with N reference blocks from sample B

    • PENDING SCENARIO

    • Overlap with a split multi-allelic variant

      In this scenario, a variant from file A may overlap with many variants produced from the split of a multi-allelic variant from file B. The information in these split variants from B is not inconsistent, so we know exactly what is in this position. This will happen if all the overlapping variants share the same FileEntry.call. We should just take any of them.

  3. Structural variants
    We should not try to merge structural variants with other smaller variants.

Rename operation

After some deliberation, we decided that this operations should be called "aggregate" and "aggregate-famliy" . Therefore, command and internal classes should be renamed to match with this new names.

Tasks

  • Write unknown genotype ?/? when gaps are found in the archive file
    • Applies only for fill-gaps (aka aggregate-family) operation, when reading all archive records.
  • Use the NON_REF allele for multiple overlappings.
  • Handle overlap with symbolic ref blocks (<*>)
  • Handle multiple overlaps
    • Special scenario: Overlap with multiple reference blocks
    • Special scenario: Overlap with multiple variants from the same multi-allelic variant
  • Decide what to do with insertions between two variants
  • Ensure structural variants are not being merged
  • Rename operation
    • Rename command line
    • Internal rename
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant