Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge SVs with high percentage of overlap #13

Closed
cai1991 opened this issue Apr 4, 2021 · 4 comments
Closed

Merge SVs with high percentage of overlap #13

cai1991 opened this issue Apr 4, 2021 · 4 comments

Comments

@cai1991
Copy link

cai1991 commented Apr 4, 2021

Hi,

I'm trying your pipeline to merge my SVs, which were generated by whole genome comparisons among several de novo assemblies, into a single vcf file. I'm wondering:

  1. if it is possible to merge SVs which are with high percentage of overlap but fail to meet the requirement of "max_dist" using Jasmine? Below lists two examples which Jasmine (only use "--output_genotypes" parameter, others are default) didn't merge. The two examples are related to the SVs in the figure.
    SV examples
  • Example 1: Same end breakpoint, 93% overlap

C3 10180346 0_INV27953 N <INV> . PASS END=10361415;SVLEN=181069;SVTYPE=INV;AVG_LEN=181069.000000;AVG_START=10180346.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27953;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27953
C3 10192856 1_INV34939 N <INV> . PASS END=10361415;SVLEN=168559;SVTYPE=INV;AVG_LEN=168559.000000;AVG_START=10192856.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34939;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34939

  • Example 2: Same start breakpoint, 96% overlap

C3 29342378 0_INV27963 N <INV> . PASS END=29948423;SVLEN=606045;SVTYPE=INV;AVG_LEN=606045.000000;AVG_START=29342378.000000;AVG_END=29948422.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27963;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27963
C3 29342378 1_INV34950 N <INV> . PASS END=29973346;SVLEN=630968;SVTYPE=INV;AVG_LEN=630968.000000;AVG_START=29342378.000000;AVG_END=29973345.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34950;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34950

  1. For insertions, how to indicate the length of the variant with the SVLEN INFO field? Does SVLEN equal the length of inserted sequence? Is the example below correct?
    C1 498768 INS37 N <INS> . PASS END=498768;ChrB=C1;StartB=496550;EndB=496651;Parent=SYN44;VarType=ShV;DupType=.;SVLEN=102;SVTYPE=INS;STRANDS=+

Thank you very much in advance for your help.

Best regards,
Chengcheng

@mkirsche
Copy link
Owner

mkirsche commented Apr 5, 2021

Hi Chengcheng,

Thanks a lot for your interest in using Jasmine!

As for your first question, the best approach would be to have the distance thresholds depend on the length of the variants using the max_dist_linear parameter. While it doesn't explicitly look at overlap, it will give these large variants large distance thresholds so that they can be correctly merged with each other. I recommend something like this (though the exact values depend on the organism being studied and the upstream pipeline you are using): max_dist_linear=0.1 min_dist=50 --mutual_distance The --mutual_distance parameter was only added very recently to the Github build, so is not in the conda release yet if you are using that, but it will be added to conda in the next release later this week. Just to briefly explain the parameters:

  • max_dist_linear=0.1 sets the distance threshold for each variant to 10% of its length
  • min_dist=50 prevents the threshold from getting impossibly small for SVs of length below 100 or so by setting a lower bound of 50 on their distance threshold, so the threshold for each SV will be max(10% of length, 50)
  • --mutual_distance requires that for a pair of variants to be merged, they must both be within each other's distance threshold. By default, they only have to be within the larger of the two thresholds, which can be problematic for very large variants since they can end up merging with much smaller variants at similar genomic positions

For your second question, that format is correct. Jasmine can infer the length from the REF and ALT fields if they are filled out (so if they are e.g. A and ATGTATGCGT it will automatically use 9 as the SVLEN value). But if not, it falls back to the SVLEN field.

I hope that helps, and please don't hesitate to reach out with any other questions!

Best,
Melanie

@cai1991
Copy link
Author

cai1991 commented Apr 5, 2021

Hi Melanie,

Thanks a lot for your clear explanation. I will try based on your suggestions.

Best regards,
Chengcheng

@cai1991
Copy link
Author

cai1991 commented Apr 8, 2021

Hi Melanie,

I see you have added a new parameter (min_overlap) in Jasmine to set the minimum reciprocal overlap. I'm wondering how it works? If two variants have reciprocal overlap greater than "min_overlap", will Jasmine still take "max_dist_linear" or "max_dist" into account to decide whether to merge or not?

Best,
Chengcheng

@mkirsche
Copy link
Owner

mkirsche commented Apr 8, 2021 via email

@Qijie0615 Qijie0615 mentioned this issue Mar 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants