Merge SVs with high percentage of overlap #13

cai1991 · 2021-04-04T16:25:29Z

Hi,

I'm trying your pipeline to merge my SVs, which were generated by whole genome comparisons among several de novo assemblies, into a single vcf file. I'm wondering:

if it is possible to merge SVs which are with high percentage of overlap but fail to meet the requirement of "max_dist" using Jasmine? Below lists two examples which Jasmine (only use "--output_genotypes" parameter, others are default) didn't merge. The two examples are related to the SVs in the figure.

Example 1: Same end breakpoint, 93% overlap

C3 10180346 0_INV27953 N <INV> . PASS END=10361415;SVLEN=181069;SVTYPE=INV;AVG_LEN=181069.000000;AVG_START=10180346.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27953;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27953
C3 10192856 1_INV34939 N <INV> . PASS END=10361415;SVLEN=168559;SVTYPE=INV;AVG_LEN=168559.000000;AVG_START=10192856.000000;AVG_END=10361414.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34939;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34939

Example 2: Same start breakpoint, 96% overlap

C3 29342378 0_INV27963 N <INV> . PASS END=29948423;SVLEN=606045;SVTYPE=INV;AVG_LEN=606045.000000;AVG_START=29342378.000000;AVG_END=29948422.000000;SUPP_VEC_EXT=10;IDLIST_EXT=INV27963;SUPP_EXT=1;SUPP_VEC=10;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV27963
C3 29342378 1_INV34950 N <INV> . PASS END=29973346;SVLEN=630968;SVTYPE=INV;AVG_LEN=630968.000000;AVG_START=29342378.000000;AVG_END=29973345.000000;SUPP_VEC_EXT=01;IDLIST_EXT=INV34950;SUPP_EXT=1;SUPP_VEC=01;SUPP=1;SVMETHOD=JASMINE;IDLIST=INV34950

For insertions, how to indicate the length of the variant with the SVLEN INFO field? Does SVLEN equal the length of inserted sequence? Is the example below correct?
C1 498768 INS37 N <INS> . PASS END=498768;ChrB=C1;StartB=496550;EndB=496651;Parent=SYN44;VarType=ShV;DupType=.;SVLEN=102;SVTYPE=INS;STRANDS=+

Thank you very much in advance for your help.

Best regards,
Chengcheng

The text was updated successfully, but these errors were encountered:

mkirsche · 2021-04-05T16:27:20Z

Hi Chengcheng,

Thanks a lot for your interest in using Jasmine!

As for your first question, the best approach would be to have the distance thresholds depend on the length of the variants using the max_dist_linear parameter. While it doesn't explicitly look at overlap, it will give these large variants large distance thresholds so that they can be correctly merged with each other. I recommend something like this (though the exact values depend on the organism being studied and the upstream pipeline you are using): max_dist_linear=0.1 min_dist=50 --mutual_distance The --mutual_distance parameter was only added very recently to the Github build, so is not in the conda release yet if you are using that, but it will be added to conda in the next release later this week. Just to briefly explain the parameters:

max_dist_linear=0.1 sets the distance threshold for each variant to 10% of its length
min_dist=50 prevents the threshold from getting impossibly small for SVs of length below 100 or so by setting a lower bound of 50 on their distance threshold, so the threshold for each SV will be max(10% of length, 50)
--mutual_distance requires that for a pair of variants to be merged, they must both be within each other's distance threshold. By default, they only have to be within the larger of the two thresholds, which can be problematic for very large variants since they can end up merging with much smaller variants at similar genomic positions

For your second question, that format is correct. Jasmine can infer the length from the REF and ALT fields if they are filled out (so if they are e.g. A and ATGTATGCGT it will automatically use 9 as the SVLEN value). But if not, it falls back to the SVLEN field.

I hope that helps, and please don't hesitate to reach out with any other questions!

Best,
Melanie

cai1991 · 2021-04-05T18:21:23Z

Hi Melanie,

Thanks a lot for your clear explanation. I will try based on your suggestions.

Best regards,
Chengcheng

cai1991 · 2021-04-08T09:05:17Z

Hi Melanie,

I see you have added a new parameter (min_overlap) in Jasmine to set the minimum reciprocal overlap. I'm wondering how it works? If two variants have reciprocal overlap greater than "min_overlap", will Jasmine still take "max_dist_linear" or "max_dist" into account to decide whether to merge or not?

Best,
Chengcheng

mkirsche · 2021-04-08T12:35:48Z

Hi Chengcheng, When using this parameter, the overlap requirement is in addition to the breakpoint distance requirement. So Jasmine checks only variant pairs with breakpoints which are within the required merging distance of one another, and then among those only merges those with sufficient overlap. I would still recommend using the max_dist_linear parameter to merge variant pairs which have high overlap but also large breakpoint distances, but this new setting is available in case you also want to avoid merging variants pairs with small breakpoint distances but little overlap. Best, Melanie

…

________________________________ From: cai1991 ***@***.***> Sent: Thursday, April 8, 2021 5:05:40 AM To: mkirsche/Jasmine ***@***.***> Cc: Melanie Kirsche ***@***.***>; Comment ***@***.***> Subject: Re: [mkirsche/Jasmine] Merge SVs with high percentage of overlap (#13) External Email - Use Caution Hi Melanie, I see you have added a new parameter (min_overlap) in Jasmine to set the minimum reciprocal overlap. I'm wondering how it works? If two variants have reciprocal overlap greater than "min_overlap", will Jasmine still take "max_dist_linear" or "max_dist" into account to decide whether to merge or not? Best, Chengcheng — You are receiving this because you commented. Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmkirsche%2FJasmine%2Fissues%2F13%23issuecomment-815591116&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C7e86dfa0bc7d42a5ad7d08d8fa6d7c05%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637534695434984699%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4Gz41enWgkozngizrOu1P%2FWi32y5CREfZ8OR%2FqOWCns%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACIYVSQW42AOAKCDIDQLNRLTHVWWJANCNFSM42LRW4DQ&data=04%7C01%7Cmelaniekirsche%40jhu.edu%7C7e86dfa0bc7d42a5ad7d08d8fa6d7c05%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637534695434984699%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8%2FcHbnBGLzYfPm1sHWtCXMyVcNQRBpGkYkush6KXqus%3D&reserved=0>.

mkirsche closed this as completed Apr 29, 2021

Qijie0615 mentioned this issue Mar 19, 2022

min_overlap #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge SVs with high percentage of overlap #13

Merge SVs with high percentage of overlap #13

cai1991 commented Apr 4, 2021

mkirsche commented Apr 5, 2021

cai1991 commented Apr 5, 2021

cai1991 commented Apr 8, 2021

mkirsche commented Apr 8, 2021 via email

Merge SVs with high percentage of overlap #13

Merge SVs with high percentage of overlap #13

Comments

cai1991 commented Apr 4, 2021

mkirsche commented Apr 5, 2021

cai1991 commented Apr 5, 2021

cai1991 commented Apr 8, 2021

mkirsche commented Apr 8, 2021 via email