Merging SV from Oxford Nanopore - expected runtime #41

agolicz · 2022-06-15T06:12:26Z

Hi,
I am trying to merge SVs discovered from Oxford Nanopore data (plant species, 60 samples, 20,000-50,000 SVs/sample). It's been running for 10 hours now. Is that an expected runtime? Would it make sense to try to merge step wise? For example, find most closely related samples, merge those first (say groups 5-10) and then merge the merged files to get final non-redundant SVs?

minimap2 --MD -t 16 -ax map-ont ../short/Express617_v1.fa /vol/agcpgl/jlee/BreedPath_nanopore/${ID}.fq.gz | samtools sort -o ${ID}.bam
dysgu call -p 8 -v 2 --min-support 5 --mode nanopore ../short/Express617_v1.fa temp_dir.$ID $ID.bam > $ID.vcf
python flt_vcf.py $p.vcf > $p.pass.vcf
dysgu merge *pass.vcf > long.vcf

The text was updated successfully, but these errors were encountered:

kcleal · 2022-06-15T09:30:36Z

Hi @agolicz,
Thanks for reporting this. That is much longer than expected, I will try and help get this fixed. Would you mind checking a few things for me? If possible could you check to see how much memory is being consumed? Also, if you have time, would you mind trying to merge just two of your samples, rather than the whole cohort - this should only take a minute or so, and it would be useful to know if it completes in a reasonable time. I have mainly tested merging on larger cohorts of short-read data, so its possible there is a scaling issue for long read data.

I had a quick scan of the code, and it looks like there might be a scaling issue if there is a complex region of the genome with lots of SVs that overlap each other or lots of diversity - this situation is common near centromeric regions in humans, for example. Merging will essentially be an all vs all comparison in these regions, which might give rise to the high run time. However, dysgu usually gives these types of rearrangements low probability, so its possible you have filtered those out with the flt_vcf.py script?

agolicz · 2022-06-15T10:06:57Z

Hi,
thanks for the reply.
It actually finished:

2022-06-14 21:50:43,619 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-14 21:54:54,489 [INFO   ]  Merge distance: 500 bp
2022-06-15 08:56:46,338 [INFO   ]  SVs output to stdout
2022-06-15 08:56:46,342 [INFO   ]  Input samples: ['a702', 'a703', 'a705', 'a709', 'a711', 'a714', 'a715', 'a716', 'a717', 'a723', 'a724', 'a726', 'a727', 'a728', 'a729', 'a730', 'a731', 'a732', 'a733', 'a734', 'a735', 'a743', 'a748', 'a762', 'a764', 'a765', 'a776', 'a778', 'a779', 'a783', 'a784', 'a786', 'a790', 'a792', 'a796', 'a797', 'a802', 'a810', 'a815', 'a816', 'a817', 'a818', 'a820', 'a823', 'a824', 'a825', 'a827', 'a828', 'a830', 'a833', 'a834', 'a835', 'a836', 'a838', 'a839', 'a840', 'a841', 'a842', 'a843', 'a845']
2022-06-15 09:08:44,000 [INFO   ]  Sample rows before merge [27524, 39428, 36668, 18995, 42091, 42029, 30433, 26344, 27145, 40743, 1269, 36795, 36315, 43578, 39322, 43658, 35629, 37191, 41179, 29260, 34499, 20557, 39690, 36079, 38506, 42710, 49017, 40979, 41427, 44644, 42835, 40906, 42328, 40995, 32513, 51, 41343, 40148, 39025, 46364, 39965, 20092, 31426, 35372, 37474, 14263, 20316, 33790, 44510, 40746, 37944, 19803, 40951, 32969, 24995, 10389, 24662, 31023, 15909, 26496], rows after 375685
2022-06-15 09:08:44,009 [INFO   ]  dysgu merge complete h:m:s, 11:18:00

 cat *pass.vcf | grep -v "^#" | wc -l
2013307
grep -v "^#" long.vcf | wc -l
375685

Yes, flt_vcf.py only keeps the variants with PASS.
Can't check exact memory usage but it had to be less than 40G which was the limit.
Just trying two files.

dysgu merge a843.pass.vcf a845.pass.vcf > dm.t.vcf
2022-06-15 11:54:15,586 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-15 11:54:18,394 [INFO   ]  Merge distance: 500 bp
2022-06-15 11:54:34,354 [INFO   ]  SVs output to stdout
2022-06-15 11:54:34,392 [INFO   ]  Input samples: ['a843', 'a845']
2022-06-15 11:54:47,542 [INFO   ]  Sample rows before merge [15909, 26496], rows after 36978
2022-06-15 11:54:47,543 [INFO   ]  dysgu merge complete h:m:s, 0:00:31

11hrs is not too bad (we're used to that in plants :)). I was just surprised because merging from 100 short read samples was much quicker.

If you are interested in testing merging for long reads I am planning to run SVJedi to genotype and can report if there any issues, sites with too many missing genotypes etc.

minimap2+dysgu have done very well in our in-house comparisons for Brassica napus! :)

kcleal · 2022-06-15T10:14:11Z

Glad it finished! I think the runtime is probably caused by high genome complexity in that case. Would be very interested to hear how you get on - feed back from users is very valuable! If you have not come across it already, jasmine could be a useful tool for merging also: https://github.com/mkirsche/Jasmine

kcleal closed this as completed Jun 15, 2022

zhongleishi mentioned this issue Feb 28, 2024

When combining a large number of samples, the speed is very slow #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging SV from Oxford Nanopore - expected runtime #41

Merging SV from Oxford Nanopore - expected runtime #41

agolicz commented Jun 15, 2022

kcleal commented Jun 15, 2022

agolicz commented Jun 15, 2022

kcleal commented Jun 15, 2022

Merging SV from Oxford Nanopore - expected runtime #41

Merging SV from Oxford Nanopore - expected runtime #41

Comments

agolicz commented Jun 15, 2022

kcleal commented Jun 15, 2022

agolicz commented Jun 15, 2022

kcleal commented Jun 15, 2022