Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging SV from Oxford Nanopore - expected runtime #41

Closed
agolicz opened this issue Jun 15, 2022 · 3 comments
Closed

Merging SV from Oxford Nanopore - expected runtime #41

agolicz opened this issue Jun 15, 2022 · 3 comments

Comments

@agolicz
Copy link

agolicz commented Jun 15, 2022

Hi,
I am trying to merge SVs discovered from Oxford Nanopore data (plant species, 60 samples, 20,000-50,000 SVs/sample). It's been running for 10 hours now. Is that an expected runtime? Would it make sense to try to merge step wise? For example, find most closely related samples, merge those first (say groups 5-10) and then merge the merged files to get final non-redundant SVs?

minimap2 --MD -t 16 -ax map-ont ../short/Express617_v1.fa /vol/agcpgl/jlee/BreedPath_nanopore/${ID}.fq.gz | samtools sort -o ${ID}.bam
dysgu call -p 8 -v 2 --min-support 5 --mode nanopore ../short/Express617_v1.fa temp_dir.$ID $ID.bam > $ID.vcf
python flt_vcf.py $p.vcf > $p.pass.vcf
dysgu merge *pass.vcf > long.vcf
@kcleal
Copy link
Owner

kcleal commented Jun 15, 2022

Hi @agolicz,
Thanks for reporting this. That is much longer than expected, I will try and help get this fixed. Would you mind checking a few things for me? If possible could you check to see how much memory is being consumed? Also, if you have time, would you mind trying to merge just two of your samples, rather than the whole cohort - this should only take a minute or so, and it would be useful to know if it completes in a reasonable time. I have mainly tested merging on larger cohorts of short-read data, so its possible there is a scaling issue for long read data.

I had a quick scan of the code, and it looks like there might be a scaling issue if there is a complex region of the genome with lots of SVs that overlap each other or lots of diversity - this situation is common near centromeric regions in humans, for example. Merging will essentially be an all vs all comparison in these regions, which might give rise to the high run time. However, dysgu usually gives these types of rearrangements low probability, so its possible you have filtered those out with the flt_vcf.py script?

@agolicz
Copy link
Author

agolicz commented Jun 15, 2022

Hi,
thanks for the reply.
It actually finished:

2022-06-14 21:50:43,619 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-14 21:54:54,489 [INFO   ]  Merge distance: 500 bp
2022-06-15 08:56:46,338 [INFO   ]  SVs output to stdout
2022-06-15 08:56:46,342 [INFO   ]  Input samples: ['a702', 'a703', 'a705', 'a709', 'a711', 'a714', 'a715', 'a716', 'a717', 'a723', 'a724', 'a726', 'a727', 'a728', 'a729', 'a730', 'a731', 'a732', 'a733', 'a734', 'a735', 'a743', 'a748', 'a762', 'a764', 'a765', 'a776', 'a778', 'a779', 'a783', 'a784', 'a786', 'a790', 'a792', 'a796', 'a797', 'a802', 'a810', 'a815', 'a816', 'a817', 'a818', 'a820', 'a823', 'a824', 'a825', 'a827', 'a828', 'a830', 'a833', 'a834', 'a835', 'a836', 'a838', 'a839', 'a840', 'a841', 'a842', 'a843', 'a845']
2022-06-15 09:08:44,000 [INFO   ]  Sample rows before merge [27524, 39428, 36668, 18995, 42091, 42029, 30433, 26344, 27145, 40743, 1269, 36795, 36315, 43578, 39322, 43658, 35629, 37191, 41179, 29260, 34499, 20557, 39690, 36079, 38506, 42710, 49017, 40979, 41427, 44644, 42835, 40906, 42328, 40995, 32513, 51, 41343, 40148, 39025, 46364, 39965, 20092, 31426, 35372, 37474, 14263, 20316, 33790, 44510, 40746, 37944, 19803, 40951, 32969, 24995, 10389, 24662, 31023, 15909, 26496], rows after 375685
2022-06-15 09:08:44,009 [INFO   ]  dysgu merge complete h:m:s, 11:18:00

 cat *pass.vcf | grep -v "^#" | wc -l
2013307
grep -v "^#" long.vcf | wc -l
375685

Yes, flt_vcf.py only keeps the variants with PASS.
Can't check exact memory usage but it had to be less than 40G which was the limit.
Just trying two files.

dysgu merge a843.pass.vcf a845.pass.vcf > dm.t.vcf
2022-06-15 11:54:15,586 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-15 11:54:18,394 [INFO   ]  Merge distance: 500 bp
2022-06-15 11:54:34,354 [INFO   ]  SVs output to stdout
2022-06-15 11:54:34,392 [INFO   ]  Input samples: ['a843', 'a845']
2022-06-15 11:54:47,542 [INFO   ]  Sample rows before merge [15909, 26496], rows after 36978
2022-06-15 11:54:47,543 [INFO   ]  dysgu merge complete h:m:s, 0:00:31

11hrs is not too bad (we're used to that in plants :)). I was just surprised because merging from 100 short read samples was much quicker.

If you are interested in testing merging for long reads I am planning to run SVJedi to genotype and can report if there any issues, sites with too many missing genotypes etc.

minimap2+dysgu have done very well in our in-house comparisons for Brassica napus! :)

@kcleal
Copy link
Owner

kcleal commented Jun 15, 2022

Glad it finished! I think the runtime is probably caused by high genome complexity in that case. Would be very interested to hear how you get on - feed back from users is very valuable! If you have not come across it already, jasmine could be a useful tool for merging also: https://github.com/mkirsche/Jasmine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants