-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sambamba error #165
Comments
This is a new one for me. Quick search says that adding the CLI option Could you try this? You would need to add this here: snpArcher/workflow/rules/fastq2bam.smk Line 58 in f0fe796
If this works, you can open a PR, or let me know and I can add it. |
Hi Cade, I tried adjusting the overflow-list-size to 600000 with the same issue, so I then tried changing it to 10,000,000 almost as a dare to the computer, and it worked! That may be overkill, but it allowed the markdup job to run. Probably best that you just make the change to the .smk file so I don't do anything inadvertent. I used: shell:
"sambamba markdup --overflow-list-size 10000000 -t {threads} {input.bam} {output.dedupBam} 2> {log}" I thought I'd wait to give you the update until the job had finished, in case anything else fouled, but the haplotype caller has been running for ~15 hours now and I'm not sure how much longer it will run. Not that I have any real basis to be surprised, since I don't know the ins and outs of the workflow, but I was surprised that the genome was only split into 9 gvcf intervals, when it was split into 50+ intervals for the alignment step. Does this sound ~right to you? The computer I'm running it on has 192G of memory and 24 cores (16 performance + 8 efficiency). I allocated 23 cores to the job. My two input fastq.gz were ~3G each with genome sizes ~350Mb. The reference genome I used is scaffold level, so maybe that was part of it, or maybe 9 intervals would be expected in this situation. Really appreciate your time and expertise! |
Wow, glad it worked. Curious how big the BAMs are going into dedup. Is this public data? If so could you share? I'd be interested to see if I can replicate this. Your config.yaml and sample.csv might also be useful. 50+ intervals for the alignment step does not sound right to me, as we do not scatter the alignment step over intervals. I'm not sure how the alignment step would get split over intervals, so if you share the dry-run summary of your workflow that might help me understand whats going on. 9 GVCF intervals does sound in the range, though this is heavily depending on the contiguity of your reference genome and the config setting |
Unfortunately the resequencing data aren't yet public, but I've uploaded the fastq.gz files for the two samples I used plus the reference to a Drive folder. The reference is what was uploaded to NCBI from the CCGP. I also uploaded the config.yaml and sample.csv, which I called test.csv. I dropped the logs folder in there as well, in its current state. I may have misunderstood the term 'interval' for the alignment step. What I was referring to was the number of pre-merged bams. I thought all of these pre merged bams for each individual were a result of splitting the genome into intervals but it sounds like that's not right. Sounds like the intervals are just for the haplotypecaller step? Not sure how large the bams were going into dedup. My impression is that the pre dedup bams are removed and replaced with _final.bam so I couldn't go in and check? The _final.bams are ~6.5G, if that's helpful. I also added a .txt file called dry_run.txt to the Drive where I just copied and pasted the output from a dry run. Let me know if you'd like to see it a different way. Thanks Cade! |
Great, I'll see if I can replicate this. Thanks for uploading that. As for the intervals - yes intervals are only for BAM->GVCF->VCF. We have a detailed description of how we generate intervals in the paper. |
Sounds great. Thank you for doing that. I think I get it now. As you can probably tell, a lot of this stuff just skims the top of my head as it sails right over it. I appreciate your patience and time. |
Hi Jason, I ran your data thru snpArcher on a Linux server running On the Ubuntu machine:
Where as on my M2 Macbook Pro:
I believe you can adjust this value by entering the command |
Hi Cade, Thanks very much for clarifying that. I got the workflow to run on the two test samples using our lab computer and it seemed to go all the way through, although it didn't produce QC files, even though the two samples had the same reference genome. Maybe there has to be more than two samples? I imagine you had the same output when you ran it. The workflow ran in ~52 hours on our lab computer with 192GB of memory and 23/24 cores used. When I run the workflow for real it will be on the UCSC Hummingbird cluster with files of similar size, although the reference genomes will be different, and potentially more fragmented. I have a request in with the Hummingbird IT to help guide me with which partition to use, how many cores and how much memory to request. I'm wondering if there are any parameters on the config.yaml end that you would adjust to get it to run faster, or is it mostly just a matter of more computing power? Thanks! |
@jasonwjohns generally snpArcher is optimized to run many single-core jobs in parallel. I think the default number of current jobs is something around 1000, which you probably shouldn't need to change. You will need to change the partition you submit to, ideally something with at least a 3 day run time. I think the Hummingbird cluster uses the slurm job management system; we have a profile for slurm built into snakemake (profiles/slurm), but if you or your IT colleagues have a Snakemake profile optimized for the Hummingbird cluster, that would probably be ideal (although you'll need to make sure that use-conda is set to True). Note that we are still working on adapting our profile system to Snakemake 8, which makes some big changes. |
Hi Jason, That's odd - I just checked and you are right - my run did not produce the QC html. I will look into this. EDIT: Number of samples has to be greater than 2. I misspoke on that earlier. In any case, if you want to run the QC module now that the workflow has completed, you can do so like this:
Just check that the dry-run makes sense should be something like this:
As for Hummingbird, and settings there, let me get back to you. I'm currently working on refactoring some of the code to clean up how we specify computational resources. I'll also be updating the docs about this, as well as using SLURM (which Hummingbird uses). |
Hi Tim and Cade, This is all great. I so appreciate all of your personalized help, not to mention all the work you did to write the workflow. Cheers to you! Jason |
Hi guys,
Sorry to be back so soon. After successfully getting the workflow to run with one sample with very low coverage this morning, I ran it again on two samples. These samples are a more comparable size (~3Gb fastq.gz's) to what I will be running in my upcoming job. The workflow ran great, all the way to sambamba markdup when I got the below error for each file.
This is essentially a wild guess, but maybe it could have something to do with the fact that I manually input phony SRR values in the 'Run' column of the sample sheet - '1' and '2'? I may be grasping at straws there.
Any ideas?
Thank you!
Jason
The text was updated successfully, but these errors were encountered: