sambamba error #165

jasonwjohns · 2024-03-07T23:24:25Z

Hi guys,

Sorry to be back so soon. After successfully getting the workflow to run with one sample with very low coverage this morning, I ran it again on two samples. These samples are a more comparable size (~3Gb fastq.gz's) to what I will be running in my upcoming job. The workflow ran great, all the way to sambamba markdup when I got the below error for each file.

This is essentially a wild guess, but maybe it could have something to do with the fact that I manually input phony SRR values in the 'Run' column of the sample sheet - '1' and '2'? I may be grasping at straws there.

[Thu Mar  7 14:49:18 2024]
Error in rule dedup:
    jobid: 13
    input: results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam.bai
    output: results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam.bai
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.HMR.176.txt (check log file(s) for error details)
    conda-env: /Users/johns/test2/.snakemake/conda/58112a9020cd88eaf08ff65323aecdd4_
    shell:
        sambamba markdup -t 1 results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam 2> logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.HMR.176.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Thu Mar  7 14:50:07 2024]
Error in rule dedup:
    jobid: 9
    input: results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam.bai
    output: results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam.bai
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.CL.39.txt (check log file(s) for error details)
    conda-env: /Users/johns/test2/.snakemake/conda/58112a9020cd88eaf08ff65323aecdd4_
    shell:
        sambamba markdup -t 1 results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam 2> logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.CL.39.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-07T143624.867501.snakemake.log

sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.19.0 / DMD v2.089.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
sambamba-markdup: Cannot open file `/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T/sambamba-pid17229-markdup-jmqr/PairedEndsInfocopa2' in mode `w+' (Too many open files)

Any ideas?

Thank you!
Jason

cademirch · 2024-03-07T23:28:20Z

This is a new one for me. Quick search says that adding the CLI option --overflow-list-size 600000 could help.

Could you try this? You would need to add this here:

snpArcher/workflow/rules/fastq2bam.smk

Line 58 in f0fe796

"sambamba markdup -t {threads} {input.bam} {output.dedupBam} 2> {log}"

If this works, you can open a PR, or let me know and I can add it.

jasonwjohns · 2024-03-08T17:02:24Z

Hi Cade,

I tried adjusting the overflow-list-size to 600000 with the same issue, so I then tried changing it to 10,000,000 almost as a dare to the computer, and it worked! That may be overkill, but it allowed the markdup job to run. Probably best that you just make the change to the .smk file so I don't do anything inadvertent. I used:

shell:
        "sambamba markdup --overflow-list-size 10000000 -t {threads} {input.bam} {output.dedupBam} 2> {log}"

I thought I'd wait to give you the update until the job had finished, in case anything else fouled, but the haplotype caller has been running for ~15 hours now and I'm not sure how much longer it will run. Not that I have any real basis to be surprised, since I don't know the ins and outs of the workflow, but I was surprised that the genome was only split into 9 gvcf intervals, when it was split into 50+ intervals for the alignment step. Does this sound ~right to you?

The computer I'm running it on has 192G of memory and 24 cores (16 performance + 8 efficiency). I allocated 23 cores to the job. My two input fastq.gz were ~3G each with genome sizes ~350Mb. The reference genome I used is scaffold level, so maybe that was part of it, or maybe 9 intervals would be expected in this situation. Really appreciate your time and expertise!

cademirch · 2024-03-08T17:15:19Z

Wow, glad it worked. Curious how big the BAMs are going into dedup. Is this public data? If so could you share? I'd be interested to see if I can replicate this. Your config.yaml and sample.csv might also be useful.

50+ intervals for the alignment step does not sound right to me, as we do not scatter the alignment step over intervals. I'm not sure how the alignment step would get split over intervals, so if you share the dry-run summary of your workflow that might help me understand whats going on.

9 GVCF intervals does sound in the range, though this is heavily depending on the contiguity of your reference genome and the config setting min_nmer.

jasonwjohns · 2024-03-08T18:50:35Z

Unfortunately the resequencing data aren't yet public, but I've uploaded the fastq.gz files for the two samples I used plus the reference to a Drive folder. The reference is what was uploaded to NCBI from the CCGP. I also uploaded the config.yaml and sample.csv, which I called test.csv. I dropped the logs folder in there as well, in its current state.

I may have misunderstood the term 'interval' for the alignment step. What I was referring to was the number of pre-merged bams. I thought all of these pre merged bams for each individual were a result of splitting the genome into intervals but it sounds like that's not right. Sounds like the intervals are just for the haplotypecaller step?

Not sure how large the bams were going into dedup. My impression is that the pre dedup bams are removed and replaced with _final.bam so I couldn't go in and check? The _final.bams are ~6.5G, if that's helpful.

I also added a .txt file called dry_run.txt to the Drive where I just copied and pasted the output from a dry run. Let me know if you'd like to see it a different way.

Thanks Cade!

cademirch · 2024-03-08T21:59:41Z

Great, I'll see if I can replicate this. Thanks for uploading that.

As for the intervals - yes intervals are only for BAM->GVCF->VCF. We have a detailed description of how we generate intervals in the paper.

jasonwjohns · 2024-03-08T23:38:21Z

Sounds great. Thank you for doing that.

I think I get it now. As you can probably tell, a lot of this stuff just skims the top of my head as it sails right over it. I appreciate your patience and time.

cademirch · 2024-03-09T18:18:09Z

Hi Jason,

I ran your data thru snpArcher on a Linux server running Ubuntu 22.04.4 and got no errors on the dedup step. I believe the issue you encountered is likely due to the ulimit -n setting (which controls limit of open file descriptors) on MacOS.

On the Ubuntu machine:

$ ulimit -n
1024

Where as on my M2 Macbook Pro:

$ ulimit -n
256

I believe you can adjust this value by entering the command ulimit -n 1024, this would up the limit to 1024 open files descriptors. However, I'm not familiar with adjusting this setting, so I would encourage you to look into it before proceeding.

jasonwjohns · 2024-03-11T18:00:42Z

Hi Cade,

Thanks very much for clarifying that. I got the workflow to run on the two test samples using our lab computer and it seemed to go all the way through, although it didn't produce QC files, even though the two samples had the same reference genome. Maybe there has to be more than two samples? I imagine you had the same output when you ran it.

The workflow ran in ~52 hours on our lab computer with 192GB of memory and 23/24 cores used. When I run the workflow for real it will be on the UCSC Hummingbird cluster with files of similar size, although the reference genomes will be different, and potentially more fragmented. I have a request in with the Hummingbird IT to help guide me with which partition to use, how many cores and how much memory to request. I'm wondering if there are any parameters on the config.yaml end that you would adjust to get it to run faster, or is it mostly just a matter of more computing power?

Thanks!

tsackton · 2024-03-11T18:26:51Z

@jasonwjohns generally snpArcher is optimized to run many single-core jobs in parallel. I think the default number of current jobs is something around 1000, which you probably shouldn't need to change. You will need to change the partition you submit to, ideally something with at least a 3 day run time.

I think the Hummingbird cluster uses the slurm job management system; we have a profile for slurm built into snakemake (profiles/slurm), but if you or your IT colleagues have a Snakemake profile optimized for the Hummingbird cluster, that would probably be ideal (although you'll need to make sure that use-conda is set to True). Note that we are still working on adapting our profile system to Snakemake 8, which makes some big changes.

cademirch · 2024-03-11T18:27:21Z

Hi Jason,

That's odd - I just checked and you are right - my run did not produce the QC html. I will look into this.

EDIT: Number of samples has to be greater than 2. I misspoke on that earlier. In any case, if you want to run the QC module now that the workflow has completed, you can do so like this:

snakemake -s snpArcher/workflow/modules/qc/Snakefile

Just check that the dry-run makes sense should be something like this:

Building DAG of jobs...
Job stats:
job                     count
--------------------  -------
admixture                   1
all                         1
check_fai                   1
plink                       1
qc_plots                    1
subsample_snps              1
vcftools_individuals        1
total                       7

As for Hummingbird, and settings there, let me get back to you. I'm currently working on refactoring some of the code to clean up how we specify computational resources. I'll also be updating the docs about this, as well as using SLURM (which Hummingbird uses).

jasonwjohns · 2024-03-11T20:45:24Z

Hi Tim and Cade,

This is all great. I so appreciate all of your personalized help, not to mention all the work you did to write the workflow. Cheers to you!

Jason

cademirch closed this as completed Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sambamba error #165

sambamba error #165

jasonwjohns commented Mar 7, 2024

cademirch commented Mar 7, 2024

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 8, 2024 •

edited

Loading

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 8, 2024

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 9, 2024

jasonwjohns commented Mar 11, 2024 •

edited

Loading

tsackton commented Mar 11, 2024

cademirch commented Mar 11, 2024 •

edited

Loading

jasonwjohns commented Mar 11, 2024

sambamba error #165

sambamba error #165

Comments

jasonwjohns commented Mar 7, 2024

cademirch commented Mar 7, 2024

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 8, 2024 • edited Loading

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 8, 2024

jasonwjohns commented Mar 8, 2024

cademirch commented Mar 9, 2024

jasonwjohns commented Mar 11, 2024 • edited Loading

tsackton commented Mar 11, 2024

cademirch commented Mar 11, 2024 • edited Loading

jasonwjohns commented Mar 11, 2024

cademirch commented Mar 8, 2024 •

edited

Loading

jasonwjohns commented Mar 11, 2024 •

edited

Loading

cademirch commented Mar 11, 2024 •

edited

Loading