Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Disk usage of the metagenomics workflow #954
Hi Anvi'o team,
I use the metagenomics workflow to process 25 libraries. I tried the workflow with 6 libraries and so far, so good. But with 25 samples, I ran out of disk space after while merging the merged file for co-assembly with IDBA-UD (merge_fastas_for_co_assembly).
I had 500Go available, but it was not enough.
The workflow involves a fq2fa to merge reads, then a cat to merge reads per group for IDBA, then gzip the quality checked fastq. These temporary files are too big for my setup right now, so I would rather process the files per group: fq2fa, gzip quality checked, then cat for Group1; then Group2; ...; GroupN
The idea is to gzip the quality checked fastq as soon as possible to lower the memory footprint drastically.
Thanks you for the great work !
@FlorianTrigodet, I'm looking into this right now, and I'm not sure how to implement such a thing.
You are asking to put the jobs relating to one group above the jobs of another group in the job execution order (and I have no idea how to do that). Alternatively, what we could do is concatenate each fastq file at a time, and then we could compress them immediately after concatenation was done. What do you think of this solution?
If you prefer your original suggestion, and if you don't have a suggestion of how to implement such a thing, then I'll try posting the following question on the snakemake google group:
I have a workflow with multiple steps, and I would like the execution of all rules per a specific value of a wildcard to be executed before starting to excecute these steps for the next value of the wildcard.
So for example if:
So I would like the order of excecuted commands to be:
What I don't want is:
Ok, so this is what I think:
Instead of having multiple rules, we should have just one rule with the steps:
The only issue is that we can have this rule run in parallel for two samples of the same group (because we can have two processes trying to concatenate files into the same file at the same time. I will send a question to the snakemake people to see if they have a suggestion.
Ok, I take it back, this is what makes sense to me, one rule will do everything with a simple loop of going through all the samples of a specific group, and in the loop unzipping, converting to fasta, concatenating.
Before this rule is executed all fastq files would be zipped, so the price we pay here is that things are zipped and then they are unzipped (temporarily) for this concatenation. But I think that the gain in optimizing disk space is worth it.
What do you think?