New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk usage of the metagenomics workflow #954

Closed
FlorianTrigodet opened this Issue Sep 3, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@FlorianTrigodet

FlorianTrigodet commented Sep 3, 2018

Hi Anvi'o team,

I use the metagenomics workflow to process 25 libraries. I tried the workflow with 6 libraries and so far, so good. But with 25 samples, I ran out of disk space after while merging the merged file for co-assembly with IDBA-UD (merge_fastas_for_co_assembly).

I had 500Go available, but it was not enough.

The workflow involves a fq2fa to merge reads, then a cat to merge reads per group for IDBA, then gzip the quality checked fastq. These temporary files are too big for my setup right now, so I would rather process the files per group: fq2fa, gzip quality checked, then cat for Group1; then Group2; ...; GroupN

The idea is to gzip the quality checked fastq as soon as possible to lower the memory footprint drastically.
Anyway, that’s what I will do given my 500Go available.

Thanks you for the great work !

@ShaiberAlon

This comment has been minimized.

Member

ShaiberAlon commented Oct 10, 2018

@FlorianTrigodet, I'm looking into this right now, and I'm not sure how to implement such a thing.
Do you have a suggesion? Do you have familiarity with snakemake in general?

You are asking to put the jobs relating to one group above the jobs of another group in the job execution order (and I have no idea how to do that). Alternatively, what we could do is concatenate each fastq file at a time, and then we could compress them immediately after concatenation was done. What do you think of this solution?

If you prefer your original suggestion, and if you don't have a suggestion of how to implement such a thing, then I'll try posting the following question on the snakemake google group:

I have a workflow with multiple steps, and I would like the execution of all rules per a specific value of a wildcard to be executed before starting to excecute these steps for the next value of the wildcard.

So for example if:

S = ['s1','s2']

rule all:
    input: expand('{s}.txt', s=S)

rule a:
    output: '{s}.temp'
    shell: 'touch {output}

rule b:
    input: rules.a.output
    output: {s}.txt
    shell: 'cat {input} > {output}'

So I would like the order of excecuted commands to be:

touch s1.temp
cat s1.temp > s1.txt
touch s2.temp
cat s2.temp > s2.txt

What I don't want is:

touch s1.temp
touch s2.temp
cat s1.temp > s1.txt
cat s2.temp > s2.txt
@ShaiberAlon

This comment has been minimized.

Member

ShaiberAlon commented Oct 11, 2018

Ok, so this is what I think:

Instead of having multiple rules, we should have just one rule with the steps:

  1. unzipping the fastq
  2. converting to fasta
  3. concatenating to the merged fasta
  4. zipping

The only issue is that we can have this rule run in parallel for two samples of the same group (because we can have two processes trying to concatenate files into the same file at the same time. I will send a question to the snakemake people to see if they have a suggestion.

@ShaiberAlon

This comment has been minimized.

Member

ShaiberAlon commented Oct 11, 2018

Ok, I take it back, this is what makes sense to me, one rule will do everything with a simple loop of going through all the samples of a specific group, and in the loop unzipping, converting to fasta, concatenating.

Before this rule is executed all fastq files would be zipped, so the price we pay here is that things are zipped and then they are unzipped (temporarily) for this concatenation. But I think that the gain in optimizing disk space is worth it.

What do you think?

@FlorianTrigodet

This comment has been minimized.

FlorianTrigodet commented Oct 12, 2018

Hi @ShaiberAlon,

I agree with you ! This is a great idea that would considerably reduce the space used by the unzipped fasta.

Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment