Unknown intermediate files in workflow #266

cmkobel · 2020-03-23T12:52:06Z

I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.

I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.

An alternative implementation is to have a single target that loops over the files. But this is not a scalable solution. If I have ~1000 files, it takes a long time even if I allocate the maximal number of cores on the local HPC.

Is there a nice solution to the problem of unknown intermediate files in gwf, or must I stick to restarting my present workflow for each checkpoint?

Forgive me for not having read the documentation to its full extent.

dansondergaard · 2020-03-23T19:58:40Z

I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.

The way gwf works, it needs to know about all input/output files to create the dependency graph, which happens before anything is actually run. So no, there's no nice solution.

I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.

From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.

Cheers,
Dan

cmkobel · 2020-03-23T20:49:48Z

Hi Dan.

Thanks for your quick answer.

From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.

The names of the genes in the genomes stem from a genome annotation (Prokka) which is part of the pipeline. So only if I pre-annotate can I do what you suggest.

Best, Carl.

dansondergaard added the question label Mar 23, 2020

dansondergaard self-assigned this Mar 23, 2020

cmkobel closed this as completed Mar 23, 2020

cmkobel changed the title ~~Identical output from many targets.~~ Unknown intermediate files in workflow Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown intermediate files in workflow #266

Unknown intermediate files in workflow #266

cmkobel commented Mar 23, 2020 •

edited

Loading

dansondergaard commented Mar 23, 2020

cmkobel commented Mar 23, 2020

Unknown intermediate files in workflow #266

Unknown intermediate files in workflow #266

Comments

cmkobel commented Mar 23, 2020 • edited Loading

dansondergaard commented Mar 23, 2020

cmkobel commented Mar 23, 2020

cmkobel commented Mar 23, 2020 •

edited

Loading