You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.
I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.
An alternative implementation is to have a single target that loops over the files. But this is not a scalable solution. If I have ~1000 files, it takes a long time even if I allocate the maximal number of cores on the local HPC.
Is there a nice solution to the problem of unknown intermediate files in gwf, or must I stick to restarting my present workflow for each checkpoint?
Forgive me for not having read the documentation to its full extent.
The text was updated successfully, but these errors were encountered:
I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.
The way gwf works, it needs to know about all input/output files to create the dependency graph, which happens before anything is actually run. So no, there's no nice solution.
I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.
From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.
From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.
The names of the genes in the genomes stem from a genome annotation (Prokka) which is part of the pipeline. So only if I pre-annotate can I do what you suggest.
I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.
I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.
An alternative implementation is to have a single target that loops over the files. But this is not a scalable solution. If I have ~1000 files, it takes a long time even if I allocate the maximal number of cores on the local HPC.
Is there a nice solution to the problem of unknown intermediate files in gwf, or must I stick to restarting my present workflow for each checkpoint?
Forgive me for not having read the documentation to its full extent.
The text was updated successfully, but these errors were encountered: