Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown intermediate files in workflow #266

Closed
cmkobel opened this issue Mar 23, 2020 · 2 comments
Closed

Unknown intermediate files in workflow #266

cmkobel opened this issue Mar 23, 2020 · 2 comments
Assignees
Labels

Comments

@cmkobel
Copy link

cmkobel commented Mar 23, 2020

I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.

I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.

An alternative implementation is to have a single target that loops over the files. But this is not a scalable solution. If I have ~1000 files, it takes a long time even if I allocate the maximal number of cores on the local HPC.

Is there a nice solution to the problem of unknown intermediate files in gwf, or must I stick to restarting my present workflow for each checkpoint?

Forgive me for not having read the documentation to its full extent.

@dansondergaard
Copy link
Collaborator

I have a workflow where I extract all genes from a number of genomes, and then perform a series of statistics on these individual files. This means that when I start the workflow I don't know how many and what the names of these files will be.

The way gwf works, it needs to know about all input/output files to create the dependency graph, which happens before anything is actually run. So no, there's no nice solution.

I can't find a way to have gwf dynamically parse the individual gene files from the beginning. This means that I need to restart the workflow for each "checkpoint" where a number of intermediate files are generated.

From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.

Cheers,
Dan

@cmkobel
Copy link
Author

cmkobel commented Mar 23, 2020

Hi Dan.

Thanks for your quick answer.

From this it sounds like you actually do know which files you're going to end up with? Do you have files that specify the genes in the genomes already? If you do, it should be trivial to parse these in your workflow file and create the appropriate targets.

The names of the genes in the genomes stem from a genome annotation (Prokka) which is part of the pipeline. So only if I pre-annotate can I do what you suggest.

Best, Carl.

@cmkobel cmkobel closed this as completed Mar 23, 2020
@cmkobel cmkobel changed the title Identical output from many targets. Unknown intermediate files in workflow Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants