Skip to content

Yet another "huge intermediate files" issue #2468

@tamasgal

Description

@tamasgal

I have read through couple of issues (like #452) but have not found any solution to my problem, although I know and see that many people are struggling with similar problems.
Either I am (or we are) doing something wrong or this is not fitting the scope of nextflow, so let me please show you a simplified example and maybe someone has an idea how to elegantly design a workflow for it.

Here is the extremely simplified pipeline with the following processes:

  1. GrabDataFromTapeServers - input is a list of remote file locations on a Grid, output is a single 5-20GB file
  2. ReconstructEvents - input is the output of GrabDataFromTapeServers (5-20GB file), output is a small 100MB file

I have the file grabbing functionality in a separate process (GrabDataFromTapeServers) because there are different configuration options depending on which batch farm is used and this process is also used in other workflows.

The problem is: we need to launch tens of thousands of jobs and we have at best working space of a few hundred TB, largely depending on the batch farm used. The "big" file is usually only used in one consecutive process and can be thrown away afterwards.

Deleting it does not work, since the following process (in this case ReconstructEvents) only has access to a symlink. I'd rather not hack around to follow the symlink and delete the file, but maybe that's the way to go?

Using scratch true is not a solution either since the output file will be copied to the work/ directory anyways. We use it though to minimise east-west-traffic.

A solution would be integrating the file-grabbing processinto ReconstructEvents and deleting the original file in the same process. The problem is that we need to change a lot of workflows which use GrabDataFromTapeServers and that would be code duplication and additional maintenance overhead.

Is there a recommended solution for such a workflow?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions