Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glacierAutoRetrieval doesn't work #4511

Closed
berguner opened this issue Nov 13, 2023 · 13 comments
Closed

glacierAutoRetrieval doesn't work #4511

berguner opened this issue Nov 13, 2023 · 13 comments

Comments

@berguner
Copy link

Bug report

Expected behavior and actual behavior

I tried running nf-core/demultiplex pipeline on a data that is stored on S3 Glacier Deep Archive but it failed. I had already restored an illumina run folder using AWS CLI from S3 Glacier Deep Archive and set aws.client.glacierAutoRetrieval = true but I am still getting the error below.

Program output

Object is of storage class GLACIER. Unable to perform download operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 download help for additional parameter options to ignore or force these transfers.

Environment

  • Nextflow version: 23.10.0.5889
  • Java version: openjdk version "11.0.20.1" 2023-08-24
  • Operating system: Ubuntu 20.04
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

I tried downloading the files with AWS CLI to make sure that they were actually restored and I had to enable the --force-glacier-transfer in order to recursively download. Could it be that this option wasn't enabled by Nextflow?

@bentsherman
Copy link
Member

Can you share the entire log please?

@berguner
Copy link
Author

Here is the log file:
nextflow_aws-glacier-auto-retreival-err.log

@bentsherman
Copy link
Member

I think it is failing because we never implemented the Glacier auto-retrieval for tasks

@pditommaso if I recall correctly, we only implemented the glacier restore for the Nextflow head node because otherwise you could have many tasks doing nothing while waiting for the restore. On the other hand, currently the only way to trigger the glacier restore is to download the entire file to the head node.

What would be ideal is a way to only trigger the glacier restore from the head node before launching the tasks, let the head node do all the waiting, and then launch new tasks only when their inputs are restored

@pditommaso
Copy link
Member

Glacier restore can take ages, I think it makes no sense to run hundreds of task hanging for the data to be available

@berguner
Copy link
Author

Yes, tasks shouldn't wait for restoring to be completed. It would be an amazing feature if the head node could check the restoration status once every set interval (i.e. every hour) and launch the tasks once the files are restored. According to the AWS documentation, it should be possible to query the restoration status of the files, so head node wouldn't have to download the files.

@bentsherman
Copy link
Member

@pditommaso I agree, but what do you think about this:

What would be ideal is a way to only trigger the glacier restore from the head node before launching the tasks, let the head node do all the waiting, and then launch new tasks only when their inputs are restored

@pditommaso
Copy link
Member

It's a too complex task for a niche use case. Don't think it's worth

@bentsherman
Copy link
Member

@berguner as a workaround, you could read the first line of each file (docs) you want to restore in the Nextflow code. That should trigger the restore, then you can pass the files to the tasks and they will run as normal. For example:

input_files = files('*.fastq')
input_files.each { file ->
  file.withReader { it.readLine() }
}

workflow {
  Channel.of(input_files) | MY_PROC
}

Of course this requires you to modify the pipeline code, and this particular example isn't ideal because it will wait for each file one at a time. If instead you read the first line in the process body, that should restore all the files in parallel

@bentsherman
Copy link
Member

It might be easier to just run a script before the pipeline execution (or as a step in the pipeline) that restores all the input files you want to retrieve from glacier.

@berguner
Copy link
Author

berguner commented Nov 15, 2023

Thanks for the suggestion @bentsherman, but I already tried to restoring prior to running the pipeline but it still failed. I don't know how each task downloads the files from S3 but I had to set --force-glacier-transfer option for getting aws s3 cp --recursive to work. Is there a way to add aws s3 cp --recursive --force-glacier-transfer to the wrapper scripts that NF generates to run the tasks?

@bentsherman
Copy link
Member

I already tried to restoring prior to running the pipeline but it still failed

What exactly did you try? You should be able to use the aws CLI just like the task script, but with the extra option, like this:

for file in $files; do
  aws s3 cp --force-glacier-transfer $file - > /dev/null &
done

wait

Although I would use the restore-object to simply restore each object without downloading it

@pditommaso
Copy link
Member

Alternatively a plugin may able to handle this

@pditommaso
Copy link
Member

This feature has been removed #4705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants