glacierAutoRetrieval doesn't work #4511

berguner · 2023-11-13T21:53:49Z

Bug report

Expected behavior and actual behavior

I tried running nf-core/demultiplex pipeline on a data that is stored on S3 Glacier Deep Archive but it failed. I had already restored an illumina run folder using AWS CLI from S3 Glacier Deep Archive and set aws.client.glacierAutoRetrieval = true but I am still getting the error below.

Program output

Object is of storage class GLACIER. Unable to perform download operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 download help for additional parameter options to ignore or force these transfers.

Environment

Nextflow version: 23.10.0.5889
Java version: openjdk version "11.0.20.1" 2023-08-24
Operating system: Ubuntu 20.04
Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

I tried downloading the files with AWS CLI to make sure that they were actually restored and I had to enable the --force-glacier-transfer in order to recursively download. Could it be that this option wasn't enabled by Nextflow?

The text was updated successfully, but these errors were encountered:

bentsherman · 2023-11-13T22:48:07Z

Can you share the entire log please?

berguner · 2023-11-14T07:39:05Z

Here is the log file:
nextflow_aws-glacier-auto-retreival-err.log

bentsherman · 2023-11-14T14:15:54Z

I think it is failing because we never implemented the Glacier auto-retrieval for tasks

@pditommaso if I recall correctly, we only implemented the glacier restore for the Nextflow head node because otherwise you could have many tasks doing nothing while waiting for the restore. On the other hand, currently the only way to trigger the glacier restore is to download the entire file to the head node.

What would be ideal is a way to only trigger the glacier restore from the head node before launching the tasks, let the head node do all the waiting, and then launch new tasks only when their inputs are restored

pditommaso · 2023-11-15T06:31:53Z

Glacier restore can take ages, I think it makes no sense to run hundreds of task hanging for the data to be available

berguner · 2023-11-15T08:46:26Z

Yes, tasks shouldn't wait for restoring to be completed. It would be an amazing feature if the head node could check the restoration status once every set interval (i.e. every hour) and launch the tasks once the files are restored. According to the AWS documentation, it should be possible to query the restoration status of the files, so head node wouldn't have to download the files.

bentsherman · 2023-11-15T13:25:06Z

@pditommaso I agree, but what do you think about this:

What would be ideal is a way to only trigger the glacier restore from the head node before launching the tasks, let the head node do all the waiting, and then launch new tasks only when their inputs are restored

pditommaso · 2023-11-15T13:34:41Z

It's a too complex task for a niche use case. Don't think it's worth

bentsherman · 2023-11-15T15:05:17Z

@berguner as a workaround, you could read the first line of each file (docs) you want to restore in the Nextflow code. That should trigger the restore, then you can pass the files to the tasks and they will run as normal. For example:

input_files = files('*.fastq')
input_files.each { file ->
  file.withReader { it.readLine() }
}

workflow {
  Channel.of(input_files) | MY_PROC
}

Of course this requires you to modify the pipeline code, and this particular example isn't ideal because it will wait for each file one at a time. If instead you read the first line in the process body, that should restore all the files in parallel

bentsherman · 2023-11-15T15:06:52Z

It might be easier to just run a script before the pipeline execution (or as a step in the pipeline) that restores all the input files you want to retrieve from glacier.

berguner · 2023-11-15T15:18:55Z

Thanks for the suggestion @bentsherman, but I already tried to restoring prior to running the pipeline but it still failed. I don't know how each task downloads the files from S3 but I had to set --force-glacier-transfer option for getting aws s3 cp --recursive to work. Is there a way to add aws s3 cp --recursive --force-glacier-transfer to the wrapper scripts that NF generates to run the tasks?

bentsherman · 2023-11-15T15:36:27Z

I already tried to restoring prior to running the pipeline but it still failed

What exactly did you try? You should be able to use the aws CLI just like the task script, but with the extra option, like this:

for file in $files; do
  aws s3 cp --force-glacier-transfer $file - > /dev/null &
done

wait

Although I would use the restore-object to simply restore each object without downloading it

pditommaso · 2023-11-15T16:34:45Z

Alternatively a plugin may able to handle this

pditommaso · 2024-02-10T09:15:53Z

This feature has been removed #4705

bentsherman added the storage/aws label Nov 13, 2023

bentsherman mentioned this issue Nov 15, 2023

Add note about limitations of glacier auto retrieval #4514

Merged

pditommaso closed this as completed Feb 10, 2024

bentsherman mentioned this issue Feb 16, 2024

Recursively downloading S3 directory restored from Glacier doesn't work #4747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glacierAutoRetrieval doesn't work #4511

glacierAutoRetrieval doesn't work #4511

berguner commented Nov 13, 2023

bentsherman commented Nov 13, 2023

berguner commented Nov 14, 2023

bentsherman commented Nov 14, 2023

pditommaso commented Nov 15, 2023

berguner commented Nov 15, 2023

bentsherman commented Nov 15, 2023

pditommaso commented Nov 15, 2023

bentsherman commented Nov 15, 2023

bentsherman commented Nov 15, 2023

berguner commented Nov 15, 2023 •

edited

bentsherman commented Nov 15, 2023

pditommaso commented Nov 15, 2023

pditommaso commented Feb 10, 2024

glacierAutoRetrieval doesn't work #4511

glacierAutoRetrieval doesn't work #4511

Comments

berguner commented Nov 13, 2023

Bug report

Expected behavior and actual behavior

Program output

Environment

Additional context

bentsherman commented Nov 13, 2023

berguner commented Nov 14, 2023

bentsherman commented Nov 14, 2023

pditommaso commented Nov 15, 2023

berguner commented Nov 15, 2023

bentsherman commented Nov 15, 2023

pditommaso commented Nov 15, 2023

bentsherman commented Nov 15, 2023

bentsherman commented Nov 15, 2023

berguner commented Nov 15, 2023 • edited

bentsherman commented Nov 15, 2023

pditommaso commented Nov 15, 2023

pditommaso commented Feb 10, 2024

berguner commented Nov 15, 2023 •

edited