-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glacierAutoRetrieval doesn't work #4511
Comments
Can you share the entire log please? |
Here is the log file: |
I think it is failing because we never implemented the Glacier auto-retrieval for tasks @pditommaso if I recall correctly, we only implemented the glacier restore for the Nextflow head node because otherwise you could have many tasks doing nothing while waiting for the restore. On the other hand, currently the only way to trigger the glacier restore is to download the entire file to the head node. What would be ideal is a way to only trigger the glacier restore from the head node before launching the tasks, let the head node do all the waiting, and then launch new tasks only when their inputs are restored |
Glacier restore can take ages, I think it makes no sense to run hundreds of task hanging for the data to be available |
Yes, tasks shouldn't wait for restoring to be completed. It would be an amazing feature if the head node could check the restoration status once every set interval (i.e. every hour) and launch the tasks once the files are restored. According to the AWS documentation, it should be possible to query the restoration status of the files, so head node wouldn't have to download the files. |
@pditommaso I agree, but what do you think about this:
|
It's a too complex task for a niche use case. Don't think it's worth |
@berguner as a workaround, you could read the first line of each file (docs) you want to restore in the Nextflow code. That should trigger the restore, then you can pass the files to the tasks and they will run as normal. For example: input_files = files('*.fastq')
input_files.each { file ->
file.withReader { it.readLine() }
}
workflow {
Channel.of(input_files) | MY_PROC
} Of course this requires you to modify the pipeline code, and this particular example isn't ideal because it will wait for each file one at a time. If instead you read the first line in the process body, that should restore all the files in parallel |
It might be easier to just run a script before the pipeline execution (or as a step in the pipeline) that restores all the input files you want to retrieve from glacier. |
Thanks for the suggestion @bentsherman, but I already tried to restoring prior to running the pipeline but it still failed. I don't know how each task downloads the files from S3 but I had to set |
What exactly did you try? You should be able to use the for file in $files; do
aws s3 cp --force-glacier-transfer $file - > /dev/null &
done
wait Although I would use the restore-object to simply restore each object without downloading it |
Alternatively a plugin may able to handle this |
This feature has been removed #4705 |
Bug report
Expected behavior and actual behavior
I tried running
nf-core/demultiplex
pipeline on a data that is stored on S3 Glacier Deep Archive but it failed. I had already restored an illumina run folder using AWS CLI from S3 Glacier Deep Archive and setaws.client.glacierAutoRetrieval = true
but I am still getting the error below.Program output
Environment
Additional context
I tried downloading the files with AWS CLI to make sure that they were actually restored and I had to enable the
--force-glacier-transfer
in order to recursively download. Could it be that this option wasn't enabled by Nextflow?The text was updated successfully, but these errors were encountered: