workflow cannot access public reference data S3 bucket in a different region #1596

dtenenba · 2020-05-04T22:07:47Z

Bug report

This is on the edge between a bug report and a new feature request. I am honestly not sure which it is, or maybe it's both. After some debate I am filing it as a bug report.

Expected behavior and actual behavior

I am running a workflow in nf-core: https://github.com/nf-core/rnaseq
That workflow makes use of a reference data bucket:
https://github.com/nf-core/rnaseq/blob/master/conf/base.config#L71-L75

That bucket is in the eu-west-1 region while all my infrastructure is in us-west-2.

When I run the workflow, I get:

The bucket is in this region: eu-west-1. Please use this region to retry the request (Service: Amazon S3;
Status Code: 301; Error Code: PermanentRedirect; Request ID: BAE09F61CA1E2DFB; S3 Extended Request ID: iA6
xn8pYkVpEbshpqrJNxC0bEzK0GkAnx0scLW0zAks1jTnaKIe1sfEqJTyeefI7whGpXOfyB0A=)

This essentially means that nobody can run this nf-core workflow unless they happen to be in the eu-west- region.

It seems to me that there are a couple of approaches that could fix this issue.
One would be to create an S3 client object that is not tied to a specific region. I am not an expert in the Java/Groovy SDK for AWS but in python's boto3 it seems to ignore the region_name argument when creating an S3 client object. At any rate, I can specify a region and then interact with buckets from multiple regions.
Another approach would be to catch the error that comes back and re-try the operation using a client that was created in the appropriate region.

Otherwise there are only two workarounds and they are both prohibitively cumbersome and/or expensive:

Make a copy of the bucket in another region. Can I convince the original bucket owner to do this, or will I do it myself? How will we manage syncing the data across regions?
Create a redundant AWS environment (VPC, Batch environment/queue, S3 bucket) in eu-west-1 just to get around this error.

Steps to reproduce the problem

Run nextflow run nf-core/rnaseq with a config that specifies your AWS region as one other than eu-west-1.

Program output

The bucket is in this region: eu-west-1. Please use this region to retry the request (Service: Amazon S3;
Status Code: 301; Error Code: PermanentRedirect; Request ID: BAE09F61CA1E2DFB; S3 Extended Request ID: iA6
xn8pYkVpEbshpqrJNxC0bEzK0GkAnx0scLW0zAks1jTnaKIe1sfEqJTyeefI7whGpXOfyB0A=)

Environment

Nextflow version: 20.01.0 build 5264
Java version: 1.8.0_181
Operating system: Ubuntu Linux 14.04
Bash version: GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here)

The text was updated successfully, but these errors were encountered:

wleepang · 2020-05-06T16:44:57Z

@dtenenba - if you omit the region from the config file the underlying AmazonCloudDriver will use the "default" region from the execution environment.

dtenenba · 2020-05-06T17:11:55Z

I just tried that and I am still getting the same error. I also commented out the region in my ~/.aws/config file as well and still got the same result.

wleepang · 2020-05-07T05:22:44Z

Are you running from a local workstation or on an EC2 instance? If the latter, is your account part of an organization, and is there an SCP applied to the account?

dtenenba · 2020-05-07T05:36:09Z

@wleepang Local workstation.

pditommaso · 2020-05-07T07:31:57Z

The region is set when the client is created.

https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L834-L836

@wleepang Does the AWS client can only access buckets for that region?

wleepang · 2020-05-07T15:29:16Z

I'm not sure if this translates to the Java SDK. I whipped up a quick test with the Python SDK, and explicitly setting the region on the client does not affect accessibility to the bucket - was able to list and get.

dtenenba · 2020-05-07T15:52:56Z

I had the same results with a python test -- no problem accessing a bucket in eu-west-1 with a client created with us-west-2, and vice versa. Not sure if the Java SDK behaves differently or if it's an implementation issue. I do notice that the error message (Please use this region to retry the request) does come from aws-sdk-java.

wleepang · 2020-05-07T16:12:09Z

I wonder if this (explicitly setting the api endpoint) could be the root cause:
https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L831

although, again - this isn't an issue with the Python SDK

wleepang · 2020-05-12T23:57:14Z

@dtenenba - ran a more elaborate test today:

Installed nextflow on an EC2 instance in us-east-2 with a "restricted" instance profile (very minimal permissions) and ran the following demo workflow:

https://github.com/wleepang/demo-genomics-workflow-nextflow

which sources public input data from us-west-2 and public reference data from us-east-1.

I was only able to replicate your error when I specifically added

aws.region = "us-east-2"

to ~/.nextflow/config:

N E X T F L O W  ~  version 20.04.1
Launching `wleepang/demo-genomics-workflow-nextflow` [fabulous_sax] - revision: 9b06eeeb04 [master]
script: f8fa70949143134785feefa8720a76f3
session: e2181429-a717-4621-a8e8-179ee372908b

sample-id: NIST7035
The bucket is in this region: us-east-1. Please use this region to retry the request (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: 3E7C6C0C95D75853; S3 Extended Request ID: EHSkfvR72NdrmpTcNyFLWdRnIIKQSH41xHIAfr5YCwDL2XsNYO/vAiOvAND58esllkwhu+WhsNk=)
The bucket is in this region: us-west-2. Please use this region to retry the request (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: 6E17F598696E07E3; S3 Extended Request ID: wo3xEbs5iD1IeIrndthuQ6zPsroEjOKl/EjcOVy1oCxPnVGN6ZUcvPTVAkkZ/3alrMRRLgKQEMo=)

When I remove that config option, the workflow runs fine.

It looks like eu-west-2 is specified as the default in the nextflow.config file that ships with the nf-core/rnaseq workflow:
https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/nextflow.config#L95

@pditommaso - what is the suggested way to override this? Could one set aws.region = null in a nextflow.config in the current working directory where nextflow is called?

dtenenba · 2020-05-13T01:29:01Z

@wleepang Thanks for the testing! I am curious how this test would have gone if you had been on a local workstation and not on an EC2 instance. How would nextflow know which region you are using for e.g. your AWS Batch queue, if it didn't have instance metadata to fall back on? I guess it would look in ~/.aws/config or for an AWS_DEFAULT_REGION variable.

I guess I already know what happens because as I mentioned above I had tried commenting out the region from my nextflow.config file and still got the original error.

sminot · 2020-06-10T19:27:26Z

I just wanted to refresh this thread and see if there has been any progress.

It sounds like Nextflow does not currently support data in multiple S3 regions, and if that is likely to continue then I can just start working around it by copying reference data into my own region (at a cost, unfortunately).

Any update would be greatly appreciated!

pditommaso · 2020-06-11T12:53:43Z

We need some aws guru that deep into this issue. Also, you may want to upvote the following feature that would allow NF to use a better support for S3 storage.

pditommaso · 2020-06-11T13:04:37Z

Sorry, forgot to include the link aws/aws-sdk-java-v2#1388

dtenenba · 2020-06-11T16:56:49Z

Hi @pditommaso ,

I am definitely not a guru - have not touched java for many years, but I was able to get a simple test working which can list buckets in two different regions with the same client. So it is definitely possible and this does not seem to be an issue at the Java level. Here is what I did:

git clone https://github.com/awslabs/aws-java-sample.git
cd aws-java-sample

Then I made the following changes to src/main/java/com/amazonaws/samples/S3Sample.java:

diff --git a/src/main/java/com/amazonaws/samples/S3Sample.java b/src/main/java/com/amazonaws/samples/S3Sample.java
index 39beedd..3d665f8 100644
--- a/src/main/java/com/amazonaws/samples/S3Sample.java
+++ b/src/main/java/com/amazonaws/samples/S3Sample.java
@@ -63,8 +63,8 @@ public class S3Sample {
          */
 
         AmazonS3 s3 = new AmazonS3Client();
-        Region usWest2 = Region.getRegion(Regions.US_WEST_2);
-        s3.setRegion(usWest2);
+        // Region usWest2 = Region.getRegion(Regions.US_WEST_2);
+        // s3.setRegion(usWest2);
 
         String bucketName = "my-first-s3-bucket-" + UUID.randomUUID();
         String key = "MyObjectKey";
@@ -82,17 +82,17 @@ public class S3Sample {
              * You can optionally specify a location for your bucket if you want to
              * keep your data closer to your applications or users.
              */
-            System.out.println("Creating bucket " + bucketName + "\n");
-            s3.createBucket(bucketName);
+            // System.out.println("Creating bucket " + bucketName + "\n");
+            // s3.createBucket(bucketName);
 
             /*
              * List the buckets in your account
              */
-            System.out.println("Listing buckets");
-            for (Bucket bucket : s3.listBuckets()) {
-                System.out.println(" - " + bucket.getName());
-            }
-            System.out.println();
+            // System.out.println("Listing buckets");
+            // for (Bucket bucket : s3.listBuckets()) {
+            //     System.out.println(" - " + bucket.getName());
+            // }
+            // System.out.println();
 
             /*
              * Upload an object to your bucket - You can easily upload a file to
@@ -102,8 +102,8 @@ public class S3Sample {
              * like content-type and content-encoding, plus additional metadata
              * specific to your applications.
              */
-            System.out.println("Uploading a new object to S3 from a file\n");
-            s3.putObject(new PutObjectRequest(bucketName, key, createSampleFile()));
+            // System.out.println("Uploading a new object to S3 from a file\n");
+            // s3.putObject(new PutObjectRequest(bucketName, key, createSampleFile()));
 
             /*
              * Download an object - When you download an object, you get all of
@@ -117,10 +117,10 @@ public class S3Sample {
              * conditional downloading of objects based on modification times,
              * ETags, and selectively downloading a range of an object.
              */
-            System.out.println("Downloading an object");
-            S3Object object = s3.getObject(new GetObjectRequest(bucketName, key));
-            System.out.println("Content-Type: "  + object.getObjectMetadata().getContentType());
-            displayTextInputStream(object.getObjectContent());
+            // System.out.println("Downloading an object");
+            // S3Object object = s3.getObject(new GetObjectRequest(bucketName, key));
+            // System.out.println("Content-Type: "  + object.getObjectMetadata().getContentType());
+            // displayTextInputStream(object.getObjectContent());
 
             /*
              * List objects in your bucket by prefix - There are many options for
@@ -130,10 +130,19 @@ public class S3Sample {
              * use the AmazonS3.listNextBatchOfObjects(...) operation to retrieve
              * additional results.
              */
-            System.out.println("Listing objects");
+            System.out.println("Listing objects in broad-references");
             ObjectListing objectListing = s3.listObjects(new ListObjectsRequest()
-                    .withBucketName(bucketName)
-                    .withPrefix("My"));
+                    .withBucketName("broad-references"));
+            for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
+                System.out.println(" - " + objectSummary.getKey() + "  " +
+                        "(size = " + objectSummary.getSize() + ")");
+            }
+            System.out.println();
+
+            System.out.println("Listing objects in ngi-igenomes/igenomes");
+            objectListing = s3.listObjects(new ListObjectsRequest()
+                    .withBucketName("ngi-igenomes")
+                    .withPrefix("igenomes"));
             for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
                 System.out.println(" - " + objectSummary.getKey() + "  " +
                         "(size = " + objectSummary.getSize() + ")");
@@ -144,16 +153,16 @@ public class S3Sample {
              * Delete an object - Unless versioning has been turned on for your bucket,
              * there is no way to undelete an object, so use caution when deleting objects.
              */
-            System.out.println("Deleting an object\n");
-            s3.deleteObject(bucketName, key);
+            // System.out.println("Deleting an object\n");
+            // s3.deleteObject(bucketName, key);
 
             /*
              * Delete a bucket - A bucket must be completely empty before it can be
              * deleted, so remember to delete any objects from your buckets before
              * you try to delete them.
              */
-            System.out.println("Deleting bucket " + bucketName + "\n");
-            s3.deleteBucket(bucketName);
+            // System.out.println("Deleting bucket " + bucketName + "\n");
+            // s3.deleteBucket(bucketName);
         } catch (AmazonServiceException ase) {
             System.out.println("Caught an AmazonServiceException, which means your request made it "
                     + "to Amazon S3, but was rejected with an error response for some reason.");

Then build and run with:

mvn clean compile exec:java

First it lists the bucket broad-references which (I believe) is in us-east-1.
Then it lists the bucket ngi-igenomes (the igenomes prefix because there are many thousands of objects in the root of the bucket) which is in eu-west-1.

It works successfully, so it is definitely possible to interact with buckets in different regions with the same client.

Things to note:

I commented out the code which specified a region for the S3 client. The code fails if I leave this code in.
I made the SDK aware of my credentials by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, but I did not set AWS_DEFAULT_REGION.

Hope this is helpful.

wleepang · 2020-06-12T04:05:08Z

I happened to hit this issue the other day in Cloud9. The fix was to remove the region specification from both nextflow.config and ~/.aws/config.

wleepang · 2020-06-12T04:18:07Z

@pditommaso - will clients from the client factory fail if the region is left null?

nextflow/modules/nf-amazon/src/main/nextflow/cloud/aws/AmazonClientFactory.groovy

Lines 120 to 122 in 28e8c5a

    
           region = config.region ?: Global.getAwsRegion() ?: fetchRegion() 
        
           if( !region ) 
        
               throw new AbortOperationException('Missing AWS region -- Make sure to define in your system environment the variable `AWS_DEFAULT_REGION`')

I optionally set that when creating the client for CodeCommit. Speaking of which, I noticed that the code got refactored and I can't find the AwsCodeCommitRepository provider. Where did that get moved to?

pditommaso · 2020-06-12T13:02:03Z

The S3 file system uses its own client. Based on what you are saying it would be enough to remove the region setting (or not specifying it)

https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L834-L836

(codecommit feature was is here)

pditommaso · 2020-06-14T20:16:42Z

I confirm that this only happens when setting the aws.region property in the NF config or the AWS_DEFAULT_REGION env var. Therefore, it should be enough to remove that property set in the S3 client.

As a quick workaround, it's enough to not specify the above config options.

sminot · 2020-06-16T18:31:10Z

When I tried this fix, I got this error:

Missing AWS region -- Make sure to define in your system environment the variable `AWS_DEFAULT_REGION`

Can you advise?

pditommaso · 2020-06-17T14:05:25Z

Maybe I was too optimistic about the workaround. The next release will be patched.

pditommaso · 2020-07-05T21:07:04Z

Included in release 20.06.0-edge. If the problem persists feel free to reopen this issue.

pditommaso added the storage/aws label May 18, 2020

pditommaso added a commit that referenced this issue Jun 14, 2020

Fix Issue accessing S3 bucket #1596

ffc6f84

pditommaso modified the milestones: v20.07.0, v20.04.0 Jul 2, 2020

pditommaso closed this as completed Jul 5, 2020

pditommaso mentioned this issue Mar 6, 2021

Nextflow 20.10 doesn't recognize AWS China credential? #1952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow cannot access public reference data S3 bucket in a different region #1596

workflow cannot access public reference data S3 bucket in a different region #1596

dtenenba commented May 4, 2020

wleepang commented May 6, 2020

dtenenba commented May 6, 2020

wleepang commented May 7, 2020

dtenenba commented May 7, 2020

pditommaso commented May 7, 2020

wleepang commented May 7, 2020

dtenenba commented May 7, 2020

wleepang commented May 7, 2020 •

edited

wleepang commented May 12, 2020

dtenenba commented May 13, 2020

sminot commented Jun 10, 2020

pditommaso commented Jun 11, 2020

pditommaso commented Jun 11, 2020

dtenenba commented Jun 11, 2020 •

edited

wleepang commented Jun 12, 2020

wleepang commented Jun 12, 2020

pditommaso commented Jun 12, 2020

pditommaso commented Jun 14, 2020 •

edited

sminot commented Jun 16, 2020

pditommaso commented Jun 17, 2020

pditommaso commented Jul 5, 2020

workflow cannot access public reference data S3 bucket in a different region #1596

workflow cannot access public reference data S3 bucket in a different region #1596

Comments

dtenenba commented May 4, 2020

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

wleepang commented May 6, 2020

dtenenba commented May 6, 2020

wleepang commented May 7, 2020

dtenenba commented May 7, 2020

pditommaso commented May 7, 2020

wleepang commented May 7, 2020

dtenenba commented May 7, 2020

wleepang commented May 7, 2020 • edited

wleepang commented May 12, 2020

dtenenba commented May 13, 2020

sminot commented Jun 10, 2020

pditommaso commented Jun 11, 2020

pditommaso commented Jun 11, 2020

dtenenba commented Jun 11, 2020 • edited

wleepang commented Jun 12, 2020

wleepang commented Jun 12, 2020

pditommaso commented Jun 12, 2020

pditommaso commented Jun 14, 2020 • edited

sminot commented Jun 16, 2020

pditommaso commented Jun 17, 2020

pditommaso commented Jul 5, 2020

wleepang commented May 7, 2020 •

edited

dtenenba commented Jun 11, 2020 •

edited

pditommaso commented Jun 14, 2020 •

edited