Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow cannot access public reference data S3 bucket in a different region #1596

Closed
dtenenba opened this issue May 4, 2020 · 21 comments
Closed
Milestone

Comments

@dtenenba
Copy link

dtenenba commented May 4, 2020

Bug report

This is on the edge between a bug report and a new feature request. I am honestly not sure which it is, or maybe it's both. After some debate I am filing it as a bug report.

Expected behavior and actual behavior

I am running a workflow in nf-core: https://github.com/nf-core/rnaseq
That workflow makes use of a reference data bucket:
https://github.com/nf-core/rnaseq/blob/master/conf/base.config#L71-L75

That bucket is in the eu-west-1 region while all my infrastructure is in us-west-2.

When I run the workflow, I get:

The bucket is in this region: eu-west-1. Please use this region to retry the request (Service: Amazon S3;
Status Code: 301; Error Code: PermanentRedirect; Request ID: BAE09F61CA1E2DFB; S3 Extended Request ID: iA6
xn8pYkVpEbshpqrJNxC0bEzK0GkAnx0scLW0zAks1jTnaKIe1sfEqJTyeefI7whGpXOfyB0A=)

This essentially means that nobody can run this nf-core workflow unless they happen to be in the eu-west- region.

It seems to me that there are a couple of approaches that could fix this issue.
One would be to create an S3 client object that is not tied to a specific region. I am not an expert in the Java/Groovy SDK for AWS but in python's boto3 it seems to ignore the region_name argument when creating an S3 client object. At any rate, I can specify a region and then interact with buckets from multiple regions.
Another approach would be to catch the error that comes back and re-try the operation using a client that was created in the appropriate region.

Otherwise there are only two workarounds and they are both prohibitively cumbersome and/or expensive:

  1. Make a copy of the bucket in another region. Can I convince the original bucket owner to do this, or will I do it myself? How will we manage syncing the data across regions?
  2. Create a redundant AWS environment (VPC, Batch environment/queue, S3 bucket) in eu-west-1 just to get around this error.

Steps to reproduce the problem

Run nextflow run nf-core/rnaseq with a config that specifies your AWS region as one other than eu-west-1.

Program output

The bucket is in this region: eu-west-1. Please use this region to retry the request (Service: Amazon S3;
Status Code: 301; Error Code: PermanentRedirect; Request ID: BAE09F61CA1E2DFB; S3 Extended Request ID: iA6
xn8pYkVpEbshpqrJNxC0bEzK0GkAnx0scLW0zAks1jTnaKIe1sfEqJTyeefI7whGpXOfyB0A=)

Environment

  • Nextflow version: 20.01.0 build 5264
  • Java version: 1.8.0_181
  • Operating system: Ubuntu Linux 14.04
  • Bash version: GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here)

@wleepang
Copy link

wleepang commented May 6, 2020

@dtenenba - if you omit the region from the config file the underlying AmazonCloudDriver will use the "default" region from the execution environment.

@dtenenba
Copy link
Author

dtenenba commented May 6, 2020

I just tried that and I am still getting the same error. I also commented out the region in my ~/.aws/config file as well and still got the same result.

@wleepang
Copy link

wleepang commented May 7, 2020

Are you running from a local workstation or on an EC2 instance? If the latter, is your account part of an organization, and is there an SCP applied to the account?

@dtenenba
Copy link
Author

dtenenba commented May 7, 2020

@wleepang Local workstation.

@pditommaso
Copy link
Member

The region is set when the client is created.

https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L834-L836

@wleepang Does the AWS client can only access buckets for that region?

@wleepang
Copy link

wleepang commented May 7, 2020

I'm not sure if this translates to the Java SDK. I whipped up a quick test with the Python SDK, and explicitly setting the region on the client does not affect accessibility to the bucket - was able to list and get.

@dtenenba
Copy link
Author

dtenenba commented May 7, 2020

I had the same results with a python test -- no problem accessing a bucket in eu-west-1 with a client created with us-west-2, and vice versa. Not sure if the Java SDK behaves differently or if it's an implementation issue. I do notice that the error message (Please use this region to retry the request) does come from aws-sdk-java.

@wleepang
Copy link

wleepang commented May 7, 2020

I wonder if this (explicitly setting the api endpoint) could be the root cause:
https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L831

although, again - this isn't an issue with the Python SDK

@wleepang
Copy link

@dtenenba - ran a more elaborate test today:

Installed nextflow on an EC2 instance in us-east-2 with a "restricted" instance profile (very minimal permissions) and ran the following demo workflow:

https://github.com/wleepang/demo-genomics-workflow-nextflow

which sources public input data from us-west-2 and public reference data from us-east-1.

I was only able to replicate your error when I specifically added

aws.region = "us-east-2"

to ~/.nextflow/config:

N E X T F L O W  ~  version 20.04.1
Launching `wleepang/demo-genomics-workflow-nextflow` [fabulous_sax] - revision: 9b06eeeb04 [master]
script: f8fa70949143134785feefa8720a76f3
session: e2181429-a717-4621-a8e8-179ee372908b

sample-id: NIST7035
The bucket is in this region: us-east-1. Please use this region to retry the request (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: 3E7C6C0C95D75853; S3 Extended Request ID: EHSkfvR72NdrmpTcNyFLWdRnIIKQSH41xHIAfr5YCwDL2XsNYO/vAiOvAND58esllkwhu+WhsNk=)
The bucket is in this region: us-west-2. Please use this region to retry the request (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: 6E17F598696E07E3; S3 Extended Request ID: wo3xEbs5iD1IeIrndthuQ6zPsroEjOKl/EjcOVy1oCxPnVGN6ZUcvPTVAkkZ/3alrMRRLgKQEMo=)

When I remove that config option, the workflow runs fine.

It looks like eu-west-2 is specified as the default in the nextflow.config file that ships with the nf-core/rnaseq workflow:
https://github.com/nf-core/rnaseq/blob/3b6df9bd104927298fcdf69e97cca7ff1f80527c/nextflow.config#L95

@pditommaso - what is the suggested way to override this? Could one set aws.region = null in a nextflow.config in the current working directory where nextflow is called?

@dtenenba
Copy link
Author

@wleepang Thanks for the testing! I am curious how this test would have gone if you had been on a local workstation and not on an EC2 instance. How would nextflow know which region you are using for e.g. your AWS Batch queue, if it didn't have instance metadata to fall back on? I guess it would look in ~/.aws/config or for an AWS_DEFAULT_REGION variable.

I guess I already know what happens because as I mentioned above I had tried commenting out the region from my nextflow.config file and still got the original error.

@sminot
Copy link

sminot commented Jun 10, 2020

I just wanted to refresh this thread and see if there has been any progress.

It sounds like Nextflow does not currently support data in multiple S3 regions, and if that is likely to continue then I can just start working around it by copying reference data into my own region (at a cost, unfortunately).

Any update would be greatly appreciated!

@pditommaso
Copy link
Member

We need some aws guru that deep into this issue. Also, you may want to upvote the following feature that would allow NF to use a better support for S3 storage.

@pditommaso
Copy link
Member

Sorry, forgot to include the link aws/aws-sdk-java-v2#1388

@dtenenba
Copy link
Author

dtenenba commented Jun 11, 2020

Hi @pditommaso ,

I am definitely not a guru - have not touched java for many years, but I was able to get a simple test working which can list buckets in two different regions with the same client. So it is definitely possible and this does not seem to be an issue at the Java level. Here is what I did:

git clone https://github.com/awslabs/aws-java-sample.git
cd aws-java-sample

Then I made the following changes to src/main/java/com/amazonaws/samples/S3Sample.java:

diff --git a/src/main/java/com/amazonaws/samples/S3Sample.java b/src/main/java/com/amazonaws/samples/S3Sample.java
index 39beedd..3d665f8 100644
--- a/src/main/java/com/amazonaws/samples/S3Sample.java
+++ b/src/main/java/com/amazonaws/samples/S3Sample.java
@@ -63,8 +63,8 @@ public class S3Sample {
          */
 
         AmazonS3 s3 = new AmazonS3Client();
-        Region usWest2 = Region.getRegion(Regions.US_WEST_2);
-        s3.setRegion(usWest2);
+        // Region usWest2 = Region.getRegion(Regions.US_WEST_2);
+        // s3.setRegion(usWest2);
 
         String bucketName = "my-first-s3-bucket-" + UUID.randomUUID();
         String key = "MyObjectKey";
@@ -82,17 +82,17 @@ public class S3Sample {
              * You can optionally specify a location for your bucket if you want to
              * keep your data closer to your applications or users.
              */
-            System.out.println("Creating bucket " + bucketName + "\n");
-            s3.createBucket(bucketName);
+            // System.out.println("Creating bucket " + bucketName + "\n");
+            // s3.createBucket(bucketName);
 
             /*
              * List the buckets in your account
              */
-            System.out.println("Listing buckets");
-            for (Bucket bucket : s3.listBuckets()) {
-                System.out.println(" - " + bucket.getName());
-            }
-            System.out.println();
+            // System.out.println("Listing buckets");
+            // for (Bucket bucket : s3.listBuckets()) {
+            //     System.out.println(" - " + bucket.getName());
+            // }
+            // System.out.println();
 
             /*
              * Upload an object to your bucket - You can easily upload a file to
@@ -102,8 +102,8 @@ public class S3Sample {
              * like content-type and content-encoding, plus additional metadata
              * specific to your applications.
              */
-            System.out.println("Uploading a new object to S3 from a file\n");
-            s3.putObject(new PutObjectRequest(bucketName, key, createSampleFile()));
+            // System.out.println("Uploading a new object to S3 from a file\n");
+            // s3.putObject(new PutObjectRequest(bucketName, key, createSampleFile()));
 
             /*
              * Download an object - When you download an object, you get all of
@@ -117,10 +117,10 @@ public class S3Sample {
              * conditional downloading of objects based on modification times,
              * ETags, and selectively downloading a range of an object.
              */
-            System.out.println("Downloading an object");
-            S3Object object = s3.getObject(new GetObjectRequest(bucketName, key));
-            System.out.println("Content-Type: "  + object.getObjectMetadata().getContentType());
-            displayTextInputStream(object.getObjectContent());
+            // System.out.println("Downloading an object");
+            // S3Object object = s3.getObject(new GetObjectRequest(bucketName, key));
+            // System.out.println("Content-Type: "  + object.getObjectMetadata().getContentType());
+            // displayTextInputStream(object.getObjectContent());
 
             /*
              * List objects in your bucket by prefix - There are many options for
@@ -130,10 +130,19 @@ public class S3Sample {
              * use the AmazonS3.listNextBatchOfObjects(...) operation to retrieve
              * additional results.
              */
-            System.out.println("Listing objects");
+            System.out.println("Listing objects in broad-references");
             ObjectListing objectListing = s3.listObjects(new ListObjectsRequest()
-                    .withBucketName(bucketName)
-                    .withPrefix("My"));
+                    .withBucketName("broad-references"));
+            for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
+                System.out.println(" - " + objectSummary.getKey() + "  " +
+                        "(size = " + objectSummary.getSize() + ")");
+            }
+            System.out.println();
+
+            System.out.println("Listing objects in ngi-igenomes/igenomes");
+            objectListing = s3.listObjects(new ListObjectsRequest()
+                    .withBucketName("ngi-igenomes")
+                    .withPrefix("igenomes"));
             for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
                 System.out.println(" - " + objectSummary.getKey() + "  " +
                         "(size = " + objectSummary.getSize() + ")");
@@ -144,16 +153,16 @@ public class S3Sample {
              * Delete an object - Unless versioning has been turned on for your bucket,
              * there is no way to undelete an object, so use caution when deleting objects.
              */
-            System.out.println("Deleting an object\n");
-            s3.deleteObject(bucketName, key);
+            // System.out.println("Deleting an object\n");
+            // s3.deleteObject(bucketName, key);
 
             /*
              * Delete a bucket - A bucket must be completely empty before it can be
              * deleted, so remember to delete any objects from your buckets before
              * you try to delete them.
              */
-            System.out.println("Deleting bucket " + bucketName + "\n");
-            s3.deleteBucket(bucketName);
+            // System.out.println("Deleting bucket " + bucketName + "\n");
+            // s3.deleteBucket(bucketName);
         } catch (AmazonServiceException ase) {
             System.out.println("Caught an AmazonServiceException, which means your request made it "
                     + "to Amazon S3, but was rejected with an error response for some reason.");

Then build and run with:

mvn clean compile exec:java

First it lists the bucket broad-references which (I believe) is in us-east-1.
Then it lists the bucket ngi-igenomes (the igenomes prefix because there are many thousands of objects in the root of the bucket) which is in eu-west-1.

It works successfully, so it is definitely possible to interact with buckets in different regions with the same client.

Things to note:

  • I commented out the code which specified a region for the S3 client. The code fails if I leave this code in.
  • I made the SDK aware of my credentials by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, but I did not set AWS_DEFAULT_REGION.

Hope this is helpful.

@wleepang
Copy link

I happened to hit this issue the other day in Cloud9. The fix was to remove the region specification from both nextflow.config and ~/.aws/config.

@wleepang
Copy link

@pditommaso - will clients from the client factory fail if the region is left null?

region = config.region ?: Global.getAwsRegion() ?: fetchRegion()
if( !region )
throw new AbortOperationException('Missing AWS region -- Make sure to define in your system environment the variable `AWS_DEFAULT_REGION`')

I optionally set that when creating the client for CodeCommit. Speaking of which, I noticed that the code got refactored and I can't find the AwsCodeCommitRepository provider. Where did that get moved to?

@pditommaso
Copy link
Member

The S3 file system uses its own client. Based on what you are saying it would be enough to remove the region setting (or not specifying it)

https://github.com/nextflow-io/nextflow-s3fs/blob/83181d0b5f9f3d7dbbde94baa4242f4ced0be598/src/main/java/com/upplication/s3fs/S3FileSystemProvider.java#L834-L836

(codecommit feature was is here)

@pditommaso
Copy link
Member

pditommaso commented Jun 14, 2020

I confirm that this only happens when setting the aws.region property in the NF config or the AWS_DEFAULT_REGION env var. Therefore, it should be enough to remove that property set in the S3 client.

As a quick workaround, it's enough to not specify the above config options.

pditommaso added a commit that referenced this issue Jun 14, 2020
@sminot
Copy link

sminot commented Jun 16, 2020

When I tried this fix, I got this error:

Missing AWS region -- Make sure to define in your system environment the variable `AWS_DEFAULT_REGION`

Can you advise?

@pditommaso
Copy link
Member

Maybe I was too optimistic about the workaround. The next release will be patched.

@pditommaso pditommaso modified the milestones: v20.07.0, v20.04.0 Jul 2, 2020
@pditommaso
Copy link
Member

Included in release 20.06.0-edge. If the problem persists feel free to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants