DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

hsinfang · 2020-05-21T22:28:56Z

No description provided.

timj

Looks good. My main comment is that I don't want Config to require boto3 so I still want the runtime checks on boto3 that only trigger if you try to do something with s3. Other than that consider getS3Client as the name and see what you think of my implementation.

timj · 2020-05-21T22:33:17Z

python/lsst/daf/butler/core/s3utils.py

-    import boto3
-except ImportError:
-    boto3 = None
+import boto3


I would still like this import to be protected please.

timj · 2020-05-21T22:34:09Z

python/lsst/daf/butler/core/s3utils.py


 from .location import ButlerURI, Location


+def getClient():


and this should raise ModuleNotFoundError if boto3 was not found.

timj · 2020-05-21T22:34:26Z

python/lsst/daf/butler/core/s3utils.py

@@ -61,12 +72,8 @@ def s3CheckFileExists(path, bucket=None, client=None):
    .. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
    configuration.html#configuring-credentials
    """
-    if boto3 is None:


please put this back

timj · 2020-05-21T22:34:34Z

python/lsst/daf/butler/core/s3utils.py

@@ -119,15 +126,12 @@ def bucketExists(bucketName, client=None):
    .. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
    configuration.html#configuring-credentials
    """
-    if boto3 is None:


timj · 2020-05-21T22:35:27Z

python/lsst/daf/butler/core/s3utils.py

+
+    Look for the crendentials in the env, and connect to the S3 client
+    with the default AWS or the Google Cloud Storage endpoint
+    """


Add a Returns field for sphinx.

timj · 2020-05-21T22:37:26Z

python/lsst/daf/butler/core/config.py

@@ -42,6 +42,7 @@
 import lsst.utils
 from lsst.utils import doImport
 from .location import ButlerURI
+from .s3utils import getClient


Since getClient is generic I'd prefer not to import it with the generic name. Maybe

from .s3utils import getClient as getS3Client

? (or call it that to start with).

timj · 2020-05-21T22:41:40Z

python/lsst/daf/butler/core/s3utils.py

+    with the default AWS or the Google Cloud Storage endpoint
+    """
+    try:
+        key = os.environ['AWS_ACCESS_KEY_ID']


I think I'd prefer:

key = os.environ.get("AWS_ACCESS_KEY_ID", "") endpoint = "https://storage.googleapis.com" if key.startswith("GOOG") else None return boto3.client("s3", endpoint_url=endpoint)

PS daf_butler uses double quotes not single quotes

timj · 2020-05-21T22:47:10Z

python/lsst/daf/butler/core/s3utils.py

@@ -61,12 +72,8 @@ def s3CheckFileExists(path, bucket=None, client=None):
    .. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
    configuration.html#configuring-credentials
    """
-    if boto3 is None:
-        raise ModuleNotFoundError(("Could not find boto3. "


When these go back please remove the unnecessary parentheses.

timj · 2020-05-21T23:03:14Z

python/lsst/daf/butler/core/s3utils.py

+def getClient():
+    """Return a s3 client
+
+    Look for the crendentials in the env, and connect to the S3 client


Typo: crendentials

DinoBektesevic · 2020-05-21T23:35:14Z

python/lsst/daf/butler/core/s3utils.py

+    with the default AWS or the Google Cloud Storage endpoint
+    """
+    try:
+        key = os.environ['AWS_ACCESS_KEY_ID']


I very much disagree with this choice. This completely locks us into only 1 way of authentication - and the worst and most unsafe one at that. If you want to lock GCE into this authentication strategy I guess that's fine by me but I instantiate a generic client to take advantage of the different, much safer, authentication mechanisms such as IAM, `~/.aws etc...

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#configuring-credentials

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#best-practices-for-configuring-credentials

I'm happy to take your input on whether the use of the environment variable is the right thing. Do you think that getS3Client needs to exist at all? Or would configuration changes mean that no code changes are needed at all here?

Hm, the problem is that I don't know how GCE handles authentication tbh. From the code it seems to me that HFC needed to pass in the endpoint URL and that it didn't work out for her otherwise.
I don't want to have two different classes for object datastore* if the only thing that's different is that they instantiate the client differently, but if boto3 needs the enpoint URL that seems like the way to go to me. I guess the other option is to wrap the boto3.client('s3'), catch the authentication exception, try wih google endpoint instead and reraise if required Then everything can remain the same.

I assume that another benefit to having two classes is that then we there is enough flexibility to diverge and adjust to individual object store specifics. But again, I don't know if there are any such use-cases so many not run before I walk (i.e. I need to do some reading up on it).

EDIT: because it might be better to be able to specify a single datastore class in config and just have it work for both AWS and GCE

I'm not sure I understand why we would ever want to have different AWS and GCE datastore implementations if the only difference is the endpoint URL. We clearly need a way to specify whether we are using AWS, Google, or US Data Facility S3 endpoint. This has to happen presumably before you try to access a butler.yaml file via a s3:// URI so can't be in that butler.yaml file (it's too late by then). This patch tries to put all the code that could use an endpoint_url in a single place and then the next question is how that code can work out where to look. We either use environment variables or we use some other dot file. This is orthogonal to credentials.

Also, always calling out to AWS, failing, and then trying Google, does not seem like the right answer to me.

ts_salobj uses env variable S3_ENDPOINT_URL as per https://ts-salobj.lsst.io/configuration.html#environment-variables

Ok, I'm wrong. Sorry about that. I guess I was in too much of a rush to give it a good thought.

I still see many benefits to having a specialized datastore for GCS using a GCS SDK instead of the Interop GCS API. For example, as mentioned in the docs, any boto3 functionality calling list_objects_v2 (which is the new and recommended way to do it) under the hood wouldn't work, it's not supported. I assume this was the reason the Buckets API in tests had to be replaced. I'm not sure the errors returned by the GCS Interop are the same as for AWS S3, this would need to be tested because S3Datastore does rely heavily on correctly handling error cases.

I still think allowing only strictly environmental variable authentication is not a good idea. We are stuffing Google Cloud Storage keys under AWS_ACCESS_KEY_ID values. I agree with Tim's latest suggestion that we should at least rename the env var name.

Ok. Now I understand your other comment. We are talking about multiple things in parallel. The first is "how do we specify arbitrary S3 endpoints" and the second is "in the future it might be worth writing a Datastore that uses google-native APIs rather than S3 compatibility layer". I think we have agreed that an endpoint environment variable specifying the actual endpoint to use is a reasonable solution. I am more than happy to end up with a google-specific datastore at some point in the future.

We are also talking about two kinds of env var here. One is the endpoint, and the other is the authentication key.

For the authentication key, for GCS using env var AWS_ACCESS_KEY_ID is the quickest way to migrate to GCS but we will want to look into other ways too. For AWS the authentication mechanism is not changed; AWS credential files continue to work like before.

For the endpoint it sounds like we want to read it from the env var S3_ENDPOINT_URL like ts_salobj does, instead of coding it in. My current patch is to use AWS_ACCESS_KEY_ID to know if it's GCS or AWS, so avoiding one more env var.

The patch has been replaced by one using the env var S3_ENDPOINT_URL. It should be less confusing now.

DM-25093 is created for looking into other GCS authentication methods.

DinoBektesevic · 2020-05-22T00:47:38Z

tests/test_s3utils.py

-
-        bucket = s3.Bucket(self.bucketName)
-        bucket.delete()
+        objects = self.client.list_objects(Bucket=self.bucketName)


This only fetches the first 1000 objects. I assume that in tests this should not be a problem, since I assume it's probably safe to assume that number of objects put into the buckets in tests will always be much less than 1000, but still wanted to point it out just in case.

Thanks for pointing it out. I saw that limit but thought it's okay for the unit test. I'll add a comment here to be more clear. Strictly speaking the tests don't need changes because it's done through moto. Nonetheless the new approach should work on both AWS & GCS assuming there are not that many objects.

From the boto3 document it looks like the approach on master via the object collection (bucket.objects.all().delete()) has the limit of 1000 too?
It seems either is adequate for the unit tests here for cleaning up.

Hmm, maybe I didn't understand the documentation myself then. This is what motivated that particular use: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Bucket.objects

boto3.Bucket.objects
A collection of ObjectSummary resources.A ObjectSummary Collection will include all resources by default, and extreme caution should be taken when performing actions on all resources.

I don't require you to change the code, I just wanted to point out that the purpose of SDKs is to wrap the base API calls in a way in which they would behave in a more pleasant and expected way. So paging, caching, not losing overhead on opening a new connection on every request etc.

On one side an unknown percentage of this could be lost for GCE when current workaround is applied. On the other side I would urge you to be maximally careful not to "un-generalize" the S3datastore by changing calls without thinking through how these things behave in the background is all. That was the point of yesterdays objections.

Reading back to yesterdays posts I can see how my objections might seem very opinionated, but I don't see anything in this PR that breaks anything else. I certainly also didn't intend the objections in that strong of a way, just as a caution sign. It's fine for me if this goes forward as is. It seems to work for everyone else too.

Here and Here you can see how paging is implemented, so it does return all objects (think in the lines of aws s3 --recursive type of command).

DinoBektesevic · 2020-05-22T21:10:23Z

python/lsst/daf/butler/datastores/s3Datastore.py

-                self.client.upload_file(Bucket=location.netloc, Key=location.relativeToPathRoot,
-                                        Filename=tmpFile.name)
+                with open(tmpFile.name, 'rb') as f:
+                    self.client.put_object(Bucket=location.netloc,


So to point out what I said in comments on the test utils call type changes. Changing from upload_file to put_objects replaces the underlying high level call through S3Transfer class with a low level API call.

upload_file call will automatically handle multipart-uploads and upload the file in multiple parts, whereas put_object does not. Additionally, if anyone wanted to ever implement a better, tighter upload control configuration this would now prevent them from doing so.

This wouldn't work on GCS, also, because from what I've read yesterday, GCS will save each uploaded part and then an additional key that is a concatenation of all the individual parts too.

The multipart upload issue was indeed the motivation behind this change (or call it workaround)

upload_file fails for large files on GCS from its use of multipart upload. An alternative is to disable auto multipart upload in the bucket config.

so to specify the endpoint more easily. Some changes are not strictly necessary in test_s3utils.py but to - use the client more consistently - use lowercase for bucket names because GCS does not allow uppercase

timj approved these changes May 21, 2020

View reviewed changes

DinoBektesevic reviewed May 21, 2020

View reviewed changes

DinoBektesevic reviewed May 22, 2020

View reviewed changes

hsinfang force-pushed the tickets/DM-24975 branch from d833bdd to da4c496 Compare May 22, 2020 01:28

DinoBektesevic reviewed May 22, 2020

View reviewed changes

hsinfang added 2 commits May 26, 2020 11:58

Use put_object instead of upload_file for S3

2e65b18

upload_file fails for large files on GCS from its use of multipart upload. An alternative is to disable auto multipart upload in the bucket config.

Move S3 client retrieval to a method

91af820

so to specify the endpoint more easily. Some changes are not strictly necessary in test_s3utils.py but to - use the client more consistently - use lowercase for bucket names because GCS does not allow uppercase

hsinfang force-pushed the tickets/DM-24975 branch from b8c8d64 to 91af820 Compare May 26, 2020 22:42

hsinfang merged commit 6378a8f into master May 27, 2020

timj deleted the tickets/DM-24975 branch February 16, 2024 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

hsinfang commented May 21, 2020

timj left a comment

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

timj May 21, 2020

DinoBektesevic May 21, 2020

timj May 21, 2020

DinoBektesevic May 21, 2020 •

edited

timj May 21, 2020 •

edited

timj May 21, 2020

r-owen May 22, 2020

DinoBektesevic May 22, 2020 •

edited

timj May 22, 2020

hsinfang May 22, 2020

hsinfang May 26, 2020

DinoBektesevic May 22, 2020

hsinfang May 22, 2020

hsinfang May 22, 2020

DinoBektesevic May 22, 2020

DinoBektesevic May 22, 2020

DinoBektesevic May 22, 2020 •

edited

hsinfang May 26, 2020

DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

Conversation

hsinfang commented May 21, 2020

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DinoBektesevic May 21, 2020 • edited

Choose a reason for hiding this comment

timj May 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DinoBektesevic May 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DinoBektesevic May 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DinoBektesevic May 21, 2020 •

edited

timj May 21, 2020 •

edited

DinoBektesevic May 22, 2020 •

edited

DinoBektesevic May 22, 2020 •

edited