Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-24975: Use GCS instead of AWS-S3 for the Butler datastore #290

Merged
merged 2 commits into from May 27, 2020

Conversation

hsinfang
Copy link
Contributor

No description provided.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My main comment is that I don't want Config to require boto3 so I still want the runtime checks on boto3 that only trigger if you try to do something with s3. Other than that consider getS3Client as the name and see what you think of my implementation.

import boto3
except ImportError:
boto3 = None
import boto3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like this import to be protected please.


from .location import ButlerURI, Location


def getClient():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this should raise ModuleNotFoundError if boto3 was not found.

@@ -61,12 +72,8 @@ def s3CheckFileExists(path, bucket=None, client=None):
.. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
configuration.html#configuring-credentials
"""
if boto3 is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put this back

@@ -119,15 +126,12 @@ def bucketExists(bucketName, client=None):
.. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
configuration.html#configuring-credentials
"""
if boto3 is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back


Look for the crendentials in the env, and connect to the S3 client
with the default AWS or the Google Cloud Storage endpoint
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a Returns field for sphinx.

@@ -42,6 +42,7 @@
import lsst.utils
from lsst.utils import doImport
from .location import ButlerURI
from .s3utils import getClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since getClient is generic I'd prefer not to import it with the generic name. Maybe

from .s3utils import getClient as getS3Client

? (or call it that to start with).

with the default AWS or the Google Cloud Storage endpoint
"""
try:
key = os.environ['AWS_ACCESS_KEY_ID']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer:

key = os.environ.get("AWS_ACCESS_KEY_ID", "")
endpoint = "https://storage.googleapis.com" if key.startswith("GOOG") else None
return boto3.client("s3", endpoint_url=endpoint)

PS daf_butler uses double quotes not single quotes

@@ -61,12 +72,8 @@ def s3CheckFileExists(path, bucket=None, client=None):
.. _manual: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/\
configuration.html#configuring-credentials
"""
if boto3 is None:
raise ModuleNotFoundError(("Could not find boto3. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these go back please remove the unnecessary parentheses.

def getClient():
"""Return a s3 client

Look for the crendentials in the env, and connect to the S3 client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: crendentials

with the default AWS or the Google Cloud Storage endpoint
"""
try:
key = os.environ['AWS_ACCESS_KEY_ID']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much disagree with this choice. This completely locks us into only 1 way of authentication - and the worst and most unsafe one at that. If you want to lock GCE into this authentication strategy I guess that's fine by me but I instantiate a generic client to take advantage of the different, much safer, authentication mechanisms such as IAM, `~/.aws etc...

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#configuring-credentials

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#best-practices-for-configuring-credentials

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take your input on whether the use of the environment variable is the right thing. Do you think that getS3Client needs to exist at all? Or would configuration changes mean that no code changes are needed at all here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the problem is that I don't know how GCE handles authentication tbh. From the code it seems to me that HFC needed to pass in the endpoint URL and that it didn't work out for her otherwise.
I don't want to have two different classes for object datastore* if the only thing that's different is that they instantiate the client differently, but if boto3 needs the enpoint URL that seems like the way to go to me. I guess the other option is to wrap the boto3.client('s3'), catch the authentication exception, try wih google endpoint instead and reraise if required Then everything can remain the same.

I assume that another benefit to having two classes is that then we there is enough flexibility to diverge and adjust to individual object store specifics. But again, I don't know if there are any such use-cases so many not run before I walk (i.e. I need to do some reading up on it).

  • EDIT: because it might be better to be able to specify a single datastore class in config and just have it work for both AWS and GCE

Copy link
Member

@timj timj May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we would ever want to have different AWS and GCE datastore implementations if the only difference is the endpoint URL. We clearly need a way to specify whether we are using AWS, Google, or US Data Facility S3 endpoint. This has to happen presumably before you try to access a butler.yaml file via a s3:// URI so can't be in that butler.yaml file (it's too late by then). This patch tries to put all the code that could use an endpoint_url in a single place and then the next question is how that code can work out where to look. We either use environment variables or we use some other dot file. This is orthogonal to credentials.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, always calling out to AWS, failing, and then trying Google, does not seem like the right answer to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts_salobj uses env variable S3_ENDPOINT_URL as per https://ts-salobj.lsst.io/configuration.html#environment-variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm wrong. Sorry about that. I guess I was in too much of a rush to give it a good thought.

I still see many benefits to having a specialized datastore for GCS using a GCS SDK instead of the Interop GCS API. For example, as mentioned in the docs, any boto3 functionality calling list_objects_v2 (which is the new and recommended way to do it) under the hood wouldn't work, it's not supported. I assume this was the reason the Buckets API in tests had to be replaced. I'm not sure the errors returned by the GCS Interop are the same as for AWS S3, this would need to be tested because S3Datastore does rely heavily on correctly handling error cases.

I still think allowing only strictly environmental variable authentication is not a good idea. We are stuffing Google Cloud Storage keys under AWS_ACCESS_KEY_ID values. I agree with Tim's latest suggestion that we should at least rename the env var name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Now I understand your other comment. We are talking about multiple things in parallel. The first is "how do we specify arbitrary S3 endpoints" and the second is "in the future it might be worth writing a Datastore that uses google-native APIs rather than S3 compatibility layer". I think we have agreed that an endpoint environment variable specifying the actual endpoint to use is a reasonable solution. I am more than happy to end up with a google-specific datastore at some point in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also talking about two kinds of env var here.  One is the endpoint, and the other is the authentication key.

For the authentication key, for GCS using env var AWS_ACCESS_KEY_ID is the quickest way to migrate to GCS but we will want to look into other ways too.  For AWS the authentication mechanism is not changed; AWS credential files continue to work like before.    

For the endpoint it sounds like we want to read it from the env var S3_ENDPOINT_URL like ts_salobj does, instead of coding it in. My current patch is to use AWS_ACCESS_KEY_ID to know if it's GCS or AWS, so avoiding one more env var.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch has been replaced by one using the env var S3_ENDPOINT_URL. It should be less confusing now.

DM-25093 is created for looking into other GCS authentication methods.


bucket = s3.Bucket(self.bucketName)
bucket.delete()
objects = self.client.list_objects(Bucket=self.bucketName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only fetches the first 1000 objects. I assume that in tests this should not be a problem, since I assume it's probably safe to assume that number of objects put into the buckets in tests will always be much less than 1000, but still wanted to point it out just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. I saw that limit but thought it's okay for the unit test. I'll add a comment here to be more clear. Strictly speaking the tests don't need changes because it's done through moto.  Nonetheless the new approach should work on both AWS & GCS assuming there are not that many objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the boto3 document it looks like the approach on master via the object collection (bucket.objects.all().delete()) has the limit of 1000 too?
It seems either is adequate for the unit tests here for cleaning up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I didn't understand the documentation myself then. This is what motivated that particular use: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Bucket.objects

boto3.Bucket.objects
A collection of ObjectSummary resources.A ObjectSummary Collection will include all resources by default, and extreme caution should be taken when performing actions on all resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't require you to change the code, I just wanted to point out that the purpose of SDKs is to wrap the base API calls in a way in which they would behave in a more pleasant and expected way. So paging, caching, not losing overhead on opening a new connection on every request etc.

On one side an unknown percentage of this could be lost for GCE when current workaround is applied. On the other side I would urge you to be maximally careful not to "un-generalize" the S3datastore by changing calls without thinking through how these things behave in the background is all. That was the point of yesterdays objections.

Reading back to yesterdays posts I can see how my objections might seem very opinionated, but I don't see anything in this PR that breaks anything else. I certainly also didn't intend the objections in that strong of a way, just as a caution sign. It's fine for me if this goes forward as is. It seems to work for everyone else too.

Here and Here you can see how paging is implemented, so it does return all objects (think in the lines of aws s3 --recursive type of command).

self.client.upload_file(Bucket=location.netloc, Key=location.relativeToPathRoot,
Filename=tmpFile.name)
with open(tmpFile.name, 'rb') as f:
self.client.put_object(Bucket=location.netloc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to point out what I said in comments on the test utils call type changes. Changing from upload_file to put_objects replaces the underlying high level call through S3Transfer class with a low level API call.

upload_file call will automatically handle multipart-uploads and upload the file in multiple parts, whereas put_object does not. Additionally, if anyone wanted to ever implement a better, tighter upload control configuration this would now prevent them from doing so.

This wouldn't work on GCS, also, because from what I've read yesterday, GCS will save each uploaded part and then an additional key that is a concatenation of all the individual parts too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multipart upload issue was indeed the motivation behind this change (or call it workaround)

upload_file fails for large files on GCS from its
use of multipart upload.

An alternative is to disable auto multipart upload
in the bucket config.
so to specify the endpoint more easily.

Some changes are not strictly necessary in test_s3utils.py but to
- use the client more consistently
- use lowercase for bucket names because GCS does not allow uppercase
@hsinfang hsinfang merged commit 6378a8f into master May 27, 2020
@timj timj deleted the tickets/DM-24975 branch February 16, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants