-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fs] allows reading from public access buckets #14292
[fs] allows reading from public access buckets #14292
Conversation
5f76a8f
to
94e2c48
Compare
b6d2286
to
94ce6dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! I think there's some annoying subtleties around how we cache storage class lookups now but otherwise it's great to see how small a PR this is. I think it's also worth very explicitly documenting how if we can read the default storage class of a bucket we will not query the storage class of every object and assume that it matches that of the bucket.
@@ -630,22 +635,28 @@ def schemes() -> Set[str]: | |||
def storage_location(self, uri: str) -> str: | |||
return self.get_bucket_and_name(uri)[0] | |||
|
|||
async def is_hot_storage(self, location: str) -> bool: | |||
async def is_hot_storage(self, location: str, uri: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth moving the added functionality here into a distinct method. Adding to this method is starting to confuse this signature for me. Is it on the caller to ensure that uri
is derived from location
? Can the bucket be hot storage by default but the object was explicitly moved to cold storage / what happens then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the part of _async_validate_file
that checks for hot storage and errors accordingly just calls methods on the fs, i moved the extraction of the location from the uri into this method, along with all downstream functionality. let me know what you think; i can break it up into smaller methods if that helps.
] | ||
public_access_uri1 = f"gs://{public_access_bucket}/references/human_g1k_v37.fasta.gz" | ||
fs = RouterAsyncFS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need to run fs.close
when this is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
@@ -35,7 +35,7 @@ async def _async_validate_file( | |||
if isinstance(fs, GoogleStorageAsyncFS): | |||
location = fs.storage_location(uri) | |||
if location not in fs.allowed_storage_locations: | |||
if not await fs.is_hot_storage(location): | |||
if not await fs.is_hot_storage(location, uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a subtle bug here for the following case:
validate_file('gs://public-bucket/hot-storage-obj') # OK because the object is hot storage
validate_file('gs://public-bucket/cold-storage-obj') # doesn't error because public-bucket is cached in allowed_storage_locations
Whether we're querying the bucket or the object needs to somehow affect our caching strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was hoping to avoid adding too much overhead for public access buckets by assuming that if one object in the bucket is hot storage, the rest are; obviously not a sound assumption in all cases, but since our current strategy also makes the unsound assumption that if the default storage class of the bucket is hot, the objects inside will all be hot storage, it seemed okay to me to do this here.
i'm not really sure what would be the best tradeoff in terms of performance vs safety; we could check and cache the storage class per individual object in all cases, just for public access buckets, or do it for each bucket's default policy and, failing that, infer that policy based on the first object we check for public access buckets. do you think public access buckets are enough of an edge case that the performance hit for checking each object should be fine, or should we consider one of the other strategies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, I think it is OK to assume that storage classes are uniform across all objects in a bucket so long as we very explicitly document that. I don't know if there's a sound middle ground between that and check all the objects. I think the thing I most want to avoid is fetching metadata for all files in a 50k partition dataset.
@@ -630,22 +635,28 @@ def schemes() -> Set[str]: | |||
def storage_location(self, uri: str) -> str: | |||
return self.get_bucket_and_name(uri)[0] | |||
|
|||
async def is_hot_storage(self, location: str) -> bool: | |||
async def is_hot_storage(self, location: str, uri: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if uri
is a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not actually sure; is there a use case for opening a directory instead of an object from a public access bucket? when writing the tests i tried using a directory in a public access bucket for the remote tmpdir of a ServiceBackend
, it errored out because the directory is not itself an object, but i wrote that off as unimportant initially because i don't think there's a situation where a user would actually want to do that with a public access bucket. i'm not really sure what we can check if we just have a directory, if there's a way for us to list objects in it i guess we could just check the first one we see? what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I mean "directory" here loosely in the sense that object storage try to look like actual filesystems. Indeed there does not appear to be an obviously correct solution in this case, but unfortunately I think that directories are going to be a common use case. Since hail datasets are partitioned across multiple files, any .ht
or .mt
is actually a directory. You can see for example the gnomad public data: gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht.
if there's a way for us to list objects in it i guess we could just check the first one we see?
Regardless though, I think this is a very reasonable solution. If someone gives us a directory it is most often a directory of data that belong together, like a Hail dataset. I am fine expecting those all to be the same storage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
async def object_info(self, bucket: str, name: str) -> Dict[str, Any]: | ||
kwargs: Dict[str, Any] = {} | ||
self._update_params_with_user_project(kwargs, bucket) | ||
return await self.get(f'/b/{bucket}/o/{urllib.parse.quote(name, safe="")}', **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use statfile
and add a storage_class
field on GetObjectFileStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
5ff4174
to
3d55bad
Compare
9317f5e
to
78aed73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left comments in previous discussions
e660584
to
de61786
Compare
b8a26ae
to
b7bb2b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement, just one last request.
error = next_error | ||
raise error | ||
self.allowed_storage_locations.append(location) | ||
return is_hot_storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I think that when it comes to readability I care a bit less about the duplicated method call as I do about the interpretation of the signature. I find it surprising that a method called is_hot_storage
would side-effect and despite being -> bool
it looks from this implementation that it's more like True | ValueError
which I find harder to reason about. I think this revision looks great my only request would be to move that allowed_storage_locations
check along with the big error message up into the validate_file
function.
Closes #14291.