[fs] allows reading from public access buckets#14292
[fs] allows reading from public access buckets#14292hail-ci-robot merged 41 commits intohail-is:mainfrom
Conversation
5f76a8f to
94e2c48
Compare
b6d2286 to
94ce6dc
Compare
There was a problem hiding this comment.
Thanks for fixing this! I think there's some annoying subtleties around how we cache storage class lookups now but otherwise it's great to see how small a PR this is. I think it's also worth very explicitly documenting how if we can read the default storage class of a bucket we will not query the storage class of every object and assume that it matches that of the bucket.
| return self.get_bucket_and_name(uri)[0] | ||
|
|
||
| async def is_hot_storage(self, location: str) -> bool: | ||
| async def is_hot_storage(self, location: str, uri: str) -> bool: |
There was a problem hiding this comment.
I think it's worth moving the added functionality here into a distinct method. Adding to this method is starting to confuse this signature for me. Is it on the caller to ensure that uri is derived from location? Can the bucket be hot storage by default but the object was explicitly moved to cold storage / what happens then?
There was a problem hiding this comment.
since the part of _async_validate_file that checks for hot storage and errors accordingly just calls methods on the fs, i moved the extraction of the location from the uri into this method, along with all downstream functionality. let me know what you think; i can break it up into smaller methods if that helps.
| for bucket in [fake_bucket1, fake_bucket2, public_access_bucket, no_perms_bucket, cold_bucket] | ||
| ] | ||
| public_access_uri1 = f"gs://{public_access_bucket}/references/human_g1k_v37.fasta.gz" | ||
| fs = RouterAsyncFS() |
There was a problem hiding this comment.
nit: need to run fs.close when this is done
| location = fs.storage_location(uri) | ||
| if location not in fs.allowed_storage_locations: | ||
| if not await fs.is_hot_storage(location): | ||
| if not await fs.is_hot_storage(location, uri): |
There was a problem hiding this comment.
I think there's a subtle bug here for the following case:
validate_file('gs://public-bucket/hot-storage-obj') # OK because the object is hot storage
validate_file('gs://public-bucket/cold-storage-obj') # doesn't error because public-bucket is cached in allowed_storage_locations
Whether we're querying the bucket or the object needs to somehow affect our caching strategy.
There was a problem hiding this comment.
i was hoping to avoid adding too much overhead for public access buckets by assuming that if one object in the bucket is hot storage, the rest are; obviously not a sound assumption in all cases, but since our current strategy also makes the unsound assumption that if the default storage class of the bucket is hot, the objects inside will all be hot storage, it seemed okay to me to do this here.
i'm not really sure what would be the best tradeoff in terms of performance vs safety; we could check and cache the storage class per individual object in all cases, just for public access buckets, or do it for each bucket's default policy and, failing that, infer that policy based on the first object we check for public access buckets. do you think public access buckets are enough of an edge case that the performance hit for checking each object should be fine, or should we consider one of the other strategies?
There was a problem hiding this comment.
Ah good point, I think it is OK to assume that storage classes are uniform across all objects in a bucket so long as we very explicitly document that. I don't know if there's a sound middle ground between that and check all the objects. I think the thing I most want to avoid is fetching metadata for all files in a 50k partition dataset.
| return self.get_bucket_and_name(uri)[0] | ||
|
|
||
| async def is_hot_storage(self, location: str) -> bool: | ||
| async def is_hot_storage(self, location: str, uri: str) -> bool: |
There was a problem hiding this comment.
What should happen if uri is a directory?
There was a problem hiding this comment.
i'm not actually sure; is there a use case for opening a directory instead of an object from a public access bucket? when writing the tests i tried using a directory in a public access bucket for the remote tmpdir of a ServiceBackend, it errored out because the directory is not itself an object, but i wrote that off as unimportant initially because i don't think there's a situation where a user would actually want to do that with a public access bucket. i'm not really sure what we can check if we just have a directory, if there's a way for us to list objects in it i guess we could just check the first one we see? what are your thoughts?
There was a problem hiding this comment.
Ya I mean "directory" here loosely in the sense that object storage try to look like actual filesystems. Indeed there does not appear to be an obviously correct solution in this case, but unfortunately I think that directories are going to be a common use case. Since hail datasets are partitioned across multiple files, any .ht or .mt is actually a directory. You can see for example the gnomad public data: gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht.
if there's a way for us to list objects in it i guess we could just check the first one we see?
Regardless though, I think this is a very reasonable solution. If someone gives us a directory it is most often a directory of data that belong together, like a Hail dataset. I am fine expecting those all to be the same storage type.
| async def object_info(self, bucket: str, name: str) -> Dict[str, Any]: | ||
| kwargs: Dict[str, Any] = {} | ||
| self._update_params_with_user_project(kwargs, bucket) | ||
| return await self.get(f'/b/{bucket}/o/{urllib.parse.quote(name, safe="")}', **kwargs) |
There was a problem hiding this comment.
I think you can use statfile and add a storage_class field on GetObjectFileStatus
5ff4174 to
3d55bad
Compare
9317f5e to
78aed73
Compare
daniel-goldstein
left a comment
There was a problem hiding this comment.
left comments in previous discussions
e660584 to
de61786
Compare
b8a26ae to
b7bb2b3
Compare
daniel-goldstein
left a comment
There was a problem hiding this comment.
This is a great improvement, just one last request.
| error = next_error | ||
| raise error | ||
| self.allowed_storage_locations.append(location) | ||
| return is_hot_storage |
There was a problem hiding this comment.
Ya I think that when it comes to readability I care a bit less about the duplicated method call as I do about the interpretation of the signature. I find it surprising that a method called is_hot_storage would side-effect and despite being -> bool it looks from this implementation that it's more like True | ValueError which I find harder to reason about. I think this revision looks great my only request would be to move that allowed_storage_locations check along with the big error message up into the validate_file function.
Closes #14291.