Skip to content

[fs] allows reading from public access buckets#14292

Merged
hail-ci-robot merged 41 commits intohail-is:mainfrom
iris-garden:cold-storage/public-access
Jul 29, 2024
Merged

[fs] allows reading from public access buckets#14292
hail-ci-robot merged 41 commits intohail-is:mainfrom
iris-garden:cold-storage/public-access

Conversation

@iris-garden
Copy link
Contributor

@iris-garden iris-garden commented Feb 14, 2024

Closes #14291.

@iris-garden iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 5f76a8f to 94e2c48 Compare February 15, 2024 16:26
@iris-garden iris-garden force-pushed the cold-storage/public-access branch from b6d2286 to 94ce6dc Compare February 23, 2024 15:51
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I think there's some annoying subtleties around how we cache storage class lookups now but otherwise it's great to see how small a PR this is. I think it's also worth very explicitly documenting how if we can read the default storage class of a bucket we will not query the storage class of every object and assume that it matches that of the bucket.

return self.get_bucket_and_name(uri)[0]

async def is_hot_storage(self, location: str) -> bool:
async def is_hot_storage(self, location: str, uri: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth moving the added functionality here into a distinct method. Adding to this method is starting to confuse this signature for me. Is it on the caller to ensure that uri is derived from location? Can the bucket be hot storage by default but the object was explicitly moved to cold storage / what happens then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the part of _async_validate_file that checks for hot storage and errors accordingly just calls methods on the fs, i moved the extraction of the location from the uri into this method, along with all downstream functionality. let me know what you think; i can break it up into smaller methods if that helps.

for bucket in [fake_bucket1, fake_bucket2, public_access_bucket, no_perms_bucket, cold_bucket]
]
public_access_uri1 = f"gs://{public_access_bucket}/references/human_g1k_v37.fasta.gz"
fs = RouterAsyncFS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need to run fs.close when this is done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

location = fs.storage_location(uri)
if location not in fs.allowed_storage_locations:
if not await fs.is_hot_storage(location):
if not await fs.is_hot_storage(location, uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a subtle bug here for the following case:

validate_file('gs://public-bucket/hot-storage-obj')  # OK because the object is hot storage
validate_file('gs://public-bucket/cold-storage-obj')  # doesn't error because public-bucket is cached in allowed_storage_locations

Whether we're querying the bucket or the object needs to somehow affect our caching strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was hoping to avoid adding too much overhead for public access buckets by assuming that if one object in the bucket is hot storage, the rest are; obviously not a sound assumption in all cases, but since our current strategy also makes the unsound assumption that if the default storage class of the bucket is hot, the objects inside will all be hot storage, it seemed okay to me to do this here.

i'm not really sure what would be the best tradeoff in terms of performance vs safety; we could check and cache the storage class per individual object in all cases, just for public access buckets, or do it for each bucket's default policy and, failing that, infer that policy based on the first object we check for public access buckets. do you think public access buckets are enough of an edge case that the performance hit for checking each object should be fine, or should we consider one of the other strategies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, I think it is OK to assume that storage classes are uniform across all objects in a bucket so long as we very explicitly document that. I don't know if there's a sound middle ground between that and check all the objects. I think the thing I most want to avoid is fetching metadata for all files in a 50k partition dataset.

return self.get_bucket_and_name(uri)[0]

async def is_hot_storage(self, location: str) -> bool:
async def is_hot_storage(self, location: str, uri: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if uri is a directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not actually sure; is there a use case for opening a directory instead of an object from a public access bucket? when writing the tests i tried using a directory in a public access bucket for the remote tmpdir of a ServiceBackend, it errored out because the directory is not itself an object, but i wrote that off as unimportant initially because i don't think there's a situation where a user would actually want to do that with a public access bucket. i'm not really sure what we can check if we just have a directory, if there's a way for us to list objects in it i guess we could just check the first one we see? what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I mean "directory" here loosely in the sense that object storage try to look like actual filesystems. Indeed there does not appear to be an obviously correct solution in this case, but unfortunately I think that directories are going to be a common use case. Since hail datasets are partitioned across multiple files, any .ht or .mt is actually a directory. You can see for example the gnomad public data: gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht.

if there's a way for us to list objects in it i guess we could just check the first one we see?

Regardless though, I think this is a very reasonable solution. If someone gives us a directory it is most often a directory of data that belong together, like a Hail dataset. I am fine expecting those all to be the same storage type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

async def object_info(self, bucket: str, name: str) -> Dict[str, Any]:
kwargs: Dict[str, Any] = {}
self._update_params_with_user_project(kwargs, bucket)
return await self.get(f'/b/{bucket}/o/{urllib.parse.quote(name, safe="")}', **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use statfile and add a storage_class field on GetObjectFileStatus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@iris-garden iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 5ff4174 to 3d55bad Compare February 23, 2024 17:56
@iris-garden iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 9317f5e to 78aed73 Compare March 4, 2024 17:38
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments in previous discussions

@iris-garden iris-garden force-pushed the cold-storage/public-access branch 8 times, most recently from e660584 to de61786 Compare March 11, 2024 17:06
@iris-garden iris-garden force-pushed the cold-storage/public-access branch from b8a26ae to b7bb2b3 Compare March 13, 2024 14:35
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement, just one last request.

error = next_error
raise error
self.allowed_storage_locations.append(location)
return is_hot_storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think that when it comes to readability I care a bit less about the duplicated method call as I do about the interpretation of the signature. I find it surprising that a method called is_hot_storage would side-effect and despite being -> bool it looks from this implementation that it's more like True | ValueError which I find harder to reason about. I think this revision looks great my only request would be to move that allowed_storage_locations check along with the big error message up into the validate_file function.

@daniel-goldstein daniel-goldstein removed their assignment Jul 22, 2024
@hail-ci-robot hail-ci-robot merged commit ab730f1 into hail-is:main Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[query] validation system needs to gracefully handle public access buckets

3 participants