Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] allows reading from public access buckets #14292

Merged
merged 41 commits into from
Jul 29, 2024

Conversation

iris-garden
Copy link
Collaborator

@iris-garden iris-garden commented Feb 14, 2024

Closes #14291.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I think there's some annoying subtleties around how we cache storage class lookups now but otherwise it's great to see how small a PR this is. I think it's also worth very explicitly documenting how if we can read the default storage class of a bucket we will not query the storage class of every object and assume that it matches that of the bucket.

@@ -630,22 +635,28 @@ def schemes() -> Set[str]:
def storage_location(self, uri: str) -> str:
return self.get_bucket_and_name(uri)[0]

async def is_hot_storage(self, location: str) -> bool:
async def is_hot_storage(self, location: str, uri: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth moving the added functionality here into a distinct method. Adding to this method is starting to confuse this signature for me. Is it on the caller to ensure that uri is derived from location? Can the bucket be hot storage by default but the object was explicitly moved to cold storage / what happens then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the part of _async_validate_file that checks for hot storage and errors accordingly just calls methods on the fs, i moved the extraction of the location from the uri into this method, along with all downstream functionality. let me know what you think; i can break it up into smaller methods if that helps.

]
public_access_uri1 = f"gs://{public_access_bucket}/references/human_g1k_v37.fasta.gz"
fs = RouterAsyncFS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need to run fs.close when this is done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

@@ -35,7 +35,7 @@ async def _async_validate_file(
if isinstance(fs, GoogleStorageAsyncFS):
location = fs.storage_location(uri)
if location not in fs.allowed_storage_locations:
if not await fs.is_hot_storage(location):
if not await fs.is_hot_storage(location, uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a subtle bug here for the following case:

validate_file('gs://public-bucket/hot-storage-obj')  # OK because the object is hot storage
validate_file('gs://public-bucket/cold-storage-obj')  # doesn't error because public-bucket is cached in allowed_storage_locations

Whether we're querying the bucket or the object needs to somehow affect our caching strategy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was hoping to avoid adding too much overhead for public access buckets by assuming that if one object in the bucket is hot storage, the rest are; obviously not a sound assumption in all cases, but since our current strategy also makes the unsound assumption that if the default storage class of the bucket is hot, the objects inside will all be hot storage, it seemed okay to me to do this here.

i'm not really sure what would be the best tradeoff in terms of performance vs safety; we could check and cache the storage class per individual object in all cases, just for public access buckets, or do it for each bucket's default policy and, failing that, infer that policy based on the first object we check for public access buckets. do you think public access buckets are enough of an edge case that the performance hit for checking each object should be fine, or should we consider one of the other strategies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, I think it is OK to assume that storage classes are uniform across all objects in a bucket so long as we very explicitly document that. I don't know if there's a sound middle ground between that and check all the objects. I think the thing I most want to avoid is fetching metadata for all files in a 50k partition dataset.

@@ -630,22 +635,28 @@ def schemes() -> Set[str]:
def storage_location(self, uri: str) -> str:
return self.get_bucket_and_name(uri)[0]

async def is_hot_storage(self, location: str) -> bool:
async def is_hot_storage(self, location: str, uri: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if uri is a directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not actually sure; is there a use case for opening a directory instead of an object from a public access bucket? when writing the tests i tried using a directory in a public access bucket for the remote tmpdir of a ServiceBackend, it errored out because the directory is not itself an object, but i wrote that off as unimportant initially because i don't think there's a situation where a user would actually want to do that with a public access bucket. i'm not really sure what we can check if we just have a directory, if there's a way for us to list objects in it i guess we could just check the first one we see? what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I mean "directory" here loosely in the sense that object storage try to look like actual filesystems. Indeed there does not appear to be an obviously correct solution in this case, but unfortunately I think that directories are going to be a common use case. Since hail datasets are partitioned across multiple files, any .ht or .mt is actually a directory. You can see for example the gnomad public data: gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht.

if there's a way for us to list objects in it i guess we could just check the first one we see?

Regardless though, I think this is a very reasonable solution. If someone gives us a directory it is most often a directory of data that belong together, like a Hail dataset. I am fine expecting those all to be the same storage type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

async def object_info(self, bucket: str, name: str) -> Dict[str, Any]:
kwargs: Dict[str, Any] = {}
self._update_params_with_user_project(kwargs, bucket)
return await self.get(f'/b/{bucket}/o/{urllib.parse.quote(name, safe="")}', **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use statfile and add a storage_class field on GetObjectFileStatus

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@iris-garden iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 5ff4174 to 3d55bad Compare February 23, 2024 17:56
@iris-garden iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 9317f5e to 78aed73 Compare March 4, 2024 17:38
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments in previous discussions

@iris-garden iris-garden force-pushed the cold-storage/public-access branch 8 times, most recently from e660584 to de61786 Compare March 11, 2024 17:06
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement, just one last request.

error = next_error
raise error
self.allowed_storage_locations.append(location)
return is_hot_storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think that when it comes to readability I care a bit less about the duplicated method call as I do about the interpretation of the signature. I find it surprising that a method called is_hot_storage would side-effect and despite being -> bool it looks from this implementation that it's more like True | ValueError which I find harder to reason about. I think this revision looks great my only request would be to move that allowed_storage_locations check along with the big error message up into the validate_file function.

@daniel-goldstein daniel-goldstein removed their assignment Jul 22, 2024
@hail-ci-robot hail-ci-robot merged commit ab730f1 into hail-is:main Jul 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] validation system needs to gracefully handle public access buckets
3 participants