[fs] allows reading from public access buckets #14292

iris-garden · 2024-02-14T18:45:36Z

daniel-goldstein

Thanks for fixing this! I think there's some annoying subtleties around how we cache storage class lookups now but otherwise it's great to see how small a PR this is. I think it's also worth very explicitly documenting how if we can read the default storage class of a bucket we will not query the storage class of every object and assume that it matches that of the bucket.

daniel-goldstein · 2024-02-23T16:17:01Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

@@ -630,22 +635,28 @@ def schemes() -> Set[str]:
    def storage_location(self, uri: str) -> str:
        return self.get_bucket_and_name(uri)[0]

-    async def is_hot_storage(self, location: str) -> bool:
+    async def is_hot_storage(self, location: str, uri: str) -> bool:


I think it's worth moving the added functionality here into a distinct method. Adding to this method is starting to confuse this signature for me. Is it on the caller to ensure that uri is derived from location? Can the bucket be hot storage by default but the object was explicitly moved to cold storage / what happens then?

since the part of _async_validate_file that checks for hot storage and errors accordingly just calls methods on the fs, i moved the extraction of the location from the uri into this method, along with all downstream functionality. let me know what you think; i can break it up into smaller methods if that helps.

daniel-goldstein · 2024-02-23T16:22:17Z

hail/python/test/hailtop/batch/test_batch_service_backend.py

    ]
+    public_access_uri1 = f"gs://{public_access_bucket}/references/human_g1k_v37.fasta.gz"
+    fs = RouterAsyncFS()


nit: need to run fs.close when this is done

daniel-goldstein · 2024-02-23T16:24:26Z

hail/python/hailtop/aiotools/validators.py

@@ -35,7 +35,7 @@ async def _async_validate_file(
    if isinstance(fs, GoogleStorageAsyncFS):
        location = fs.storage_location(uri)
        if location not in fs.allowed_storage_locations:
-            if not await fs.is_hot_storage(location):
+            if not await fs.is_hot_storage(location, uri):


I think there's a subtle bug here for the following case:

validate_file('gs://public-bucket/hot-storage-obj') # OK because the object is hot storage validate_file('gs://public-bucket/cold-storage-obj') # doesn't error because public-bucket is cached in allowed_storage_locations

Whether we're querying the bucket or the object needs to somehow affect our caching strategy.

i was hoping to avoid adding too much overhead for public access buckets by assuming that if one object in the bucket is hot storage, the rest are; obviously not a sound assumption in all cases, but since our current strategy also makes the unsound assumption that if the default storage class of the bucket is hot, the objects inside will all be hot storage, it seemed okay to me to do this here.

i'm not really sure what would be the best tradeoff in terms of performance vs safety; we could check and cache the storage class per individual object in all cases, just for public access buckets, or do it for each bucket's default policy and, failing that, infer that policy based on the first object we check for public access buckets. do you think public access buckets are enough of an edge case that the performance hit for checking each object should be fine, or should we consider one of the other strategies?

Ah good point, I think it is OK to assume that storage classes are uniform across all objects in a bucket so long as we very explicitly document that. I don't know if there's a sound middle ground between that and check all the objects. I think the thing I most want to avoid is fetching metadata for all files in a 50k partition dataset.

daniel-goldstein · 2024-02-23T16:25:58Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

@@ -630,22 +635,28 @@ def schemes() -> Set[str]:
    def storage_location(self, uri: str) -> str:
        return self.get_bucket_and_name(uri)[0]

-    async def is_hot_storage(self, location: str) -> bool:
+    async def is_hot_storage(self, location: str, uri: str) -> bool:


What should happen if uri is a directory?

i'm not actually sure; is there a use case for opening a directory instead of an object from a public access bucket? when writing the tests i tried using a directory in a public access bucket for the remote tmpdir of a ServiceBackend, it errored out because the directory is not itself an object, but i wrote that off as unimportant initially because i don't think there's a situation where a user would actually want to do that with a public access bucket. i'm not really sure what we can check if we just have a directory, if there's a way for us to list objects in it i guess we could just check the first one we see? what are your thoughts?

Ya I mean "directory" here loosely in the sense that object storage try to look like actual filesystems. Indeed there does not appear to be an obviously correct solution in this case, but unfortunately I think that directories are going to be a common use case. Since hail datasets are partitioned across multiple files, any .ht or .mt is actually a directory. You can see for example the gnomad public data: gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht.

if there's a way for us to list objects in it i guess we could just check the first one we see?

Regardless though, I think this is a very reasonable solution. If someone gives us a directory it is most often a directory of data that belong together, like a Hail dataset. I am fine expecting those all to be the same storage type.

daniel-goldstein · 2024-02-23T16:27:27Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+    async def object_info(self, bucket: str, name: str) -> Dict[str, Any]:
+        kwargs: Dict[str, Any] = {}
+        self._update_params_with_user_project(kwargs, bucket)
+        return await self.get(f'/b/{bucket}/o/{urllib.parse.quote(name, safe="")}', **kwargs)


I think you can use statfile and add a storage_class field on GetObjectFileStatus

_

daniel-goldstein

left comments in previous discussions

_

daniel-goldstein

This is a great improvement, just one last request.

daniel-goldstein · 2024-04-25T17:43:29Z

hail/python/hailtop/aiocloud/aiogoogle/client/storage_client.py

+                    error = next_error
+                raise error
+            self.allowed_storage_locations.append(location)
+        return is_hot_storage


Ya I think that when it comes to readability I care a bit less about the duplicated method call as I do about the interpretation of the signature. I find it surprising that a method called is_hot_storage would side-effect and despite being -> bool it looks from this implementation that it's more like True | ValueError which I find harder to reason about. I think this revision looks great my only request would be to move that allowed_storage_locations check along with the big error message up into the validate_file function.

_

iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 5f76a8f to 94e2c48 Compare February 15, 2024 16:26

iris-garden force-pushed the cold-storage/public-access branch from b6d2286 to 94ce6dc Compare February 23, 2024 15:51

iris-garden assigned daniel-goldstein Feb 23, 2024

daniel-goldstein previously requested changes Feb 23, 2024

View reviewed changes

iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 5ff4174 to 3d55bad Compare February 23, 2024 17:56

iris-garden force-pushed the cold-storage/public-access branch 2 times, most recently from 9317f5e to 78aed73 Compare March 4, 2024 17:38

daniel-goldstein previously requested changes Mar 7, 2024

View reviewed changes

iris-garden force-pushed the cold-storage/public-access branch 8 times, most recently from e660584 to de61786 Compare March 11, 2024 17:06

iris-garden added 4 commits March 11, 2024 13:06

[fs] allows reading from public access buckets

8b5b9c7

review

1157cc3

review 2

de61786

Merge branch 'main' into cold-storage/public-access

1a66ca5

iris-garden force-pushed the cold-storage/public-access branch from b8a26ae to b7bb2b3 Compare March 13, 2024 14:35

iris-garden added 6 commits March 13, 2024 10:35

Merge branch 'main' into cold-storage/public-access

b7bb2b3

reraise error

ce905f4

re-raise

8232aba

re-raise part 3

381f1ef

avoid caching in test

5c78ae5

reverse order of error prop

ace6c18

iris-garden added 17 commits April 17, 2024 11:13

Merge branch 'main' into cold-storage/public-access

6c54eb3

review

1190ce0

review

ea6ede4

review

4d40fee

review

5fe3120

Merge branch 'main' into cold-storage/public-access

14a699a

review

6cb00b0

review

8b44fc4

Merge branch 'main' into cold-storage/public-access

0d50fe6

review

2212743

review

5f7e9e0

review

6e7554f

review

3743292

review

7cb9337

review

c976f47

review

7367c28

Merge branch 'main' into cold-storage/public-access

7d3fe81

daniel-goldstein previously requested changes Apr 25, 2024

View reviewed changes

iris-garden added 2 commits June 17, 2024 12:40

Merge branch 'main' into cold-storage/public-access

714a9d1

feedback

e30363b

lint

a3fd0db

daniel-goldstein approved these changes Jun 27, 2024

View reviewed changes

iris-garden added 2 commits July 16, 2024 14:26

Merge branch 'main' into cold-storage/public-access

db0bc9a

review

2e95962

daniel-goldstein removed their assignment Jul 22, 2024

iris-garden added 2 commits July 29, 2024 11:20

Merge branch 'main' into cold-storage/public-access

f938e28

fix tests

b10f8d8

hail-ci-robot merged commit ab730f1 into hail-is:main Jul 29, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fs] allows reading from public access buckets #14292

[fs] allows reading from public access buckets #14292

iris-garden commented Feb 14, 2024 •

edited

Loading

daniel-goldstein left a comment •

edited

Loading

daniel-goldstein Feb 23, 2024

iris-garden Feb 23, 2024

daniel-goldstein Feb 23, 2024

iris-garden Feb 23, 2024

daniel-goldstein Feb 23, 2024

iris-garden Feb 23, 2024

daniel-goldstein Mar 1, 2024

daniel-goldstein Feb 23, 2024

iris-garden Feb 23, 2024

daniel-goldstein Mar 1, 2024

iris-garden Mar 25, 2024

daniel-goldstein Feb 23, 2024

iris-garden Feb 23, 2024

daniel-goldstein left a comment

daniel-goldstein left a comment

daniel-goldstein Apr 25, 2024

[fs] allows reading from public access buckets #14292

[fs] allows reading from public access buckets #14292

Conversation

iris-garden commented Feb 14, 2024 • edited Loading

daniel-goldstein left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein left a comment

Choose a reason for hiding this comment

daniel-goldstein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iris-garden commented Feb 14, 2024 •

edited

Loading

daniel-goldstein left a comment •

edited

Loading