[azure] Optimize file status operations #13368

jigold · 2023-08-02T20:46:44Z

I think this change will help the number of operations we're making substantially. My Scala skills are not great, so I don't know if this is written correctly.

Basically, we were making a call to list the blobs recursively to test if the path was a directory which was streaming through the first 5000 records. I made the page size equal to 1 record as we don't care about all records. The next thing I did was to just get the blob properties rather than calling exists + get blob properties. So that will cut the number of HTTP calls by half for every blob. Lastly, listing items in a directory which is used for globbing was making a call to get the metadata for each file and then it was making the 3 API calls above to check whether it's a directory, whether it exists, and what the blob properties are. All of this information is in the original result from listing the blobs in the hierarchy so I just use that information directly.

danking · 2023-08-03T17:58:00Z

build.yaml

@@ -2605,6 +2605,7 @@ steps:
    scopes:
      - deploy
      - dev
+      - test


note to ourselves to nix this before merging

danking

Wow this is a killer find. Excellent work. I think we should hold off on enabling QoB tests. Let's do that as a separate PR and maybe run it 10-20 times to convince ourselves its finally reliable first.

danking · 2023-08-03T20:14:47Z

hail/src/main/scala/is/hail/io/fs/AzureStorageFS.scala

@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS {
    val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix)

    prefixMatches.forEach(blobItem => {
-      statList += fileStatus(url.withPath(blobItem.getName))
+      statList += AzureStorageFileStatus(blobItem)


holy yikes-a-molly! Great change, this is awful.

danking · 2023-08-03T20:35:22Z

hail/src/main/scala/is/hail/io/fs/AzureStorageFS.scala


-    new BlobStorageFileStatus(path, modificationTime, size, isDir)
+  def apply(blobItem: BlobItem): BlobStorageFileStatus = {
+    val properties = blobItem.getProperties


While you're changing build.yaml, also push this into the else branch.

danking · 2023-08-03T20:37:06Z

hail/src/main/scala/is/hail/io/fs/AzureStorageFS.scala

@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS {
    val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix)

    prefixMatches.forEach(blobItem => {
-      statList += fileStatus(url.withPath(blobItem.getName))
+      statList += AzureStorageFileStatus(blobItem)


Ah, I think this is actually a bit of a bug. The name of a blobItem is like /foo/bar/baz but we need the URL which is hail-az://account/container/foo/bar/baz.

jigold · 2023-08-03T20:56:17Z

I'm putting the WIP tag to remind ourselves to turn off the Azure tests before merging.

done

danking

LGTM. Excellent change.

jigold · 2023-08-03T23:17:44Z

@danking I had to make some changes I'm not 100% confident in with stripping trailing "/".

needs another look

danking

Yeah, this is right, the old code for directories did the same thing:

    val filename = dropTrailingSlash(url.toString)
    if (!isDir && !blobClient.exists()) {
      throw new FileNotFoundException(s"File not found: $filename")
    }

    if (isDir) {
      new BlobStorageFileStatus(path = filename, null, 0, isDir = true)

[azure] Optimize file status operations

81985e2

jigold assigned danking Aug 2, 2023

jigold mentioned this pull request Aug 2, 2023

[query] In Azure, QoB sees elevated rates of weird errors from the Azure Blob Storage SDK #13351

Closed

danking reviewed Aug 3, 2023

View reviewed changes

build.yaml Outdated

@@ -2605,6 +2605,7 @@ steps:

scopes:

- deploy

- dev

- test

Copy link

Collaborator

danking Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to ourselves to nix this before merging

danking requested changes Aug 3, 2023

View reviewed changes

danking reviewed Aug 3, 2023

View reviewed changes

danking previously requested changes Aug 3, 2023

View reviewed changes

fix bug

442c70f

jigold force-pushed the azure-fs-debugging branch from 2953849 to 442c70f Compare August 3, 2023 20:51

jigold added the WIP label Aug 3, 2023

jigold added 2 commits August 3, 2023 16:59

more fixes

05d99c5

fix

e651fdc

danking previously approved these changes Aug 3, 2023

View reviewed changes

fixes

15b6150

jigold removed the WIP label Aug 3, 2023

danking approved these changes Aug 5, 2023

View reviewed changes

danking merged commit a31f941 into hail-is:main Aug 5, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azure] Optimize file status operations #13368

[azure] Optimize file status operations #13368

jigold commented Aug 2, 2023

danking Aug 3, 2023

danking left a comment

danking Aug 3, 2023

danking Aug 3, 2023

danking Aug 3, 2023

jigold commented Aug 3, 2023

danking left a comment

jigold commented Aug 3, 2023

danking left a comment

@@ @@ -2605,6 +2605,7 @@ steps: @@
                   scopes:
                     - deploy
                     - dev
+                    - test

[azure] Optimize file status operations #13368

[azure] Optimize file status operations #13368

Conversation

jigold commented Aug 2, 2023

danking Aug 3, 2023

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

danking Aug 3, 2023

Choose a reason for hiding this comment

danking Aug 3, 2023

Choose a reason for hiding this comment

danking Aug 3, 2023

Choose a reason for hiding this comment

jigold commented Aug 3, 2023

danking left a comment

Choose a reason for hiding this comment

jigold commented Aug 3, 2023

danking left a comment

Choose a reason for hiding this comment