Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azure] Optimize file status operations #13368

Merged
merged 5 commits into from Aug 5, 2023

Conversation

jigold
Copy link
Collaborator

@jigold jigold commented Aug 2, 2023

I think this change will help the number of operations we're making substantially. My Scala skills are not great, so I don't know if this is written correctly.

Basically, we were making a call to list the blobs recursively to test if the path was a directory which was streaming through the first 5000 records. I made the page size equal to 1 record as we don't care about all records. The next thing I did was to just get the blob properties rather than calling exists + get blob properties. So that will cut the number of HTTP calls by half for every blob. Lastly, listing items in a directory which is used for globbing was making a call to get the metadata for each file and then it was making the 3 API calls above to check whether it's a directory, whether it exists, and what the blob properties are. All of this information is in the original result from listing the blobs in the hierarchy so I just use that information directly.

build.yaml Outdated
@@ -2605,6 +2605,7 @@ steps:
scopes:
- deploy
- dev
- test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to ourselves to nix this before merging

Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is a killer find. Excellent work. I think we should hold off on enabling QoB tests. Let's do that as a separate PR and maybe run it 10-20 times to convince ourselves its finally reliable first.

@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS {
val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix)

prefixMatches.forEach(blobItem => {
statList += fileStatus(url.withPath(blobItem.getName))
statList += AzureStorageFileStatus(blobItem)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holy yikes-a-molly! Great change, this is awful.


new BlobStorageFileStatus(path, modificationTime, size, isDir)
def apply(blobItem: BlobItem): BlobStorageFileStatus = {
val properties = blobItem.getProperties
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're changing build.yaml, also push this into the else branch.

danking
danking previously requested changes Aug 3, 2023
@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS {
val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix)

prefixMatches.forEach(blobItem => {
statList += fileStatus(url.withPath(blobItem.getName))
statList += AzureStorageFileStatus(blobItem)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this is actually a bit of a bug. The name of a blobItem is like /foo/bar/baz but we need the URL which is hail-az://account/container/foo/bar/baz.

@jigold
Copy link
Collaborator Author

jigold commented Aug 3, 2023

I'm putting the WIP tag to remind ourselves to turn off the Azure tests before merging.

danking
danking previously approved these changes Aug 3, 2023
Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Excellent change.

@jigold jigold removed the WIP label Aug 3, 2023
@jigold
Copy link
Collaborator Author

jigold commented Aug 3, 2023

@danking I had to make some changes I'm not 100% confident in with stripping trailing "/".

@jigold jigold dismissed danking’s stale review August 3, 2023 23:18

needs another look

Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is right, the old code for directories did the same thing:

    val filename = dropTrailingSlash(url.toString)
    if (!isDir && !blobClient.exists()) {
      throw new FileNotFoundException(s"File not found: $filename")
    }

    if (isDir) {
      new BlobStorageFileStatus(path = filename, null, 0, isDir = true)

@danking danking merged commit a31f941 into hail-is:main Aug 5, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants