New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[azure] Optimize file status operations #13368
Conversation
build.yaml
Outdated
@@ -2605,6 +2605,7 @@ steps: | |||
scopes: | |||
- deploy | |||
- dev | |||
- test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to ourselves to nix this before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is a killer find. Excellent work. I think we should hold off on enabling QoB tests. Let's do that as a separate PR and maybe run it 10-20 times to convince ourselves its finally reliable first.
@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS { | |||
val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix) | |||
|
|||
prefixMatches.forEach(blobItem => { | |||
statList += fileStatus(url.withPath(blobItem.getName)) | |||
statList += AzureStorageFileStatus(blobItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
holy yikes-a-molly! Great change, this is awful.
|
||
new BlobStorageFileStatus(path, modificationTime, size, isDir) | ||
def apply(blobItem: BlobItem): BlobStorageFileStatus = { | ||
val properties = blobItem.getProperties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're changing build.yaml, also push this into the else branch.
@@ -389,8 +399,9 @@ class AzureStorageFS(val credentialsJSON: Option[String] = None) extends FS { | |||
val prefixMatches = blobContainerClient.listBlobsByHierarchy(prefix) | |||
|
|||
prefixMatches.forEach(blobItem => { | |||
statList += fileStatus(url.withPath(blobItem.getName)) | |||
statList += AzureStorageFileStatus(blobItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think this is actually a bit of a bug. The name of a blobItem is like /foo/bar/baz
but we need the URL which is hail-az://account/container/foo/bar/baz
.
2953849
to
442c70f
Compare
I'm putting the WIP tag to remind ourselves to turn off the Azure tests before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Excellent change.
@danking I had to make some changes I'm not 100% confident in with stripping trailing "/". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is right, the old code for directories did the same thing:
val filename = dropTrailingSlash(url.toString)
if (!isDir && !blobClient.exists()) {
throw new FileNotFoundException(s"File not found: $filename")
}
if (isDir) {
new BlobStorageFileStatus(path = filename, null, 0, isDir = true)
I think this change will help the number of operations we're making substantially. My Scala skills are not great, so I don't know if this is written correctly.
Basically, we were making a call to list the blobs recursively to test if the path was a directory which was streaming through the first 5000 records. I made the page size equal to 1 record as we don't care about all records. The next thing I did was to just get the blob properties rather than calling exists + get blob properties. So that will cut the number of HTTP calls by half for every blob. Lastly, listing items in a directory which is used for globbing was making a call to get the metadata for each file and then it was making the 3 API calls above to check whether it's a directory, whether it exists, and what the blob properties are. All of this information is in the original result from listing the blobs in the hierarchy so I just use that information directly.