Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.find: cache path ids #286

Merged
merged 1 commit into from
Jun 22, 2023
Merged

fs.find: cache path ids #286

merged 1 commit into from
Jun 22, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jun 22, 2023

GDriveFileSystem was previously caching dir ids of root, and was using those on fs.find().
This worked well when the remote cache was at the root, but now since dvc uses /files/md5/ by default, the dir ids are no longer in the cache and find ends up returning an empty list.

This PR checks if the path is cached, and if not, it caches the ID of the path.

Tests passes for dvc in iterative/dvc-gdrive#28

@skshetry skshetry temporarily deployed to internal June 22, 2023 03:08 — with GitHub Actions Inactive
@skshetry skshetry marked this pull request as draft June 22, 2023 03:08
@skshetry

This comment was marked as resolved.

@skshetry skshetry requested a review from shcheklein June 22, 2023 03:31
@skshetry skshetry marked this pull request as ready for review June 22, 2023 03:31
@skshetry
Copy link
Member Author

skshetry commented Jun 22, 2023

@shcheklein, GDriveFileSystem does cache files id by default.

On fs.find("/root/<>/files/md5"), the base here will be /files/md5. find() checks for path starting with /files/md5 which does not exist.

query_ids = {
dir_id: dir_name
for dir_id, dir_name in dir_ids.pop().items()
if posixpath.commonpath([base, dir_name]) == base
if dir_id not in seen_paths
}

So it uses base here instead of self.base. I don't think it's worth it to save one API call here (which usually gets cached anyway).

@shcheklein
Copy link
Member

shcheklein commented Jun 22, 2023

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

@shcheklein
Copy link
Member

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix? (that would mean an extra cost of getting to the children 00 .... ff every time).

@skshetry
Copy link
Member Author

skshetry commented Jun 22, 2023

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix?

yes, we do find('files/md5'), and is what was broken. It's just one extra API call though, right? Which gets cached?

@skshetry
Copy link
Member Author

It will be running an extra query on each find to fetch roots, right? may be even two?

For files/md5, that'll be just one. files/ is cached, so only an id for files/md5 will be fetched (which will be cached).

that would mean an extra cost of getting to the children 00 .... ff every time)

So, similarly here, fetching id for files/md5 will take one query, and a listing on files/md5 will take another. Both of them are cached, and won't be fetched again.

@shcheklein
Copy link
Member

One cached initial call is fine. What happens next is when we are getting an extra call every single time. We need to run a query to get all chidren of the 'files/md'. That will be happening again and again unless i'm missing something (?).

@shcheklein
Copy link
Member

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

@skshetry
Copy link
Member Author

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

It should be just one extra query because on subsequent find() call, it'll use query id of all dir_ids matching files/md5 (so all prefixes should include that).

query_ids = {
dir_id: dir_name
for dir_id, dir_name in dir_ids.pop().items()
if posixpath.commonpath([base, dir_name]) == base
if dir_id not in seen_paths
}

@shcheklein
Copy link
Member

It should be just one extra query because on subsequent find() call,

yep. which is not that bad - but still the same query again and again. I don't remember by now if we do that in parallel and rapidly (don't see a reason from the top of my head). The situation we want to avoid where that leads to 2x queries per second - that would be bad for us. If that's not the case- that's fine.

@skshetry
Copy link
Member Author

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

We can think of using dircache in the future, similar to which is implemented in s3fs/gcsfs/adlfs.

@shcheklein
Copy link
Member

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

@skshetry
Copy link
Member Author

skshetry commented Jun 22, 2023

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

@shcheklein, on first find('files/md5') call, three things happen:

  1. There is no id for files/md5. It is fetched.
  2. Then, it tries to get dir id for path starting with files/md5. Only one exists which was recently fetched. So this is essentially only a files/md5 listing.
  3. Listing of prefixes gets cached, and it recurses.

On subsequent find('files/md5') call, id for files/md5 is already cached, and the query id for path matching files/md5 does return all paths with prefixes.

@skshetry
Copy link
Member Author

So at the end, subsequent find('files/md5') is just one query, same as before.

@skshetry skshetry temporarily deployed to internal June 22, 2023 13:23 — with GitHub Actions Inactive
Base automatically changed from dvc-issue-9607 to main June 22, 2023 13:25
@skshetry skshetry temporarily deployed to internal June 22, 2023 13:26 — with GitHub Actions Inactive
@skshetry skshetry temporarily deployed to internal June 22, 2023 13:29 — with GitHub Actions Inactive
@shcheklein
Copy link
Member

LGTM, @skshetry ! Let's merge it and release.

@shcheklein
Copy link
Member

Thanks!

@skshetry skshetry merged commit 53ee84c into main Jun 22, 2023
@skshetry skshetry deleted the cache-path-ids-find branch June 22, 2023 15:55
if not cached:
dir_ids = self._path_to_item_ids(base)
self._cache_path_id(base, *dir_ids)

dir_ids = [self._ids_cache["ids"].copy()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, coming back here to double check :) one potential race condition here is if _cache_path_id that is being executed in some other thread got to the point where is update the first dictionary, but not yet the second. In this case dir_ids might not contain the cache for base yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With base, do you mean the root of the filesystem (aka self.base) or the path passed in find (aka base here)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the local var base value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looks like we need to lock self._cache_path_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd have been okay if we used dirs here instead of ids. But I went with locking in #289.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants