fs.find: cache path ids #286

skshetry · 2023-06-22T03:08:22Z

GDriveFileSystem was previously caching dir ids of root, and was using those on fs.find().
This worked well when the remote cache was at the root, but now since dvc uses /files/md5/ by default, the dir ids are no longer in the cache and find ends up returning an empty list.

This PR checks if the path is cached, and if not, it caches the ID of the path.

Tests passes for dvc in iterative/dvc-gdrive#28

skshetry · 2023-06-22T04:15:45Z

@shcheklein, GDriveFileSystem does cache files id by default.

On fs.find("/root/<>/files/md5"), the base here will be /files/md5. find() checks for path starting with /files/md5 which does not exist.

PyDrive2/pydrive2/fs/spec.py

Lines 484 to 489 in 4897344

    
           query_ids = { 
        
               dir_id: dir_name 
        
               for dir_id, dir_name in dir_ids.pop().items() 
        
               if posixpath.commonpath([base, dir_name]) == base 
        
               if dir_id not in seen_paths 
        
           }

So it uses base here instead of self.base. I don't think it's worth it to save one API call here (which usually gets cached anyway).

pydrive2/fs/spec.py

shcheklein · 2023-06-22T04:39:18Z

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

shcheklein · 2023-06-22T04:43:24Z

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix? (that would mean an extra cost of getting to the children 00 .... ff every time).

skshetry · 2023-06-22T05:17:44Z

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix?

yes, we do find('files/md5'), and is what was broken. It's just one extra API call though, right? Which gets cached?

skshetry · 2023-06-22T05:29:28Z

It will be running an extra query on each find to fetch roots, right? may be even two?

For files/md5, that'll be just one. files/ is cached, so only an id for files/md5 will be fetched (which will be cached).

that would mean an extra cost of getting to the children 00 .... ff every time)

So, similarly here, fetching id for files/md5 will take one query, and a listing on files/md5 will take another. Both of them are cached, and won't be fetched again.

shcheklein · 2023-06-22T05:29:46Z

One cached initial call is fine. What happens next is when we are getting an extra call every single time. We need to run a query to get all chidren of the 'files/md'. That will be happening again and again unless i'm missing something (?).

shcheklein · 2023-06-22T05:31:53Z

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

skshetry · 2023-06-22T05:37:48Z

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

It should be just one extra query because on subsequent find() call, it'll use query id of all dir_ids matching files/md5 (so all prefixes should include that).

PyDrive2/pydrive2/fs/spec.py

Lines 484 to 489 in 4897344

    
           query_ids = { 
        
               dir_id: dir_name 
        
               for dir_id, dir_name in dir_ids.pop().items() 
        
               if posixpath.commonpath([base, dir_name]) == base 
        
               if dir_id not in seen_paths 
        
           }

shcheklein · 2023-06-22T05:42:07Z

It should be just one extra query because on subsequent find() call,

yep. which is not that bad - but still the same query again and again. I don't remember by now if we do that in parallel and rapidly (don't see a reason from the top of my head). The situation we want to avoid where that leads to 2x queries per second - that would be bad for us. If that's not the case- that's fine.

skshetry · 2023-06-22T05:44:28Z

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

We can think of using dircache in the future, similar to which is implemented in s3fs/gcsfs/adlfs.

shcheklein · 2023-06-22T05:46:15Z

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

skshetry · 2023-06-22T05:55:44Z

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

@shcheklein, on first find('files/md5') call, three things happen:

There is no id for files/md5. It is fetched.
Then, it tries to get dir id for path starting with files/md5. Only one exists which was recently fetched. So this is essentially only a files/md5 listing.
Listing of prefixes gets cached, and it recurses.

On subsequent find('files/md5') call, id for files/md5 is already cached, and the query id for path matching files/md5 does return all paths with prefixes.

skshetry · 2023-06-22T05:57:04Z

So at the end, subsequent find('files/md5') is just one query, same as before.

shcheklein · 2023-06-22T15:53:58Z

LGTM, @skshetry ! Let's merge it and release.

shcheklein · 2023-06-22T15:54:03Z

Thanks!

shcheklein · 2023-06-23T03:04:14Z

pydrive2/fs/spec.py

+        if not cached:
+            dir_ids = self._path_to_item_ids(base)
+            self._cache_path_id(base, *dir_ids)
+
        dir_ids = [self._ids_cache["ids"].copy()]


sorry, coming back here to double check :) one potential race condition here is if _cache_path_id that is being executed in some other thread got to the point where is update the first dictionary, but not yet the second. In this case dir_ids might not contain the cache for base yet.

With base, do you mean the root of the filesystem (aka self.base) or the path passed in find (aka base here)?

I mean the local var base value.

yeah, looks like we need to lock self._cache_path_id.

Maybe it'd have been okay if we used dirs here instead of ids. But I went with locking in #289.

skshetry temporarily deployed to internal June 22, 2023 03:08 — with GitHub Actions Inactive

skshetry marked this pull request as draft June 22, 2023 03:08

This comment was marked as resolved.

Sign in to view

skshetry requested a review from shcheklein June 22, 2023 03:31

skshetry marked this pull request as ready for review June 22, 2023 03:31

shcheklein reviewed Jun 22, 2023

View reviewed changes

pydrive2/fs/spec.py Show resolved Hide resolved

shcheklein approved these changes Jun 22, 2023

View reviewed changes

skshetry mentioned this pull request Jun 22, 2023

remove find implementation #285

Closed

skshetry force-pushed the dvc-issue-9607 branch from e6e2b2e to 8181bc6 Compare June 22, 2023 13:22

skshetry temporarily deployed to internal June 22, 2023 13:23 — with GitHub Actions Inactive

Base automatically changed from dvc-issue-9607 to main June 22, 2023 13:25

skshetry temporarily deployed to internal June 22, 2023 13:26 — with GitHub Actions Inactive

find: cache path item ids

b9fc857

skshetry force-pushed the cache-path-ids-find branch from 0464326 to b9fc857 Compare June 22, 2023 13:28

skshetry temporarily deployed to internal June 22, 2023 13:29 — with GitHub Actions Inactive

skshetry merged commit 53ee84c into main Jun 22, 2023

skshetry deleted the cache-path-ids-find branch June 22, 2023 15:55

shcheklein reviewed Jun 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs.find: cache path ids #286

fs.find: cache path ids #286

skshetry commented Jun 22, 2023 •

edited

Loading

This comment was marked as resolved.

skshetry commented Jun 22, 2023 •

edited

Loading

shcheklein commented Jun 22, 2023 •

edited

Loading

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023 •

edited

Loading

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023 •

edited

Loading

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein Jun 23, 2023

skshetry Jun 23, 2023

shcheklein Jun 23, 2023

skshetry Jun 23, 2023

skshetry Jun 23, 2023

fs.find: cache path ids #286

fs.find: cache path ids #286

Conversation

skshetry commented Jun 22, 2023 • edited Loading

This comment was marked as resolved.

skshetry commented Jun 22, 2023 • edited Loading

shcheklein commented Jun 22, 2023 • edited Loading

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023 • edited Loading

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

skshetry commented Jun 22, 2023 • edited Loading

skshetry commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein commented Jun 22, 2023

shcheklein Jun 23, 2023

Choose a reason for hiding this comment

skshetry Jun 23, 2023

Choose a reason for hiding this comment

shcheklein Jun 23, 2023

Choose a reason for hiding this comment

skshetry Jun 23, 2023

Choose a reason for hiding this comment

skshetry Jun 23, 2023

Choose a reason for hiding this comment

skshetry commented Jun 22, 2023 •

edited

Loading

skshetry commented Jun 22, 2023 •

edited

Loading

shcheklein commented Jun 22, 2023 •

edited

Loading

skshetry commented Jun 22, 2023 •

edited

Loading

skshetry commented Jun 22, 2023 •

edited

Loading