implement autosync_interval trait #25

dlqqq · 2022-10-24T20:22:42Z

See discussion in #20.

kevin-bates · 2022-10-24T22:17:05Z

Hi @dlqqq - thanks for the ping.

We should avoid exposing implementation details, like a boolean autosync parameter, on get_path(). Instead, could we defer synchronization until we find that the path that corresponds to id does not exist? I think this will trigger far fewer synchronizations (and zero in some cases like cloud/controlled deployments).

get_path(id):

    query for id, if not found return None

    id found, 
    stat(path), if found return path

    path not found,
    if initial attempt - sync-all and repeat

    else (second attempt), drop row (or whatever should be done when files are deleted)
    return None

Regarding auto-sync functionality, I figured that would be done in a background process, but, regardless, you'd still run the risk of an out-of-band change between intervals. By deferring synchronization to only when necessary, yes, you penalize the poor client that found an anomaly, but that synchronization likely resolves the next client's anomaly (kinda like crowdsourcing 😄).

dlqqq · 2022-10-25T00:16:00Z

@kevin-bates

We should avoid exposing implementation details, like a boolean autosync parameter, on get_path().

Yeah, I had initially added this because I figured we should give users some way of disabling autosync, but honestly I don't see much use-case for it given that autosync is already being smart about not calling sync_all() too many times. I'll remove it.

Instead, could we defer synchronization until we find that the path that corresponds to id does not exist? I think this will trigger far fewer synchronizations (and zero in some cases like cloud/controlled deployments).

Yeah, this would be ideal. This would be a huge performance gain for calling get_path() on a file that wasn't moved. I had planned on implementing this as soon as I got back into file ID, but unfortunately more pressing matters emerged. Let me track this in a separate issue: #27

Regarding auto-sync functionality, I figured that would be done in a background process, but, regardless, you'd still run the risk of an out-of-band change between intervals. By deferring synchronization to only when necessary, yes, you penalize the poor client that found an anomaly, but that synchronization likely resolves the next client's anomaly (kinda like crowdsourcing 😄).

Yup, running sync_all() in a separate process would be ideal because that would mean that get_path() will never block the current thread for too long by calling sync_all(). Right now, the behavior is that if your last invocation of get_path() was longer than 5 seconds ago, you're just going to have to endure the 0-400ms delay.

However, because sync_all() involves writing to the database, this won't work right now because SQLite does not work with concurrent writes. We'll likely have all database writes get routed to a separate process polling requests from ZMQ. We will need to think about this issue more deeply in the future, but for now the current implementation will not need to handle multiple processes. I think we can be track further discussion about this here: #13

As usual, thank you for asking the hard questions.

kevin-bates · 2022-10-25T00:26:33Z

It's awesome to see #27 - yes, an optimistic approach is a simple way to express it (thanks).

I suspect the autosync_interval trait could go away if you find optimistic get_path() sufficient and I wonder if that approach should be checked out first.

kevin-bates · 2022-10-25T00:27:08Z

oh - never mind I guess. I see this was just merged.

dlqqq · 2022-10-25T01:15:36Z

I don't think any of the performance enhancements we had discussed are mutually exclusive with one another. They all have different use cases:

optimistic get_path(): applies to all use cases
autosync interval: used when having a slight delay is OK (e.g. path is only used for rendering the name of the file in the file browser, nothing important)
execution in a separate process: applies to cases when autosync is used and when multiple processes need to call file ID manager

implement autosync_interval trait

39388d6

dlqqq requested review from davidbrochart, ellisonbg and kevin-bates October 24, 2022 20:23

ellisonbg approved these changes Oct 24, 2022

View reviewed changes

remove autosync arg

ed22f89

dlqqq added the enhancement New feature or request label Oct 25, 2022

dlqqq merged commit 867be79 into jupyter-server:main Oct 25, 2022

dlqqq deleted the autosync branch October 25, 2022 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement autosync_interval trait #25

implement autosync_interval trait #25

dlqqq commented Oct 24, 2022

kevin-bates commented Oct 24, 2022

dlqqq commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

dlqqq commented Oct 25, 2022

implement autosync_interval trait #25

implement autosync_interval trait #25

Conversation

dlqqq commented Oct 24, 2022

kevin-bates commented Oct 24, 2022

dlqqq commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

dlqqq commented Oct 25, 2022