Fix race condition issues on slower FS #564

mlucool · 2020-03-17T19:34:33Z

This adds locking before executing commands. While this may add locking in some cases that are not a race (i.e. things that do not need to acquire index.lock), this solves issues related to using this with a git directory on NFS.

mlucool · 2020-03-17T20:13:29Z

Let me know if you generally like this. Not sure why it can't find tornado.locks as it has been there a while: https://www.tornadoweb.org/en/stable/locks.html

jupyterlab_git/git.py

telamonian · 2020-03-18T02:26:15Z

Nice work. I like the thinking behind this PR. We will need a better description of the error/issue this is trying to solve (eg a specimen of the actual error message; might be nicest if it was written down in a separate issue). I'd also like to better understand what locking (if any) the git CLI itself does.

Ideally, this PR should also include a test to prove that it fixes the problem it's trying to solve. One idea for the test: explicitly lock a resource for ~5 seconds, then immediately start another operation that would create a race condition if locking is not present.

mlucool · 2020-03-18T14:53:28Z

I'd also like to better understand what locking (if any) the git CLI itself does.

Git locking: https://docs.microsoft.com/en-us/azure/devops/repos/git/git-index-lock?view=azure-devops

Waiting on locks: https://stackoverflow.com/a/36364687

So we can improve this by telling execute which commands care about a lock file or not (e.g. I don't believe status does), but I think thats an over optimization at this point and error prone.

We will need a better description of the error/issue this is trying to solve

I think #555 is the same issue. Symptoms are at least

this PR should also include a test to prove that it fixes the problem it's trying to solve

I'll add one

mlucool · 2020-03-18T15:11:39Z

@telamonian do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

telamonian · 2020-03-18T15:58:58Z

do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

Hmm, not off the top of my head. Frederic's the expert, since he wrote execute and most of the related code (and the related tests). @fcollonval Any ideas?

telamonian · 2020-03-20T00:21:44Z

I thought of a potential concern: should read-only ops (in particular git status) ignore locking? Basically, I'm wondering if there are any potential interactions between this PR and how our UI updates.

mlucool · 2020-03-20T01:31:12Z

Per my comment:

While this may add locking in some cases that are not a race (i.e. things that do not need to acquire index.lock)

Yes, git status will now unnecessarily wait for the lock. We can remove it in some cases trivially (i.e. opt out) but that feels like an pre-mature optimization based on my local testing. If you feel strongly, I'm happy to make this change

mlucool · 2020-03-21T11:56:36Z

@telamonian let me know what you'd like me to do here to get this merged so that it's included in 0.10.

fcollonval · 2020-03-25T08:22:15Z

do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

Hmm, not off the top of my head. Frederic's the expert, since he wrote execute and most of the related code (and the related tests). @fcollonval Any ideas?

Hey sorry for the delay.
Unfortunately as stated in the out-of-date roadmap. There is no integration test. Meaning that all git command are mocked. But when I introduced the asynchronous calls, I went to mock the execute function directly.

But you could test your nice feature by calling directly the execute function and mock the run_in_executor; the definition to patch is tornado.ioloop.IOLoop.run_in_executor.

fcollonval · 2020-03-25T09:14:29Z

Testing is a bit more complicated that what I said. But that commit shows how to start achieving it: fcollonval@6ede2a8

jupyterlab_git/git.py

fcollonval · 2020-03-25T11:37:54Z

One additional comment:

The argument cwd of execute should be mandatory, now:

-    cwd: "Optional[str]" = None,
+    cwd: str,

jupyterlab_git/git.py

telamonian · 2020-03-26T01:53:07Z

I'm targetting this for the next release (which should be 0.20.0). Once this PR is pulled in we'll also backport it to (jlab 1.0 compatible) 0.11.0

This adds locking before executing commands. While this may add locking in some cases that are not a race (i.e. things that do not need to aquire index.lock), this solves issues related to using this with a git directory on NFS

mlucool · 2020-03-26T23:40:20Z

@fcollonval this should be all set now. I'll admit the unit test is mediocre because we don't have integration testing. It does prove that if a lock exists, it will sleep and simulates that if a lock exists, a command fails.

mlucool · 2020-03-26T23:42:16Z

@telamonian

I'm targetting this for the next release (which should be 0.20.0). Once this PR is pulled in we'll also backport it to (jlab 1.0 compatible) 0.11.0

There is no need to backport this to 0.11 for me at this time. I'm hoping to be on 2.x in the near future, but I'll let you know if something changes. If you want to backport anyway, it may be a good idea.

Also, let me know if you want me to deal with some commands not touching the index.lock and we know it.

fcollonval

@mlucool

Thanks for the update and the test. Here is a proposal of a better less mocked test. The idea is simply to mock sleep to check the lock status and remove the lock file.

jupyterlab_git/tests/test_execute.py

Co-Authored-By: Frédéric Collonval <fcollonval@gmail.com>

fcollonval

Thanks @mlucool for this PR and the discussion.

I won't merge it immediately to let @telamonian have a look.

kgryte · 2020-03-28T17:56:10Z

It is not clear to me that we should continually retry until command success in the event that an index.lock is already present, as opposed to parsing error messages and displaying a dialog in the UI which instructs a user to retry the command after waiting a specified period of time.

The main issue I see with this PR is the possibility of a race condition. We don't "lock" the UI to prevent further commands, and, from my reading, we run the risk that the pending command could be executed after a subsequent command (e.g., another commit).

If we want to add retry logic, we should presumably lock the UI to prevent further interactions. IMO, that is not particularly desirable, in comparison to pushing the retry to the user, which is exactly what happens when using Git at the command-line.

mlucool · 2020-03-28T18:07:41Z

It is not clear to me that we should continually retry until command success in the event that an index.lock is already present

We only loop on waiting for the lock, not rerunning the command

we run the risk that the pending command could be executed after a subsequent command (e.g., another commit).

This enforces that each command wait for execution_lock.acquire. I assumed, maybe wrongly, they would acquire the lock in FIFO and thus preserve ordering. If this is not true, we should enforce this somehow.

The reason for this commit has less to do with needing retry, but because some of the action from the UI require multiple commands to be sent (e.g. a git add then a git commit)

kgryte · 2020-04-06T19:05:05Z

@mlucool Yes, you are correct; from my reading, tornado uses something akin to a FIFO queue. In which case, the race condition I alluded to earlier is not applicable.

However, another concern is what happens if a user shuts down JupyterLab before an index.lock file is removed and the queued command/commands has/have executed?

Meaning, is it possible that, e.g., one or more pending commands (e.g., one or more commit commands for "saving" changes) could fail to be executed in the event that JupyterLab closes during the interim waiting for an index.lock file to be removed?

If, instead of locking on the backend, where the locking behavior is hidden from the user, the extension erred immediately and displayed an error modal to the user, recommending manually retrying an action, then, IMO, while arguably more laborious, expectations could be better managed.

In short, I am largely concerned about all the ways that this change could cause the extension to fail. Are there edge cases that we are missing?

Lastly, I am leery about hardcoding intervals for wait times. Wait times would seem to be highly particular to a user's environment and the relevant repository.

For example, one of my side projects is rather large, and I frequently encounter the presence of an index.lock which sometimes takes upward of 30-60 seconds to disappear, depending on the current resource constraints on my laptop.

Personally, I don't have a good sense as to what could be considered reasonable.

mlucool · 2020-04-06T19:51:25Z

However, another concern is what happens if a user shuts down a kernel before an index.lock file is removed and the queued command/commands has/have executed?

I thought extension added routes and interactions are not done via a kernel?

All actions that take more than one command which affects index.lock (e.g. git add + commit in simple staging mode) fail because they happen too fast on NFS in my case. These would have to be made atomic from the UI POV and the backend would have to enforce waiting and ordering between those. I'll also note that having more than one UI instance means you can have race conditions between them and is another reason to put it into the backend.

Lastly, I am leery about hardcoding intervals for wait times. Wait times would seem to be highly particular to a user's environment and the relevant repository.

You are right. We should let users change these. There is no "right" amount of time to wait with very different environments. I think we pick a good default in terms of human timeframes and then let advanced users change this. Should we both MAX_WAIT_FOR_EXECUTE_S and MAX_WAIT_FOR_LOCK_S ?

saulshanabrook · 2020-04-08T03:37:37Z

These would have to be made atomic from the UI POV and the backend would have to enforce waiting and ordering between those.

It seems to make sense that if we are allowing a long running process to take place on the server, then we should make sure to make this clear to the user. So we should add some way for them to see what actions are being run on the server, or block the UI while one action is being run.

If that could be done after this PR is merged, instead of in this PR itself, that seems like it could be fine as well.

mlucool · 2020-04-08T11:54:02Z

So we should add some way for them to see what actions are being run on the server, or block the UI while one action is being run.

This already happens for at least some commands (a popup with a spinner). I agree that we should have feedback for all commands to make it more clear to a user that something is being done.

FWIW, in practice, this wait is still subhuman scale for the cases I tested (fast network, NFS, smallish repo). Any case that would be on human scales (e.g. very large repos that @kgryte pointed out), would either fail (e.g. git add and git commit would step on each other) or take approximately as long as they would from the CLI.

kgryte · 2020-04-10T17:47:13Z

@telamonian You have any thoughts on this?

Proposal:

Modify PR to allow retry timeout configurability.
Follow-up PR to suspend UI interactions until pending command complete (e.g., a modal with a spinner). I believe this should be straightforward, as we just need to enable a UI element until receiving an HTTP response.

The need for the latter is to prevent/dissuade the user from closing the JupyterLab server before pending commands have had a chance to complete (e.g., before an index.lock file is removed and git add && git commit run, thus potentially leading to the discarding of user changes).

mlucool · 2020-04-13T22:19:32Z

I'm happy with @kgryte proposal if that means @telamonian / @fcollonval will accept this PR. It's been stuck for a while, and it would be good wrap up the work here.

fcollonval · 2020-04-20T16:13:33Z

Thanks @mlucool for the patience. I'll merge this and add an issue with the proposal of enhancement from @kgryte.

fcollonval · 2020-05-28T08:07:26Z

@meeseeksdev backport to 0.11.x

…on-0.11.x Backport PR #564 on branch 0.11.x (Fix race condition issues on slower FS)

telamonian reviewed Mar 18, 2020

View reviewed changes

jupyterlab_git/git.py Show resolved Hide resolved

mlucool mentioned this pull request Mar 19, 2020

Final checklist for 0.10.0 release #568

Closed

5 tasks

fcollonval requested changes Mar 25, 2020

View reviewed changes

jupyterlab_git/git.py Outdated Show resolved Hide resolved

fcollonval reviewed Mar 25, 2020

View reviewed changes

jupyterlab_git/git.py Show resolved Hide resolved

telamonian added this to the 0.20.0 milestone Mar 26, 2020

Marc Udoff added 2 commits March 26, 2020 13:57

Fix race condition issues with slower FS

94bd2ab

This adds locking before executing commands. While this may add locking in some cases that are not a race (i.e. things that do not need to aquire index.lock), this solves issues related to using this with a git directory on NFS

Added unit test, fixed bug

9f61ec5

mlucool force-pushed the master branch from 62c844b to 9f61ec5 Compare March 26, 2020 23:38

fcollonval requested changes Mar 28, 2020

View reviewed changes

jupyterlab_git/tests/test_execute.py Outdated Show resolved Hide resolved

Update jupyterlab_git/tests/test_execute.py

aac7baf

Co-Authored-By: Frédéric Collonval <fcollonval@gmail.com>

fcollonval approved these changes Mar 28, 2020

View reviewed changes

fcollonval merged commit 083baac into jupyterlab:master Apr 20, 2020

fcollonval mentioned this pull request Apr 20, 2020

Provide feedback on git command execution #601

Closed

kgryte mentioned this pull request May 1, 2020

Provide UI feedback during Git command execution #630

Merged

meeseeksmachine mentioned this pull request May 28, 2020

Backport PR #564 on branch 0.11.x (Fix race condition issues on slower FS) #654

Merged

meeseeksmachine pushed a commit to meeseeksmachine/jupyterlab-git that referenced this pull request May 28, 2020

Backport PR jupyterlab#564: Fix race condition issues on slower FS

b2001c8

fcollonval added a commit that referenced this pull request May 28, 2020

Merge pull request #654 from meeseeksmachine/auto-backport-of-pr-564-…

41e03a6

…on-0.11.x Backport PR #564 on branch 0.11.x (Fix race condition issues on slower FS)

fcollonval mentioned this pull request Jul 28, 2020

Discard changes fails due to index.lock already exists #555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition issues on slower FS #564

Fix race condition issues on slower FS #564

mlucool commented Mar 17, 2020

mlucool commented Mar 17, 2020

telamonian commented Mar 18, 2020 •

edited

mlucool commented Mar 18, 2020 •

edited

mlucool commented Mar 18, 2020

telamonian commented Mar 18, 2020

telamonian commented Mar 20, 2020

mlucool commented Mar 20, 2020

mlucool commented Mar 21, 2020

fcollonval commented Mar 25, 2020

fcollonval commented Mar 25, 2020 •

edited

fcollonval commented Mar 25, 2020

telamonian commented Mar 26, 2020 •

edited

mlucool commented Mar 26, 2020

mlucool commented Mar 26, 2020 •

edited

fcollonval left a comment

fcollonval left a comment

kgryte commented Mar 28, 2020

mlucool commented Mar 28, 2020

kgryte commented Apr 6, 2020 •

edited

mlucool commented Apr 6, 2020

saulshanabrook commented Apr 8, 2020

mlucool commented Apr 8, 2020

kgryte commented Apr 10, 2020 •

edited

mlucool commented Apr 13, 2020

fcollonval commented Apr 20, 2020

fcollonval commented May 28, 2020

Fix race condition issues on slower FS #564

Fix race condition issues on slower FS #564

Conversation

mlucool commented Mar 17, 2020

mlucool commented Mar 17, 2020

telamonian commented Mar 18, 2020 • edited

mlucool commented Mar 18, 2020 • edited

mlucool commented Mar 18, 2020

telamonian commented Mar 18, 2020

telamonian commented Mar 20, 2020

mlucool commented Mar 20, 2020

mlucool commented Mar 21, 2020

fcollonval commented Mar 25, 2020

fcollonval commented Mar 25, 2020 • edited

fcollonval commented Mar 25, 2020

telamonian commented Mar 26, 2020 • edited

mlucool commented Mar 26, 2020

mlucool commented Mar 26, 2020 • edited

fcollonval left a comment

Choose a reason for hiding this comment

fcollonval left a comment

Choose a reason for hiding this comment

kgryte commented Mar 28, 2020

mlucool commented Mar 28, 2020

kgryte commented Apr 6, 2020 • edited

mlucool commented Apr 6, 2020

saulshanabrook commented Apr 8, 2020

mlucool commented Apr 8, 2020

kgryte commented Apr 10, 2020 • edited

mlucool commented Apr 13, 2020

fcollonval commented Apr 20, 2020

fcollonval commented May 28, 2020

telamonian commented Mar 18, 2020 •

edited

mlucool commented Mar 18, 2020 •

edited

fcollonval commented Mar 25, 2020 •

edited

telamonian commented Mar 26, 2020 •

edited

mlucool commented Mar 26, 2020 •

edited

kgryte commented Apr 6, 2020 •

edited

kgryte commented Apr 10, 2020 •

edited