Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition issues on slower FS #564

Merged
merged 3 commits into from Apr 20, 2020
Merged

Conversation

mlucool
Copy link
Contributor

@mlucool mlucool commented Mar 17, 2020

This adds locking before executing commands. While this may add locking in some cases that are not a race (i.e. things that do not need to acquire index.lock), this solves issues related to using this with a git directory on NFS.

@mlucool
Copy link
Contributor Author

mlucool commented Mar 17, 2020

Let me know if you generally like this. Not sure why it can't find tornado.locks as it has been there a while: https://www.tornadoweb.org/en/stable/locks.html

@telamonian
Copy link
Member

telamonian commented Mar 18, 2020

Nice work. I like the thinking behind this PR. We will need a better description of the error/issue this is trying to solve (eg a specimen of the actual error message; might be nicest if it was written down in a separate issue). I'd also like to better understand what locking (if any) the git CLI itself does.

Ideally, this PR should also include a test to prove that it fixes the problem it's trying to solve. One idea for the test: explicitly lock a resource for ~5 seconds, then immediately start another operation that would create a race condition if locking is not present.

@mlucool
Copy link
Contributor Author

mlucool commented Mar 18, 2020

I'd also like to better understand what locking (if any) the git CLI itself does.

Git locking: https://docs.microsoft.com/en-us/azure/devops/repos/git/git-index-lock?view=azure-devops

Waiting on locks: https://stackoverflow.com/a/36364687

So we can improve this by telling execute which commands care about a lock file or not (e.g. I don't believe status does), but I think thats an over optimization at this point and error prone.

We will need a better description of the error/issue this is trying to solve

I think #555 is the same issue. Symptoms are at least

this PR should also include a test to prove that it fixes the problem it's trying to solve

I'll add one

@mlucool
Copy link
Contributor Author

mlucool commented Mar 18, 2020

@telamonian do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

@telamonian
Copy link
Member

do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

Hmm, not off the top of my head. Frederic's the expert, since he wrote execute and most of the related code (and the related tests). @fcollonval Any ideas?

@mlucool mlucool mentioned this pull request Mar 19, 2020
5 tasks
@telamonian
Copy link
Member

I thought of a potential concern: should read-only ops (in particular git status) ignore locking? Basically, I'm wondering if there are any potential interactions between this PR and how our UI updates.

@mlucool
Copy link
Contributor Author

mlucool commented Mar 20, 2020

Per my comment:

While this may add locking in some cases that are not a race (i.e. things that do not need to acquire index.lock)

Yes, git status will now unnecessarily wait for the lock. We can remove it in some cases trivially (i.e. opt out) but that feels like an pre-mature optimization based on my local testing. If you feel strongly, I'm happy to make this change

@mlucool
Copy link
Contributor Author

mlucool commented Mar 21, 2020

@telamonian let me know what you'd like me to do here to get this merged so that it's included in 0.10.

@fcollonval
Copy link
Member

do you have an example where execute is tested? pytest --cov=jupyterlab_git --cov-report html jupyterlab_git makes it seem like we always mock that out

Hmm, not off the top of my head. Frederic's the expert, since he wrote execute and most of the related code (and the related tests). @fcollonval Any ideas?

Hey sorry for the delay.
Unfortunately as stated in the out-of-date roadmap. There is no integration test. Meaning that all git command are mocked. But when I introduced the asynchronous calls, I went to mock the execute function directly.

But you could test your nice feature by calling directly the execute function and mock the run_in_executor; the definition to patch is tornado.ioloop.IOLoop.run_in_executor.

@fcollonval
Copy link
Member

fcollonval commented Mar 25, 2020

Testing is a bit more complicated that what I said. But that commit shows how to start achieving it: fcollonval@6ede2a8

jupyterlab_git/git.py Outdated Show resolved Hide resolved
@fcollonval
Copy link
Member

One additional comment:

The argument cwd of execute should be mandatory, now:

-    cwd: "Optional[str]" = None,
+    cwd: str,

@telamonian telamonian added this to the 0.20.0 milestone Mar 26, 2020
@telamonian
Copy link
Member

telamonian commented Mar 26, 2020

I'm targetting this for the next release (which should be 0.20.0). Once this PR is pulled in we'll also backport it to (jlab 1.0 compatible) 0.11.0

Marc Udoff added 2 commits March 26, 2020 13:57
This adds locking before executing commands. While this may
add locking in some cases that are not a race (i.e. things that
do not need to aquire index.lock), this solves issues related to
using this with a git directory on NFS
@mlucool
Copy link
Contributor Author

mlucool commented Mar 26, 2020

@fcollonval this should be all set now. I'll admit the unit test is mediocre because we don't have integration testing. It does prove that if a lock exists, it will sleep and simulates that if a lock exists, a command fails.

@mlucool
Copy link
Contributor Author

mlucool commented Mar 26, 2020

@telamonian

I'm targetting this for the next release (which should be 0.20.0). Once this PR is pulled in we'll also backport it to (jlab 1.0 compatible) 0.11.0

There is no need to backport this to 0.11 for me at this time. I'm hoping to be on 2.x in the near future, but I'll let you know if something changes. If you want to backport anyway, it may be a good idea.

Also, let me know if you want me to deal with some commands not touching the index.lock and we know it.

Copy link
Member

@fcollonval fcollonval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlucool

Thanks for the update and the test. Here is a proposal of a better less mocked test. The idea is simply to mock sleep to check the lock status and remove the lock file.

jupyterlab_git/tests/test_execute.py Outdated Show resolved Hide resolved
Co-Authored-By: Frédéric Collonval <fcollonval@gmail.com>
Copy link
Member

@fcollonval fcollonval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mlucool for this PR and the discussion.

I won't merge it immediately to let @telamonian have a look.

@kgryte
Copy link
Member

kgryte commented Mar 28, 2020

It is not clear to me that we should continually retry until command success in the event that an index.lock is already present, as opposed to parsing error messages and displaying a dialog in the UI which instructs a user to retry the command after waiting a specified period of time.

The main issue I see with this PR is the possibility of a race condition. We don't "lock" the UI to prevent further commands, and, from my reading, we run the risk that the pending command could be executed after a subsequent command (e.g., another commit).

If we want to add retry logic, we should presumably lock the UI to prevent further interactions. IMO, that is not particularly desirable, in comparison to pushing the retry to the user, which is exactly what happens when using Git at the command-line.

@mlucool
Copy link
Contributor Author

mlucool commented Mar 28, 2020

It is not clear to me that we should continually retry until command success in the event that an index.lock is already present

We only loop on waiting for the lock, not rerunning the command

we run the risk that the pending command could be executed after a subsequent command (e.g., another commit).

This enforces that each command wait for execution_lock.acquire. I assumed, maybe wrongly, they would acquire the lock in FIFO and thus preserve ordering. If this is not true, we should enforce this somehow.

The reason for this commit has less to do with needing retry, but because some of the action from the UI require multiple commands to be sent (e.g. a git add then a git commit)

@kgryte
Copy link
Member

kgryte commented Apr 6, 2020

@mlucool Yes, you are correct; from my reading, tornado uses something akin to a FIFO queue. In which case, the race condition I alluded to earlier is not applicable.

However, another concern is what happens if a user shuts down JupyterLab before an index.lock file is removed and the queued command/commands has/have executed?

Meaning, is it possible that, e.g., one or more pending commands (e.g., one or more commit commands for "saving" changes) could fail to be executed in the event that JupyterLab closes during the interim waiting for an index.lock file to be removed?

If, instead of locking on the backend, where the locking behavior is hidden from the user, the extension erred immediately and displayed an error modal to the user, recommending manually retrying an action, then, IMO, while arguably more laborious, expectations could be better managed.

In short, I am largely concerned about all the ways that this change could cause the extension to fail. Are there edge cases that we are missing?

Lastly, I am leery about hardcoding intervals for wait times. Wait times would seem to be highly particular to a user's environment and the relevant repository.

For example, one of my side projects is rather large, and I frequently encounter the presence of an index.lock which sometimes takes upward of 30-60 seconds to disappear, depending on the current resource constraints on my laptop.

Personally, I don't have a good sense as to what could be considered reasonable.

@mlucool
Copy link
Contributor Author

mlucool commented Apr 6, 2020

However, another concern is what happens if a user shuts down a kernel before an index.lock file is removed and the queued command/commands has/have executed?

I thought extension added routes and interactions are not done via a kernel?

All actions that take more than one command which affects index.lock (e.g. git add + commit in simple staging mode) fail because they happen too fast on NFS in my case. These would have to be made atomic from the UI POV and the backend would have to enforce waiting and ordering between those. I'll also note that having more than one UI instance means you can have race conditions between them and is another reason to put it into the backend.

Lastly, I am leery about hardcoding intervals for wait times. Wait times would seem to be highly particular to a user's environment and the relevant repository.

You are right. We should let users change these. There is no "right" amount of time to wait with very different environments. I think we pick a good default in terms of human timeframes and then let advanced users change this. Should we both MAX_WAIT_FOR_EXECUTE_S and MAX_WAIT_FOR_LOCK_S ?

@saulshanabrook
Copy link
Member

These would have to be made atomic from the UI POV and the backend would have to enforce waiting and ordering between those.

It seems to make sense that if we are allowing a long running process to take place on the server, then we should make sure to make this clear to the user. So we should add some way for them to see what actions are being run on the server, or block the UI while one action is being run.

If that could be done after this PR is merged, instead of in this PR itself, that seems like it could be fine as well.

@mlucool
Copy link
Contributor Author

mlucool commented Apr 8, 2020

So we should add some way for them to see what actions are being run on the server, or block the UI while one action is being run.

This already happens for at least some commands (a popup with a spinner). I agree that we should have feedback for all commands to make it more clear to a user that something is being done.

FWIW, in practice, this wait is still subhuman scale for the cases I tested (fast network, NFS, smallish repo). Any case that would be on human scales (e.g. very large repos that @kgryte pointed out), would either fail (e.g. git add and git commit would step on each other) or take approximately as long as they would from the CLI.

@kgryte
Copy link
Member

kgryte commented Apr 10, 2020

@telamonian You have any thoughts on this?

Proposal:

  1. Modify PR to allow retry timeout configurability.
  2. Follow-up PR to suspend UI interactions until pending command complete (e.g., a modal with a spinner). I believe this should be straightforward, as we just need to enable a UI element until receiving an HTTP response.

The need for the latter is to prevent/dissuade the user from closing the JupyterLab server before pending commands have had a chance to complete (e.g., before an index.lock file is removed and git add && git commit run, thus potentially leading to the discarding of user changes).

@mlucool
Copy link
Contributor Author

mlucool commented Apr 13, 2020

I'm happy with @kgryte proposal if that means @telamonian / @fcollonval will accept this PR. It's been stuck for a while, and it would be good wrap up the work here.

@fcollonval
Copy link
Member

Thanks @mlucool for the patience. I'll merge this and add an issue with the proposal of enhancement from @kgryte.

@fcollonval
Copy link
Member

@meeseeksdev backport to 0.11.x

meeseeksmachine pushed a commit to meeseeksmachine/jupyterlab-git that referenced this pull request May 28, 2020
fcollonval added a commit that referenced this pull request May 28, 2020
…on-0.11.x

Backport PR #564 on branch 0.11.x (Fix race condition issues on slower FS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants