Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization loop doesn't monitor for context cancellation during scan operations #71

Closed
IngCr3at1on opened this issue Feb 26, 2019 · 13 comments
Assignees
Milestone

Comments

@IngCr3at1on
Copy link

Firstly I want to point out that I ran into this trying to disable the sessions that are apparently the cause of my crash in #70 because I wanted to ensure that the other sessions would continue to run until I had a chance to look at #70 again later.

In trying to pause the session though it just kind of hangs on 'Pausing session ...'; I have seen this before in the past as well but it was months ago and I don't have any details on it. I expect this issue will require the logs that I need to to debug #70 as well so I'll add more here after I've taken a look but I wanted to throw something up in case others saw this behavior.

@xenoscopic
Copy link
Member

xenoscopic commented Feb 26, 2019

Can you tell me which protocols the session URLs are using? Local? SSH? Docker?

The best way to debug this hanging would be to run the daemon in an interactive mode with mutagen daemon run as described in #70, initiate the mutagen pause command and let it hang, and then do kill -ABRT <daemon-pid> (obtained, e.g., through ps aux | grep mutagen). Then, on the console where the daemon is running, a stack trace of every running Goroutine will be printed. If you can paste that here, it should be trivial to isolate the hang.

@IngCr3at1on
Copy link
Author

IngCr3at1on commented Feb 26, 2019

@havoc-io I'm using a domain (not IP) URL over SSH with key authentication. Will have to get back with the other data but yes I agree it should be trivial but I still wanted to track it for you so we could try to solve it (as I've seen it in other cases outside of the crash mentioned in #70) 😄

@xenoscopic
Copy link
Member

@IngCr3at1on Okay, cool, thanks for the info.

Additionally, can you tell me which OS you're running? My guess is that the synchronization halting operation is being performed, but it's waiting for the synchronization loop to exit and that's being blocked waiting for the underlying ssh process to terminate (or it's not detecting the broken pipe to the ssh process). The way that Go does asynchronous I/O can vary from platform to platform, so it'd be useful to know what you're running.

At the end of the day though, the daemon Goroutine stack traces will suss this out pretty quickly, so I look forward to having a peek at those.

@IngCr3at1on
Copy link
Author

IngCr3at1on commented Feb 26, 2019

@havoc-io the host machine is running debian squeeze (using bash) while the agents are running on a macbook and an arch linux desktop (both using zsh).

It is worth noting (though likely unrelated) that the host is a cloud based VM.

@xenoscopic
Copy link
Member

@IngCr3at1on So was the pause hang due to the daemon crashing? Or is this an unrelated issue? If the daemon has crashed, then the mutagen pause command should timeout after a few seconds with a message saying Error: unable to connect to daemon: context deadline exceeded, but it sounds like it's hanging indefinitely?

@IngCr3at1on
Copy link
Author

@havoc-io the pause is indefinite, in the event the crash does occur yes I see an unable to connect to daemon message trying to call pause.

What's interesting to me is that the pause seems to block in this way only if a file scan is currently in progress (it occurs to me I didn't include that detail initially). I've been half looking at this in between work related tasks so I have not had a chance to run your suggested steps for this issue.

@xenoscopic
Copy link
Member

@IngCr3at1on Okay, that explains it. The scanning operation is the only location where the synchronization loop doesn't monitor for cancellation during a blocking operation. It's usually not an issue since it's a "less"-blocking operation (usually finishing quickly), but clearly it's not the most ideal behavior. I'll add a fix in v0.9.0.

@xenoscopic xenoscopic changed the title Attempting to pause a session sometimes hangs indefinitely Synchronization loop doesn't monitor for context cancellation during scan operations Feb 27, 2019
@xenoscopic xenoscopic self-assigned this Feb 27, 2019
@xenoscopic xenoscopic added this to the v0.9.x milestone Feb 27, 2019
@IngCr3at1on
Copy link
Author

@havoc-io can you provide any advice in the meantime for how I can force terminate the hanging sessions that I know are going to crash because of memory restrictions?

@xenoscopic
Copy link
Member

@IngCr3at1on I think your best bet would be to kill (maybe kill -9) the ssh (or docker) process that's being used as a transport. I think that should break the hang, but again it really depends on the particular OS pipe implementation. It should be the one invoking mutagen-agent on the remote.

@IngCr3at1on
Copy link
Author

Well unfortunately that didn't work; the ssh session would not allow me to kill it O.o

I wound up killing the entire daemon and deleting everything associated with the offending session inside the $HOME/.mutagen directory and the broken session is now gone.

@xenoscopic
Copy link
Member

Well unfortunately that didn't work; the ssh session would not allow me to kill it O.o

That's... bizarre :-/

Glad to hear you were able to get rid of it. I'll see if I can make the proposed change quickly and maybe backport that for v0.8.1 or something this week. Sorry for the delay.

@driskell
Copy link

I had a similar issue where it was stuck in scanning and would not terminate. The scanning was hitting an error due to an absolute symbolic link in the remote. It would then retry scanning perpetually. Unfortunately this error did not propagate to the terminal running the project terminate, nor to the project start (which was hanging indefinitely too with forced sync message)

@xenoscopic xenoscopic modified the milestones: v0.11.x, v0.12.x Jan 5, 2021
@xenoscopic
Copy link
Member

This should be fixed as of 6085451, available in Mutagen v0.11 and later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants