Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

sprymiker · 2020-08-27T08:21:28Z

Which version of Mutagen are you using (`mutagen version`)?

0.12.0-beta1

Which operating system (platform/version/architecture) are you using?

MacOS 10.14.6 (18G6020)

What is the issue that you're experiencing?

A terminating sync session is stuck forever.

What are the steps to reproduce the issue?

Download and install https://desktop.docker.com/mac/edge/46980/Docker.dmg
brew install mutagen-io/mutagen/mutagen-beta
cd /tmp
git clone https://github.com/mutagen-io/example-voting-app.git ./voting
cd ./voting
mutagen compose up -d
mutagen compose down OR mutagen sync terminate vote

What is the expected result of these steps in the absence of the issue?

The compose project is stopped. The sync session is terminated.

What is the actual result?

Terminating forwarding sessions
Terminating session fwrd_VTKv9HCdxMLBQsaIQemxlGfuin3c1YaFtt73ChtqD2x...

And this longs for hours.

Is there any other information that might be helpful?

Running:

$ mutagen compose --project-directory /Users/kalinin/b2c-demo-shop --project-name spryker_b2c_dev -f ./docker/deployment/default/docker-compose.yml down --remove-orphans
Terminating forwarding sessions
Terminating synchronization sessions                                            
Terminating session sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg...

In a different terminal, the list shows 2 identical sessions (not always 2). The status can be different: "Watching files" or "Error", and it does not matter in the case.

$ mutagen list
--------------------------------------------------------------------------------
Name: codebase
Identifier: sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg
Labels:
        io.mutagen.compose.daemon.identifier: 4VYO_F3UV_HN4Z_M46A_KY72_6ARZ_3KEN_6SST_NYMU_ELZM_ZLSA_MQBJ
        io.mutagen.compose.project.name: spryker_b2c_dev
Alpha:
        URL: /Users/kalinin/b2c-demo-shop/docker/deployment/default
        Connection state: Connected
Beta:
        URL: docker://spryker_b2c_dev_mutagen_1/volumes/spryker_b2c_dev_dev_data_sync
        Connection state: Connected
Status: Scanning files
--------------------------------------------------------------------------------
admings-MacBook-Pro-179:b2c-demo-shop kalinin$ mutagen list
--------------------------------------------------------------------------------
Name: codebase
Identifier: sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg
Labels:
        io.mutagen.compose.daemon.identifier: 4VYO_F3UV_HN4Z_M46A_KY72_6ARZ_3KEN_6SST_NYMU_ELZM_ZLSA_MQBJ
        io.mutagen.compose.project.name: spryker_b2c_dev
Alpha:
        URL: /Users/kalinin/b2c-demo-shop/docker/deployment/default
        Connection state: Connected
Beta:
        URL: docker://spryker_b2c_dev_mutagen_1/volumes/spryker_b2c_dev_dev_data_sync
        Connection state: Connected
Status: Scanning files
--------------------------------------------------------------------------------

$ ps -a | grep mutagen
20028 ttys005    0:00.04 mutagen compose --project-directory /Users/kalinin/b2c-demo-shop --project-name spryker_b2c_dev -f ./docker/deployment/default/docker-compose.yml down --remove-orphans

mutagen daemon stop stops daemon, but leaves .lock file. So mutagen daemon stop && rm ~/.mutagen/daemon/daemon.lock && mutagen daemon start helps, however the problem persists.

The only way to fix a problem:
mutagen daemon stop && rm -rf ~/.mutagen

The text was updated successfully, but these errors were encountered:

xenoscopic · 2020-08-27T21:41:22Z

Thanks for bringing this to my attention. I don't see anything in the recent Docker for Mac Edge release notes that would explain this, so I'll have to try to reproduce and debug it. I'm traveling at the moment, but I'll try to have a look as soon as possible. Do you know if it also happens on the 2.3.5.0 build of Docker for Mac Edge?

sprymiker · 2020-08-27T22:19:23Z

I have not tried 2.3.5.0 as my goal is a speedy development environment.
2.3.5.0 does not have anything to reach the goal, just removal of :delegated mutagen feature :(.

I am really sorry, I have no time to spare for such an experiment. However, it is so easy to reproduce, so it should not be a problem for you to check it out with 2.3.5.0 in the background while you making some other stuff.

PhilETaylor · 2020-08-30T11:28:03Z

my first try with mutagen compose after mutagen was removed in Docker For Mac Edge, and I get this same issue - terminating session never ends. Using Docker For Mac Edge 2.3.5.0

marickvantuil · 2020-08-30T13:37:37Z

Having the same issue on Edge 2.3.5.0. I think (part of) the issue relies in the sidecar container. If I run docker-compose down it will stop all containers except the sidecar, running mutagen compose down will be stuck again. If I run docker-compose down --remove-orphans (or stop the mutagenio/sidecar container manually) and run mutagen compose down it does shut down correctly.

ryansch · 2020-09-09T15:31:08Z

I can still repro this on edge 2.3.6.1.

We were using SIGKILL for transport process termination, but this led to zombie processes in certain cases (e.g. in issue #223). This commit switches to using SIGTERM, which is not only more idiomatic, but also doesn't have this problem. Of course, SIGTERM has some downsides, namely that transport processes can ignore it and may not forward it properly, but that doesn't really happen with our transports (SSH and Docker), and even if it did we would still just block in a wait call until someone kill -9'd the offending process. The only reason we were using SIGKILL to begin with was because it was offered by a convenience method. On Windows, we stick with os.Process.Kill, but this isn't sending a signal, it's using TerminateProcess. Signed-off-by: Jacob Howard <jacob@havoc.io>

xenoscopic · 2020-09-10T19:27:57Z

Thanks for all of the reports here, they were very helpful.

I think I've diagnosed the issue, which is that Docker for Mac's docker executable starts another private executable (com.docker.cli) as a subprocess to do its actual work. I'm not sure why this is used, but the problem arises from the fact that Mutagen uses a SIGKILL on its transport processes when terminating synchronization/forwarding sessions, which works fine in the case of a single child process, but in this case causes the docker process to become a zombie that never exits because it sits waiting for com.docker.cli to exit but the SIGKILL isn't (and probably can't be) forwarded to the child process.

I'm sorry I didn't catch this during earlier testing—I primarily use a standalone docker executable from Homebrew.

It seems like the reason things work when the sidecar container is shut down is that the com.docker.cli process terminates due to the container terminating.

So there are two solutions: either switch to a forward-able termination signal (e.g. the more appropriate SIGTERM) or forward the SIGKILL to the entire process group via killpg. Using SIGTERM is more idiomatic, but it relies on the child process' good behavior when it comes to handling and forwarding signals. Using SIGKILL on the process group is more effective, but it's kind of a harsh design.

The current use of SIGKILL is just an artifact of using Go's os.Process.Kill convenience function; I don't remember it being an intentional choice. Given that Mutagen has a fairly limited subset of transport executables (namely ssh and docker) and they behave well, I think switching to SIGTERM makes the most sense.

I've just released v0.12.0-beta2 with this fix (as well as an update to Go 1.15.2). Please let me know if this issue still occurs.

ryansch · 2020-09-11T15:11:08Z

@havoc-io I can confirm that this is fixed on edge 2.3.6.1 with mutagen v0.12.0-beta2. 🎉

We were using SIGKILL for transport process termination, but this led to zombie processes in certain cases (e.g. in issue #223). This commit switches to using SIGTERM, which is not only more idiomatic, but also doesn't have this problem. Of course, SIGTERM has some downsides, namely that transport processes can ignore it and may not forward it properly, but that doesn't really happen with our transports (SSH and Docker), and even if it did we would still just block in a wait call until someone kill -9'd the offending process. The only reason we were using SIGKILL to begin with was because it was offered by a convenience method. On Windows, we stick with os.Process.Kill, but this isn't sending a signal, it's using TerminateProcess. This is a backport from v0.12.x. Signed-off-by: Jacob Howard <jacob@havoc.io>

We've been bitten again by golang/go#23019, so rather than a quick hack, it's time to properly fix agent termination signaling. While the original plan was to use killpg and Windows job objects to manage agent process hierarchies, it quickly became clear when trying to implement that behavior that the APIs needed to accomplish it simply weren't there on POSIX or Windows. Moreover, it became even less clear what the correct signaling and forceful termination mechanisms should be, especially since Mutagen has no insight into transport process hierarchies. Thus, Mutagen now takes a staged approach to transport process termination, first waiting, then closing standard input, then (on POSIX) sending SIGTERM, and finally forcing termination. It relies on transport processes to correctly forward these mechanisms to the Mutagen agent and to manage their own internal process hierarchies accordingly once the Mutagen agent terminates. This commit also takes the opportunity to impose a size limit on the in-memory error buffer used to capture transport errors after a handshake failure. Updates #223 Updates mutagen-io/mutagen-compose#11 Signed-off-by: Jacob Howard <jacob@mutagen.io>

We've already seen several manifestations of golang/go#23019, so this commit refactors the way that agent input, output, and error streams are managed, as well as the way that agent process termination is signaled. Historically (in 6c1a47c), we avoided closing standard input/output pipes because they were blocking, meaning that a close wouldn't preempt a read/write and that the close itself could potentially block. However, this hasn't been the case since Go 1.9, when os.File was switched to use polling I/O (at least for pipes and other pollable files). As such, we can now use closure of standard input as a signal to agent processes (via their intermediate transport processes) that they should terminate. Failing that, we still fall back to signal-based termination, but this standard input closure mechanism is particularly important on Windows, where no "soft" signaling mechanism (like SIGTERM) exists and transport process termination via TerminateProcess often triggers golang/go#23019. This is especially problematic with Docker Desktop, where an intermediate com.docker.cli process is used underneath the primary Docker CLI, and standard input closure is the only reliable termination signaling mechanism. Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT on Windows, which some runtimes (such as Go's) will convert to a SIGTERM, but there's no telling how intermediate transport processes will interpret these messages. They don't necessarily have the same semantics as SIGTERM. And, just in case none of our signaling mechanisms works, we now avoid using os/exec.Cmd's forwarding Goroutines entirely, meaning that its Wait method will close all pipes as soon as the child process is terminated. As part of this refactor, I've also looked at switching to a more systematic approach to manage the process hierarchies that could potentially be generated by transport processes. Things like killpg on POSIX or job objects on Windows could theoretically facilitate such management. However, the fact of the matter is that there's simply no way to reliably enforce termination of such hierarchies and, more importantly, no way for Mutagen to know what termination signaling mechanism would be appropriate for any intermediate processes. Essentially, we just have to rely on transport processes to either correctly forwarded standard input closure (especially on Windows) and/or correctly forward SIGTERM on POSIX. But, if they don't, we will forcibly terminate them and any associated resources in the Mutagen daemon. If their subprocesses linger on afterward, that's a bug in the transport process, not Mutagen. This commit also takes the opportunity to impose a size limit on the in-memory error buffer used to capture transport errors after a handshake failure. Updates #223 Updates mutagen-io/mutagen-compose#11 Signed-off-by: Jacob Howard <jacob@mutagen.io>

We've already seen several manifestations of golang/go#23019, so this commit refactors the way that agent input, output, and error streams are managed, as well as the way that agent process termination is signaled. Historically (in 6c1a47c), we avoided closing standard input/output pipes because they were blocking, meaning that a close wouldn't preempt a read/write and that the close itself could potentially block. However, this hasn't been the case since Go 1.9, when os.File was switched to use polling I/O (at least for pipes and other pollable files). As such, we can now use closure of standard input as a signal to agent processes (via their intermediate transport processes) that they should terminate. Failing that, we still fall back to signal-based termination, but this standard input closure mechanism is particularly important on Windows, where no "soft" signaling mechanism (like SIGTERM) exists and transport process termination via TerminateProcess often triggers golang/go#23019. This is especially problematic with Docker Desktop, where an intermediate com.docker.cli process is used underneath the primary Docker CLI, and standard input closure is the only reliable termination signaling mechanism. Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT on Windows, which some runtimes (such as Go's) will convert to a SIGTERM, but there's no telling how intermediate transport processes will interpret these messages. They don't necessarily have the same semantics as SIGTERM. And, just in case none of our signaling mechanisms works, we now avoid using os/exec.Cmd's forwarding Goroutines entirely, meaning that its Wait method will close all pipes as soon as the child process is terminated. As part of this refactor, I've also looked at switching to a more systematic approach to manage the process hierarchies that could potentially be generated by transport processes. Things like killpg on POSIX or job objects on Windows could theoretically facilitate such management. However, the fact of the matter is that there's simply no way to reliably enforce termination of such hierarchies and, more importantly, no way for Mutagen to know what termination signaling mechanism would be appropriate for any intermediate processes. Essentially, we just have to rely on transport processes to either correctly forwarded standard input closure (especially on Windows) and/or correctly forward SIGTERM on POSIX. But, if they don't, we will forcibly terminate them and any associated resources in the Mutagen daemon. If their subprocesses linger on afterward, that's a bug in the transport process, not Mutagen. This commit also takes the opportunity to impose a size limit on the in-memory error buffer used to capture transport errors after a handshake failure. Updates #223 Updates mutagen-io/mutagen-compose#11 Backport of 8556e07 Signed-off-by: Jacob Howard <jacob@mutagen.io>

We've already seen several manifestations of golang/go#23019, so this commit refactors the way that agent input, output, and error streams are managed, as well as the way that agent process termination is signaled. Historically (in mutagen-io/mutagen@6c1a47c), we avoided closing standard input/output pipes because they were blocking, meaning that a close wouldn't preempt a read/write and that the close itself could potentially block. However, this hasn't been the case since Go 1.9, when os.File was switched to use polling I/O (at least for pipes and other pollable files). As such, we can now use closure of standard input as a signal to agent processes (via their intermediate transport processes) that they should terminate. Failing that, we still fall back to signal-based termination, but this standard input closure mechanism is particularly important on Windows, where no "soft" signaling mechanism (like SIGTERM) exists and transport process termination via TerminateProcess often triggers golang/go#23019. This is especially problematic with Docker Desktop, where an intermediate com.docker.cli process is used underneath the primary Docker CLI, and standard input closure is the only reliable termination signaling mechanism. Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT on Windows, which some runtimes (such as Go's) will convert to a SIGTERM, but there's no telling how intermediate transport processes will interpret these messages. They don't necessarily have the same semantics as SIGTERM. And, just in case none of our signaling mechanisms works, we now avoid using os/exec.Cmd's forwarding Goroutines entirely, meaning that its Wait method will close all pipes as soon as the child process is terminated. As part of this refactor, I've also looked at switching to a more systematic approach to manage the process hierarchies that could potentially be generated by transport processes. Things like killpg on POSIX or job objects on Windows could theoretically facilitate such management. However, the fact of the matter is that there's simply no way to reliably enforce termination of such hierarchies and, more importantly, no way for Mutagen to know what termination signaling mechanism would be appropriate for any intermediate processes. Essentially, we just have to rely on transport processes to either correctly forwarded standard input closure (especially on Windows) and/or correctly forward SIGTERM on POSIX. But, if they don't, we will forcibly terminate them and any associated resources in the Mutagen daemon. If their subprocesses linger on afterward, that's a bug in the transport process, not Mutagen. This commit also takes the opportunity to impose a size limit on the in-memory error buffer used to capture transport errors after a handshake failure. Updates mutagen-io#223 Updates mutagen-io/mutagen-compose#11 Signed-off-by: Jacob Howard <jacob@mutagen.io>

xenoscopic closed this as completed Sep 10, 2020

xenoscopic mentioned this issue Sep 18, 2020

Resetting Project Stuck and Hangs Indefinitely #216

Open

sprymiker mentioned this issue Sep 24, 2020

Adjust requirements for latest mutagen beta. spryker/docker-sdk#152

Closed

This was referenced Oct 26, 2020

Command hangs: mutagen sync terminate xxx #174

Closed

Mutagen stuck when stopping environment wardenenv/warden#235

Closed

xenoscopic mentioned this issue Feb 4, 2022

agent: refactor termination and stream management #316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

sprymiker commented Aug 27, 2020

xenoscopic commented Aug 27, 2020

sprymiker commented Aug 27, 2020

PhilETaylor commented Aug 30, 2020

marickvantuil commented Aug 30, 2020

ryansch commented Sep 9, 2020

xenoscopic commented Sep 10, 2020

ryansch commented Sep 11, 2020

Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

Comments

sprymiker commented Aug 27, 2020

Which version of Mutagen are you using (mutagen version)?

Which operating system (platform/version/architecture) are you using?

What is the issue that you're experiencing?

What are the steps to reproduce the issue?

What is the expected result of these steps in the absence of the issue?

What is the actual result?

Is there any other information that might be helpful?

xenoscopic commented Aug 27, 2020

sprymiker commented Aug 27, 2020

PhilETaylor commented Aug 30, 2020

marickvantuil commented Aug 30, 2020

ryansch commented Sep 9, 2020

xenoscopic commented Sep 10, 2020

ryansch commented Sep 11, 2020

Which version of Mutagen are you using (`mutagen version`)?