Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminating sync session is stuck forever on Docker Desktop for Mac Edge #223

Closed
sprymiker opened this issue Aug 27, 2020 · 7 comments
Closed

Comments

@sprymiker
Copy link

Which version of Mutagen are you using (mutagen version)?

0.12.0-beta1

Which operating system (platform/version/architecture) are you using?

MacOS 10.14.6 (18G6020)

What is the issue that you're experiencing?

A terminating sync session is stuck forever.

What are the steps to reproduce the issue?

  1. Download and install https://desktop.docker.com/mac/edge/46980/Docker.dmg
  2. brew install mutagen-io/mutagen/mutagen-beta
  3. cd /tmp
  4. git clone https://github.com/mutagen-io/example-voting-app.git ./voting
  5. cd ./voting
  6. mutagen compose up -d
  7. mutagen compose down OR mutagen sync terminate vote

What is the expected result of these steps in the absence of the issue?

The compose project is stopped. The sync session is terminated.

What is the actual result?

Terminating forwarding sessions
Terminating session fwrd_VTKv9HCdxMLBQsaIQemxlGfuin3c1YaFtt73ChtqD2x...

And this longs for hours.

Is there any other information that might be helpful?

Running:

$ mutagen compose --project-directory /Users/kalinin/b2c-demo-shop --project-name spryker_b2c_dev -f ./docker/deployment/default/docker-compose.yml down --remove-orphans
Terminating forwarding sessions
Terminating synchronization sessions                                            
Terminating session sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg...

In a different terminal, the list shows 2 identical sessions (not always 2). The status can be different: "Watching files" or "Error", and it does not matter in the case.

$ mutagen list
--------------------------------------------------------------------------------
Name: codebase
Identifier: sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg
Labels:
        io.mutagen.compose.daemon.identifier: 4VYO_F3UV_HN4Z_M46A_KY72_6ARZ_3KEN_6SST_NYMU_ELZM_ZLSA_MQBJ
        io.mutagen.compose.project.name: spryker_b2c_dev
Alpha:
        URL: /Users/kalinin/b2c-demo-shop/docker/deployment/default
        Connection state: Connected
Beta:
        URL: docker://spryker_b2c_dev_mutagen_1/volumes/spryker_b2c_dev_dev_data_sync
        Connection state: Connected
Status: Scanning files
--------------------------------------------------------------------------------
admings-MacBook-Pro-179:b2c-demo-shop kalinin$ mutagen list
--------------------------------------------------------------------------------
Name: codebase
Identifier: sync_5NXJcgyuBpRofHP0hdyra8INgkvMGWtZ0c4Rl8aMeqg
Labels:
        io.mutagen.compose.daemon.identifier: 4VYO_F3UV_HN4Z_M46A_KY72_6ARZ_3KEN_6SST_NYMU_ELZM_ZLSA_MQBJ
        io.mutagen.compose.project.name: spryker_b2c_dev
Alpha:
        URL: /Users/kalinin/b2c-demo-shop/docker/deployment/default
        Connection state: Connected
Beta:
        URL: docker://spryker_b2c_dev_mutagen_1/volumes/spryker_b2c_dev_dev_data_sync
        Connection state: Connected
Status: Scanning files
--------------------------------------------------------------------------------
$ ps -a | grep mutagen
20028 ttys005    0:00.04 mutagen compose --project-directory /Users/kalinin/b2c-demo-shop --project-name spryker_b2c_dev -f ./docker/deployment/default/docker-compose.yml down --remove-orphans

mutagen daemon stop stops daemon, but leaves .lock file. So mutagen daemon stop && rm ~/.mutagen/daemon/daemon.lock && mutagen daemon start helps, however the problem persists.

The only way to fix a problem:
mutagen daemon stop && rm -rf ~/.mutagen

@xenoscopic
Copy link
Member

Thanks for bringing this to my attention. I don't see anything in the recent Docker for Mac Edge release notes that would explain this, so I'll have to try to reproduce and debug it. I'm traveling at the moment, but I'll try to have a look as soon as possible. Do you know if it also happens on the 2.3.5.0 build of Docker for Mac Edge?

@sprymiker
Copy link
Author

I have not tried 2.3.5.0 as my goal is a speedy development environment.
2.3.5.0 does not have anything to reach the goal, just removal of :delegated mutagen feature :(.

I am really sorry, I have no time to spare for such an experiment. However, it is so easy to reproduce, so it should not be a problem for you to check it out with 2.3.5.0 in the background while you making some other stuff.

@PhilETaylor
Copy link

my first try with mutagen compose after mutagen was removed in Docker For Mac Edge, and I get this same issue - terminating session never ends. Using Docker For Mac Edge 2.3.5.0

@marickvantuil
Copy link

Having the same issue on Edge 2.3.5.0. I think (part of) the issue relies in the sidecar container. If I run docker-compose down it will stop all containers except the sidecar, running mutagen compose down will be stuck again. If I run docker-compose down --remove-orphans (or stop the mutagenio/sidecar container manually) and run mutagen compose down it does shut down correctly.

@ryansch
Copy link

ryansch commented Sep 9, 2020

I can still repro this on edge 2.3.6.1.

xenoscopic added a commit that referenced this issue Sep 10, 2020
We were using SIGKILL for transport process termination, but this led to
zombie processes in certain cases (e.g. in issue #223). This commit
switches to using SIGTERM, which is not only more idiomatic, but also
doesn't have this problem. Of course, SIGTERM has some downsides, namely
that transport processes can ignore it and may not forward it properly,
but that doesn't really happen with our transports (SSH and Docker), and
even if it did we would still just block in a wait call until someone
kill -9'd the offending process. The only reason we were using SIGKILL
to begin with was because it was offered by a convenience method.

On Windows, we stick with os.Process.Kill, but this isn't sending a
signal, it's using TerminateProcess.

Signed-off-by: Jacob Howard <jacob@havoc.io>
@xenoscopic
Copy link
Member

Thanks for all of the reports here, they were very helpful.

I think I've diagnosed the issue, which is that Docker for Mac's docker executable starts another private executable (com.docker.cli) as a subprocess to do its actual work. I'm not sure why this is used, but the problem arises from the fact that Mutagen uses a SIGKILL on its transport processes when terminating synchronization/forwarding sessions, which works fine in the case of a single child process, but in this case causes the docker process to become a zombie that never exits because it sits waiting for com.docker.cli to exit but the SIGKILL isn't (and probably can't be) forwarded to the child process.

I'm sorry I didn't catch this during earlier testing—I primarily use a standalone docker executable from Homebrew.

It seems like the reason things work when the sidecar container is shut down is that the com.docker.cli process terminates due to the container terminating.

So there are two solutions: either switch to a forward-able termination signal (e.g. the more appropriate SIGTERM) or forward the SIGKILL to the entire process group via killpg. Using SIGTERM is more idiomatic, but it relies on the child process' good behavior when it comes to handling and forwarding signals. Using SIGKILL on the process group is more effective, but it's kind of a harsh design.

The current use of SIGKILL is just an artifact of using Go's os.Process.Kill convenience function; I don't remember it being an intentional choice. Given that Mutagen has a fairly limited subset of transport executables (namely ssh and docker) and they behave well, I think switching to SIGTERM makes the most sense.

I've just released v0.12.0-beta2 with this fix (as well as an update to Go 1.15.2). Please let me know if this issue still occurs.

@ryansch
Copy link

ryansch commented Sep 11, 2020

@havoc-io I can confirm that this is fixed on edge 2.3.6.1 with mutagen v0.12.0-beta2. 🎉

xenoscopic added a commit that referenced this issue Oct 20, 2020
We were using SIGKILL for transport process termination, but this led to
zombie processes in certain cases (e.g. in issue #223). This commit
switches to using SIGTERM, which is not only more idiomatic, but also
doesn't have this problem. Of course, SIGTERM has some downsides, namely
that transport processes can ignore it and may not forward it properly,
but that doesn't really happen with our transports (SSH and Docker), and
even if it did we would still just block in a wait call until someone
kill -9'd the offending process. The only reason we were using SIGKILL
to begin with was because it was offered by a convenience method.

On Windows, we stick with os.Process.Kill, but this isn't sending a
signal, it's using TerminateProcess.

This is a backport from v0.12.x.

Signed-off-by: Jacob Howard <jacob@havoc.io>
xenoscopic added a commit that referenced this issue Feb 3, 2022
We've been bitten again by golang/go#23019, so rather than a quick hack,
it's time to properly fix agent termination signaling. While the
original plan was to use killpg and Windows job objects to manage agent
process hierarchies, it quickly became clear when trying to implement
that behavior that the APIs needed to accomplish it simply weren't there
on POSIX or Windows. Moreover, it became even less clear what the
correct signaling and forceful termination mechanisms should be,
especially since Mutagen has no insight into transport process
hierarchies. Thus, Mutagen now takes a staged approach to transport
process termination, first waiting, then closing standard input, then
(on POSIX) sending SIGTERM, and finally forcing termination. It relies
on transport processes to correctly forward these mechanisms to the
Mutagen agent and to manage their own internal process hierarchies
accordingly once the Mutagen agent terminates.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
xenoscopic added a commit that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
xenoscopic added a commit that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Backport of 8556e07

Signed-off-by: Jacob Howard <jacob@mutagen.io>
xenoscopic added a commit that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Backport of 8556e07

Signed-off-by: Jacob Howard <jacob@mutagen.io>
coryb pushed a commit to coryb/mutagen that referenced this issue Feb 23, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in mutagen-io/mutagen@6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates mutagen-io#223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants