Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutagen-compose down fails #11

Closed
darrylkuhn opened this issue Jan 19, 2022 · 14 comments
Closed

mutagen-compose down fails #11

darrylkuhn opened this issue Jan 19, 2022 · 14 comments

Comments

@darrylkuhn
Copy link

I'm new to mutagen and mutagen compose but have been able to get things up and running (what an improvement to our hot module reloading times as compared to bind mounts!). I am running into one issue though (with mutagen-compose I think). I bring the orchestration up with mutagen-compose --project-name vue up and everything works as expected, however when I run mutagen-compose --project-name vue down the vue container comes down but the mutagen sidecar and sync session just keep running:

 - Container vue-mutagen-1  Stopping                                                                             336.0s
 - Container vue-vue-1      Removed                                                                               11.6s
 - Mutagen                  Pausing session sync_afv4HMEvIQedDHD9zYR5nCQm7gokr7gazBO0BU8xtz6...                  336.0s

So far as I can tell the SIGTERM is not making it to the container from the mutagen-compose down command. When I issue a docker container stop vue-mutagen-1 the rest of the orchestration comes down (network, etc...) and the sync is stopped and removed.

I am running:

  • Windows 11 Enterprise 21H2 (OS build 22000.434)
  • Mutagen version 0.13.0
  • Compose version v2.2.3
  • Docker version v20.10.7 (back by wsl2)

And for reference here is the mutagen compose file:

# This compose file must be run using mutagen-compose - please be sure you have
# it installed (see: https://mutagen.io/documentation/orchestration/compose)
version: '3.4'

services:
  vue:
    image: "node:14-alpine"
    volumes:
      # use a mutagen sync to address performance issues of bind mounts
      - "vuesrc:/web"
      # write node_modules to a named volume to reduce sync io
      - "node_modules:/web/node_modules" 
    working_dir: /web
    # We use the existance of the /web/node_modules/.bin directory as a proxy 
    # for whether or not node modules are already installed. This command will
    # test to see if it exists and if not will run npm install (which takes a 
    # while), otherwise we skip it and in either case run the serve command
    command:
      - /bin/sh
      - -c
      - |
        test -d /web/node_modules/.bin || npm install
        npm run serve
    ports:
     - "8080:8080"
    environment:
     # Required to support hot module reloading
     - CHOKIDAR_USEPOLLING=true
x-mutagen:
  sync:
    defaults:
      ignore:
        vcs: true
        paths:
          - "/node_modules"
    vuesrc:
      alpha: "."
      beta: "volume://vuesrc"
volumes:
  vuesrc:
  node_modules:

Happy to provide any other details, logs, etc...

@darrylkuhn
Copy link
Author

Bump @xenoscopic - anything you'd recommend I try here to mitigate the issue? I'm happy to share more logs, run a beta build, make a video capture, or hop on screen share. Cheers!

@xenoscopic
Copy link
Member

I apologize that I haven't gotten back to you on this yet (especially after such a detailed report). I'll have a look at this tomorrow morning and let you know if I need anything else, but hopefully it should be pretty easy to suss out.

@darrylkuhn
Copy link
Author

@xenoscopic no worries and I appreciate any attention you can give it 👍

@xenoscopic
Copy link
Member

So, reviewing the issue, it seems like what's blocking shutdown is the pause operation on the Mutagen synchronization session. I've had sporadic reports of session pausing/termination hanging with Docker, but the only way I've been able to reproduce this is by trying to pause/terminate a Mutagen synchronization session targeting a paused Docker container (which is a known issue that will be solved soon with an additional heartbeat mechanism).

You didn't mention it, so I assume the Mutagen container is not paused in this case, but it's entirely possible a similar issue is preventing the sync session from shutting down.

Are you able to reproduce this hang at least somewhat reliably? If so, it would be really useful if I could get a stack trace of the Mutagen daemon on your system once this hang occurs. Unfortunately this is a little bit complicated and would require doing a custom build of Mutagen, but I could provide instructions if you're willing/able to reproduce.

@darrylkuhn
Copy link
Author

Willing and able - seems to happen every time for me

@xenoscopic
Copy link
Member

Awesome! Here are the basic instructions. Some minor tweaking may be necessary depending on your exact system setup, but if something doesn't work, let me know - definitely don't spend hours digging into it 😅. Delve can be a little finicky, but it should work.

First, you'll need a Go 1.17 toolchain installed and in your path.

Then, the setup is basically the following:

# Install Delve (assuming you have $GOPATH/bin in your path), otherwise install
# instructions are available here:
# https://github.com/go-delve/delve/tree/master/Documentation/installation
go install github.com/go-delve/delve/cmd/dlv@latest

# Create a debuggable Mutagen build
git clone https://github.com/mutagen-io/mutagen.git
cd mutagen
git checkout v0.13.0
go run scripts/build.go --mode=slim   # ignore warnings about cgo/macOS

# Stop the existing Mutagen daemon (if running - ignore errors if not)
mutagen daemon stop

# Run the daemon in a debugger
dlv exec ./build/mutagen.exe -- daemon run
(dlv) continue

# In a separate terminal, start up your Compose project, then tear it down to
# the point where it hangs on "Pausing session ..."

# Then, back in the debugger, hit CTRL+C, and then type
(dlv) goroutines -t
# ...and copy this output...

If you can copy that output, either here or to a Gist, I should be able to figure out exactly what's hanging. I may have some follow-up questions, but hopefully it should be all the info I need. Feel free to audit the output for any info that needs to be redacted, but there shouldn't be any user-visible data in it.

Thanks in advance!

@darrylkuhn
Copy link
Author

@xenoscopic thanks for the clear instructions! Here is the output:

PS C:\Users\DarrylKuhn\source\mutagen> dlv exec .\build\mutagen.exe -- daemon run
Type 'help' for list of commands.
(dlv) continue
2022/02/02 08:40:42 [forwarding|info] Looking for existing sessions
2022/02/02 08:40:42 [forwarding|info] Session manager initialized
2022/02/02 08:40:42 [synchronization|info] Looking for existing sessions
2022/02/02 08:40:42 [synchronization|info] Loading session sync_kUvNxRuXM33krgJxAkd6M1e8OKCewX8W9D85VWXZmtp
2022/02/02 08:40:42 [synchronization.sync_kUvNxRuXM33krgJxAkd6M1e8OKCewX8W9D85VWXZmtp|info] Session loaded
2022/02/02 08:40:42 [synchronization|info] Session manager initialized
2022/02/02 08:41:16 [synchronization.sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3|info] Connecting to alpha endpoint
2022/02/02 08:41:16 [synchronization.sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3|info] Connecting to beta endpoint
2022/02/02 08:41:20 [synchronization.sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3.beta.remote|info] [housekeeping|info] Performing initial housekeeping
2022/02/02 08:41:20 [synchronization.sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3|info] Starting synchronization loop
2022/02/02 08:41:20 [synchronization.sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3|info] Session initialized
received SIGINT, stopping process (will not forward signal)
2022/02/02 08:43:23 [synchronization|info] Shutting down
2022/02/02 08:43:23 [synchronization|info] Halting session sync_kUvNxRuXM33krgJxAkd6M1e8OKCewX8W9D85VWXZmtp
2022/02/02 08:43:23 [synchronization|info] Halting session sync_S3d2G5WizvG7p7PZTiyadq6NRPsgNw4qMbRG5adjgD3
Stopped at: 0x7fffe7cc3dc4
=>   1: no source available
(dlv) goroutines -t
  Goroutine 1 - User: c:/program files/go/src/runtime/proc.go:6382 sync.runtime_doSpin (0xc7f01a) (thread 15404)
         0  0x0000000000c84346 in runtime.procyield
             at c:/program files/go/src/runtime/asm_amd64.s:657
         1  0x0000000000c7f01a in sync.runtime_doSpin
             at c:/program files/go/src/runtime/proc.go:6382
         2  0x0000000000c9c225 in sync.(*Mutex).lockSlow
             at c:/program files/go/src/sync/mutex.go:101
         3  0x000000000109fabb in sync.(*Mutex).Lock
             at c:/program files/go/src/sync/mutex.go:81
         4  0x000000000109fabb in github.com/mutagen-io/mutagen/pkg/synchronization.(*controller).halt
             at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/controller.go:563
         5  0x00000000010a5876 in github.com/mutagen-io/mutagen/pkg/synchronization.(*Manager).Shutdown
             at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/manager.go:194
         6  0x00000000010daf86 in github.com/mutagen-io/mutagen/cmd/mutagen/daemon.runMain·dwrap·4
             at c:/users/darrylkuhn/source/mutagen/cmd/mutagen/daemon/run.go:60
         7  0x00000000010dac40 in github.com/mutagen-io/mutagen/cmd/mutagen/daemon.runMain
             at c:/users/darrylkuhn/source/mutagen/cmd/mutagen/daemon/run.go:113
         8  0x0000000000da7a8e in github.com/spf13/cobra.(*Command).execute
             at c:/users/darrylkuhn/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:856
         9  0x0000000000da817c in github.com/spf13/cobra.(*Command).ExecuteC
             at c:/users/darrylkuhn/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:974
        10  0x00000000010f8431 in github.com/spf13/cobra.(*Command).Execute
             at c:/users/darrylkuhn/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:902
        11  0x00000000010f8431 in main.main
             at c:/users/darrylkuhn/source/mutagen/cmd/mutagen/main.go:115
        (truncated)
  Goroutine 2 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [force gc (idle) 456404h17m47.1402569s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c59471 in runtime.goparkunlock
            at c:/program files/go/src/runtime/proc.go:372
        2  0x0000000000c59471 in runtime.forcegchelper
            at c:/program files/go/src/runtime/proc.go:306
        3  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 3 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC sweep wait]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c43238 in runtime.goparkunlock
            at c:/program files/go/src/runtime/proc.go:372
        2  0x0000000000c43238 in runtime.bgsweep
            at c:/program files/go/src/runtime/mgcsweep.go:182
        3  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 4 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC scavenge wait]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c414a8 in runtime.goparkunlock
            at c:/program files/go/src/runtime/proc.go:372
        2  0x0000000000c414a8 in runtime.bgscavenge
            at c:/program files/go/src/runtime/mgcscavenge.go:314
        3  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 5 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [finalizer wait 456404h17m47.1422572s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c38c73 in runtime.goparkunlock
            at c:/program files/go/src/runtime/proc.go:372
        2  0x0000000000c38c73 in runtime.runfinq
            at c:/program files/go/src/runtime/mfinal.go:177
        3  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 6 - User: c:/program files/go/src/runtime/sigqueue.go:169 os/signal.signal_recv (0xc7ff18) (thread 2740)
        0  0x00007fffe7cc33e4 in ???
            at ?:-1
        1  0x0000000000c82400 in runtime.systemstack_switch
            at c:/program files/go/src/runtime/asm_amd64.s:350
        2  0x0000000000c52f5d in runtime.stdcall
            at c:/program files/go/src/runtime/os_windows.go:1068
        3  0x0000000000c53105 in runtime.stdcall2
            at :0
        4  0x0000000000c5232f in runtime.semasleep
            at c:/program files/go/src/runtime/os_windows.go:818
        5  0x0000000000c2bb96 in runtime.notetsleep_internal
            at c:/program files/go/src/runtime/lock_sema.go:214
        6  0x0000000000c2becf in runtime.notetsleepg
            at c:/program files/go/src/runtime/lock_sema.go:296
        7  0x0000000000c7ff18 in os/signal.signal_recv
            at c:/program files/go/src/runtime/sigqueue.go:169
        8  0x0000000001044f79 in os/signal.loop
            at c:/program files/go/src/os/signal/signal_unix.go:24
        9  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 11 - User: c:/program files/go/src/runtime/syscall_windows.go:497 syscall.Syscall6 (0xc80dba) (thread 8244)
        0  0x00007fffe7cc3484 in ???
            at ?:-1
        1  0x0000000000c82400 in runtime.systemstack_switch
            at c:/program files/go/src/runtime/asm_amd64.s:350
        2  0x0000000000c2430c in runtime.cgocall
            at c:/program files/go/src/runtime/cgocall.go:166
        3  0x0000000000c80dba in syscall.Syscall6
            at c:/program files/go/src/runtime/syscall_windows.go:497
        4  0x0000000000c86f1a in syscall.Syscall6
            at :0
        5  0x0000000001030f74 in github.com/Microsoft/go-winio.getQueuedCompletionStatus
            at c:/users/darrylkuhn/go/pkg/mod/github.com/!microsoft/go-winio@v0.5.1/zsyscall_windows.go:357
        6  0x000000000102d1aa in github.com/Microsoft/go-winio.ioCompletionProcessor
            at c:/users/darrylkuhn/go/pkg/mod/github.com/!microsoft/go-winio@v0.5.1/file.go:167
        7  0x000000000102cd66 in github.com/Microsoft/go-winio.initIo·dwrap·1
            at c:/users/darrylkuhn/go/pkg/mod/github.com/!microsoft/go-winio@v0.5.1/file.go:73
        8  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 31 - User: c:/program files/go/src/runtime/syscall_windows.go:497 syscall.Syscall6 (0xc80dba) (thread 18068)
         0  0x00007fffe7cc3424 in ???
             at ?:-1
         1  0x0000000000c82400 in runtime.systemstack_switch
             at c:/program files/go/src/runtime/asm_amd64.s:350
         2  0x0000000000c2430c in runtime.cgocall
             at c:/program files/go/src/runtime/cgocall.go:166
         3  0x0000000000c80dba in syscall.Syscall6
             at c:/program files/go/src/runtime/syscall_windows.go:497
         4  0x0000000000c86f1a in syscall.Syscall6
             at :0
         5  0x0000000000cd201e in syscall.ReadFile
             at c:/program files/go/src/syscall/zsyscall_windows.go:1024
         6  0x0000000000ccac2e in syscall.Read
             at c:/program files/go/src/syscall/syscall_windows.go:380
         7  0x0000000000cf4a54 in internal/poll.(*FD).Read
             at c:/program files/go/src/internal/poll/fd_windows.go:427
         8  0x0000000000cfdfde in os.(*File).read
             at c:/program files/go/src/os/file_posix.go:32
         9  0x0000000000cfdfde in os.(*File).Read
             at c:/program files/go/src/os/file.go:119
        10  0x0000000000cc3972 in io.copyBuffer
             at c:/program files/go/src/io/io.go:423
        (truncated)
  Goroutine 50 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle) 456404h17m47.2254867s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 51 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 52 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle) 456404h16m30.2663036s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 66 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 67 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 82 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 98 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 114 - User: c:/program files/go/src/runtime/proc.go:367 runtime.gopark (0xc595d6) [GC worker (idle)]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c3b725 in runtime.gcBgMarkWorker
            at c:/program files/go/src/runtime/mgc.go:1200
        2  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 149 - User: c:/users/darrylkuhn/source/mutagen/pkg/process/connection.go:159 github.com/mutagen-io/mutagen/pkg/process.(*Stream).Close (0x103c6fa) [chan receive 456404h16m30.2980813s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c269cc in runtime.chanrecv
            at c:/program files/go/src/runtime/chan.go:576
        2  0x0000000000c263f8 in runtime.chanrecv1
            at c:/program files/go/src/runtime/chan.go:439
        3  0x000000000103c6fa in github.com/mutagen-io/mutagen/pkg/process.(*Stream).Close
            at c:/users/darrylkuhn/source/mutagen/pkg/process/connection.go:159
        4  0x00000000010d47a2 in github.com/mutagen-io/mutagen/pkg/synchronization/endpoint/remote.(*endpointClient).Shutdown
            at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/endpoint/remote/client.go:452
        5  0x00000000010a0667 in github.com/mutagen-io/mutagen/pkg/synchronization.(*controller).run
            at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/controller.go:781
        6  0x000000000109dc3e in github.com/mutagen-io/mutagen/pkg/synchronization.newSession·dwrap·1
            at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/controller.go:232
        7  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 214 - User: c:/program files/go/src/runtime/sema.go:56 sync.runtime_Semacquire (0xc7f785) [semacquire]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c686b3 in runtime.goparkunlock
            at c:/program files/go/src/runtime/proc.go:372
        2  0x0000000000c686b3 in runtime.semacquire1
            at c:/program files/go/src/runtime/sema.go:144
        3  0x0000000000c7f785 in sync.runtime_Semacquire
            at c:/program files/go/src/runtime/sema.go:56
        4  0x0000000000c9db71 in sync.(*WaitGroup).Wait
            at c:/program files/go/src/sync/waitgroup.go:130
        5  0x00000000010163cc in google.golang.org/grpc.(*Server).serveStreams
            at c:/users/darrylkuhn/go/pkg/mod/google.golang.org/grpc@v1.43.0/server.go:931
        6  0x0000000001015ca6 in google.golang.org/grpc.(*Server).handleRawConn.func1
            at c:/users/darrylkuhn/go/pkg/mod/google.golang.org/grpc@v1.43.0/server.go:847
        7  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 215 - User: c:/users/darrylkuhn/source/mutagen/pkg/synchronization/controller.go:578 github.com/mutagen-io/mutagen/pkg/synchronization.(*controller).halt (0x109fb45) [chan receive 456404h16m30.3312283s]
         0  0x0000000000c595d6 in runtime.gopark
             at c:/program files/go/src/runtime/proc.go:367
         1  0x0000000000c269cc in runtime.chanrecv
             at c:/program files/go/src/runtime/chan.go:576
         2  0x0000000000c263f8 in runtime.chanrecv1
             at c:/program files/go/src/runtime/chan.go:439
         3  0x000000000109fb45 in github.com/mutagen-io/mutagen/pkg/synchronization.(*controller).halt
             at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/controller.go:578
         4  0x00000000010a66bc in github.com/mutagen-io/mutagen/pkg/synchronization.(*Manager).Pause
             at c:/users/darrylkuhn/source/mutagen/pkg/synchronization/manager.go:375
         5  0x00000000010ae969 in github.com/mutagen-io/mutagen/pkg/service/synchronization.(*Server).Pause
             at c:/users/darrylkuhn/source/mutagen/pkg/service/synchronization/server.go:97
         6  0x00000000010b3f50 in github.com/mutagen-io/mutagen/pkg/service/synchronization._Synchronization_Pause_Handler
             at c:/users/darrylkuhn/source/mutagen/pkg/service/synchronization/synchronization_grpc.pb.go:231
         7  0x0000000001018c6f in google.golang.org/grpc.(*Server).processUnaryRPC
             at c:/users/darrylkuhn/go/pkg/mod/google.golang.org/grpc@v1.43.0/server.go:1282
         8  0x000000000101ca4a in google.golang.org/grpc.(*Server).handleStream
             at c:/users/darrylkuhn/go/pkg/mod/google.golang.org/grpc@v1.43.0/server.go:1616
         9  0x0000000001016798 in google.golang.org/grpc.(*Server).serveStreams.func1.2
             at c:/users/darrylkuhn/go/pkg/mod/google.golang.org/grpc@v1.43.0/server.go:921
        10  0x0000000000c84781 in runtime.goexit
             at c:/program files/go/src/runtime/asm_amd64.s:1581
  Goroutine 216 - User: c:/program files/go/src/os/exec/exec.go:515 os/exec.(*Cmd).Wait (0xe49ca8) [chan receive 456404h16m30.3589574s]
        0  0x0000000000c595d6 in runtime.gopark
            at c:/program files/go/src/runtime/proc.go:367
        1  0x0000000000c269cc in runtime.chanrecv
            at c:/program files/go/src/runtime/chan.go:576
        2  0x0000000000c263f8 in runtime.chanrecv1
            at c:/program files/go/src/runtime/chan.go:439
        3  0x0000000000e49ca8 in os/exec.(*Cmd).Wait
            at c:/program files/go/src/os/exec/exec.go:515
        4  0x000000000103c869 in github.com/mutagen-io/mutagen/pkg/process.(*Stream).Close.func1
            at c:/users/darrylkuhn/source/mutagen/pkg/process/connection.go:109
        5  0x0000000000c84781 in runtime.goexit
            at c:/program files/go/src/runtime/asm_amd64.s:1581
[20 goroutines]
(dlv)

Let me know if you need anything else. Cheers!

xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 2, 2022
We were previously hesitant to close standard input because the close
could block indefinitely if writes were pending, but with the poll-based
pipe I/O introduced in Go 1.10 (see golang/go@187957d), this should no
longer be an issue, and it's going to be necessary (for now) to work
around the issues described in golang/go#23019, which are causing Docker
transport shutdown hangs.

Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
@xenoscopic
Copy link
Member

xenoscopic commented Feb 2, 2022

Excellent, thank you very much!

As expected, that pointed right to the issue. It would have been really difficult to figure out without the stack trace, so thanks again!

I'm fairly certain it's the problem addressed by this proposal and this proposal.

In Mutagen's case, the issue is that (with Docker Desktop) the docker command launches an internal com.docker.cli command to actually do its work. When Mutagen wants to terminate its agent processes running via docker exec, it issues SIGTERM on POSIX and uses TerminateProcess on Windows. The problem is that (for reasons I don't fully understand due to closed-source), the termination doesn't always propagate to com.docker.cli (and actually I'm not sure that it should via TerminateProcess). However, since com.docker.cli has inherited standard input/output/error from docker, those pipes stay open, and the Wait operation that Mutagen performs here actually ends up waiting on a Goroutine that's waiting for an EOF from com.docker.cli's standard error output that will never come.

So as a quick fix, I've added closure of the standard input pipe, which should trickle down to com.docker.cli and trigger the mutagen-agent process to exit on its own. Mutagen used to do this a really long time ago, but OS pipes used to be blocking in Go, so I abandoned that behavior. However, since Go 1.9, pipes are non-blocking, so this should no longer be an issue, and should be sufficient to signal termination to com.docker.cli. It's worth noting that this is only an issue with the Docker Desktop docker command.

I've got a build testing this change, but it would be really helpful if you could test whether or not it actually fixes the issue for you before I tag a release.

Would you be able to replace <mutagen>/pkg/process/connection.go (in your v0.13.0 checked out tree) with this copy and then run

go run scripts/build.go --mode=slim   # Again, ignore macOS warnings
build/mutagen daemon stop      # Ignore warnings if not running
build/mutagen daemon run

Then try to bring the Compose project up/down and see if it still hangs.

If it does, can you redo the Delve procedure (with this new connection.go code build) and provide the same output?

xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 2, 2022
We were previously hesitant to close standard input pipes because the
close operation could block indefinitely if writes were pending.
However, with the poll-based pipe I/O introduced in Go 1.10 (see
golang/go@187957d), this should no longer be an issue, and it's going to
be necessary (for now) to work around the issues described in
golang/go#23019, which are causing Docker transport shutdown hangs.
Essentially, closing the standard input pipe should (in theory) signal
to the com.docker.cli process that it's time to shut down.

Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
@darrylkuhn
Copy link
Author

@xenoscopic I rebuilt and ran as described above using the new connection.go and am happy to report that it worked like a charm

@xenoscopic
Copy link
Member

@darrylkuhn Awesome, thanks for all your help. I've got a bit more testing and validation I want to do for this change tomorrow, but I'll get it tagged into v0.13.1 and get that out the door ASAP. I'll close out this issue once that's shipped.

@darrylkuhn
Copy link
Author

Fantastic - appreciate it!

xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 3, 2022
We've been bitten again by golang/go#23019, so rather than a quick hack,
it's time to properly fix agent termination signaling. While the
original plan was to use killpg and Windows job objects to manage agent
process hierarchies, it quickly became clear when trying to implement
that behavior that the APIs needed to accomplish it simply weren't there
on POSIX or Windows. Moreover, it became even less clear what the
correct signaling and forceful termination mechanisms should be,
especially since Mutagen has no insight into transport process
hierarchies. Thus, Mutagen now takes a staged approach to transport
process termination, first waiting, then closing standard input, then
(on POSIX) sending SIGTERM, and finally forcing termination. It relies
on transport processes to correctly forward these mechanisms to the
Mutagen agent and to manage their own internal process hierarchies
accordingly once the Mutagen agent terminates.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
@xenoscopic
Copy link
Member

Just an additional update: while the quick hack will work, I've decided to address this more comprehensively in mutagen-io/mutagen@b910ae7. I'll need to do a bit more testing with this tomorrow since it's such a core change, but should be able to ship it by the end of the weekend.

xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Backport of 8556e07

Signed-off-by: Jacob Howard <jacob@mutagen.io>
xenoscopic added a commit to mutagen-io/mutagen that referenced this issue Feb 4, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in 6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates #223
Updates mutagen-io/mutagen-compose#11

Backport of 8556e07

Signed-off-by: Jacob Howard <jacob@mutagen.io>
@xenoscopic
Copy link
Member

This should now be fixed in Mutagen/Mutagen Compose v0.13.1 (note that you'll have to update both due to their version matching requirement at the moment).

Please let me know if you run into trouble with this newer fix. It's not exactly the same as the code you tried, but it's the same idea.

Thanks again for your help.

@darrylkuhn
Copy link
Author

Downloaded and installed 0.13.1 and can confirm issue remains resolved thanks @xenoscopic!

coryb pushed a commit to coryb/mutagen that referenced this issue Feb 23, 2022
We've already seen several manifestations of golang/go#23019, so this
commit refactors the way that agent input, output, and error streams are
managed, as well as the way that agent process termination is signaled.

Historically (in mutagen-io/mutagen@6c1a47c), we avoided closing
standard input/output pipes because they were blocking, meaning that a
close wouldn't preempt a read/write and that the close itself could
potentially block. However, this hasn't been the case since Go 1.9, when
os.File was switched to use polling I/O (at least for pipes and other
pollable files).

As such, we can now use closure of standard input as a signal to agent
processes (via their intermediate transport processes) that they should
terminate. Failing that, we still fall back to signal-based termination,
but this standard input closure mechanism is particularly important on
Windows, where no "soft" signaling mechanism (like SIGTERM) exists and
transport process termination via TerminateProcess often triggers
golang/go#23019. This is especially problematic with Docker Desktop,
where an intermediate com.docker.cli process is used underneath the
primary Docker CLI, and standard input closure is the only reliable
termination signaling mechanism.

Just to clarify, there are mechanisms like WM_CLOSE and CTRL_CLOSE_EVENT
on Windows, which some runtimes (such as Go's) will convert to a
SIGTERM, but there's no telling how intermediate transport processes
will interpret these messages. They don't necessarily have the same
semantics as SIGTERM.

And, just in case none of our signaling mechanisms works, we now avoid
using os/exec.Cmd's forwarding Goroutines entirely, meaning that its
Wait method will close all pipes as soon as the child process is
terminated.

As part of this refactor, I've also looked at switching to a more
systematic approach to manage the process hierarchies that could
potentially be generated by transport processes. Things like killpg on
POSIX or job objects on Windows could theoretically facilitate such
management. However, the fact of the matter is that there's simply no
way to reliably enforce termination of such hierarchies and, more
importantly, no way for Mutagen to know what termination signaling
mechanism would be appropriate for any intermediate processes.
Essentially, we just have to rely on transport processes to either
correctly forwarded standard input closure (especially on Windows)
and/or correctly forward SIGTERM on POSIX. But, if they don't, we will
forcibly terminate them and any associated resources in the Mutagen
daemon. If their subprocesses linger on afterward, that's a bug in the
transport process, not Mutagen.

This commit also takes the opportunity to impose a size limit on the
in-memory error buffer used to capture transport errors after a
handshake failure.

Updates mutagen-io#223
Updates mutagen-io/mutagen-compose#11

Signed-off-by: Jacob Howard <jacob@mutagen.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants