Skip to content
This repository has been archived by the owner on Jul 28, 2021. It is now read-only.

Intermittent failures on docker pull (block devices not present) #135

Closed
rn opened this issue Sep 13, 2017 · 8 comments · Fixed by #136
Closed

Intermittent failures on docker pull (block devices not present) #135

rn opened this issue Sep 13, 2017 · 8 comments · Fixed by #136

Comments

@rn
Copy link
Contributor

rn commented Sep 13, 2017

[I think this is a known issue, but could not find an issue tracking it]

docker pull busybox or docker pull alpine fail intermittently on a fresh start (with the c:\lcow directory empty.

This is with:

  • OpenGCS commit: 203a54283e0b0d58c1bc9d8d0f0b4bea8503fe37
  • dockerd commit: 0dbc182b6
  • LinuxKit kernel 4.11.1

I've seen two types of error messages (see below), both related to block devices for layers not being present.

  • failed to 'ls /sys/bus/scsi/devices/0:0:0:2/block':
C:\Users\rneugeba\Desktop\docker\docker.exe: failed to register layer: failed to create SVM scratch VHDX (applydiff e7b97c3c849c8c1e46023e68798d10075375068d4b11651400252fbe2e296dd1): failed to `ls /sys/bus/scsi/devices/0:0:0:2/block` following hot-add C:\lcow\lcow\scratch\e7b97c3c849c8c1e46023e68798d10075375068d4b11651400252fbe2e296dd1.vhdx to utility VM: failed to create process (&{Options:{KirdPath:C:\Program Files\Linux Containers KernelFile:bootx64.efi InitrdFile:initrd.img Vhdx:C:\Program Files\Linux Containers\uvm.vhdx TimeoutSeconds:0 BootParameters:} Name:e7b97c3c849c8c1e46023e68798d10075375068d4b11651400252fbe2e296dd1_svm RequestedMode:3 ActualMode:2 UvmTimeoutSeconds:300 Uvm:0xc042966300 MappedVirtualDisks:[]}) in utility VM: container e7b97c3c849c8c1e46023e68798d10075375068d4b11651400252fbe2e296dd1_svm encountered an error during CreateProcess: failure in a Windows system call: Unspecified error (0x80004005) extra info: {"CommandLine":"ls /sys/bus/scsi/devices/0:0:0:2/block","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}.
  • "mkfs.ext4 -q -E lazy_itable_init=1 -O ^has_journal,sparse_super2,uninit_bg,^resize_inode /dev/sda" failing
C:\Users\rneugeba\Desktop\docker\docker.exe: failed to register layer: failed to create SVM scratch VHDX (applydiff 8a461f34126ea60d11f66bf68679d4e53b3de4fa34a590c8e8d401daff8349ab): failed to RunProcess "C:\\lcow\\lcow\\scratch\\8a461f34126ea60d11f66bf68679d4e53b3de4fa34a590c8e8d401daff8349ab.vhdx" following hot-add mkfs.ext4 -q -E lazy_itable_init=1 -O ^has_journal,sparse_super2,uninit_bg,^resize_inode /dev/sda to utility VM: failed to create process (&{Options:{KirdPath:C:\Program Files\Linux Containers KernelFile:bootx64.efi InitrdFile:initrd.img Vhdx:C:\Program Files\Linux Containers\uvm.vhdx TimeoutSeconds:0 BootParameters:} Name:8a461f34126ea60d11f66bf68679d4e53b3de4fa34a590c8e8d401daff8349ab_svm RequestedMode:3 ActualMode:2 UvmTimeoutSeconds:300 Uvm:0xc04297ca80 MappedVirtualDisks:[]}) in utility VM: container 8a461f34126ea60d11f66bf68679d4e53b3de4fa34a590c8e8d401daff8349ab_svm encountered an error during CreateProcess: failure in a Windows system call: Unspecified error (0x80004005) extra info: {"CommandLine":"mkfs.ext4 -q -E lazy_itable_init=1 -O ^has_journal,sparse_super2,uninit_bg,^resize_inode /dev/sda","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}.
@rn
Copy link
Contributor Author

rn commented Sep 13, 2017

The error message might be a red herring. Here is the output from the daemon logs:

DEBU[2017-09-13T15:52:09.458936500+01:00] HCSShim::Container::Modify succeeded id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm
DEBU[2017-09-13T15:52:09.458936500+01:00] opengcs: HotAddVhd: C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx added successfully
DEBU[2017-09-13T15:52:09.459935500+01:00] opengcs: CreateExt4Vhdx: C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx querying mapped virtual disks
DEBU[2017-09-13T15:52:09.459935500+01:00] HCSShim::Container::MappedVirtualDiskList id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm
DEBU[2017-09-13T15:52:09.460955300+01:00] HCSShim::Container::MappedVirtualDiskList succeeded id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm
DEBU[2017-09-13T15:52:09.460955300+01:00] opengcs: CreateExt4Vhdx: C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx at C=0 L=2
DEBU[2017-09-13T15:52:09.460955300+01:00] opengcs: RunProcess: test -d /sys/bus/scsi/devices/0:0:0:2
DEBU[2017-09-13T15:52:09.460955300+01:00] opengcs: createUtilsProcess
DEBU[2017-09-13T15:52:09.460955300+01:00] HCSShim::Container::CreateProcess id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm config={"CommandLine":"test -d /sys/bus/scsi/devices/0:0:0:2","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}
DEBU[2017-09-13T15:52:09.471936000+01:00] HCSShim::Container::CreateProcess succeeded id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm processid=408
DEBU[2017-09-13T15:52:09.471936000+01:00] HCSShim::Process::Stdio processid=408
DEBU[2017-09-13T15:52:09.472939100+01:00] HCSShim::Process::Stdio succeeded processid=408
DEBU[2017-09-13T15:52:09.472939100+01:00] opengcs: createUtilsProcess success: pid 408
DEBU[2017-09-13T15:52:09.472939100+01:00] opengcs: runProcess success: test -d /sys/bus/scsi/devices/0:0:0:2
DEBU[2017-09-13T15:52:09.472939100+01:00] HCSShim::Process::WaitTimeout processid=408
DEBU[2017-09-13T15:52:09.526251200+01:00] HCSShim::Process::WaitTimeout succeeded processid=408
DEBU[2017-09-13T15:52:09.527248700+01:00] HCSShim::Process::ExitCode processid=408
DEBU[2017-09-13T15:52:09.527248700+01:00] HCSShim::Process::properties processid=408
DEBU[2017-09-13T15:52:09.528258300+01:00] HCSShim::Process::properties succeeded processid=408, properties={"ProcessId":408,"Exited":true,"ExitCode":0,"LastWaitResult":0}
DEBU[2017-09-13T15:52:09.528258300+01:00] HCSShim::Process::ExitCode succeeded processid=408 exitCode=0
DEBU[2017-09-13T15:52:09.528258300+01:00] opengcs: RunProcess: ls /sys/bus/scsi/devices/0:0:0:2/block
DEBU[2017-09-13T15:52:09.529341700+01:00] opengcs: createUtilsProcess
DEBU[2017-09-13T15:52:09.529341700+01:00] HCSShim::Container::CreateProcess id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm config={"CommandLine":"ls /sys/bus/scsi/devices/0:0:0:2/block","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}
DEBU[2017-09-13T15:52:11.609893700+01:00] Result: {"Error":-2147467259,"ErrorMessage":"Unspecified error","ErrorEvents":[{"Message":"failed creating stdin Connection: failed connecting the VsockConnection: failed connect() to 00000002.40000080: connection timed out","StackTrace":"\ngithub.com/Microsoft/opengcs/service/gcs/transport.(*VsockTransport).Dial\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/transport/vsock.go:27\ngithub.com/Microsoft/opengcs/service/gcs/bridge.connectStdio\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/connection.go:21\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).execProcess\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:382\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).(github.com/Microsoft/opengcs/service/gcs/bridge.execProcess)-fm\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:199\ngithub.com/Microsoft/opengcs/service/gcs/bridge.HandlerFunc.ServeMsg\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:45\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Mux).ServeMsg\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:107\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).ListenAndServe.func2.1\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:262\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:2197","Provider":"00000000-0000-0000-0000-000000000000","EventId":0}]}
DEBU[2017-09-13T15:52:11.610930700+01:00] opengcs: CreateExt4Vhdx: hot-remove of C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx
DEBU[2017-09-13T15:52:11.615887400+01:00] opengcs: HotRemoveVhd: C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx
DEBU[2017-09-13T15:52:11.615887400+01:00] HCSShim::Container::Modify id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm request={"ResourceType":"MappedVirtualDisk","Settings":{"HostPath":"C:\\lcow\\lcow\\scratch\\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx","ContainerPath":"","CreateInUtilityVM":true,"AttachOnly":false},"RequestType":"Remove"}
DEBU[2017-09-13T15:52:11.656885000+01:00] HCSShim::Container::Modify succeeded id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm
DEBU[2017-09-13T15:52:11.656885000+01:00] opengcs: HotRemoveVhd: C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx removed successfully
DEBU[2017-09-13T15:52:11.657879800+01:00] HCSShim::Process::Close processid=408
DEBU[2017-09-13T15:52:11.657879800+01:00] HCSShim::Process::Close succeeded processid=408
DEBU[2017-09-13T15:52:11.657879800+01:00] lcowdriver: startservicevmifnotrunning 76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6: (applydiff 76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6): releasing serviceVM on error path from CreateExt4Vhdx: failed to `ls /sys/bus/scsi/devices/0:0:0:2/block` following hot-add C:\lcow\lcow\scratch\76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6.vhdx to utility VM: failed to create process (&{Options:{KirdPath:C:\Program Files\Linux Containers KernelFile:bootx64.efi InitrdFile:initrd.img Vhdx:C:\Program Files\Linux Containers\uvm.vhdx TimeoutSeconds:0 BootParameters:} Name:76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm RequestedMode:3 ActualMode:2 UvmTimeoutSeconds:300 Uvm:0xc042942fc0 MappedVirtualDisks:[]}) in utility VM: container 76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm encountered an error during CreateProcess: failure in a Windows system call: Unspecified error (0x80004005) extra info: {"CommandLine":"ls /sys/bus/scsi/devices/0:0:0:2/block","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}
DEBU[2017-09-13T15:52:11.658893100+01:00] lcowdriver: startservicevmifnotrunning 76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6: (applydiff 76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6): releasing cachedScratchMutex on error path

in there is this:

DEBU[2017-09-13T15:52:09.528258300+01:00] opengcs: RunProcess: ls /sys/bus/scsi/devices/0:0:0:2/block
DEBU[2017-09-13T15:52:09.529341700+01:00] opengcs: createUtilsProcess
DEBU[2017-09-13T15:52:09.529341700+01:00] HCSShim::Container::CreateProcess id=76c1196c65017f90705ed697f01646b81e132e7319edbacf52768b1bb996e0b6_svm config={"CommandLine":"ls /sys/bus/scsi/devices/0:0:0:2/block","WorkingDirectory":"/bin","Environment":{"PATH":"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0],"CreateInUtilityVm":true}
DEBU[2017-09-13T15:52:11.609893700+01:00] Result: {"Error":-2147467259,"ErrorMessage":"Unspecified error","ErrorEvents":[{"Message":"failed creating stdin Connection: failed connecting the VsockConnection: failed connect() to 00000002.40000080: connection timed out","StackTrace":"\ngithub.com/Microsoft/opengcs/service/gcs/transport.(*VsockTransport).Dial\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/transport/vsock.go:27\ngithub.com/Microsoft/opengcs/service/gcs/bridge.connectStdio\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/connection.go:21\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).execProcess\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:382\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).(github.com/Microsoft/opengcs/service/gcs/bridge.execProcess)-fm\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:199\ngithub.com/Microsoft/opengcs/service/gcs/bridge.HandlerFunc.ServeMsg\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:45\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Mux).ServeMsg\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:107\ngithub.com/Microsoft/opengcs/service/gcs/bridge.(*Bridge).ListenAndServe.func2.1\n\t/go/src/github.com/Microsoft/opengcs/service/gcs/bridge/bridge.go:262\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:2197","Provider":"00000000-0000-0000-0000-000000000000","EventId":0}]}

which indicates a time out error on a VsockConnection: failed creating stdin Connection: failed connecting the VsockConnection: failed connect() to 00000002.40000080: connection timed out

So the real reason why the ls of the block device fails may have nothing to do with the command itself but with the plumbing of stdin/stdout for the process.

This could potentially be a dup of #124

@gupta-ak
Copy link
Member

It looks like the same error as #124. Can you repro it if you cherry-pick the top commit from the retry-connect branch? (PR is #133)

@rn
Copy link
Contributor Author

rn commented Sep 13, 2017

I've tried something slightly different https://github.com/rn/opengcs/tree/dial-timeout-hack
basically do a errors.Wrap() in virtsock and then look at the error returned and retry on ETIMEDOUT and with that I can't repro. I've only tried a few times and weirdly I haven't seen the log messages I added...I also don't see the logrus.Info() messages were already in the code (ie https://github.com/Microsoft/opengcs/blob/master/service/gcs/transport/vsock.go#L23). Any idea why they are not visible?

I'll try to debug more tomorrow (I'm in the UK and it's evening now)

@gupta-ak
Copy link
Member

gupta-ak commented Sep 13, 2017

I can try out your branch and see why the logs aren't showing up. At a quick glance, it looks like you're missing the commit where the gcs logs are sent to the HCS, so maybe that's the reason if you're trying to find them from docker.

@gupta-ak
Copy link
Member

Using your branch, I definitely see the logs. Here is a truncated snippet from /tmp/gcs.log:

vsock Dial port (1073741955)
external process 380 exited
bridge: response sent
opengcs: re-dial 1073741955
vsock Connect port (107374195
vsock Dial port (1073741956)
vsock Connect port (107374195
vsock Dial port (1073741957)
vsock Connect port (107374195

@rn
Copy link
Contributor Author

rn commented Sep 14, 2017

Cool, thanks for checking. So this is definitely the same as #124 then.

Do you want me to tidy this up (and roll in the virtsock changes)? I think it's cleaner to just check for ETIMEDOUT as it doesn't mask other potential errors.

@gupta-ak
Copy link
Member

Yeah, I agree. I originally had it check etimedout, but I realised it was getting masked. With the virtsock update, I think your change of only retrying on etimedout is right.

@rn
Copy link
Contributor Author

rn commented Sep 14, 2017

yes, only retrying on ETIMEDOUT, other errors error out properly

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants