Kata fails to start #702
Comments
I think would be good to get more kata-proxy logs here to see what happens with the VM boot.
|
Unfortunately, I just see this
|
Hi @ydjainopensource - A few things:
|
Same issue here with ubuntu 18.04, it's like a kind of timeout, sometimes it works, sometimes not kata-runtime 1.3.0~rc1 docker run -d -ti --runtime kata-runtime my.registry/images/apache Right after: docker run -d -ti --runtime kata-runtime my.registry/images/images/apache |
Right - the |
If those affected could please:
The OP was running on a new architecture which hasn't landed in master yet (see #667). Hence, the more data points we can get, the easier it is going to be for us to fix ;) |
There is a related mention over on kata-containers/tests#766 (comment) from me. I was running parallel launches. Before I can paste more details here I need to re-create, as I think my docker/runtime install is a little bust at the moment (updated after I'd see the error). |
This makes me think the agent sometimes is taking longer to start to serve grpc calls. Agree would be helpful get kata-collect-data.sh info or when it fails collect the proxy logs and see the boot log. |
Wait, if only we had agent tracing ...! 😄 |
@fredbcode please try with vsocks |
@devimc how ? I will post debug monday |
@devimc not better, but when it works seems more faster than before docker run -d -ti --name test1 --runtime kata-runtime app_apache I will provide full debug logs |
Fresh reboot, all images removed, without vsocks, ubuntu 18.04 fresh install: Test 1 OK Test 2 Ok docker run -d -ti --name test1 --runtime kata-runtime registry.test/images/nginx Test 3 Ko docker run -d -ti --name test2 --runtime kata-runtime registry.test/images/nginx Test 4 Ok docker run -d -ti --name test3 --runtime kata-runtime registry.test/images/nginx Meta detailsRunning Runtime is
|
Hi @fredbcode - thanks for posting this. Please could you attach the full proxy log:
|
Thanks @fredbcode - attaching that file here to the thread (for posterity ;-) ) If that is not OK with you then let me know and I'll drop it again. |
No problem, let me known if something is needed, I'm familiar with dev tools (but not at all with go language :)) |
One kind of timeout I've hit is when memory preallocation is enabled and run kata on a slow machine that can cost some time to allocate the required memory. Then the very first |
Yes, that first check call is very slow: ... but I still cannot recreate this issue. |
Hi @fredbcode - would you be able to create a clean set of logs basically following the simple instructions here?: I think we only need the proxy + runtime logs. If you could clear the systemd journal first, that would be helpful so we only get a minimal set of logs from It would also be very useful to know if you have ever seen this on a single run (first We really need a reliable way to recreate this to allow us to debug. If we can't do this, we could create a special version of The error message suggests that either the gRPC layer isn't ready yet (VM or agent not ready). Note that the agend hasn't died as we don't see any crashes in the proxy log. @fredbcode - you mention you are seeing this with version One other thing you can try is to update the
|
I have spent some time on this issue this morning and I have been able to reproduce it very easily. $ docker run --rm -d -ti --runtime kata-runtime fedora bash
3d66c99ea21eddb5046a1e1636adbe35148091594471e16c64f23b55c4eb40a6
$ docker run --rm -d -ti --runtime kata-runtime fedora bash
docker: Error response from daemon: OCI runtime create failed: Failed to check if grpc server is working: rpc error: code = Unavailable desc = transport is closing: unknown. Mostly, it happened on the second or third container run. Because the first container is detached, it's still running in the background, and it's still consuming the entropy of the host, as you can see here: $ cat /proc/sys/kernel/random/entropy_avail
3844
$ docker run --rm -d -ti --runtime kata-runtime fedora bash
b0ed37de68328fc59f1b8fa644ff5f0ac184d4782231b737e17b7199debebe74
$ cat /proc/sys/kernel/random/entropy_avail
833
$ docker run --rm -d -ti --runtime kata-runtime fedora bash
b0ed37de68328fc59f1b8fa644ff5f0ac184d4782231b737e17b7199debebe74
$ cat /proc/sys/kernel/random/entropy_avail
53
$ docker run --rm -d -ti --runtime kata-runtime fedora bash
docker: Error response from daemon: OCI runtime create failed: Failed to check if grpc server is The problem comes from the fact that those containers are actually consuming a large amount of the host entropy, leaving almost nothing to the next ones. I'm not sure if this is something related to one of our recent PR in the agent, because I don't think we were hitting this issue when the |
Today we use /dev/random as entropy source. This is a blocking entropy source, when the host is getting a low amount of entropy the Kata VM startup takes longer because block trying to get entropy from the source. This change the entropy source to /dev/urandom that is non-blocking . Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Yes, but I'm using a test machine without internet, but if somehow can share with me a package (or at least just the binaries for 18.04) that would help a lot. |
@sboeuf @jon et. al. I had a memory there was a kernel related upgrade item not long ago when we first saw these random number hang up issues on our CI (@jcvenegas @chavafg for any reference...). I had a quick dig, and came up with https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.14.40&id=6e513bc20ca63f594632eca4e1968791240b8f18, referenced from linuxkit/linuxkit#3032 . |
I confirm entropy impact, I generated a lot of them and now kata works better (better because I have an another bug now that I will report after, related or not I don't know -) cat /proc/sys/kernel/random/entropy_avail as you can see the value is stable and gprc error is gone |
great, thanks @fredbcode . I also confirm - I installed |
@grahamwhaley with a stable value in /proc/sys/kernel/random/entropy_avail like me ? |
/me goes to hack script.... |
@grahamwhaley Yes I confirm it's stable because rng-tools (haveged) runs in background and generates permanently entropy |
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
@grahamwhaley so one suggestion to add a part of our docs make is a recommendation to have a daemon haveged to help the kernel to collect entropy quickly. The other is if we left urandom as default or document for users that are out of entropy. |
@grahamwhaley Actually my env is around 2.5 probably that is why I can not reproduce it. |
@grahamwhaley so what would be the best approach here? That being said, there's one point where I totally agree with you, we should understand why this issue happened at some point, since I don't think we had it before. |
@sboeuf I think |
@jcvenegas sure but if we go with |
My gut tells me we:
My only concern there is if passing But, we need to do some more research first on VM state-of-the-art recommendations, and see if/what turns up on the mailing list. I think we have some members of the kata community who have experience in this area (scaling out VMs) who can probably contribute or know some others who can. |
I am not sure if running |
:-) well, technically we don't have to 'fix' the lack of entropy, we just have to document where and why it happens, and note that a fix will be needed if you are a heavy entropy user :-) We can of course point to some tools, like |
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
from @bergwolf in the ML.
@sboeuf @grahamwhaley here is a kernel with the patch let me know if solve the issue. |
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
This adds a config option to choose the VM entropy source. Fixes: kata-containers#702 Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
based in the feedback from the ML we are moving to dev/urandom as VM source of entropy. |
@jcvenegas I'm presuming you don't need us to test that patched kernel any more? |
@grahamwhaley No, it is unnecessary now that we've decided to feed virtio-rng with host urandom. |
Description of problem
I was trying to port runtime to Z. It passes all the unit tests as of now. I used the following command to run
kata-runtime --log /dev/stdout create -b ../osbuilder/ubuntu_rootfs test
However, I got
Failed to check if grpc server is working: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Meta details
Running
kata-collect-data.sh
version1.2.0 (commit 1820047c4e028314a2a19d1773d1aa26aa42e1f8-dirty)
at2018-09-05.15:29:54.876826586-0400
.Runtime is
/usr/local/bin/kata-runtime
.kata-env
Output of "
/usr/local/bin/kata-runtime kata-env
":Runtime config files
Runtime default config files
Runtime config file contents
Config file
/etc/kata-containers/configuration.toml
not foundOutput of "
cat "/usr/share/defaults/kata-containers/configuration.toml"
":Image details
No image
Initrd details
No initrd
Logfiles
Runtime logs
Recent runtime problems found in system journal:
Proxy logs
Recent proxy problems found in system journal:
Shim logs
No recent shim problems found in system journal.
Container manager details
Have
docker
Docker
Output of "
docker version
":Output of "
docker info
":Output of "
systemctl show docker
":No
kubectl
Packages
Have
dpkg
Output of "
dpkg -l|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-containers-image|linux-container|qemu-)"
":No
rpm
The text was updated successfully, but these errors were encountered: