New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow node registration (~20 minutes) with the sympton that protokube image cannot be downloaded #9692
Comments
@yuha0 Could you check the |
@hakman in kops-configuration log, I see:
And it printed this error message repeatedly for 20 minutes until:
So I think it is |
The next thing to find out is why the |
When docker daemon started, it kept complaining about two things:
The pulling failures, I guess, was caused by the init script making attempts to run
The oomkill stopped after the successful docker load. I am really confused why dockerd can get OOMKilled. As you can see this instance is a m5.xlarge with 15G memory and docker itself doesn't seem to have cgroup memory limit:
I can manually reproduce this on any new m5.xlarge node in AWS:
|
The kubeReserved:
cpu: "200m"
memory: "512Mi"
ephemeral-storage: "1Gi" |
@hakman Thank you. I can confirm increasing the memory did shorten the startup time (from 20+ minutes to 5 minutes). dockerd still got oomkilled once, but protokube image was successfully loaded in the second attempt. https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved
Maybe my understanding was incorrect, but I thought I was not aware that docker is in
The documentation indeed says container runtime reservation is done via |
Seems so, didn't have to dig that deep. In any case, in real life, reserving 100Mi for something will not help much.
I think the process was "runc", but anyway, you have docker-cli -> dockerd -> containerd -> runc. I am updating the Kops example to be closer to what someone should set though. Hoping that will help in the future. |
I am not very familiar with lower level concepts but I thought But anyways, thank you for the help. My problem is resolved. |
It may be interesting to find out what exactly happens, but for sure the docker command sends a command to dockerd to do load the image, not doing the work itself. Not sure either how runc plays into this or if it is just a side effect of kubelet already running. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.18
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.1.18.6
3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
No need to run commands. simply bring up a node and observe the slow start time.
5. What happened after the commands executed?
N/A
6. What did you expect to happen?
My previous experience with kops on debian AMI was pretty fast, usually within 5 minutes.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
N/A
9. Anything else do we need to know?
AFAIK, protokube is not on a public container registry like dockerhub, instead, it is loaded from s3 (github kops release artifact)?
I am not sure which service is responsible for doing this, but the symptom is that protokube.service kept failing for about 20 minutes until then protokube image was somehow magically loaded. I attached the syslog from the node which recorded all the protokube service start attempts.
I also pasted one of three master IG manifest and the nodes IG manifest. The weird thing is that, this can only be re-produced on
nodes
, not onmasters
.syslog.txt
The text was updated successfully, but these errors were encountered: