Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status inconsistencies between deployment and its pods #82405

Closed
kayrus opened this issue Sep 6, 2019 · 14 comments
Closed

Status inconsistencies between deployment and its pods #82405

kayrus opened this issue Sep 6, 2019 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@kayrus
Copy link
Contributor

kayrus commented Sep 6, 2019

What happened:

I face issues with the deployment status. Its pods are ready and have Running status, however the deployment status doesn't show readiness.

$ kubectl get pods -l role=server
NAME                             READY   STATUS    RESTARTS   AGE
server-bc9c5c7b8-2vv6f   3/3     Running   0          44h
server-bc9c5c7b8-6m9j5   3/3     Running   2          46h
server-bc9c5c7b8-76zdb   3/3     Running   3          44h
$ kubectl get deployment server
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
server   2/3     3            2           21d

What you expected to happen:

I expect the see the Ready 3/3 deployment status.

How to reproduce it (as minimally and precisely as possible):

This happens, when a kubelet reestablishes a connection to a kube-apiserver. Not always, but I'm able to reproduce the issue with 50% chance.

Some further debugging showed that the pod status cache map stuck with Ready True in pkg/kubelet/status/status_manager.go as an old and a new value, therefore reconciliation is not triggered.

Cache map in pkg/kubelet/config/config.go stuck with Ready False for both old and new pod statuses, reconciliation is not triggered as well.

For some reason reconciliation loops don't merge these two values and they persist until you restart kubelet. Still trying to understand what exactly is wrong (probably one line fix is needed, where pointer is used instead of DeepCopy, probably some lack of mutex lock)

Environment:

  • Kubernetes version (use kubectl version): 1.15.4
  • Cloud provider or hardware configuration: openstack
  • OS (e.g: cat /etc/os-release): coreos stable

@kubernetes/sig-scheduling

@kayrus kayrus added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 6, 2019
@kayrus
Copy link
Contributor Author

kayrus commented Sep 6, 2019

@kubernetes/sig-scheduling
/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 6, 2019
@k82cn
Copy link
Member

k82cn commented Sep 8, 2019

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 8, 2019
@k82cn k82cn removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 8, 2019
@Joseph-Irving
Copy link
Member

Can you reproduce this issue on a more recent version of Kubernetes? v1.5.0 is a few years old now and is no longer supported, the current supported versions are v1.13, v1.14 and v1.15

@kayrus
Copy link
Contributor Author

kayrus commented Sep 11, 2019

@Joseph-Irving sorry, it's a typo, the version is 1.15

@Joseph-Irving
Copy link
Member

can you show the full output of the deployment, kubectl get deployment server -o yaml ?

@kayrus
Copy link
Contributor Author

kayrus commented Sep 11, 2019

@Joseph-Irving here is the output for another deployment with the same problem:
https://gist.github.com/kayrus/8a5956126f08fe974c91f5fb64e2e6a4

@Joseph-Irving
Copy link
Member

Joseph-Irving commented Sep 11, 2019

Interesting, there are a few things that don't look right there, Pod ready is false here but container is ready here

Deployment says it has 1 unavailable replica here

The replicaset seems to have an incomplete status field. I would expect it to have readyReplicas and availableReplicas

@kayrus
Copy link
Contributor Author

kayrus commented Sep 12, 2019

@Joseph-Irving do you have a clue where to dig further?

@Joseph-Irving
Copy link
Member

Joseph-Irving commented Sep 12, 2019

So the reason the Deployment/Replicaset don't appear to have the correct status is because the Pod has the condition Ready set to False.
However this should be true, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/generate.go#L94 PodReady is set to True if ContainersReady is True and all your readiness gates and ready (you're not using any pod readiness gates). So in your case PodReady should be true.

updateStatusInternal should go an update the pod status on the api server https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/status_manager.go#L362, so I would have a look in your kubelet log, specifically looking for things in status_manager.go to see if there are any errors when attempting to update the status.

@kayrus
Copy link
Contributor Author

kayrus commented Sep 12, 2019

@Joseph-Irving there were issues, when kubelet tried to connect to the api-server, but these issues were related to another pod. And this pod and corresponding deployment are fine now. I suppose the issue related to the kube-apiserver restart.

https://gist.github.com/kayrus/eac4891efdf1b7817e40d0bf15c0a277

UPD: I restarted the kubelet and it fixed the consistency issue. I suppose that once kubelet can't connect to the server, it stops further tries.

@kayrus
Copy link
Contributor Author

kayrus commented Nov 15, 2019

@Joseph-Irving I was able to reproduce this case by using iptables rule:

iptables -I OUTPUT -d kubeapiserver -j REJECT --reject-with icmp-port-unreachable

see the animation attached.
kubelet_status_bug

UPD: there is a race condition somewhere, because sometimes after the same operation pod gets proper status update within the

klog.V(3).Infof("Pod status is inconsistent with cached status for pod %q, a reconciliation should be triggered:\n %+v", format.Pod(pod),
, see logs https://gist.github.com/kayrus/d1b09a51822983a1951fbfeb22ed46f8

@kayrus
Copy link
Contributor Author

kayrus commented Nov 17, 2019

I added more debug into the

} else if m.needsReconcile(uid, status.status) {
and it appears that the status.status contains Type:Ready Status:True, when the actual kube-apiserver status is Type:Ready Status:False.

Then I added more debug into the

func checkAndUpdatePod(existing, ref *v1.Pod) (needUpdate, needReconcile, needGracefulDelete bool) {
and it appeared that both existing and ref contain Ready False.

So far I suspect this func:

updatePodsFunc := func(newPods []*v1.Pod, oldPods, pods map[types.UID]*v1.Pod) {

@kayrus
Copy link
Contributor Author

kayrus commented Dec 4, 2019

@databus23

the patch below, adapted to v1.15.x from the #84951 PR, solved the issue. Tested multiple times.

kayrus@e2a18d9

@kayrus
Copy link
Contributor Author

kayrus commented Jan 28, 2020

fixed in k8s 1.15.8 (1.15.9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

4 participants