Status inconsistencies between deployment and its pods #82405

kayrus · 2019-09-06T07:08:27Z

What happened:

I face issues with the deployment status. Its pods are ready and have Running status, however the deployment status doesn't show readiness.

$ kubectl get pods -l role=server
NAME                             READY   STATUS    RESTARTS   AGE
server-bc9c5c7b8-2vv6f   3/3     Running   0          44h
server-bc9c5c7b8-6m9j5   3/3     Running   2          46h
server-bc9c5c7b8-76zdb   3/3     Running   3          44h
$ kubectl get deployment server
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
server   2/3     3            2           21d

What you expected to happen:

I expect the see the Ready 3/3 deployment status.

How to reproduce it (as minimally and precisely as possible):

This happens, when a kubelet reestablishes a connection to a kube-apiserver. Not always, but I'm able to reproduce the issue with 50% chance.

Some further debugging showed that the pod status cache map stuck with Ready True in pkg/kubelet/status/status_manager.go as an old and a new value, therefore reconciliation is not triggered.

Cache map in pkg/kubelet/config/config.go stuck with Ready False for both old and new pod statuses, reconciliation is not triggered as well.

For some reason reconciliation loops don't merge these two values and they persist until you restart kubelet. Still trying to understand what exactly is wrong (probably one line fix is needed, where pointer is used instead of DeepCopy, probably some lack of mutex lock)

Environment:

Kubernetes version (use kubectl version): 1.15.4
Cloud provider or hardware configuration: openstack
OS (e.g: cat /etc/os-release): coreos stable

@kubernetes/sig-scheduling

The text was updated successfully, but these errors were encountered:

kayrus · 2019-09-06T07:10:05Z

@kubernetes/sig-scheduling
/sig scheduling

k82cn · 2019-09-08T01:27:36Z

/sig apps

Joseph-Irving · 2019-09-10T20:12:54Z

Can you reproduce this issue on a more recent version of Kubernetes? v1.5.0 is a few years old now and is no longer supported, the current supported versions are v1.13, v1.14 and v1.15

kayrus · 2019-09-11T07:40:27Z

@Joseph-Irving sorry, it's a typo, the version is 1.15

Joseph-Irving · 2019-09-11T07:52:35Z

can you show the full output of the deployment, kubectl get deployment server -o yaml ?

kayrus · 2019-09-11T12:49:08Z

@Joseph-Irving here is the output for another deployment with the same problem:
https://gist.github.com/kayrus/8a5956126f08fe974c91f5fb64e2e6a4

Joseph-Irving · 2019-09-11T15:00:12Z

Interesting, there are a few things that don't look right there, Pod ready is false here but container is ready here

Deployment says it has 1 unavailable replica here

The replicaset seems to have an incomplete status field. I would expect it to have readyReplicas and availableReplicas

kayrus · 2019-09-12T07:13:06Z

@Joseph-Irving do you have a clue where to dig further?

Joseph-Irving · 2019-09-12T07:58:35Z

So the reason the Deployment/Replicaset don't appear to have the correct status is because the Pod has the condition Ready set to False.
However this should be true, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/generate.go#L94 PodReady is set to True if ContainersReady is True and all your readiness gates and ready (you're not using any pod readiness gates). So in your case PodReady should be true.

updateStatusInternal should go an update the pod status on the api server https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/status/status_manager.go#L362, so I would have a look in your kubelet log, specifically looking for things in status_manager.go to see if there are any errors when attempting to update the status.

kayrus · 2019-09-12T20:49:23Z

@Joseph-Irving there were issues, when kubelet tried to connect to the api-server, but these issues were related to another pod. And this pod and corresponding deployment are fine now. I suppose the issue related to the kube-apiserver restart.

https://gist.github.com/kayrus/eac4891efdf1b7817e40d0bf15c0a277

UPD: I restarted the kubelet and it fixed the consistency issue. I suppose that once kubelet can't connect to the server, it stops further tries.

kayrus · 2019-11-15T15:43:07Z

@Joseph-Irving I was able to reproduce this case by using iptables rule:

iptables -I OUTPUT -d kubeapiserver -j REJECT --reject-with icmp-port-unreachable

see the animation attached.

UPD: there is a race condition somewhere, because sometimes after the same operation pod gets proper status update within the

kubernetes/pkg/kubelet/status/status_manager.go

Line 624 in 7a5929d

    
           klog.V(3).Infof("Pod status is inconsistent with cached status for pod %q, a reconciliation should be triggered:\n %+v", format.Pod(pod),

, see logs https://gist.github.com/kayrus/d1b09a51822983a1951fbfeb22ed46f8

kayrus · 2019-11-17T09:33:01Z

I added more debug into the

kubernetes/pkg/kubelet/status/status_manager.go

Line 497 in 7a5929d

} else if m.needsReconcile(uid, status.status) {

and it appears that the status.status contains Type:Ready Status:True, when the actual kube-apiserver status is Type:Ready Status:False.

Then I added more debug into the

kubernetes/pkg/kubelet/config/config.go

Line 448 in 7a5929d

    
           func checkAndUpdatePod(existing, ref *v1.Pod) (needUpdate, needReconcile, needGracefulDelete bool) {

and it appeared that both existing and ref contain Ready False.

So far I suspect this func:

kubernetes/pkg/kubelet/config/config.go

Line 252 in 7a5929d

    
           updatePodsFunc := func(newPods []*v1.Pod, oldPods, pods map[types.UID]*v1.Pod) {

kayrus · 2019-12-04T12:29:06Z

@databus23

the patch below, adapted to v1.15.x from the #84951 PR, solved the issue. Tested multiple times.

kayrus@e2a18d9

kayrus · 2020-01-28T13:00:19Z

fixed in k8s 1.15.8 (1.15.9)

kayrus added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 6, 2019

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 6, 2019

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 8, 2019

k82cn removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 8, 2019

kayrus mentioned this issue Oct 29, 2019

Core: kubelet services lose connections to kube-apiserver gardener/gardener#1560

Closed

kayrus mentioned this issue Nov 15, 2019

More human readable condition for the retry logic #85307

Closed

databus23 mentioned this issue Dec 4, 2019

Pods remaining in NotReady state #84931

Closed

tedyu mentioned this issue Dec 4, 2019

Sync the status of static Pods #84951

Merged

kayrus closed this as completed Jan 28, 2020

jdamata mentioned this issue Mar 4, 2020

Pods in not ready state caused by K8s issue, requires >= 1.15.8 Azure/AKS#1472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status inconsistencies between deployment and its pods #82405

Status inconsistencies between deployment and its pods #82405

kayrus commented Sep 6, 2019 •

edited

Loading

kayrus commented Sep 6, 2019

k82cn commented Sep 8, 2019

Joseph-Irving commented Sep 10, 2019

kayrus commented Sep 11, 2019

Joseph-Irving commented Sep 11, 2019

kayrus commented Sep 11, 2019

Joseph-Irving commented Sep 11, 2019 •

edited

Loading

kayrus commented Sep 12, 2019

Joseph-Irving commented Sep 12, 2019 •

edited

Loading

kayrus commented Sep 12, 2019 •

edited

Loading

kayrus commented Nov 15, 2019 •

edited

Loading

kayrus commented Nov 17, 2019 •

edited

Loading

kayrus commented Dec 4, 2019 •

edited

Loading

kayrus commented Jan 28, 2020

Status inconsistencies between deployment and its pods #82405

Status inconsistencies between deployment and its pods #82405

Comments

kayrus commented Sep 6, 2019 • edited Loading

kayrus commented Sep 6, 2019

k82cn commented Sep 8, 2019

Joseph-Irving commented Sep 10, 2019

kayrus commented Sep 11, 2019

Joseph-Irving commented Sep 11, 2019

kayrus commented Sep 11, 2019

Joseph-Irving commented Sep 11, 2019 • edited Loading

kayrus commented Sep 12, 2019

Joseph-Irving commented Sep 12, 2019 • edited Loading

kayrus commented Sep 12, 2019 • edited Loading

kayrus commented Nov 15, 2019 • edited Loading

kayrus commented Nov 17, 2019 • edited Loading

kayrus commented Dec 4, 2019 • edited Loading

kayrus commented Jan 28, 2020

kayrus commented Sep 6, 2019 •

edited

Loading

Joseph-Irving commented Sep 11, 2019 •

edited

Loading

Joseph-Irving commented Sep 12, 2019 •

edited

Loading

kayrus commented Sep 12, 2019 •

edited

Loading

kayrus commented Nov 15, 2019 •

edited

Loading

kayrus commented Nov 17, 2019 •

edited

Loading

kayrus commented Dec 4, 2019 •

edited

Loading