Pod with container with failed StartupProbe stays in Ready: false. #84178

odinuge · 2019-10-22T08:01:10Z

What happened:

Pod with container with failed StartupProbe stays in Ready: false.

What you expected to happen:

Pod/container should be killed and restarted and eventually enter CrashLoopBackOff.

How to reproduce it (as minimally and precisely as possible):

Start cluster with feature gate StartupProbe=true.

Deploy pod:

apiVersion: v1
kind: Pod
metadata:
  name: startup-probe-pod
spec:
  containers:
    - image: fedora
      command:
        - sleep
        - inf
      name: example1
      readinessProbe:
        exec:
          command:
            - "true"
        failureThreshold: 2
        periodSeconds: 2
      livenessProbe:
        exec:
          command:
            - "true"
        failureThreshold: 2
        periodSeconds: 2
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 2
        periodSeconds: 2

Watch output:

> kubectl get pod
NAME                READY   STATUS    RESTARTS   AGE
startup-probe-pod   0/1     Running   0          3m28s
> kubectl describe pod
Name:         startup-probe-pod                                                                                                                                             
Namespace:    default                                                                                                                                                       
Priority:     0                                                                                                                                                             
Node:         kind-control-plane/172.17.0.2                                                                                                                                 
Start Time:   Tue, 22 Oct 2019 09:54:33 +0200                                                                                                                               
Labels:       <none>                                                                                                                                                        
Annotations:  kubectl.kubernetes.io/last-applied-configuration:                                                                                                             
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"startup-probe-pod","namespace":"default"},"spec":{"containers":[{"com...               
Status:       Running                                                                                                                                                       
IP:           10.244.0.2                                                                                                                                                    
IPs:                                                                                                                                                                        
  IP:  10.244.0.2                                                                                                                                                           
Containers:                                                                                                                                                                 
  example1:                                                                                                                                                                 
    Container ID:  containerd://8451c1b089635e2fde65e59a8b00813db728cb7ad8942bf1a81ab0279496b4d9                                                                            
    Image:         fedora                                                                                                                                                   
    Image ID:      docker.io/library/fedora@sha256:8a91dbd4b9d283ca1edc2de5dbeef9267b68bb5dae2335ef64d2db77ddf3aa68                                                         
    Port:          <none>                                                                                                                                                   
    Host Port:     <none>                                                                                                                                                   
    Command:                                                                                                                                                                
      sleep
      inf
    State:          Running
      Started:      Tue, 22 Oct 2019 09:54:54 +0200
    Ready:          False
    Restart Count:  0
    Liveness:       exec [true] delay=0s timeout=1s period=2s #success=1 #failure=2
    Readiness:      exec [true] delay=0s timeout=1s period=2s #success=1 #failure=2
    Startup:        http-get http://:8080/healthz delay=0s timeout=1s period=2s #success=1 #failure=2
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s8pdb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-s8pdb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s8pdb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                    From                         Message
  ----     ------            ----                   ----                         -------
  Warning  FailedScheduling  <unknown>              default-scheduler            0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled         <unknown>              default-scheduler            Successfully assigned default/startup-probe-pod to kind-control-plane
  Normal   Pulling           3m54s                  kubelet, kind-control-plane  Pulling image "fedora"
  Normal   Pulled            3m36s                  kubelet, kind-control-plane  Successfully pulled image "fedora"
  Normal   Created           3m33s                  kubelet, kind-control-plane  Created container example1
  Normal   Started           3m33s                  kubelet, kind-control-plane  Started container example1
  Warning  Unhealthy         3m30s (x2 over 3m32s)  kubelet, kind-control-plane  Startup probe failed: Get http://10.244.0.2:8080/healthz: dial tcp 10.244.0.2:8080: connect
: connection refused

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.0-alpha.1", GitCommit:"0960c74c3788b1724bd7e7b9933bc49c7e5b5afa", GitTreeState:"clean", BuildDate:"20
19-10-02T23:24:03Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-16T
07:13:29Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: kind
OS (e.g: cat /etc/os-release): `Arch linux
Kernel (e.g. uname -a): Linux xps13 5.3.7-arch1-1-ARCH #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019 x86_64 GNU/Linux
Install tools: kind

# this config file contains all config fields with comments
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
# patch the generated kubeadm config with some extra settings
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      "feature-gates": "StartupProbe=true"
  scheduler:
    extraArgs:
      "feature-gates": "StartupProbe=true"
  controllerManager:
    extraArgs:
      "feature-gates": "StartupProbe=true"
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: InitConfiguration
  metadata:
    name: config
  nodeRegistration:
    kubeletExtraArgs:
      "feature-gates": "StartupProbe=true"
# 1 control plane node and 3 workers
nodes:
- role: control-plane

Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

odinuge · 2019-10-22T08:05:23Z

/sig node
/cc @matthyx

matthyx · 2019-10-22T11:24:43Z

Right, I was blindly hoping that this was sufficient:

kubernetes/pkg/kubelet/prober/worker.go

Line 275 in 3d40c4c

    
           if (w.probeType == liveness || w.probeType == startup) && result == results.Failure {

Let me correct, and thanks for the hint @Pingan2017 !

odinuge · 2019-10-22T13:14:45Z

we need to add block for StartupProbe failure.

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 590 to 596 in ea4570a

} else if liveness, found := m.livenessManager.Get(containerStatus.ID); found && liveness == proberesults.Failure {

// If the container failed the liveness probe, we should kill it.

message = fmt.Sprintf("Container %s failed liveness probe", container.Name)

} else {

// Keep the container.

keepCount++

continue

Did some research a few days ago, and it turns out that adding a StarupProbe there will not be sufficient to fix the issue. Since we init the probe with Failed, the container will keep restarting if we do so. I think we have to rewrite the probing mechanism a bit to achieve what we want, since the code was written primarly for the liveness and readiness probes. I have a small wip solution that works, so I guess I can upload that one when I get some time.

matthyx · 2019-10-24T07:09:32Z

I think we have to rewrite the probing mechanism a bit to achieve what we want, since the code was written primarly for the liveness and readiness probes. I have a small wip solution that works, so I guess I can upload that one when I get some time.

I have proposed a PR, but could you elaborate on your wip... as it doesn't seem this case is correctly detected by the tests I have added?

odinuge · 2019-10-24T07:31:41Z

Thanks for the PR! My wip didn't get that much further, and I am still not sure what the best way is.

The "main" difficulty I see is that fact that a probe can only hold one of two values:

kubernetes/pkg/kubelet/prober/results/results_manager.go

Lines 43 to 51 in ea4570a

    
           type Result bool 
        
           const ( 
        
           	// Success is encoded as "true" (type Result) 
        
           	Success Result = true 
        
           	// Failure is encoded as "false" (type Result) 
        
           	Failure Result = false 
        
           )

In StartupProbe we need to initialize with one of them, and detect the next update, both if it is successful or failing. The result_manager will only propagate changes if the probe result change, but that isn't sufficient for us.

kubernetes/pkg/kubelet/prober/results/results_manager.go

Lines 110 to 114 in ea4570a

    
           func (m *manager) Set(id kubecontainer.ContainerID, result Result, pod *v1.Pod) { 
        
           	if m.setInternal(id, result) { 
        
           		m.updates <- Update{id, result, pod.UID} 
        
           	} 
        
           }

Adding a new entity to the Result type may work, eg. Unknown/NotAvailable. We can then use the value as the defaultValue for the probe, making it possible to detect both Success and Failure when they occur.

This is just some of my thoughts, and there may be better ways to achieve the same.

matthyx · 2019-10-24T09:34:14Z

@thockin maybe you could jump in?

thockin · 2019-10-24T15:45:17Z

Unfortunately, I just don't know this code at all any more. Clearly we have some state that we track, since they all have failureThreshold, so there's something between "success" and "fail" for the signal.

I mean, this should be functionally the same as livenessProbe, right?

thockin · 2019-10-24T15:49:35Z

@Random-Liu can maybe help guide?

odinuge added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 22, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 22, 2019

This was referenced Oct 22, 2019

[Failing Test] [sig-node] startup_probe_test.go #82747

Closed

Promote StartupProbe to beta for 1.18 #83437

Merged

matthyx mentioned this issue Oct 24, 2019

Add startupProbe result handling to kuberuntime #84279

Merged

k8s-ci-robot closed this as completed in #84279 Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod with container with failed StartupProbe stays in Ready: false. #84178

Pod with container with failed StartupProbe stays in Ready: false. #84178

odinuge commented Oct 22, 2019

odinuge commented Oct 22, 2019 •

edited

matthyx commented Oct 22, 2019

odinuge commented Oct 22, 2019

matthyx commented Oct 24, 2019

odinuge commented Oct 24, 2019

matthyx commented Oct 24, 2019

thockin commented Oct 24, 2019

thockin commented Oct 24, 2019

Pod with container with failed StartupProbe stays in Ready: false. #84178

Pod with container with failed StartupProbe stays in Ready: false. #84178

Comments

odinuge commented Oct 22, 2019

odinuge commented Oct 22, 2019 • edited

matthyx commented Oct 22, 2019

odinuge commented Oct 22, 2019

matthyx commented Oct 24, 2019

odinuge commented Oct 24, 2019

matthyx commented Oct 24, 2019

thockin commented Oct 24, 2019

thockin commented Oct 24, 2019

odinuge commented Oct 22, 2019 •

edited