Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod with container with failed StartupProbe stays in Ready: false. #84178

Closed
odinuge opened this issue Oct 22, 2019 · 8 comments · Fixed by #84279
Closed

Pod with container with failed StartupProbe stays in Ready: false. #84178

odinuge opened this issue Oct 22, 2019 · 8 comments · Fixed by #84279
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@odinuge
Copy link
Member

odinuge commented Oct 22, 2019

What happened:

Pod with container with failed StartupProbe stays in Ready: false.

What you expected to happen:

Pod/container should be killed and restarted and eventually enter CrashLoopBackOff.

How to reproduce it (as minimally and precisely as possible):

Start cluster with feature gate StartupProbe=true.

Deploy pod:

apiVersion: v1
kind: Pod
metadata:
  name: startup-probe-pod
spec:
  containers:
    - image: fedora
      command:
        - sleep
        - inf
      name: example1
      readinessProbe:
        exec:
          command:
            - "true"
        failureThreshold: 2
        periodSeconds: 2
      livenessProbe:
        exec:
          command:
            - "true"
        failureThreshold: 2
        periodSeconds: 2
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 2
        periodSeconds: 2

Watch output:

> kubectl get pod
NAME                READY   STATUS    RESTARTS   AGE
startup-probe-pod   0/1     Running   0          3m28s
> kubectl describe pod
Name:         startup-probe-pod                                                                                                                                             
Namespace:    default                                                                                                                                                       
Priority:     0                                                                                                                                                             
Node:         kind-control-plane/172.17.0.2                                                                                                                                 
Start Time:   Tue, 22 Oct 2019 09:54:33 +0200                                                                                                                               
Labels:       <none>                                                                                                                                                        
Annotations:  kubectl.kubernetes.io/last-applied-configuration:                                                                                                             
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"startup-probe-pod","namespace":"default"},"spec":{"containers":[{"com...               
Status:       Running                                                                                                                                                       
IP:           10.244.0.2                                                                                                                                                    
IPs:                                                                                                                                                                        
  IP:  10.244.0.2                                                                                                                                                           
Containers:                                                                                                                                                                 
  example1:                                                                                                                                                                 
    Container ID:  containerd://8451c1b089635e2fde65e59a8b00813db728cb7ad8942bf1a81ab0279496b4d9                                                                            
    Image:         fedora                                                                                                                                                   
    Image ID:      docker.io/library/fedora@sha256:8a91dbd4b9d283ca1edc2de5dbeef9267b68bb5dae2335ef64d2db77ddf3aa68                                                         
    Port:          <none>                                                                                                                                                   
    Host Port:     <none>                                                                                                                                                   
    Command:                                                                                                                                                                
      sleep
      inf
    State:          Running
      Started:      Tue, 22 Oct 2019 09:54:54 +0200
    Ready:          False
    Restart Count:  0
    Liveness:       exec [true] delay=0s timeout=1s period=2s #success=1 #failure=2
    Readiness:      exec [true] delay=0s timeout=1s period=2s #success=1 #failure=2
    Startup:        http-get http://:8080/healthz delay=0s timeout=1s period=2s #success=1 #failure=2
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s8pdb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-s8pdb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s8pdb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                    From                         Message
  ----     ------            ----                   ----                         -------
  Warning  FailedScheduling  <unknown>              default-scheduler            0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled         <unknown>              default-scheduler            Successfully assigned default/startup-probe-pod to kind-control-plane
  Normal   Pulling           3m54s                  kubelet, kind-control-plane  Pulling image "fedora"
  Normal   Pulled            3m36s                  kubelet, kind-control-plane  Successfully pulled image "fedora"
  Normal   Created           3m33s                  kubelet, kind-control-plane  Created container example1
  Normal   Started           3m33s                  kubelet, kind-control-plane  Started container example1
  Warning  Unhealthy         3m30s (x2 over 3m32s)  kubelet, kind-control-plane  Startup probe failed: Get http://10.244.0.2:8080/healthz: dial tcp 10.244.0.2:8080: connect
: connection refused

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.0-alpha.1", GitCommit:"0960c74c3788b1724bd7e7b9933bc49c7e5b5afa", GitTreeState:"clean", BuildDate:"20
19-10-02T23:24:03Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-16T
07:13:29Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: kind
  • OS (e.g: cat /etc/os-release): `Arch linux
  • Kernel (e.g. uname -a): Linux xps13 5.3.7-arch1-1-ARCH #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019 x86_64 GNU/Linux
  • Install tools: kind
# this config file contains all config fields with comments
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
# patch the generated kubeadm config with some extra settings
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      "feature-gates": "StartupProbe=true"
  scheduler:
    extraArgs:
      "feature-gates": "StartupProbe=true"
  controllerManager:
    extraArgs:
      "feature-gates": "StartupProbe=true"
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: InitConfiguration
  metadata:
    name: config
  nodeRegistration:
    kubeletExtraArgs:
      "feature-gates": "StartupProbe=true"
# 1 control plane node and 3 workers
nodes:
- role: control-plane
  • Network plugin and version (if this is a network-related bug):
  • Others:
@odinuge odinuge added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 22, 2019
@odinuge
Copy link
Member Author

odinuge commented Oct 22, 2019

/sig node
/cc @matthyx

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 22, 2019
@matthyx
Copy link
Contributor

matthyx commented Oct 22, 2019

Right, I was blindly hoping that this was sufficient:

if (w.probeType == liveness || w.probeType == startup) && result == results.Failure {

Let me correct, and thanks for the hint @Pingan2017 !

@odinuge
Copy link
Member Author

odinuge commented Oct 22, 2019

we need to add block for StartupProbe failure.

} else if liveness, found := m.livenessManager.Get(containerStatus.ID); found && liveness == proberesults.Failure {
// If the container failed the liveness probe, we should kill it.
message = fmt.Sprintf("Container %s failed liveness probe", container.Name)
} else {
// Keep the container.
keepCount++
continue

Did some research a few days ago, and it turns out that adding a StarupProbe there will not be sufficient to fix the issue. Since we init the probe with Failed, the container will keep restarting if we do so. I think we have to rewrite the probing mechanism a bit to achieve what we want, since the code was written primarly for the liveness and readiness probes. I have a small wip solution that works, so I guess I can upload that one when I get some time.

@matthyx
Copy link
Contributor

matthyx commented Oct 24, 2019

I think we have to rewrite the probing mechanism a bit to achieve what we want, since the code was written primarly for the liveness and readiness probes. I have a small wip solution that works, so I guess I can upload that one when I get some time.

I have proposed a PR, but could you elaborate on your wip... as it doesn't seem this case is correctly detected by the tests I have added?

@odinuge
Copy link
Member Author

odinuge commented Oct 24, 2019

Thanks for the PR! My wip didn't get that much further, and I am still not sure what the best way is.

The "main" difficulty I see is that fact that a probe can only hold one of two values:

type Result bool
const (
// Success is encoded as "true" (type Result)
Success Result = true
// Failure is encoded as "false" (type Result)
Failure Result = false
)

In StartupProbe we need to initialize with one of them, and detect the next update, both if it is successful or failing. The result_manager will only propagate changes if the probe result change, but that isn't sufficient for us.

func (m *manager) Set(id kubecontainer.ContainerID, result Result, pod *v1.Pod) {
if m.setInternal(id, result) {
m.updates <- Update{id, result, pod.UID}
}
}

Adding a new entity to the Result type may work, eg. Unknown/NotAvailable. We can then use the value as the defaultValue for the probe, making it possible to detect both Success and Failure when they occur.

This is just some of my thoughts, and there may be better ways to achieve the same.

@matthyx
Copy link
Contributor

matthyx commented Oct 24, 2019

@thockin maybe you could jump in?

@thockin
Copy link
Member

thockin commented Oct 24, 2019

Unfortunately, I just don't know this code at all any more. Clearly we have some state that we track, since they all have failureThreshold, so there's something between "success" and "fail" for the signal.

I mean, this should be functionally the same as livenessProbe, right?

@thockin
Copy link
Member

thockin commented Oct 24, 2019

@Random-Liu can maybe help guide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants