Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness probe is failing, but the service is accessible when execd #51096

Closed
chrissound opened this issue Aug 22, 2017 · 36 comments
Closed

Readiness probe is failing, but the service is accessible when execd #51096

chrissound opened this issue Aug 22, 2017 · 36 comments
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@chrissound
Copy link

chrissound commented Aug 22, 2017

What happened: Readiness probe failed

What you expected to happen: Succesfull container startup

How to reproduce it (as minimally and precisely as possible): I created a NodePort service, and then the issue seemed to start. I deleted the NodePort service and reverted back to the previous LoadBalancer service but the issue remains.

If I remove the liveness probe from my deployment config, the containers start up successfully, and I'm able to access them via the LoadBalancer service external IP

Cloud provider or hardware configuration: Google Cloud Platform

The container in question is a Nginx Container listening on port 80.


The event from kubectl describe is:
Unhealthy Readiness probe failed: Get http://10.24.1.33:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I can exec into the container and do: wget 10.24.1.33 which works correctly (HTTP 200 response).


Readiness probe config:

          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5 
            periodSeconds: 5
@k8s-github-robot
Copy link

@chrissound
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 22, 2017
@chrissound
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Aug 22, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 22, 2017
@chrissound
Copy link
Author

If I SSH into the GCP instance I'm also able to get a HTTP 200 from the container IP.

@chrissound
Copy link
Author

Nginx logs is showing this:

10.24.1.1 - - [23/Aug/2017:16:47:19 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:24 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:29 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:34 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:39 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:44 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:49 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:54 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"
10.24.1.1 - - [23/Aug/2017:16:47:59 +0000] "GET / HTTP/1.1" 499 0 "-" "Go-http-client/1.1"

Which is probably the readiness probe.

@chrissound
Copy link
Author

Nginx debug logs:

2017/08/23 17:03:26 [debug] 5#5: accept on 0.0.0.0:80, ready: 1
2017/08/23 17:03:26 [debug] 5#5: posix_memalign: 000055DEEC1D5BE0:512 @16
2017/08/23 17:03:26 [debug] 5#5: *9 accept: 10.24.1.1:53422 fd:3
2017/08/23 17:03:26 [debug] 5#5: *9 event timer add: 3: 60000:1503507866741
2017/08/23 17:03:26 [debug] 5#5: *9 reusable connection: 1
2017/08/23 17:03:26 [debug] 5#5: *9 epoll add event: fd:3 op:1 ev:80002001
2017/08/23 17:03:26 [debug] 5#5: accept() not ready (11: Resource temporarily unavailable)
2017/08/23 17:03:26 [debug] 5#5: *9 http wait request handler
2017/08/23 17:03:26 [debug] 5#5: *9 malloc: 000055DEEC1CBEC0:1024
2017/08/23 17:03:26 [debug] 5#5: *9 recv: eof:0, avail:1
2017/08/23 17:03:26 [debug] 5#5: *9 recv: fd:3 113 of 1024
2017/08/23 17:03:26 [debug] 5#5: *9 reusable connection: 0
2017/08/23 17:03:26 [debug] 5#5: *9 posix_memalign: 000055DEEC1D9E80:4096 @16
2017/08/23 17:03:26 [debug] 5#5: *9 http process request line
2017/08/23 17:03:26 [debug] 5#5: *9 http request line: "GET / HTTP/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 http uri: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 http args: ""
2017/08/23 17:03:26 [debug] 5#5: *9 http exten: ""
2017/08/23 17:03:26 [debug] 5#5: *9 posix_memalign: 000055DEEC1DAEA0:4096 @16
2017/08/23 17:03:26 [debug] 5#5: *9 http process request header line
2017/08/23 17:03:26 [debug] 5#5: *9 http header: "Host: 10.24.1.56:80"
2017/08/23 17:03:26 [debug] 5#5: *9 http header: "User-Agent: Go-http-client/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 http header: "Accept-Encoding: gzip"
2017/08/23 17:03:26 [debug] 5#5: *9 http header: "Connection: close"
2017/08/23 17:03:26 [debug] 5#5: *9 http header done
2017/08/23 17:03:26 [debug] 5#5: *9 event timer del: 3: 1503507866741
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 0
2017/08/23 17:03:26 [debug] 5#5: *9 rewrite phase: 1
2017/08/23 17:03:26 [debug] 5#5: *9 test location: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 test location: ~ "\.php$"
2017/08/23 17:03:26 [debug] 5#5: *9 using configuration "/"
2017/08/23 17:03:26 [debug] 5#5: *9 http cl:-1 max:1048576
2017/08/23 17:03:26 [debug] 5#5: *9 rewrite phase: 3
2017/08/23 17:03:26 [debug] 5#5: *9 post rewrite phase: 4
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 5
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 6
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 7
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 8
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 9
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 10
2017/08/23 17:03:26 [debug] 5#5: *9 post access phase: 11
2017/08/23 17:03:26 [debug] 5#5: *9 try files phase: 12
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 trying to use file: "/" "/var/www/html/"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 trying to use dir: "/" "/var/www/html/"
2017/08/23 17:03:26 [debug] 5#5: *9 try file uri: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 content phase: 13
2017/08/23 17:03:26 [debug] 5#5: *9 content phase: 14
2017/08/23 17:03:26 [debug] 5#5: *9 open index "/var/www/html/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 internal redirect: "/index.php?"
2017/08/23 17:03:26 [debug] 5#5: *9 rewrite phase: 1
2017/08/23 17:03:26 [debug] 5#5: *9 test location: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 test location: ~ "\.php$"
2017/08/23 17:03:26 [debug] 5#5: *9 using configuration "\.php$"
2017/08/23 17:03:26 [debug] 5#5: *9 http cl:-1 max:1048576
2017/08/23 17:03:26 [debug] 5#5: *9 rewrite phase: 3
2017/08/23 17:03:26 [debug] 5#5: *9 post rewrite phase: 4
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 5
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 6
2017/08/23 17:03:26 [debug] 5#5: *9 generic phase: 7
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 8
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 9
2017/08/23 17:03:26 [debug] 5#5: *9 access phase: 10
2017/08/23 17:03:26 [debug] 5#5: *9 post access phase: 11
2017/08/23 17:03:26 [debug] 5#5: *9 try files phase: 12
2017/08/23 17:03:26 [debug] 5#5: *9 http init upstream, client timer: 0
2017/08/23 17:03:26 [debug] 5#5: *9 epoll add event: fd:3 op:3 ev:80002005
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "GATEWAY_INTERFACE"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "CGI/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "GATEWAY_INTERFACE: CGI/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SERVER_SOFTWARE"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "nginx"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SERVER_SOFTWARE: nginx"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "QUERY_STRING"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "QUERY_STRING: "
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "REQUEST_METHOD"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "GET"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "REQUEST_METHOD: GET"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "CONTENT_TYPE"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "CONTENT_TYPE: "
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "CONTENT_LENGTH"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "CONTENT_LENGTH: "
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SCRIPT_FILENAME"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/var/www/html"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SCRIPT_FILENAME: /var/www/html/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SCRIPT_NAME"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SCRIPT_NAME: /index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "REQUEST_URI"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "REQUEST_URI: /"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "DOCUMENT_URI"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "DOCUMENT_URI: /index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "DOCUMENT_ROOT"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "/var/www/html"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "DOCUMENT_ROOT: /var/www/html"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SERVER_PROTOCOL"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "HTTP/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SERVER_PROTOCOL: HTTP/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "REMOTE_ADDR"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "10.24.1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "REMOTE_ADDR: 10.24.1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "REMOTE_PORT"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "53422"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "REMOTE_PORT: 53422"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SERVER_ADDR"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "10.24.1.56"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SERVER_ADDR: 10.24.1.56"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SERVER_PORT"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: "80"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SERVER_PORT: 80"
2017/08/23 17:03:26 [debug] 5#5: *9 http script copy: "SERVER_NAME"
2017/08/23 17:03:26 [debug] 5#5: *9 http script var: ""
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "SERVER_NAME: "
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "HTTP_HOST: 10.24.1.56:80"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "HTTP_USER_AGENT: Go-http-client/1.1"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "HTTP_ACCEPT_ENCODING: gzip"
2017/08/23 17:03:26 [debug] 5#5: *9 fastcgi param: "HTTP_CONNECTION: close"
2017/08/23 17:03:26 [debug] 5#5: *9 http cleanup add: 000055DEEC1DB990
2017/08/23 17:03:26 [debug] 5#5: *9 get rr peer, try: 2
2017/08/23 17:03:26 [debug] 5#5: *9 get rr peer, current: 000055DEEC1EA890 -1
2017/08/23 17:03:26 [debug] 5#5: *9 stream socket 12
2017/08/23 17:03:26 [debug] 5#5: *9 epoll add connection: fd:12 ev:80002005
2017/08/23 17:03:26 [debug] 5#5: *9 connect to 127.0.0.1:9000, fd:12 #10
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream connect: -2
2017/08/23 17:03:26 [debug] 5#5: *9 posix_memalign: 000055DEEC1F1460:128 @16
2017/08/23 17:03:26 [debug] 5#5: *9 event timer add: 12: 60000:1503507866742
2017/08/23 17:03:26 [debug] 5#5: *9 http finalize request: -4, "/index.php?" a:1, c:3
2017/08/23 17:03:26 [debug] 5#5: *9 http request count:3 blk:0
2017/08/23 17:03:26 [debug] 5#5: *9 http finalize request: -4, "/index.php?" a:1, c:2
2017/08/23 17:03:26 [debug] 5#5: *9 http request count:2 blk:0
2017/08/23 17:03:26 [debug] 5#5: *9 http run request: "/index.php?"
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream check client, write event:1, "/index.php"
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream request: "/index.php?"
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream send request handler
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream send request
2017/08/23 17:03:26 [debug] 5#5: *9 http upstream send request body
2017/08/23 17:03:26 [debug] 5#5: *9 chain writer buf fl:0 s:504
2017/08/23 17:03:26 [debug] 5#5: *9 chain writer in: 000055DEEC1DB9D0
2017/08/23 17:03:26 [debug] 5#5: *9 writev: 504 of 504
2017/08/23 17:03:26 [debug] 5#5: *9 chain writer out: 0000000000000000
2017/08/23 17:03:26 [debug] 5#5: *9 event timer del: 12: 1503507866742
2017/08/23 17:03:26 [debug] 5#5: *9 event timer add: 12: 60000:1503507866742
2017/08/23 17:03:27 [debug] 5#5: *9 http run request: "/index.php?"
2017/08/23 17:03:27 [debug] 5#5: *9 http upstream check client, write event:0, "/index.php"
2017/08/23 17:03:27 [info] 5#5: *9 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 10.24.1.1, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "10.24.1.56:80"
2017/08/23 17:03:27 [debug] 5#5: *9 finalize http upstream request: 499
2017/08/23 17:03:27 [debug] 5#5: *9 finalize http fastcgi request
2017/08/23 17:03:27 [debug] 5#5: *9 free rr peer 2 0
2017/08/23 17:03:27 [debug] 5#5: *9 close http upstream connection: 12
2017/08/23 17:03:27 [debug] 5#5: *9 free: 000055DEEC1F1460, unused: 48
2017/08/23 17:03:27 [debug] 5#5: *9 event timer del: 12: 1503507866742
2017/08/23 17:03:27 [debug] 5#5: *9 reusable connection: 0
2017/08/23 17:03:27 [debug] 5#5: *9 http finalize request: 499, "/index.php?" a:1, c:1
2017/08/23 17:03:27 [debug] 5#5: *9 http terminate request count:1
2017/08/23 17:03:27 [debug] 5#5: *9 http terminate cleanup count:1 blk:0
2017/08/23 17:03:27 [debug] 5#5: *9 http posted request: "/index.php?"
2017/08/23 17:03:27 [debug] 5#5: *9 http terminate handler count:1
2017/08/23 17:03:27 [debug] 5#5: *9 http request count:1 blk:0
2017/08/23 17:03:27 [debug] 5#5: *9 http close request
2017/08/23 17:03:27 [debug] 5#5: *9 http log handler
2017/08/23 17:03:27 [debug] 5#5: *9 free: 000055DEEC1D9E80, unused: 0
2017/08/23 17:03:27 [debug] 5#5: *9 free: 000055DEEC1DAEA0, unused: 1089
2017/08/23 17:03:27 [debug] 5#5: *9 close http connection: 3
2017/08/23 17:03:27 [debug] 5#5: *9 reusable connection: 0
2017/08/23 17:03:27 [debug] 5#5: *9 free: 000055DEEC1CBEC0
2017/08/23 17:03:27 [debug] 5#5: *9 free: 000055DEEC1D5BE0, unused: 120

And according to https://stackoverflow.com/questions/15613452/nginx-issues-http-499-error-after-60-seconds-despite-config-php-and-aws#answer-15621223 it sounds like the probe is closing the connection....

Hmm

@jiaxuanzhou
Copy link
Contributor

@chrissound
what happens when the "host" set to "127.0.0.1" ?

@chrissound
Copy link
Author

The issue was - the container was taking longer then a second to respond.

@saurabhdevops
Copy link

@chrissound Would you mind suggesting what you did to resolve this? I am facing a similar issue.

@mgalgs
Copy link

mgalgs commented Jan 17, 2018

@saurabhdevops You can increase the probe timeout if you don't expect the container to respond within 1 second (livenessProbe.timeoutSeconds).

@Macmee
Copy link

Macmee commented Sep 29, 2018

sorry to bump this, but I'm on EKS (if it matters) and my nginx ingress readiness probes are taking more than 1 second to respond too. Sometimes they hang upwards of 5-8 seconds and then respond. I increased livenessProbe.timeoutSeconds but I'm trying to figure out why this delay occurs. Does anyone here have any suggestions?

@tproenca
Copy link

tproenca commented Oct 9, 2018

I'm having the same issue and I have increased the timeoutSeconds to 5.

@JorgenG
Copy link

JorgenG commented Nov 19, 2018

@Macmee Having the same strange issue here. Basically we started using NodePort, and it sporadically times out the readiness / liveness probe checks.

When I inspect our monitoring metrics, the endpoints used for this has a maximum response time of 4.5ms, in other words, there is something stalling the connection into the container/pod somehow.

The way I run it is that two services attach to the container, one headless and one http. I'm curious if that could be part of the problem?

Have you learned anything more about the issue?

@stafot
Copy link
Contributor

stafot commented Dec 4, 2018

Same experience here EKS too. c5 instances.

@nomcopter
Copy link

Seeing this issue in GCP as well. Not sure where this stall is coming from.

@wwyhy
Copy link

wwyhy commented Feb 18, 2019

Same experience here EKS too. c5 instances.

we are using same C5 instances, but not EKS, we are deploy k8s with kops.

we set the timeoutSeconds to 5, no help.

@futare
Copy link

futare commented Mar 12, 2019

Same experience. I'm using kops on AWS v1.10.6. timeoutSeconds to 5 (I tried higher also) but metrics show sub second response time. Using tomcat in pod. Get 200 in tomcat access logs and externally app responds fine. Can reach from kubelet host fine also. Using r5 instances. Anyone here have suggestions?

@blaketastic2
Copy link

Did anyone ever solve this?

@mbelang
Copy link

mbelang commented May 31, 2019

same problem with m5 instances on EKS (do no think EKS is related) k8s 1.12.x and gunicorn app. If I expose flask server directly the issue doesn't appear anymore...

What is super odd is that I see log from the pod and this is coming from the probe :

2019-05-31T13:16:46.863528Z {"url": "/healthcheck", "event": "Healthcheck", "timestamp": "2019-05-31T13:16:46.863284Z", "logger": "demo.main", "level": "info"}

We have other services running uvicorn and gunicorn and not problem.

@cattermo
Copy link

Did anyone find a solution to this?

@gitnik
Copy link

gitnik commented Nov 18, 2019

Any updates on this?

@AndrewGrachov
Copy link

Ran into this issue today. Really strange behaviour

@Biga-incorta
Copy link

Any Updates ?

@rohitrd0987
Copy link

Same issue, this is effecting only in one of my GCP project, I have 3 private GKE clusters only 2 of them are being affected. The same code running in different projects are not behaving the same. Only one container is failing the readiness/liveness probes. This is seen with multiple pods across different namespaces. In kube-system namespace the kube-dns, l7-default-backend, heapster pods are restarting with the same behavior. Any insight is much appreciated.

@ervinb
Copy link

ervinb commented Jan 13, 2020

Try doubling your CPU requests to see if that helps. I saw readiness probes failing due to lack of resources.

@muscovitebob
Copy link

I have encountered this with a resource-limited namespace that was close to capacity (unsure if this was causative). Setting the timeout as suggested here fixed it.

@vanhaxl
Copy link

vanhaxl commented Jan 17, 2020

hi everyone. I wonder how can we know the root cause is the resource?
This issue is happening to our software currently. I check CPU usage is only 40%, memory is 36%.
is this the situation of thread starvation

@vanhaxl
Copy link

vanhaxl commented Jan 17, 2020

Add more info: the service is still available when I make a REST call from my postman. But in Kubernetes ecosystem, when other services call it, the unhealthy pods takes a long time to response => causes a time out exception

@rockysingh
Copy link

Add more info: the service is still available when I make a REST call from my postman. But in Kubernetes ecosystem, when other services call it, the unhealthy pods takes a long time to response => causes a time out exception

did you resolve this issue?

@vanhaxl
Copy link

vanhaxl commented Jan 22, 2020

hi rockysingh, Not yet.

@vanhaxl
Copy link

vanhaxl commented Jan 22, 2020

I tried to increase (twice) the time in Restemplate while calling API in unhealthy pods, but I still received some time out exception. Still finding the root cause of this

@vatsramkesh
Copy link

Here facing the same issue Readiness probe failed: Get http://127.0.0.1:8085/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) even though service return 200 response

@Nittarab
Copy link

I have the same issue... Is not a resources problem.

I can see the correct response in the pods logs:

10.48.8.1 - - [25/Feb/2020:12:30:05 +0000] "GET /wp-login.php HTTP/1.1" 200 1340
10.48.8.1 - - [25/Feb/2020:12:30:09 +0000] "GET /wp-login.php HTTP/1.1" 200 1340
10.156.0.12 - - [25/Feb/2020:12:30:11 +0000] "GET /wp-login.php HTTP/1.1" 200 3206

But if I describe the pod:

Events:
  Type     Reason     Age                   From                                          Message
  ----     ------     ----                  ----                                          -------
  Normal   Scheduled  3m6s                  default-scheduler                             Successfully assigned default/prod-shopmates-crm-74db965897-g9xzl to gke-shop-mates-pool-3-10336c2f-q5jf
  Warning  Unhealthy  105s (x4 over 2m15s)  kubelet, gke-shop-mates-pool-3-10336c2f-q5jf  Readiness probe failed: Get http://10.48.8.21:80/wp-login.php: dial tcp 10.48.8.21:80: connect: connection refused
  Warning  Unhealthy  101s                  kubelet, gke-shop-mates-pool-3-10336c2f-q5jf  Liveness probe failed: Get http://10.48.8.21:80/wp-login.php: dial tcp 10.48.8.21:80: connect: connection refused
  Warning  Unhealthy  25s (x6 over 75s)     kubelet, gke-shop-mates-pool-3-10336c2f-q5jf  Readiness probe failed: Get http://10.48.8.21:80/wp-login.php: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  21s (x6 over 71s)     kubelet, gke-shop-mates-pool-3-10336c2f-q5jf  Liveness probe failed: Get http://10.48.8.21:80/wp-login.php: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /wp-login.php
            port: 80
            scheme: HTTP
          initialDelaySeconds: 80
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /wp-login.php
            port: 80
            scheme: HTTP
          initialDelaySeconds: 40
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10



Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.9-gke.8", GitCommit:"a9973cbb2722793e2ea08d20880633ca61d3e669", GitTreeState:"clean", BuildDate:"2020-02-07T00:50:57Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

@yuripastushenko
Copy link

/triage unresolved

@rockysingh
Copy link

It is worth checking if the pods CPU spikes on startup.

@rajeshkumaratappviewx
Copy link

rajeshkumaratappviewx commented Nov 9, 2020

Started seeing the problem when there was a very high CPU and memory usage during a performance testing. Continued testing by increasing the liveliness and readiness timeout as it was working after the change. But after started failing with the increased values. Some of the pods fails to register as a services in kubernetes even though the docker service is up properly.

Looking for any help to understand what causes this problem.

Note: Please note that mongodb also runs in the same setup, and mongodb is consuming almost all the memory. Mongodb is running as a statefulset. After few more rounds of testing even mongodb pods also went to crashloopbackoff due to liveliness failure.

@hs18manish
Copy link

I am also facing this issue on k8 version 1.26 , 1.27 and 1.28 .

NAME="Rocky Linux"
VERSION="8.9 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.9"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests