Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio readiness Probe rewrite should respect the user specified timeout duration #18242

Closed
rnkhouse opened this issue Oct 23, 2019 · 13 comments
Closed
Assignees
Labels
area/security lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@rnkhouse
Copy link

rnkhouse commented Oct 23, 2019

Bug description
I followed the guide from this link (https://istio.io/docs/ops/app-health-check/) and using Enable via Helm Option Globally method to use readiness probe in my application. sidecarInjectorWebhook.rewriteAppHTTPProbe=true has been set in global config.

Deployment

readinessProbe:
      httpGet:
        path: /healthz
        port: 3000
      initialDelaySeconds: 700
      periodSeconds: 10
      timeoutSeconds: 10

Error

Warning  Unhealthy  16s (x2 over 26s)  kubelet, aks-agentpool-17141372-2  Readiness probe failed: Get http://10.240.0.78:15020/app-health/example-app/readyz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  13s                kubelet, aks-agentpool-17141372-2  Readiness probe failed: HTTP probe failed with statuscode: 503
Warning  Unhealthy  12s                kubelet, aks-agentpool-17141372-2  Readiness probe failed: Get http://10.240.0.78:15020/app-health/example-app/readyz: EOF
Warning  Unhealthy  11s                kubelet, aks-agentpool-17141372-2  Readiness probe failed: Get http://10.240.0.78:15020/healthz/ready: dial tcp 10.240.0.78:15020: connect: connection refused
Normal   Pulled     9s (x2 over 12m)   kubelet, aks-agentpool-17141372-2  Container image "docker.io/istio/proxyv2:1.2.6" already present on machine

Expected behavior
It should check readiness probe and route traffic if it's ready. Also, it should point to the application's /healthz path for readiness check as described in deployment settings.

Version (include the output of istioctl version --remote and kubectl version)
Istio version: 1.2.6
Kubernetes version: 1.14.5

How was Istio installed?
Helm template

Environment where bug was observed (cloud vendor, OS, etc)
AKS

@incfly
Copy link

incfly commented Oct 24, 2019

Do you mind do

  • Confirm before sidecar inject, your apps' health check works? annotations sidecar.istio.io/inject: false allows to do that. Im asking since seems like your app is not listening on that port at all.
  • Could you show us some relevant log of kubectl log $pod -c istio-proxy we should see some error code in pilot agent, with prober.go/status.go something like that.
  • Could you show the pod's istio-proxy env variable section? To make sure rewrite is correct.

@rnkhouse
Copy link
Author

rnkhouse commented Oct 24, 2019

@incfly
App is working fine on port 3000 I checked it by exec into the container and it's working on gateway domain if I don't use readiness probe at all.

example-app logs

> Ready on http://localhost:3000

Istio proxy logs

2019-10-24T14:41:16.702184Z	warn	Epoch 0 terminated with an error: signal: killed
2019-10-24T14:41:16.702206Z	warn	Aborted all epochs
2019-10-24T14:41:16.748544Z	error	Request to probe app failed: Get http://localhost:15020/app-health/example-app/readyz: net/http: request canceled (Client.Timeout exceeded while awaiting headers), original URL path = /app-health/example-app/readyz app URL path = /app-health/example-app/readyz

Environment variables istio-proxy:

Environment:
  POD_NAME:                          example-app-75c6cd597f-l8kzv (v1:metadata.name)
  POD_NAMESPACE:                     default (v1:metadata.namespace)
  INSTANCE_IP:                        (v1:status.podIP)
  ISTIO_META_POD_NAME:               example-app-75c6cd597f-l8kzv (v1:metadata.name)
  ISTIO_META_CONFIG_NAMESPACE:       default (v1:metadata.namespace)
  ISTIO_META_INTERCEPTION_MODE:      REDIRECT
  ISTIO_META_INCLUDE_INBOUND_PORTS:  3000
  ISTIO_METAJSON_ANNOTATIONS:        {"sidecar.istio.io/status":"{\"version\":\"761ebc5a63976754715f22fcf548f05270fb4b8db07324894aebdb31fa81d960\",\"initContainers\":[\"istio-init\"],\"containers\":[\"istio-proxy\"],\"volumes\":[\"istio-envoy\",\"istio-certs\"],\"imagePullSecrets\":null}"}
                                     
  ISTIO_METAJSON_LABELS:             {"app":"example-app","pod-template-hash":"75c6cd597f"}
                                     
  ISTIO_KUBE_APP_PROBERS:            {"/app-health/example-app/readyz":{"path":"/app-health/example-app/readyz","port":15020,"scheme":"HTTP"}}

PS: When I deploy app using sidecar.istio.io/inject: false it's working fine. Also, I can see ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/healthz","port":3000}} in env variables of istio-proxy.

@incfly
Copy link

incfly commented Oct 29, 2019

ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/app-health/example-app/readyz","port":15020,"scheme":"HTTP"}}

This is what you pasted from the env var section, which is different from

ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/healthz","port":3000}}

which one is the actual accurate env var of the istio-proxy container?

@incfly
Copy link

incfly commented Oct 29, 2019

Also what's your app health check endpoint time out/latency duration? I vaguely remembered we have a hard-coded in pilot agent log, if your app is longer than that, we may pass the timeout.

incfly@b412d59#diff-c1bdb159b354a382bbfe6a0bb13863b8R156

ours is 10 seconds, might be less than kubelet's implementation...

@rnkhouse
Copy link
Author

@incfly
I get ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/app-health/example-app/readyz","port":15020,"scheme":"HTTP"}} in environment variables.

When I add sidecar.istio.io/inject: false and re-deploy app then I get ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/healthz","port":3000}}

My app health check timeout is 10 seconds.

@incfly
Copy link

incfly commented Oct 31, 2019

okay, i see the kubelet code has a time out duration specified, we didn't respect that, rather using a hardcoded 10 seconds. https://sourcegraph.com/github.com/kubernetes/kubernetes@a381f7cb3efa64395661c73961c07b02594acf64/-/blob/pkg/kubelet/prober/prober.go#L172

In your case, you explicitly set the time out to be 10 seconds, and we're on the boundary line for this, so we might time out too early.

I think we can have a fix to propogate the timouet to istio pilot-agent to respect the timeout.


BTW, I don't quite understand the ISTIO_KUBE_APP_PROBERS values... if you make it annotations to not inject, you should not see istio-proxy sidecar container at all, right?

What I'm expecting

with sidecar injected, ISTIO_KUBE_APP_PROBERS: {"/app-health/example-app/readyz":{"path":"/healthz","port":3000}}.

without sidecar injected, no env var. ( I just want to ensure whether your health check works before for a control comparision)

@incfly incfly changed the title Readiness probe failed - connection refused Istio readiness Probe rewrite should respect the user specified timeout duration Oct 31, 2019
@rnkhouse
Copy link
Author

I have auto-inject enabled but it should overwrite global setting over deployment values.

Please take a look at below logs (sidecar inject enabled):

Name:           example-app-7c6ddc6c4b-xrzn8
Namespace:      default
Priority:       0
Node:           aks-agentpool-17141372-vmss000003/10.0.11.35
Start Time:     Thu, 31 Oct 2019 14:13:38 -0400
Labels:         app=example-app
                pod-template-hash=7c6ddc6c4b
Annotations:    sidecar.istio.io/status:
                {"version":"761ebc5a63976754715f22fcf548f05270fb4b8db07324894aebdb31fa81d960","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status:         Running
IP:             10.0.11.55
IPs:            <none>
Controlled By:  ReplicaSet/example-app-7c6ddc6c4b
Init Containers:
istio-init:
    Container ID:  docker://9fdf2850b104e6b722235c00f1825317431827dd41eddbcacfda38bf1caa6d83
    Image:         docker.io/istio/proxy_init:1.2.6
    Image ID:      docker-pullable://istio/proxy_init@sha256:dfa6205d7f7f53fa3fe1b8844d439f2c367930901a8e73856e806c71a094dcaf
    Port:          <none>
    Host Port:     <none>
    Args:
    -p
    15001
    -u
    1337
    -m
    REDIRECT
    -i
    *
    -x
    
    -b
    3000
    -d
    15020
    State:          Terminated
    Reason:       Completed
    Exit Code:    0
    Started:      Thu, 31 Oct 2019 14:13:40 -0400
    Finished:     Thu, 31 Oct 2019 14:13:40 -0400
    Ready:          True
    Restart Count:  0
    Limits:
    cpu:     100m
    memory:  50Mi
    Requests:
    cpu:        10m
    memory:     10Mi
    Environment:  <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-5g6qf (ro)
Containers:
example-app:
    Container ID:   docker://a63cbaeafd1ad30a0a09162a1d18973e08ecc978dd278eb4ba8ad318107ee75c
    Image:          mtesacr.azurecr.io/remitbee-cp:90-staging2019_10_31_14_10_01
    Image ID:       docker-pullable://mtesacr.azurecr.io/remitbee-cp@sha256:00eb646bfdea1ba9eec570bb7966f0395af363693e24deeb7fdd1c6a81ab11a4
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
    Started:      Thu, 31 Oct 2019 14:14:04 -0400
    Ready:          False
    Restart Count:  0
    Limits:
    cpu:     1
    memory:  1536Mi
    Requests:
    cpu:        800m
    memory:     1536Mi
    Readiness:    http-get http://:15020/app-health/example-app/readyz delay=200s timeout=10s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-5g6qf (ro)
istio-proxy:
    Container ID:  docker://7cd9aac781e82eb3b2de8cab4e665fa6777b864e8e4f3d4e8e73fea5bfc1dbd1
    Image:         docker.io/istio/proxyv2:1.2.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:cc9850e0281bb471a3fe6bec8a5303182f7f49fab687367015ba1fa8db6b7cf3
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
    proxy
    sidecar
    --domain
    $(POD_NAMESPACE).svc.cluster.local
    --configPath
    /etc/istio/proxy
    --binaryPath
    /usr/local/bin/envoy
    --serviceCluster
    example-app.$(POD_NAMESPACE)
    --drainDuration
    45s
    --parentShutdownDuration
    1m0s
    --discoveryAddress
    istio-pilot.istio-system:15010
    --zipkinAddress
    zipkin.istio-system:9411
    --dnsRefreshRate
    300s
    --connectTimeout
    10s
    --proxyAdminPort
    15000
    --concurrency
    2
    --controlPlaneAuthPolicy
    NONE
    --statusPort
    15020
    --applicationPorts
    3000
    State:          Running
    Started:      Thu, 31 Oct 2019 14:14:04 -0400
    Ready:          True
    Restart Count:  0
    Limits:
    cpu:     40m
    memory:  128Mi
    Requests:
    cpu:      20m
    memory:   64Mi
    Readiness:  http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
    POD_NAME:                          example-app-7c6ddc6c4b-xrzn8 (v1:metadata.name)
    POD_NAMESPACE:                     default (v1:metadata.namespace)
    INSTANCE_IP:                        (v1:status.podIP)
    ISTIO_META_POD_NAME:               example-app-7c6ddc6c4b-xrzn8 (v1:metadata.name)
    ISTIO_META_CONFIG_NAMESPACE:       default (v1:metadata.namespace)
    ISTIO_META_INTERCEPTION_MODE:      REDIRECT
    ISTIO_META_INCLUDE_INBOUND_PORTS:  3000
    ISTIO_METAJSON_LABELS:             {"app":"example-app"}
                                        
    ISTIO_KUBE_APP_PROBERS:            {"/app-health/example-app/readyz":{"path":"/healthz","port":3000}}
    Mounts:
    /etc/certs/ from istio-certs (ro)
    /etc/istio/proxy from istio-envoy (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-5g6qf (ro)
Conditions:
Type              Status
Initialized       True 
Ready             False 
ContainersReady   False 
PodScheduled      True 
Volumes:
istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.default
    Optional:    true
default-token-5g6qf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5g6qf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type     Reason     Age                From                                        Message
----     ------     ----               ----                                        -------
Normal   Scheduled  93s                default-scheduler                           Successfully assigned default/example-app-7c6ddc6c4b-xrzn8 to aks-agentpool-17141372-vmss000003
Normal   Pulled     92s                kubelet, aks-agentpool-17141372-vmss000003  Container image "docker.io/istio/proxy_init:1.2.6" already present on machine
Normal   Created    92s                kubelet, aks-agentpool-17141372-vmss000003  Created container istio-init
Normal   Started    91s                kubelet, aks-agentpool-17141372-vmss000003  Started container istio-init
Normal   Pulling    90s                kubelet, aks-agentpool-17141372-vmss000003  Pulling image "mtesacr.azurecr.io/remitbee-cp:90-staging2019_10_31_14_10_01"
Normal   Pulled     76s                kubelet, aks-agentpool-17141372-vmss000003  Successfully pulled image "mtesacr.azurecr.io/remitbee-cp:90-staging2019_10_31_14_10_01"
Normal   Created    68s                kubelet, aks-agentpool-17141372-vmss000003  Created container example-app
Normal   Started    67s                kubelet, aks-agentpool-17141372-vmss000003  Started container example-app
Normal   Pulled     67s                kubelet, aks-agentpool-17141372-vmss000003  Container image "docker.io/istio/proxyv2:1.2.6" already present on machine
Normal   Created    67s                kubelet, aks-agentpool-17141372-vmss000003  Created container istio-proxy
Normal   Started    67s                kubelet, aks-agentpool-17141372-vmss000003  Started container istio-proxy
Warning  Unhealthy  63s (x2 over 64s)  kubelet, aks-agentpool-17141372-vmss000003  Readiness probe failed: HTTP probe failed with statuscode: 503
Warning  Unhealthy  60s                kubelet, aks-agentpool-17141372-vmss000003  Readiness probe failed: Get http://10.0.11.55:15020/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

As you can see, ISTIO_KUBE_APP_PROBERS is correctly mentioned but still, it's not pointing to the same readiness probe. Also, the initial delay is 200 but my pod is in a running state even if one container is not ready.

@incfly
Copy link

incfly commented Oct 31, 2019

As you can see, ISTIO_KUBE_APP_PROBERS is correctly mentioned but still, it's not pointing to the same readiness probe.

probers point to a new endpoint at port 15020, this is expected. As the name indicates, "prober rewrites" means the actual podspec container prober configuration is configured to different path and port.

The buggy part is that we miss the delay & time out section... which needs to be implemented in sidecar injector, and respect the original timeout as well..

@incfly
Copy link

incfly commented Nov 3, 2019

I do agree we miss the timeout part. Which should be carried over to the pilot agent -> app. But I can't tell anything else is wrong.

also given your apps timeout is 10s, which equals to our hardcoded time out in our pilot agent -> app, so I'm not sure fixing that will address your issue..


Actually I think because we use jsonpath replace action for only httpGetAction part of the probe, so I think we j leave the initial delay seconds as what it is. The initialDelaySeconds part of the probe should not be relevant to this bug

I confirmed in my own deployment with a readinessProbe.initialDelaySeconds = 50s, and that's kept untouched after rewrite.

Are you sure you set original prober's delay seconds to 700 seconds and then you see the injected one, app probers become 200s?


Warning Unhealthy 11s kubelet, aks-agentpool-17141372-2 Readiness probe failed: Get http://10.240.0.78:15020/healthz/ready: dial tcp 10.240.0.78:15020: connect: connection refused

This error is complaining about the sidecar its own health check is failing, which should not be related to app probe rewrite... I dont think if such 200/700 initial delay of the apps have anything to do with sidecar's health check though.

@rnkhouse
Copy link
Author

rnkhouse commented Nov 4, 2019

@incfly 700s to 200s I changed it. So, that's not the issue (sorry for the confusion). But, When I set it to 200s it shouldn't check for readiness before that.

The app should be in ready state after 200 seconds, in this case, it happens immediately which is incorrect I guess.

Warning  Unhealthy  63s (x2 over 64s)  kubelet, aks-agentpool-17141372-vmss000003  Readiness probe failed: HTTP probe failed with statuscode: 503
Warning  Unhealthy  60s                kubelet, aks-agentpool-17141372-vmss000003  Readiness probe failed: Get http://10.0.11.55:15020/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Even if it's showing timeout in the pod description the app is in running state.

example-app-5c66db474f-n8htf    1/2     Running   0          87s

Initialdelay is 200 sec, container is not ready to serve traffic but still, the state is Running.

@incfly
Copy link

incfly commented Nov 4, 2019 via email

@incfly
Copy link

incfly commented Nov 7, 2019

Here are some pointers of where the changes can be made to pass the timeout

  • See the env var definition of how we propagate the app prober information, ISTIO_KUBE_APP_PROBERS, you can add another separate field in the json encoding to pass original time out duration
  • When constructing http client, specify the timeout from the value extracted from https://github.com/istio/istio/blob/master/pilot/cmd/pilot-agent/status/server.go#L220, rather than hardcode it.
  • Sidecar injector (webhook and kube-inject from istioctl) both need corresponding change. DumpProbers returned JSON encoding.
  • Note the createProbeRewritePatch is for webhook specifically, which uses a different json patch to do sidecar injection. We need to make sure both istioctl kube-inject and the webhook injector both works.

Hope this helps @johnma14

@incfly incfly added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed and removed lifecycle/needs-triage labels Nov 18, 2019
@yuya894
Copy link

yuya894 commented Jan 29, 2020

The same problem occurred in my environment, but it was due to a "double Istio injection".

  • Namespace labeled istio-injection=enable
  • And apply yaml generated by istioctl kube-inject

No problem if only one of them. https://istio.io/docs/ops/configuration/mesh/app-health-check/#enable-globally-via-install-option fine too.

Solution Example:

$ kubectl get cm istio-sidecar-injector -n istio-system -o yaml | sed -e 's/"rewriteAppHTTPProbe":false/"rewriteAppHTTPProbe":true/' | kubectl apply -f - 
configmap/istio-sidecar-injector configured
$ kubectl label namespace default istio-injection=enabled --overwrite
namespace/default labeled
$ kubectl apply -f  samples/health-check/liveness-http-same-port.yaml
service/liveness-http created
deployment.apps/liveness-http created

Or

$ kubectl get cm istio-sidecar-injector -n istio-system -o yaml | sed -e 's/"rewriteAppHTTPProbe":false/"rewriteAppHTTPProbe":true/' | kubectl apply -f - 
configmap/istio-sidecar-injector configured
$ kubectl label namespace default istio-injection=disabled --overwrite
namespace/default labeled
$ kubectl apply -f <(istioctl kube-inject -f samples/health-check/liveness-http-same-port.yaml)
service/liveness-http created
deployment.apps/liveness-http created

My version:

$ kubectl version | awk '{print $5}'
GitVersion:"v1.17.0",
GitVersion:"v1.15.7",
$ istioctl version
client version: 1.4.0
control plane version: 1.4.0
data plane version: 1.4.0 (3 proxies)
$ sudo minikube version
minikube version: v1.4.0
commit: 7969c25a98a018b94ea87d949350f3271e9d64b6

johnma14 added a commit to johnma14/istio that referenced this issue Mar 2, 2020
…ied timeout

The httpClient created in the pilot agent code currently uses the default timeout of 10s.
Updated the code to use the timeout specified by the user instead.

Fixes: istio#18242
dgn pushed a commit to dgn/istio that referenced this issue Mar 3, 2020
…tio#18242)

* support timeout retention on livez and readyz probe rewrite (fixes istio#18242)

* fix incorrect error formatting

* fix tests for pilot-agent

* use internal struct for prober configuration
istio-testing pushed a commit that referenced this issue Mar 3, 2020
…8242) (#21755)

* support timeout retention on livez and readyz probe rewrite (fixes #18242)

* fix incorrect error formatting

* fix tests for pilot-agent

* use internal struct for prober configuration

Co-authored-by: Yordi Pauptit <4119194+yp28@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants