Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admission controller fails on timeout when failurePolicy set to Ignore #71508

Closed
omri86 opened this issue Nov 28, 2018 · 19 comments

Comments

@omri86
Copy link

commented Nov 28, 2018

Trying to set up a validating admission webhook on my GKE cluster using the following yaml:

apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
  name: test-admission-webhook
webhooks:
  - name: my-admission-webhook.io
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - "v1"
        operations:
          - "CREATE"
        resources:
          - "pods"
    failurePolicy: Ignore
    clientConfig:
      url: "https://192.168.99.1:8080"
      caBundle: %%TEST_BUNDLE%%

And my server is still down, I can't run any pod on my cluster, getting the following error:

Error creating: Timeout: request did not complete within allowed duration

Even when I removed the failurePolicy field from my yaml file (which is supposed to default to Ignore as mentioned in the official doc) and I'm getting the same error.

/sig api-machinery
/kind bug

@yue9944882

This comment has been minimized.

Copy link
Member

commented Nov 28, 2018

Error creating: Timeout: request did not complete within allowed duration

this will happen when calling webhook exceeds request timeout and the cause is that ur webhook is not responding.

@omri86

This comment has been minimized.

Copy link
Author

commented Nov 28, 2018

@yue9944882 Thanks for your reply.

I know that this happens because my webhook is not responding but from what I understand the failurePolicy field defines what happens in this kind of result:

FailurePolicy defines how unrecognized errors from the admission endpoint are handled - allowed values are Ignore or Fail. Defaults to Ignore.

So when setting it to Ignore - the pod should be up and running even though the webhook is not responding, am I wrong?

@yue9944882

This comment has been minimized.

Copy link
Member

commented Nov 28, 2018

So when setting it to Ignore - the pod should be up and running even though the webhook is not responding, am I wrong?

no, actually, it's ur CREATE request failing not ur admission request, which is to say, ur CREATE request failed before apiserver getting know and ignore that response from ur webhook. you can workaround by setting a shorter timeout in ur webhook server i suppose

@yue9944882

This comment has been minimized.

Copy link
Member

commented Nov 28, 2018

@kubernetes/sig-api-machinery-feature-requests do you think about adding a timeout option to webhook configuration?

/kind feature

@omri86

This comment has been minimized.

Copy link
Author

commented Nov 28, 2018

@yue9944882 Thanks again for your help - but still I'm not sure I follow.

I've set up the webhook to some address, then I'm trying to send the CREATE request that should arrive to that webhook - what you're saying is that as long as the webhook is down, every CREATE request sent will fail?
This seems a bit odd - is there any way around this? how will reducing the timeout help in this scenario?

And BTW the feature for timeout on webhook configuration already exist here: #60914

@yue9944882

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

So when setting it to Ignore - the pod should be up and running even though the webhook is not responding, am I wrong?

trying to explain why your requests failed. the failurePolicy defines how we deal w/ responses returned from webhooks, while apiserver doesn't receive anything if the webhook is not responding so the failurePolicy didn't work. currently calling webhooks doesn't have a timeout but we have a generic timeout (defaulted to 60s) for all in-coming requests which is the cause why you can't run any new pods in your case. does that make sense?

is there any way around this? how will reducing the timeout help in this scenario?

to clarify, i suggest setting a timeout in your webhook server too so that it could return error status like 5XX instead of un-responding. To walkaround that, do you think about changing the matching rules to bypass pod creation? Or just simply delete that webhook before it's set. Would that help in your scenario?

@liggitt

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

the failurePolicy defines how we deal w/ responses returned from webhooks, while apiserver doesn't receive anything if the webhook is not responding so the failurePolicy didn't work

It should work. A failure policy of ignore should fail open on timeout or other call errors

@yue9944882

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

then we need a proper timeout here

// TODO: Figure out if adding one second timeout make sense here.
ctx := context.TODO()

@fedebongio

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

@gyuho

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Any updates?

@roycaihw

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

/assign

@roycaihw

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

it's not necessary that admission webhook is causing the failure

@omri86 I suspect if you see the error because of components other than admission webhook, but it's hard to tell with the given information

@omri86

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

@roycaihw Do you need me to supply more data?

@floriankoch

This comment has been minimized.

Copy link

commented Mar 6, 2019

@roycaihw we also hit this bug, i our case , we run the admission controller on kubernetes, and when updating a node ( on wich the admission controller runs) , the networking (calico) does not come up , because the admission controller was down, and this bug happen , so the failure policy does not work

@liggitt

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

the failure policy does not work

how long does the API request take before failing? it is possible you are encountering #60914 (comment), where the client-side timeout aborts the request before the webhook timeout is reached

@roycaihw

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

I think @yue9944882 and @liggitt were right. It's your create request timing-out not your admission request

tl;dr: it's behaving by design. But we shouldn't use the same timeout for client request and admission request. You could do one of the following to fix:

  • configure timeout (> 30s) in your client request by setting Timeout in your restclient.Config (e.g. config here). It will change the timeout for all your client requests (per request timeout configuration is WIP)
  • configure timeout (< 30s) in your webhook server as @yue9944882 suggested
  • configure timeout (< 30s) for admission request using #74562 (it's in 1.14, probably the least solution you want)

(longer version) I think what happened is:

there are two different requests, built on same client package with the same timeout

  1. you client sends a create request to apiserver using client-go (which builds on rest client, and eventually on a http client). The http client has a timeout set for every request (I think the config is defaulted to 30s somewhere)
  2. apiserver receives the create request from client and sends an admission request to webhook server, also using rest client with 30s timeout

since webhook server is unresponsive, both requests hang.

  1. your client hits its timeout and returns error first
  2. apiserver hits timeout talking to webhook server. It could have ignored the error based on the policy and created pod successfully, but the client has dropped alreay

(you can tell from the error message, it should contain the text "Internal error" if apiserver actually didn't honor ignore policy and returned error)

@fejta-bot

This comment has been minimized.

Copy link

commented Jun 4, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

/close

as it's working as expected

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

@roycaihw: Closing this issue.

In response to this:

/close

as it's working as expected

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.