Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Probe-level terminationGracePeriodSeconds #99375

Merged
merged 6 commits into from Mar 12, 2021

Conversation

ehashman
Copy link
Member

@ehashman ehashman commented Feb 23, 2021

What type of PR is this?

/kind bug
/kind api-change

/sig node
/priority important-soon

What this PR does / why we need it:

Adds a Probe-level terminationGracePeriodSeconds.

  • Add field to APIs
  • Add feature flag for field (ProbeTerminationGracePeriod)
  • Wire up code changes
  • Add e2e tests (maybe in a follow-up?)

Which issue(s) this PR fixes:

Fixes #64715.

See kubernetes/enhancements#2238 and linked KEP below for details.

Special notes for your reviewer:

Manually tested code changes with ./hack/local-up-cluster.sh using no feature gates/FEATURE_GATES="ProbeTerminationGracePeriod=true".

Test pod spec:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  terminationGracePeriodSeconds: 500
  containers:
    - name: test
      image: nginx
      command: [bash, -c, "sleep 100000000"]
      ports:
        - containerPort: 8080
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 1
        periodSeconds: 60
        terminationGracePeriodSeconds: 10

With feature gate on, terminationGracePeriod is successfully overridden:

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  49s               default-scheduler  Successfully assigned default/test-pod to 127.0.0.1
  Normal   Pulled     48s               kubelet            Successfully pulled image "nginx" in 1.413504483s
  Warning  Unhealthy  19s               kubelet            Liveness probe failed: Get "http://172.17.0.5:8080/healthz": dial tcp 172.17.0.5:8080: connect: connection refused
  Normal   Killing    19s               kubelet            Container test failed liveness probe, will be restarted
  Normal   Pulling    9s (x2 over 49s)  kubelet            Pulling image "nginx"
  Normal   Created    7s (x2 over 48s)  kubelet            Created container test
  Normal   Started    7s (x2 over 48s)  kubelet            Started container test
  Normal   Pulled     7s                kubelet            Successfully pulled image "nginx" in 1.630529866s

Without it, behaviour is same as status quo:

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  96s   default-scheduler  Successfully assigned default/test-pod to 127.0.0.1
  Normal   Pulling    95s   kubelet            Pulling image "nginx"
  Normal   Pulled     92s   kubelet            Successfully pulled image "nginx" in 3.370430829s
  Normal   Created    92s   kubelet            Created container test
  Normal   Started    92s   kubelet            Started container test
  Warning  Unhealthy  41s   kubelet            Liveness probe failed: Get "http://172.17.0.3:8080/healthz": dial tcp 172.17.0.3:8080: connect: connection refused
  Normal   Killing    41s   kubelet            Container test failed liveness probe, will be restarted

Does this PR introduce a user-facing change?

Add Probe-level terminationGracePeriodSeconds field

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2238-liveness-probe-grace-period

@k8s-ci-robot k8s-ci-robot added release-note do-not-merge/work-in-progress kind/bug kind/api-change size/XXL sig/node priority/important-soon cncf-cla: yes needs-triage sig/apps labels Feb 23, 2021
@k8s-ci-robot k8s-ci-robot requested review from andrewsykim and dims Feb 23, 2021
@ehashman ehashman added this to Waiting on Author in SIG Node PR Triage Feb 24, 2021
@ehashman ehashman changed the title [WIP] Add Probe-level terminationGracePeriodSeconds Add Probe-level terminationGracePeriodSeconds Mar 1, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress label Mar 1, 2021
@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 1, 2021

Taking off WIP since there is probably enough here for an initial review/feedback. I'm working on the e2es but should have that shortly.

@ehashman ehashman moved this from Waiting on Author to Triage in SIG Node PR Triage Mar 1, 2021
@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 1, 2021

Since this is an alpha feature, it's unclear to me if the e2es should land in this PR -- I think for that to work I'd need to add the feature gate (which won't exist until this lands) to the test-infra jobs like kubernetes/test-infra#20369 ? Should I do that after this lands or simultaneously? @dims?

@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 2, 2021

/cc @smarterclayton @matthyx

@dims
Copy link
Member

@dims dims commented Mar 2, 2021

@ehashman we would land the e2e test and feature here. and follow up with a PR to test-infra to switch it on. (then iterate if something fails).

@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 2, 2021

@dims (just thinking out loud) it won't be able to actually exercise any of the new logic with the alpha flag turned on until I add it to test-infra, and I'm not sure if I can add the flag to test-infra if it's not yet defined, am I missing something?

@liggitt
Copy link
Member

@liggitt liggitt commented Mar 2, 2021

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review label Mar 2, 2021
@liggitt liggitt added this to Assigned in API Reviews Mar 2, 2021
@dims
Copy link
Member

@dims dims commented Mar 2, 2021

@ehashman we would fake test it with an extra commit that turns it on ... run it a few times and remove the extra commit before getting reviews and merging the PR

@k8s-ci-robot k8s-ci-robot removed lgtm needs-rebase labels Mar 10, 2021
@ehashman ehashman moved this from Waiting on Author to Needs Approver in SIG Node PR Triage Mar 10, 2021
@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 10, 2021

/approve

for api changes. I'm ok with an exception request since we have had this mostly baked for a while and the API issues were minor. I assume sig-node is ok too. Risk seems minimal.

@k8s-ci-robot k8s-ci-robot added the approved label Mar 10, 2021
@palnabarun
Copy link
Member

@palnabarun palnabarun commented Mar 11, 2021

/milestone v1.21

Restoring the milestone since the exception request was APPROVED. ref: https://groups.google.com/g/kubernetes-sig-release/c/eOphdiRmK7k/m/4cSfBcneAgAJ

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 11, 2021
@rphillips
Copy link
Member

@rphillips rphillips commented Mar 11, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 11, 2021
@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 11, 2021

/retest

@mrunalp mrunalp moved this from Needs Approver to Done in SIG Node PR Triage Mar 11, 2021
@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 11, 2021

@ehashman: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e-alpha afc3fdb link /test pull-kubernetes-node-e2e-alpha

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 12, 2021
@ehashman
Copy link
Member Author

@ehashman ehashman commented Mar 12, 2021

I think the verify fail was a rebase/merge conflict issue, hopefully the update will pass. Passes locally.

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, derekwaynecarr, ehashman, matthyx, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@matthyx
Copy link
Contributor

@matthyx matthyx commented Mar 12, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit faa5c8c into kubernetes:master Mar 12, 2021
15 of 16 checks passed
@liggitt liggitt moved this from Assigned to API review completed, 1.21 in API Reviews Apr 6, 2021
@thockin
Copy link
Member

@thockin thockin commented Oct 30, 2021

Hi. I know this is old, so I am not sure we want to do anything about this comment, but I wanted to ask:

I'm investigating places where we share API structs across use-cases but some fields don't apply. We have a few cases in Probe, and I stumbled on this one (which is new to me). I skimmed the KEP but I didn't see this aspect really discussed. Was it considered?

For example, maybe we should have un-shared the struct? Or maybe this should have been Pod.spec.abnormalTerminationGracePeriod (so it applies to all relevant probes and maybe other things without changing the probe struct)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review approved area/kubelet area/test cncf-cla: yes kind/api-change kind/bug lgtm priority/important-soon release-note sig/api-machinery sig/apps sig/node sig/testing size/XXL triage/accepted
Projects
API Reviews
API review completed, 1.21
Development

Successfully merging this pull request may close these issues.