Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add readyz handling to netexec #110174

Merged
merged 1 commit into from
May 24, 2022
Merged

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented May 23, 2022

Testing services and endpoint handling requires readyz handling during termination of a pod. Chasing a possible bug, I found this gap in the server where we have no external signal for a kubelet to mark ready=false.

/kind bug
/priority important-soon
/sig network

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2022
@k8s-ci-robot
Copy link
Contributor

@deads2k: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2022
mux.HandleFunc("/", rootHandler)
mux.HandleFunc("/clientip", clientIPHandler)
mux.HandleFunc("/header", headerHandler)
mux.HandleFunc("/dial", dialHandler)
mux.HandleFunc("/echo", echoHandler)
mux.HandleFunc("/exit", func(w http.ResponseWriter, req *http.Request) { exitHandler(w, req, exitCh) })
mux.HandleFunc("/healthz", healthzHandler)
mux.HandleFunc("/readyz", readyzHandler(termCh))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return

default:
if serverReady.get() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we set the serverReady.get() variable as unready based on the term signal, will that not work for both healthz and readyz handler and simplify the code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we set the serverReady.get() variable as unready based on the term signal, will that not work for both healthz and readyz handler and simplify the code here?

No, we want to remain healthy during something like termination, but we also want readyz to be false. So healthz and readyz will have different values. The term signal should turn readyz=false and leave healthz=true.

@aojea
Copy link
Member

aojea commented May 23, 2022

/assign @aojea @andrewsykim

hmm, this handle was added to being able to test the "termination endpoints" functionality IIRC

#108750

I need to check carefully if this is going to break it

The problem is that the changes in the images are not tested directly in the CI, the images have to be promoted and the k/k code updated with the new image, so the CI picks it up ... there is a trick to build the image locally and publish in a personal registry and add a temporary commit to test the new image in the CI

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 23, 2022
@aojea
Copy link
Member

aojea commented May 23, 2022

There are more changes needed for the image, like updating the version

https://github.com/kubernetes/kubernetes/blob/5ee7de7b5cd682380e036dc769cd1fde053a6626/test/images/agnhost/VERSION

the whole process is documented here

https://github.com/kubernetes/kubernetes/tree/master/test/images#updating-the-tests-images

a trick for testing your changes in the CI is to build your custom image and publish to a personal repo, as example

$ cd test/images
$ REGISTRY=quay.io/aojea make all-push WHAT=agnhost

and then "hijack" the image by replacing it in the manifest used to configure the e2e registries

configs[Agnhost] = Config{list.PromoterE2eRegistry, "agnhost", "2.36"}

or just adding the image url directly to the tests

@deads2k
Copy link
Contributor Author

deads2k commented May 23, 2022

updated for version and comments.

@deads2k
Copy link
Contributor Author

deads2k commented May 23, 2022

attempting proof in #110176

@aojea
Copy link
Member

aojea commented May 24, 2022

/retest
/lgtm

Tests with the new image pass
#110176

@deads2k feel free to unhold, remember you still need 2 additional steps https://github.com/kubernetes/kubernetes/tree/master/test/images#promoting-images

  1. wait for the postubmit job and PR k8s.io to promote the image
  2. update k/k tests with the new image

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2022
@deads2k
Copy link
Contributor Author

deads2k commented May 24, 2022

@deads2k feel free to unhold, remember you still need 2 additional steps https://github.com/kubernetes/kubernetes/tree/master/test/images#promoting-images

  1. wait for the postubmit job and PR k8s.io to promote the image
  2. update k/k tests with the new image

Acknolwedged

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 24, 2022
@k8s-ci-robot k8s-ci-robot merged commit 78a4ba6 into kubernetes:master May 24, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone May 24, 2022
@deads2k
Copy link
Contributor Author

deads2k commented May 24, 2022

next step here: kubernetes/k8s.io#3759

close(sigTermReceived)
}()

if delayShutdown > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the problem, if not delayShutdown is set the process keeps hanging on sigterm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the problem, if not delayShutdown is set the process keeps hanging on sigterm

Which particular line does the process hang on? I don't mind reverting or fixing, but this line existed gating off this go func before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before, we didn't capture the sigterm signal if delayShutdown was 0.
Now, we always capture the sigterm signal, but if we have delayShutdown == 0, we don't execute the os.Exit() and we don't signal the process to exit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants