Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy healthz server for dualstack clusters #118146

Merged
merged 3 commits into from
Oct 16, 2023

Conversation

aroradaman
Copy link
Member

@aroradaman aroradaman commented May 20, 2023

What type of PR is this?

/kind cleanup
/sig network

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-proxy now reports its health more accurately in dual-stack clusters when there are problems with only one IP family.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 20, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 20, 2023
@k8s-ci-robot k8s-ci-robot added area/ipvs area/kube-proxy sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 20, 2023
@aroradaman
Copy link
Member Author

this will partly fix #116486

I have created this draft to get initial response/feedback on the approach for the fix the issue
/cc @danwinship @aojea @uablrek

@aroradaman aroradaman force-pushed the fix/proxy-healthzserver branch 2 times, most recently from 41f4ab2 to 9afaf97 Compare May 20, 2023 10:07
@aroradaman aroradaman changed the title proxy healthz server for daulstack clusters proxy healthz server for dualstack clusters May 20, 2023
Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is probably right.

This is going to conflict with #116470. I'm not sure if @alexanderConstantinescu is working on that PR again currently?

Regardless of which is going to merge first, it would be good for you to look at that PR and make sure you're not changing things in ways that will need to be reverted or rewritten completely differently for that PR.

pkg/proxy/iptables/proxier.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxy_health.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxy_health.go Outdated Show resolved Hide resolved
@alexanderConstantinescu
Copy link
Member

I'm not sure if @alexanderConstantinescu is working on that PR again currently?

I am. I've pinged people off-band for a review on it. In case you have some cycles @danwinship: feel free to have a look.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2023
@aroradaman
Copy link
Member Author

@alexanderConstantinescu if you are just waiting for final review/approval on #116470 I'll wait for it to merge first.

@aroradaman aroradaman force-pushed the fix/proxy-healthzserver branch 2 times, most recently from c7e1960 to 9838c4d Compare October 14, 2023 16:25
@aroradaman
Copy link
Member Author

/retest

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like this better, overall.

pkg/proxy/healthcheck/healthcheck_test.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxier_health.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxier_health.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxier_health.go Outdated Show resolved Hide resolved
pkg/proxy/healthcheck/proxier_health.go Outdated Show resolved Hide resolved
pkg/proxy/winkernel/proxier.go Outdated Show resolved Hide resolved
Signed-off-by: Daman Arora <aroradaman@gmail.com>
@aroradaman aroradaman force-pushed the fix/proxy-healthzserver branch 3 times, most recently from f3d686a to 9a55712 Compare October 15, 2023 17:26
// Set oldestPendingQueuedMap only if it's currently zero
if val, ok := hs.oldestPendingQueuedMap[ipFamily]; ok && val == zeroTime {
hs.oldestPendingQueuedMap[ipFamily] = hs.clock.Now()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is wrong now, because we don't initialize oldestPendingQueuedMap to { v1.IPv4Protocol: zeroTime, v1.IPv6Protocol: zeroTime }. We initialize it to {}. So you want:

// Set oldestPendingQueuedMap[ipFamily] only if it's currently unset
if _, set := hs.oldestPendingQueuedMap[ipFamily]; !set {
        hs.oldestPendingQueuedMap[ipFamily] = hs.clock.Now()
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if _, set := hs.oldestPendingQueuedMap[ipFamily]; !set {

This feels wrong to me, this will only allow the first QueuedUpdate() call to set hs.oldestPendingQueuedMap[ipFamily] to current time, all the subsequent calls won't be updating hs.oldestPendingQueuedMap[ipFamily] as the key will already be part of the map and proxy will always respond unhealthy.

I may be wrong here, but in the current code we do not initialize the oldestPendingQueued and being of type atomic.Value its zero value won't be zeroTime and hs.oldestPendingQueued.CompareAndSwap(zeroTime, hs.clock.Now()) won't work as old value is not zeroTime

hs.oldestPendingQueued.CompareAndSwap(zeroTime, hs.clock.Now())

I tried to play around with it here https://go.dev/play/p/q2onw6P3aZB, CompareAndSwap(zeroTime, currentTime) only works if we explicitly set oldestPendingQueued to zeroTime which I guess happens in the first Updated() call.

hs.oldestPendingQueued.Store(zeroTime)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only allow the first QueuedUpdate() call to set hs.oldestPendingQueuedMap[ipFa

oh, right, Updated needs to be updated too.

Anyway, the stuff with comparing against zeroTime is just because, with the old code, oldestPendingQueued stored a time.Time, not a *time.Time, so we couldn't store nil to mean "nothing is queued", so we recognized the zero time as meaning that instead. Which is kind of weird... we really should have used a pointer.

Anyway, with the map, there's no reason to worry about comparing against the zero time. You should just have a value in the map if there is an update queued, and no value in the map when there is no update queued. So Updated should do delete(hs.oldestPendingQueuedMap, ipFamily).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks !

case oldestPendingQueued.IsZero():
// The proxier is healthy while it's starting up
// or the proxier is fully synced.
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So again, this case would never get hit. The initial state is that there's no entry for that IP family, not a zero entry. But we don't need any special case for this here now; when the proxy is starting up (ie, oldestPendingQueuedMap is empty), the for loop just gets skipped and we return true below.

pkg/proxy/healthcheck/proxier_health.go Outdated Show resolved Hide resolved
Signed-off-by: Daman Arora <aroradaman@gmail.com>
@aroradaman
Copy link
Member Author

/retest

@danwinship
Copy link
Contributor

/lgtm
/approve

/kind bug

@aroradaman I feel like this is worthy of a small release note. Something like "kube-proxy now reports its health more accurately in dual-stack clusters when there are problems with only one IP family." or something like that. Can you update the release note field in the initial comment, and then /hold cancel ?

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 16, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 16, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 375e63aee9ce61600c8c9a0db2bb54d9549a80a2

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aroradaman, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 16, 2023
@aroradaman
Copy link
Member Author

/hold cancel
I added the release-note.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2023
@k8s-ci-robot k8s-ci-robot merged commit b5ba899 into kubernetes:master Oct 16, 2023
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants