Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IPVS proxier to update stale real server after restart #111635

Merged
merged 2 commits into from Aug 27, 2022

Conversation

aryan9600
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

Update the IPVS proxier to have a bool updateWeights which is set to
true during the initial syncs performed by OnEndpointSlicesSynced and
OnServiceSynced to make sure any real servers with stale weights are
updated accordingly at startup. This logic is gated behind a bool to
avoid doing this during every sync as it's an expensive operation.

Which issue(s) this PR fixes:

Fixes #108319

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2022
@k8s-ci-robot
Copy link
Contributor

@aryan9600: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 2, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @aryan9600. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 2, 2022
@aryan9600
Copy link
Member Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. area/ipvs and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 2, 2022
@aojea
Copy link
Member

aojea commented Aug 3, 2022

/assign @uablrek @andrewsykim
/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 3, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 3, 2022
proxier.mu.Unlock()

// Sync unconditionally - this is called once per lifetime.
proxier.syncProxyRules()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the proxier.updateWeights = false from here and instead do it where it's checked (see below)

proxier.mu.Unlock()

// Sync unconditionally - this is called once per lifetime.
proxier.syncProxyRules()

proxier.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

// if we want to update weights, loop through all current destinations and
// reset their weight.
if proxier.updateWeights {
for _, dest := range curDests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add proxier.updateWeights = false here. The mutex is held so it's safe

Copy link
Contributor

@uablrek uablrek Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw, you may make a comment that the mutex is held and that it's a one-time event)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't add this here, as this is inside a loop. adding this outside the loop instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the "for _, ep := range newEndpoints.List() {" loop you mean? Seens reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, is it completely safe to have proxier.updateWeights = false inside syncProxyRules() instead of OnServicesSynced()/OnEndpointSynced()?
i'm asking because in OnServicesSynced() we do something like:

proxier.mu.Lock()
...
proxier.updateWeights = true
proxier.mu.Unlock()

proxier.syncProxyRules()

so even though proxier.syncProxyRules() would capture the mutex first thing, theoretically another goroutine which called syncProxyRules() could execute proxier.updateWeights = false at the exact moment between OnServiceSynced let go of the lock and proxier.syncProxyRules() acquired the lock.

Copy link
Contributor

@uablrek uablrek Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly mistake, now it works https://go.dev/play/p/CuVZg3B2itE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i noticed first needs to be initialized to true instead of false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems go has support for "Once"; https://golangcode.com/run-code-once-with-sync/

You may see if you can use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if it's better since we already have the proxier.mu mutex.

Copy link
Member Author

@aryan9600 aryan9600 Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to avoid sync.Once because if in the future we want to do more things only on startup, the underlying function will get large. Having a flag gives us flexibility about where we want to run the one-time logic during sync.

@aryan9600 aryan9600 force-pushed the ipvs-restart branch 2 times, most recently from 20054a6 to da19d62 Compare August 26, 2022 07:53
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aryan9600, uablrek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2022
@uablrek
Copy link
Contributor

uablrek commented Aug 26, 2022

EDIT; this is a bad solution since it checks proxier.initialized directly. Please see the updated proposal below

I was a bit hasty with the approve button 😊 The code looks good but it doesn't solve the problem. It seems there is a need to do the update both for endpoints and endpointSlices. I found my old tests for the issue and as it is now the weight stays a 0.

I altered the code to;

func (proxier *Proxier) syncProxyRules() {
	proxier.mu.Lock()
	defer proxier.mu.Unlock()
	// To ensure complete initialization we can only consider the
	// initial sync done when the proxier is initialized.
	defer func() {
		if proxier.initialized == 1 {
			proxier.initialSync = false
		}
	}()

then it works.

@aryan9600
Copy link
Member Author

okay, thanks for catching this. let me confirm this and update 👍

@aojea
Copy link
Member

aojea commented Aug 26, 2022

I was a bit hasty with the approve button The code looks good but it doesn't solve the problem. It seems there is a need to do the update both for endpoints and endpointSlices. I found my old tests for the issue and as it is now the weight stays a 0.

/hold

just to avoid we merge unintentionally , once you are good unhold it, is just a precaution

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2022
@uablrek
Copy link
Contributor

uablrek commented Aug 26, 2022

Taking another look and I see that proxier.initialized is checked just a few lines down and with an atomic operation. So the defer function just have to be moved to after the check;

func (proxier *Proxier) syncProxyRules() {
	proxier.mu.Lock()
	defer proxier.mu.Unlock()

	// don't sync rules till we've received services and endpoints
	if !proxier.isInitialized() {
		klog.V(2).InfoS("Not syncing ipvs rules until Services and Endpoints have been received from master")
		return
	}

	defer func() {
		proxier.initialSync = false
	}()

@uablrek
Copy link
Contributor

uablrek commented Aug 26, 2022

(the proposed fix above is tested and works)

Update the IPVS proxier to have a bool `initialSync` which is set to
true when a new proxier is initialized and then set to false on all
syncs. This lets us run startup-only logic, which subsequently lets us
update the realserver only when needed and avoiding any expensive
operations.

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
@uablrek
Copy link
Contributor

uablrek commented Aug 27, 2022

Tested and works.

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit 41df816 into kubernetes:master Aug 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-proxy in ipvs mode leaves real server weight as 0 after restart
5 participants