Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-proxy] Fix for non-blocking updates during min-sync-period #37726

Closed
wants to merge 1 commit into from

Conversation

timothysc
Copy link
Member

@timothysc timothysc commented Nov 30, 2016

What this PR does / why we need it:
Addresses the issue that was seen in #36281

Fixes: #33693
Related: #36332
Related: #35334

/cc @thockin @bprashanth @jeremyeder

Special notes for your reviewer:
Follow on PR from #35334

Release note:

NONE

@timothysc timothysc added area/kube-proxy release-note-none Denotes a PR that doesn't merit a release note. labels Nov 30, 2016
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2016
@k8s-oncall
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 30, 2016
// Lastly, the sync timer will be set to ensure we catch the last update.
if proxier.throttle.TryAccept() == false {
duration := time.Since(proxier.lastSync)
if duration < proxier.minSyncPeriod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say you get one notification, and it starts syncing proxy rules taking 10s
in that 10s you get 5 other notifications
they all get saved somewhere, but try accept fails all 5 times

now we need to wait for IPTableSyncPeriod before the 5 notifications take effect?

can we just drop the no-op updates before applying any rate limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to wait the min-sync-period not full-sync-period.

can we just drop the no-op updates before applying any rate limits?

Potentially yes, there is a separate patch for that #36006

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc The #36006 patch checks for no-op after any rate limits would be applied though, so that won't help this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really matter when you think about it.

@jeremyeder
Copy link

@knobunc iptables is a major concern, and is our current lowest ceiling for scale because it consumes such a large amount of CPU time on every node, as the rulesets grow. At larger scales it can easily consume an entire core.

@knobunc
Copy link
Contributor

knobunc commented Dec 1, 2016

@jeremyeder thanks. @dcbw has looked at the iptables performance problems.

@jeremyeder
Copy link

The RHBZ that corresponds to this PR is: https://bugzilla.redhat.com/show_bug.cgi?id=1387149

@timothysc
Copy link
Member Author

timothysc commented Dec 1, 2016

@saad-ali I'd propose this as a cherry-pick for 1.5, it's a performance issue for large scale clusters.

@timothysc timothysc added this to the v1.5 milestone Dec 1, 2016
@timothysc timothysc force-pushed the proxy_min_sync_fix branch 2 times, most recently from 6a01357 to 9ae62e3 Compare December 1, 2016 22:10
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 1, 2016
@saad-ali
Copy link
Member

saad-ali commented Dec 2, 2016

@saad-ali I'd propose this as a cherry-pick for 1.5, it's a performance issue for large scale clusters.

It is pretty late. We are less than 1 week from launch. Is this critical? If it does get in what is the risk to the 1.5 release?

CC @bprashanth @thockin

@bprashanth
Copy link
Contributor

If we get a repro for the soft lockup, I'd like to try it with this rate limit. It's the only known issue that results in more than the required cpu for frequent (once a second) syncs.

@timothysc
Copy link
Member Author

@saad-ali Given the change is an opt-in behavior now, I see little risk.

The benefit is we will see reduced load on the nodes, with modest effects to SLOs.

@saad-ali
Copy link
Member

saad-ali commented Dec 2, 2016

Spoke with @bprashanth offline. @timothysc Is everything in this PR 100% flag gated, and is the default disabled? If so, it is fine for 1.5.

@dims
Copy link
Member

dims commented Dec 2, 2016

@k8s-bot gce etcd3 e2e test this

1 similar comment
@timothysc
Copy link
Member Author

@k8s-bot gce etcd3 e2e test this

@timothysc timothysc added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed cherrypick-candidate labels Dec 2, 2016
@timothysc
Copy link
Member Author

@saad-ali tests are all green, but someone needs to review.

Copy link
Contributor

@bprashanth bprashanth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this if you can do the due diligence to confirm what @saad-ali asked (if the command line flag is unset or set to 0, this has no net change other than the replacement of a ticker with a timer+reset).

While this is desirable overall, I'm more comfortable switching the default rate limit flag in 1.5.X, with more testing.

@@ -788,11 +792,25 @@ func (proxier *Proxier) execConntrackTool(parameters ...string) error {
// assumes proxier.mu is held
func (proxier *Proxier) syncProxyRules() {
if proxier.throttle != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that proxier.throttle is nil if no ratelimits are set via command line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, default atm is to do nothing proxier.throttle is nill without input option.

@@ -169,6 +169,8 @@ type Proxier struct {
haveReceivedServiceUpdate bool // true once we've seen an OnServiceUpdate event
haveReceivedEndpointsUpdate bool // true once we've seen an OnEndpointsUpdate event
throttle flowcontrol.RateLimiter
timer *time.Timer
lastSync time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the way you've structured this make it hard to reason about and unittests. We need something that can:

  1. use a fake clock
  2. takes time.Duration uniformly
  3. no-ops trivially if no rate limits are specified

i.e:

type syncThrottler struct {
 timer, lastSync, throttle, util.clock
}

// canSync returns true we are below allowed rate limits.
// everytime it returns false it also tries to reduce the frequency 
// of period sync events returned by nextSync
func (s *syncThrottle) canSync() bool{
  if s.throttle == nil {
    return true
  }
  // rest of the function
}

// setLastSync records the last sync timestamp
func (s *syncThrottle) setLastSync(time.Duration) {}

// setNextSync resets the timer to when we expect the next sync notification
func (s *syncThrottle) setNextSync(time.Duration) {}

// nextSync hangs till it's time for the next periodic sync
func (s *syncThrottle) nextSync() {}



newSyncThrottler() *syncThrottle {}

...
func Sync() {
  proxier.syncProxyRules()
  proxier.syncThrottle.setNextSync(time.Now() + proxier.syncPeriod)
}

...
func syncProxyRules () {
  if !proxier.syncThrottle.canSync() {
     return
  }

  defer proxier.syncThrottle.setSyncTime(time.Now())

  // sync
}

but this is too close to the release to ask for a restructure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leave this out of 1.5 entirely, or we could take a minimal patch for a 1.5 cherrypick followed by this sort of cleanup AND TESTING. @timothysc what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would prefer, I could clean it up and aim for 1.5.1 vs. holding it up? I don't disagree with @bprashanth 's suggestion. This isn't really a feature addition at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, so I looked through this several times, and I see what you're saying, but I would move it out as a cleanup item.

imho the proxy will likely be due for some overhaul in the mid-near-term due to the endpoints spamming. Even with the filter changes, this is still 1:(N) broadcast and as we add a number of other HA components that will become C*N. So refactor has to come.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think avioiding endpoint spam should be pretty easy without an extensive refactor. The recommended code structure is 100% limited to this pr, not the surrounding code, so we don't make things any harder to debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is that testing without a fake clock is ~impossible. Testing this could be possible if we captured all of the timer logic into a simple struct and function-pointer, sort of design. If we merge as-is, I'd like a commitment to fix it please :)

This code is not well tested enough, let's not make it worse when the fix is O(easy)

@@ -408,14 +410,16 @@ func (proxier *Proxier) Sync() {
proxier.mu.Lock()
defer proxier.mu.Unlock()
proxier.syncProxyRules()
proxier.timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the SyncLoop the only caller of this function ? if not, are we guaranteed that timer will be initialized before all calls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the SyncLoop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on this? Or maybe rename to sync or syncOnce

for {
<-t.C
<-proxier.timer.C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone refactors the resets() embedded somewhere in the call path, this timer will hang. Coupling like that should always be encapsulated in a struct so some eager newbie doesn't shoot themselves in the foot. Please add a comment explaining where we reset it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I'll add a comment.

@timothysc
Copy link
Member Author

@thockin PTAL.

@timothysc
Copy link
Member Author

@bprashanth could you walk over and poke @thockin for me ;-)

@timothysc timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 16, 2017
@timothysc
Copy link
Member Author

@thockin re-ping.

}

if minSyncPeriod != 0 {
syncsPerSecond := float32(time.Second) / float32(minSyncPeriod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is minSyncPeriod allowed to be greater than time.Second? eg will it work OK if syncsPerSecond is 0.5? It looks like a float32 all the way through to juju/ratelimit, but would be nice to confirm. Maybe newSyncThrottle could get some godoc to state any limitations on its arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minSyncPeriod is a duration, and it's being divided by seconds. You can do .5 all the way down. I can add suppositions on behavior above.

@timothysc
Copy link
Member Author

@smarterclayton unblocking wand request, I know @thockin is busy, but this is pretty important to us. @jeremyeder has slides and graphs showing how bad it is.

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @thockin
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

// it's not exported.
type syncThrottle struct {
rl flowcontrol.RateLimiter
timer *time.Timer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments on these would be nice

func (s *syncThrottle) allowSync() bool {
if s.rl != nil {
if s.rl.TryAccept() == false {
duration := s.timeEllapsedSinceLastSync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Elapsed" only has one "l"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if s.rl != nil {
if s.rl.TryAccept() == false {
duration := s.timeEllapsedSinceLastSync()
if duration < s.minSyncPeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TryAccept() == false, and duration >= s.minSyncPeriod, you'll hit the true case, but it's not at all clear to me that this is correct. I am not intimately familiar with the RateLimiter.

Can you convince me, by way of comments, that this is all correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redux'd to remove my latent paranoia.

// utility wrapper to handle time based sync updates
// it's specifically meant for the iptables proxy so
// it's not exported.
type syncThrottle struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this thing need a mutex? It's not obviously safe in the face of concurrent access, and it's not documented as being safe for some non-obvious reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

"k8s.io/kubernetes/pkg/util/flowcontrol"
)

// utility wrapper to handle time based sync updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't super helpful. Since this problem warrants a whole new type, I'd really like a thorough explanation here.

rl flowcontrol.RateLimiter
timer *time.Timer
lastSync time.Time
minSyncPeriod time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"minSyncPeriod" is a weird name, and it makes reading code hard. It's not a minimum sync period, it's more like a hysteresis or rest, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same name as before just shuffled. technically correct.

@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit cc45ab9. Full PR test history.

cc @timothysc, your PR dashboard

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@thockin
Copy link
Member

thockin commented Jan 25, 2017

I am looking right now, I am still not 100% sure I follow the intent. I have patched it into a client, and I have an idea to simplify. Back in a little while with result.

@thockin
Copy link
Member

thockin commented Jan 25, 2017

I stared and I stared and I stared and I finally figured out what felt wrong.

You can not Reset() a Timer while a read is pending. At the very least you have to Stop() it, and the docs are unclear if that is actually enough. This makes it a little complicated. while staring at it I kept jumping back and forth, and I felt like we could encapsulate it better.

https://gist.github.com/thockin/1c05beb4075025798e9d242e082e4852

I did a lot of manual testing, but did not write a test yet.

@timothysc
Copy link
Member Author

You can not Reset() a Timer while a read is pending. At the very least you have to Stop() it, and the docs are unclear if that is actually enough

That statement isn't true, https://golang.org/pkg/time/#Timer.Reset.
The docs state that you can't "reuse" without calling stop and draining the channel. Which means using time after it has been signaled, it must be drained and stop called. I might be missing something, but given the code paths I don't see how this isn't true.

Can you find a test condition in the code that fails a test case? b/c I wrote the wrapper explicitly on your request for testability, so it seems perfectly reasonable to write a test case where the wrapper fails.

At this point I don't even care what it looks like anymore so long as the behavior is correct, and there is sound evidence and test behind the choice.

FWIW - It is very easy to beat the heck out of this by changing the timeout intervals behind the endpoint annotation updates and setting minsync to say 2s and just letting it run with log level=4.

@thockin
Copy link
Member

thockin commented Jan 25, 2017 via email

@thockin
Copy link
Member

thockin commented Jan 25, 2017

From the go docs: "This should not be done concurrent to other receives from the Timer's channel." It is unclear whether "this" is Reset() or the draining procedure.

@timothysc
Copy link
Member Author

I stop and reset the timer here: https://gist.github.com/thockin/1c05beb4075025798e9d242e082e4852#file-periodic_runner_2_timers-go-L70

After it's been pulled off the channel.

@thockin
Copy link
Member

thockin commented Jan 25, 2017

Other than this bit, I am OK with it.

if s.rl.TryAccept() == false {
duration := s.timeElapsedSinceLastSync()
glog.V(4).Infof("Attempting to synch too often. Duration: %v, min period: %v", duration, s.minSyncPeriod)
s.timer.Reset(s.minSyncPeriod - duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Reset that concerns me - I think you at least need to Stop() first, and even that is not clear since the docs say not to do Reset() concurrent to receives.

Imagine goroutine 1 is in SyncLoop() receiving on s.timer.C, and goroutine 2 call Sync() which calls syncProxyRules() which calls allowSync() whcih calls Reset(). From looking at the Timer core code, it looks like you can call Stop() here and then Reset, and it's possible the timer will fire in-between but that's OK. I'm more concerned with the potential for corruption or other badness if you reset while the timer is not stopped. Docs just aren't clear on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I emailed with go folks, and they said that if you call Stop() before Reset() it is "safe" with the exception that you might still take a delivery in the interim. That's fine for us, I think.

if proxier.throttle != nil {
proxier.throttle.Accept()
if !proxier.throttle.allowSync() {
return
Copy link
Member

@thockin thockin Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold up. syncProxyRules() is called by OnEndpointsUpdate and OnServiceUpdate, both of which call something AFTER the sync. They expect that the sync actually sync'ed, and if it didn't (was deferred) this is going to do the wrong thing.

I think we need a larger change than this, I am sad to say. We need to defer the events, not just the sync..

I am sorry this review is so painful - this code is very async and overly delicate.

@k8s-github-robot
Copy link

@timothysc PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 2, 2017
@timothysc
Copy link
Member Author

closing in favor to #40868 , I'll let @danwinship drive this through.

@timothysc timothysc closed this Feb 2, 2017
@thockin
Copy link
Member

thockin commented Feb 2, 2017 via email

thockin added a commit to thockin/kubernetes that referenced this pull request Feb 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet