servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

justinsb · 2016-02-25T03:28:14Z

I found a concrete problem with the servicecontroller (more concrete than some of my locking complaints - I saw this one in the real world). When we hit an error applying an update, we requeue the changes at the end of the queue: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/service/servicecontroller.go#L191

So suppose the user makes a "bad" change to a service (e.g. support we try to enable UDP on an AWS LoadBalancer). The service controller loop will observe that change, try to make the change, fail and keep re-enqueuing it for retry indefinitely. Not great, but OK.

Now suppose the user notices the bad change, and makes a corrective "good" change. The good change will join the queue and be processed. But the bad change will still be being requeued, and so will keep being retried. We are now trying to apply an older version of the change than the one we have already successfully applied.

In my particular case, I had a path dependency bug, and so I did Good1, Bad, Good2. Bad could not be applied to Good1, but could be applied to Good2 (a bug, but that's an issue in my code). So the sequence I got was Good1 -> Good2 -> Bad. i.e. I ended up on Bad, not Good2.

justinsb · 2016-02-25T03:28:45Z

cc @thockin because I think you were having a look at this area

justinsb · 2016-02-25T03:30:41Z

Actually, I'm confused as to what is happening here. I think AddIfNotPresent is supposed to prevent this. It doesn't appear to be doing so in practice.

bprashanth · 2016-02-25T06:34:55Z

I think if you have a reliable repro you should send a surgical fix for 1.2, and post 1.2 we should just rewrite it per #21625 (comment)

Without having dug through the code yet, I'm going to wave my hands a bit and say it should be requeuing a key, not the actual object. The key that's requeued should be used to lookup the object, which should always mirror the object in etcd.

justinsb · 2016-02-25T13:12:16Z

Yes I agree 100%. I was thinking about this and think it might actually be good news:

If there's a bug here with the requeuing priority, it should be easy to fix. It looks like it is supposed to not requeue if there's a newer version, but that didn't match what (I thought) I was seeing. I'm going to look at this issue first thing this morning
We could put the requeue into a defer. It's not great, but it'll likely be good enough for 1.2. Slowing down the retries should make any races less impactful.
Those changes will likely be a big improvement for 1.2, likely address some of the things that have actually been reported against 1.1, and then we can defer deeper work to 1.3

justinsb · 2016-02-25T13:24:56Z

I don't mean a defer... I mean a goroutine which sleeps before re-enqueuing. A deferred re-enqueue.

a-robinson · 2016-02-26T22:40:58Z

Should be resolved by #22069

Issue kubernetes#21952

justinsb added this to the v1.2 milestone Feb 25, 2016

bprashanth added the team/cluster label Feb 25, 2016

justinsb self-assigned this Feb 25, 2016

justinsb mentioned this issue Feb 25, 2016

servicecontroller fixes: exponential backoff #21982

Merged

justinsb mentioned this issue Feb 25, 2016

LoadBalancer creation retry happens every second #21630

Closed

freehan mentioned this issue Feb 26, 2016

kube-controller-manager never gives up trying to create invalid LB #20855

Closed

a-robinson mentioned this issue Feb 26, 2016

Protect against race conditions in the service controller #22069

Merged

a-robinson assigned a-robinson and unassigned justinsb Feb 26, 2016

a-robinson closed this as completed Feb 26, 2016

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 29, 2016

Apply exponential backoff in servicecontroller before retrying

c046f19

Issue kubernetes#21952

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 29, 2016

Apply exponential backoff in servicecontroller before retrying

5b3bb56

Issue kubernetes#21952

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

bprashanth commented Feb 25, 2016

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

a-robinson commented Feb 26, 2016

servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

Comments

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

bprashanth commented Feb 25, 2016

justinsb commented Feb 25, 2016

justinsb commented Feb 25, 2016

a-robinson commented Feb 26, 2016