Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

servicecontroller requeues bad changes at the end of the queue, ignoring subsequent changes #21952

Closed
justinsb opened this issue Feb 25, 2016 · 6 comments
Assignees
Milestone

Comments

@justinsb
Copy link
Member

I found a concrete problem with the servicecontroller (more concrete than some of my locking complaints - I saw this one in the real world). When we hit an error applying an update, we requeue the changes at the end of the queue: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/service/servicecontroller.go#L191

So suppose the user makes a "bad" change to a service (e.g. support we try to enable UDP on an AWS LoadBalancer). The service controller loop will observe that change, try to make the change, fail and keep re-enqueuing it for retry indefinitely. Not great, but OK.

Now suppose the user notices the bad change, and makes a corrective "good" change. The good change will join the queue and be processed. But the bad change will still be being requeued, and so will keep being retried. We are now trying to apply an older version of the change than the one we have already successfully applied.

In my particular case, I had a path dependency bug, and so I did Good1, Bad, Good2. Bad could not be applied to Good1, but could be applied to Good2 (a bug, but that's an issue in my code). So the sequence I got was Good1 -> Good2 -> Bad. i.e. I ended up on Bad, not Good2.

@justinsb justinsb added this to the v1.2 milestone Feb 25, 2016
@justinsb
Copy link
Member Author

cc @thockin because I think you were having a look at this area

@justinsb
Copy link
Member Author

Actually, I'm confused as to what is happening here. I think AddIfNotPresent is supposed to prevent this. It doesn't appear to be doing so in practice.

@bprashanth
Copy link
Contributor

I think if you have a reliable repro you should send a surgical fix for 1.2, and post 1.2 we should just rewrite it per #21625 (comment)

Without having dug through the code yet, I'm going to wave my hands a bit and say it should be requeuing a key, not the actual object. The key that's requeued should be used to lookup the object, which should always mirror the object in etcd.

@justinsb
Copy link
Member Author

Yes I agree 100%. I was thinking about this and think it might actually be good news:

  1. If there's a bug here with the requeuing priority, it should be easy to fix. It looks like it is supposed to not requeue if there's a newer version, but that didn't match what (I thought) I was seeing. I'm going to look at this issue first thing this morning
  2. We could put the requeue into a defer. It's not great, but it'll likely be good enough for 1.2. Slowing down the retries should make any races less impactful.
  3. Those changes will likely be a big improvement for 1.2, likely address some of the things that have actually been reported against 1.1, and then we can defer deeper work to 1.3

@justinsb
Copy link
Member Author

I don't mean a defer... I mean a goroutine which sleeps before re-enqueuing. A deferred re-enqueue.

@a-robinson
Copy link
Contributor

Should be resolved by #22069

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 29, 2016
justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants