-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822
Conversation
bprashanth
commented
May 26, 2015
- Delta fifo includes (possibly stale) version of the object in DeletetionFinalStateUnknown keys
- Rc manager wakes up the appropriate rc without waiting for next relist when it encounters a tombstone
ref #8676 |
Addresses v1.0 issue. Assigning reviewer. |
v, exists, err := f.knownObjects.GetByKey(k) | ||
if err != nil || !exists { | ||
v = nil | ||
glog.Infof("Unable to lookup key %v returned by list, ignoring", k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe also log err if it's non-nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or event better: since you are returning "exists" anyway, maybe it's fine for GetByKey() to just return a pair: "interface{}, bool"?
At least, the error is always nil in the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to preserve the signature of GetByKey for the store interface, I've added the error log line
Thanks for this change @bprashanth LGTM - just some minor nits |
Also @bprashanth - let's have it merged today so make scalability on Jenkins green asap :) |
@wojtek-t this should already not be a problem because we increased the timeout right? addressed nits. |
Yes, but it's cheating so I don't like this approach :) LGTM - thanks for this change. |
The Shippable failure is due to non-regenerated conversions. Can you please regenerate them with: |
Done. I think it was because I hadn't rebased.
Ofc, just checking that it wasn't taking > 10m now. |
It seems to fail with the same error...
No I haven't seen failures due to it since increasing the timeout. |
@@ -43,13 +45,13 @@ import ( | |||
// affects error retrying. | |||
// | |||
// Also see the comment on DeltaFIFO. | |||
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjectKeys KeyLister) *DeltaFIFO { | |||
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjects KeyLookup) *DeltaFIFO { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really prefer that you not add the requirement to support GetByKey here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a concrete suggestion, why not leave this the same, but add the interface below, too, and say in the docs that if knownObjectKeys supports it, it will be called instead of just dropping in finalStateUnknown markers.
type KeyGetter interface {
GetByKey(key string) (interface{}, bool, error)
}
The problem is that the known objects may not have the final state, so it may be just plain wrong to pass the last known state as the final state in a deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, are you suggesting i use reflect to check if the given knownObjectKeys struct satisfies KeyGetter, and only call GetByKey if it does?
regarding staleness, I figured it's upto the clients to trust or re-get when they find a deletemarker, since not all clients really want the most upto date state? I documented that in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erm, i guess you mean a simple type assertion. Sure.
I still think it should be upto the clients, but doing as suggested should be easy enough.
@lavalamp PTAL |
@@ -334,7 +338,21 @@ func (f *DeltaFIFO) Replace(list []interface{}) error { | |||
continue | |||
} | |||
} | |||
if err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k}); err != nil { | |||
var deletedObj interface{} | |||
if keyGetter, ok := f.knownObjectKeys.(KeyGetter); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is what I was looking for.
Addressed nits, PTAL |
// an object was deleted but the watch deletion event was missed. In this | ||
// case we don't know the final "resting" state of the object. Callers that | ||
// absolutely need an upto-date state should not rely on the Obj member | ||
// of the following struct, but relist using the Key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A relist is not possible because the object has been deleted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes yes, thought i removed that but apparently i had it in multiple places
CI passed, PTAL |
Thanks @bprashanth - LGTM |
Wake up rcs when pods get DeletionFinalStateUnknown tombstones
It looks very much like this broke our e2e tests, and will need to be rolled back. @thockin oncall FYI Starting from test run number 6448 in our Jenkins CI "kubernetes-e2e-gce" a bunch of tests started failing. The only merge between that run and the previous one is this PR. I haven't yet debugged exactly what's going on, but I'm pretty sure the fault is with this PR. |
Echoing this. Trying to get a better handle on it, but this is the first victim if I can't get something soon. |
Will debug but feel free to revert, I ran e2e when i uploaded the pr but not after that (2 days ago). At the time there was only a single failure, from my logs |
I am looking into each of the reporducable cases now On Thu, May 28, 2015 at 9:30 AM, Prashanth B notifications@github.com
|
No luck reproducing failures so far. Going to broad-spectrum solutions, reverting this. |
@bprashanth - will you be able to look into it and resubmit it? I was looking through our scalability tests on Jenkins and observed cases when this problem appeared (although due to higher timeout they passed). |
@wojtek-t this was reverted to clear the water, there's nothing wrong with it. I'm going to unrevert it (after I rerun e2e at head etc) since we figured out the leaky routes problem yesterday |
Great thanks! |