Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

bprashanth · 2015-05-26T18:50:05Z

Delta fifo includes (possibly stale) version of the object in DeletetionFinalStateUnknown keys
Rc manager wakes up the appropriate rc without waiting for next relist when it encounters a tombstone

bprashanth · 2015-05-26T18:50:24Z

@lavalamp @wojtek-t

bprashanth · 2015-05-26T18:50:43Z

ref #8676

saad-ali · 2015-05-27T00:17:41Z

Addresses v1.0 issue. Assigning reviewer.

wojtek-t · 2015-05-27T07:11:27Z

pkg/client/cache/delta_fifo.go

+		v, exists, err := f.knownObjects.GetByKey(k)
+		if err != nil || !exists {
+			v = nil
+			glog.Infof("Unable to lookup key %v returned by list, ignoring", k)


nit: maybe also log err if it's non-nil?

Or event better: since you are returning "exists" anyway, maybe it's fine for GetByKey() to just return a pair: "interface{}, bool"?
At least, the error is always nil in the existing code.

We need to preserve the signature of GetByKey for the store interface, I've added the error log line

wojtek-t · 2015-05-27T07:21:48Z

Thanks for this change @bprashanth

LGTM - just some minor nits

wojtek-t · 2015-05-27T13:59:20Z

Also @bprashanth - let's have it merged today so make scalability on Jenkins green asap :)

bprashanth · 2015-05-27T16:15:21Z

@wojtek-t this should already not be a problem because we increased the timeout right? addressed nits.

wojtek-t · 2015-05-27T16:32:51Z

@wojtek-t this should already not be a problem because we increased the timeout right? addressed nits.

Yes, but it's cheating so I don't like this approach :)

LGTM - thanks for this change.

wojtek-t · 2015-05-27T16:35:20Z

The Shippable failure is due to non-regenerated conversions. Can you please regenerate them with:
hack/update-generated-conversions.sh
?

bprashanth · 2015-05-27T16:39:42Z

Can you please regenerate them with: hack/update-generated-conversions.sh

Done. I think it was because I hadn't rebased.

Yes, but it's cheating so I don't like this approach :)

Ofc, just checking that it wasn't taking > 10m now.

wojtek-t · 2015-05-27T17:07:09Z

It seems to fail with the same error...

Ofc, just checking that it wasn't taking > 10m now.

No I haven't seen failures due to it since increasing the timeout.

bprashanth · 2015-05-27T17:30:47Z

@wojtek-t #8872

lavalamp · 2015-05-27T19:56:36Z

pkg/client/cache/delta_fifo.go

@@ -43,13 +45,13 @@ import (
 //       affects error retrying.
 //
 // Also see the comment on DeltaFIFO.
-func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjectKeys KeyLister) *DeltaFIFO {
+func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjects KeyLookup) *DeltaFIFO {


I would really prefer that you not add the requirement to support GetByKey here.

As a concrete suggestion, why not leave this the same, but add the interface below, too, and say in the docs that if knownObjectKeys supports it, it will be called instead of just dropping in finalStateUnknown markers.

type KeyGetter interface { GetByKey(key string) (interface{}, bool, error) }

The problem is that the known objects may not have the final state, so it may be just plain wrong to pass the last known state as the final state in a deletion.

hmm, are you suggesting i use reflect to check if the given knownObjectKeys struct satisfies KeyGetter, and only call GetByKey if it does?

regarding staleness, I figured it's upto the clients to trust or re-get when they find a deletemarker, since not all clients really want the most upto date state? I documented that in a comment.

Erm, i guess you mean a simple type assertion. Sure.
I still think it should be upto the clients, but doing as suggested should be easy enough.

bprashanth · 2015-05-27T21:10:57Z

@lavalamp PTAL

lavalamp · 2015-05-27T22:44:27Z

pkg/client/cache/delta_fifo.go

@@ -334,7 +338,21 @@ func (f *DeltaFIFO) Replace(list []interface{}) error {
 				continue
 			}
 		}
-		if err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k}); err != nil {
+		var deletedObj interface{}
+		if keyGetter, ok := f.knownObjectKeys.(KeyGetter); ok {


Thanks, this is what I was looking for.

bprashanth · 2015-05-27T23:31:10Z

Addressed nits, PTAL

lavalamp · 2015-05-27T23:34:41Z

pkg/client/cache/delta_fifo.go

+// an object was deleted but the watch deletion event was missed. In this
+// case we don't know the final "resting" state of the object. Callers that
+// absolutely need an upto-date state should not rely on the Obj member
+// of the following struct, but relist using the Key.


A relist is not possible because the object has been deleted...

yes yes, thought i removed that but apparently i had it in multiple places

bprashanth · 2015-05-28T02:02:32Z

CI passed, PTAL

wojtek-t · 2015-05-28T07:31:59Z

Thanks @bprashanth - LGTM

Wake up rcs when pods get DeletionFinalStateUnknown tombstones

ghost · 2015-05-28T15:05:30Z

It looks very much like this broke our e2e tests, and will need to be rolled back.

@thockin oncall FYI
@bprashanth FYI

Starting from test run number 6448 in our Jenkins CI "kubernetes-e2e-gce" a bunch of tests started failing. The only merge between that run and the previous one is this PR.

I haven't yet debugged exactly what's going on, but I'm pretty sure the fault is with this PR.

thockin · 2015-05-28T16:07:09Z

Echoing this. Trying to get a better handle on it, but this is the first victim if I can't get something soon.

bprashanth · 2015-05-28T16:29:58Z

Will debug but feel free to revert, I ran e2e when i uploaded the pr but not after that (2 days ago). At the time there was only a single failure, from my logs [Fail] PD [It] should schedule a pod w/ a RW PD, remove it, then schedule it on another host.

ghost · 2015-05-28T16:34:23Z

I am looking into each of the reporducable cases now

On Thu, May 28, 2015 at 9:30 AM, Prashanth B notifications@github.com
wrote:

Will debug but feel free to revert, I ran e2e when i uploaded the pr but
not after that (2 days ago). At the time there was only a single failure,
from my logs [Fail] PD [It] should schedule a pod w/ a RW PD, remove it,
then schedule it on another host.

—
Reply to this email directly or view it on GitHub
#8822 (comment)
.

thockin · 2015-05-28T17:23:34Z

No luck reproducing failures so far. Going to broad-spectrum solutions, reverting this.

wojtek-t · 2015-05-29T07:50:20Z

@bprashanth - will you be able to look into it and resubmit it? I was looking through our scalability tests on Jenkins and observed cases when this problem appeared (although due to higher timeout they passed).

bprashanth · 2015-05-29T16:19:11Z

@wojtek-t this was reverted to clear the water, there's nothing wrong with it. I'm going to unrevert it (after I rerun e2e at head etc) since we figured out the leaky routes problem yesterday

wojtek-t · 2015-05-29T16:40:59Z

Great thanks!

googlebot added the cla: yes label May 26, 2015

saad-ali mentioned this pull request May 26, 2015

Room for optimizing rc stop operations #8676

Closed

saad-ali assigned wojtek-t May 27, 2015

wojtek-t reviewed May 27, 2015
View reviewed changes

wojtek-t mentioned this pull request May 27, 2015

Failing performance tests on Jenkins #7561

Closed

bprashanth force-pushed the fifo_rc branch from a9a790d to 8a439c8 Compare May 27, 2015 16:14

wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 27, 2015

bprashanth force-pushed the fifo_rc branch from 8a439c8 to 4d0bde3 Compare May 27, 2015 16:38

bprashanth force-pushed the fifo_rc branch from 4d0bde3 to 99f303f Compare May 27, 2015 16:56

bprashanth force-pushed the fifo_rc branch from 99f303f to bb227dc Compare May 27, 2015 18:16

lavalamp reviewed May 27, 2015
View reviewed changes

bprashanth force-pushed the fifo_rc branch from 031c523 to 77aa405 Compare May 27, 2015 23:30

lavalamp reviewed May 27, 2015
View reviewed changes

bprashanth force-pushed the fifo_rc branch from 77aa405 to a0558ba Compare May 27, 2015 23:39

Delta fifo includes objects in DeleteFinalStateUnknow, rcs stop faster

8fa66bd

bprashanth force-pushed the fifo_rc branch from a0558ba to 8fa66bd Compare May 27, 2015 23:46

wojtek-t added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2015

wojtek-t added a commit that referenced this pull request May 28, 2015

Merge pull request #8822 from bprashanth/fifo_rc

6ffe46a

Wake up rcs when pods get DeletionFinalStateUnknown tombstones

wojtek-t merged commit 6ffe46a into kubernetes:master May 28, 2015

thockin mentioned this pull request May 28, 2015

Revert "Wake up rcs when pods get DeletionFinalStateUnknown tombstones" #8927

Merged

bprashanth deleted the fifo_rc branch October 26, 2015 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

bprashanth commented May 26, 2015

bprashanth commented May 26, 2015

bprashanth commented May 26, 2015

saad-ali commented May 27, 2015

wojtek-t May 27, 2015

wojtek-t May 27, 2015

bprashanth May 27, 2015

wojtek-t commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

wojtek-t commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

lavalamp May 27, 2015

lavalamp May 27, 2015

bprashanth May 27, 2015

bprashanth May 27, 2015

bprashanth commented May 27, 2015

lavalamp May 27, 2015

bprashanth commented May 27, 2015

lavalamp May 27, 2015

bprashanth May 27, 2015

bprashanth commented May 28, 2015

wojtek-t commented May 28, 2015

ghost commented May 28, 2015

thockin commented May 28, 2015

bprashanth commented May 28, 2015

ghost commented May 28, 2015

thockin commented May 28, 2015

wojtek-t commented May 29, 2015

bprashanth commented May 29, 2015

wojtek-t commented May 29, 2015

Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

Conversation

bprashanth commented May 26, 2015

bprashanth commented May 26, 2015

bprashanth commented May 26, 2015

saad-ali commented May 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

wojtek-t commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

wojtek-t commented May 27, 2015

bprashanth commented May 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth commented May 27, 2015

Choose a reason for hiding this comment

bprashanth commented May 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth commented May 28, 2015

wojtek-t commented May 28, 2015

ghost commented May 28, 2015

thockin commented May 28, 2015

bprashanth commented May 28, 2015

ghost commented May 28, 2015

thockin commented May 28, 2015

wojtek-t commented May 29, 2015

bprashanth commented May 29, 2015

wojtek-t commented May 29, 2015