Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wake up rcs when pods get DeletionFinalStateUnknown tombstones #8822

Merged
merged 1 commit into from
May 28, 2015

Conversation

bprashanth
Copy link
Contributor

  1. Delta fifo includes (possibly stale) version of the object in DeletetionFinalStateUnknown keys
  2. Rc manager wakes up the appropriate rc without waiting for next relist when it encounters a tombstone

@bprashanth
Copy link
Contributor Author

@lavalamp @wojtek-t

@bprashanth
Copy link
Contributor Author

ref #8676

@saad-ali
Copy link
Member

Addresses v1.0 issue. Assigning reviewer.

v, exists, err := f.knownObjects.GetByKey(k)
if err != nil || !exists {
v = nil
glog.Infof("Unable to lookup key %v returned by list, ignoring", k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe also log err if it's non-nil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or event better: since you are returning "exists" anyway, maybe it's fine for GetByKey() to just return a pair: "interface{}, bool"?
At least, the error is always nil in the existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to preserve the signature of GetByKey for the store interface, I've added the error log line

@wojtek-t
Copy link
Member

Thanks for this change @bprashanth

LGTM - just some minor nits

@wojtek-t
Copy link
Member

Also @bprashanth - let's have it merged today so make scalability on Jenkins green asap :)

@bprashanth
Copy link
Contributor Author

@wojtek-t this should already not be a problem because we increased the timeout right? addressed nits.

@wojtek-t
Copy link
Member

@wojtek-t this should already not be a problem because we increased the timeout right? addressed nits.

Yes, but it's cheating so I don't like this approach :)

LGTM - thanks for this change.

@wojtek-t wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 27, 2015
@wojtek-t
Copy link
Member

The Shippable failure is due to non-regenerated conversions. Can you please regenerate them with:
hack/update-generated-conversions.sh
?

@bprashanth
Copy link
Contributor Author

Can you please regenerate them with: hack/update-generated-conversions.sh

Done. I think it was because I hadn't rebased.

Yes, but it's cheating so I don't like this approach :)

Ofc, just checking that it wasn't taking > 10m now.

@wojtek-t
Copy link
Member

It seems to fail with the same error...

Ofc, just checking that it wasn't taking > 10m now.

No I haven't seen failures due to it since increasing the timeout.

@bprashanth
Copy link
Contributor Author

@wojtek-t #8872

@@ -43,13 +45,13 @@ import (
// affects error retrying.
//
// Also see the comment on DeltaFIFO.
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjectKeys KeyLister) *DeltaFIFO {
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjects KeyLookup) *DeltaFIFO {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer that you not add the requirement to support GetByKey here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a concrete suggestion, why not leave this the same, but add the interface below, too, and say in the docs that if knownObjectKeys supports it, it will be called instead of just dropping in finalStateUnknown markers.

type KeyGetter interface {
  GetByKey(key string) (interface{}, bool, error)
}

The problem is that the known objects may not have the final state, so it may be just plain wrong to pass the last known state as the final state in a deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, are you suggesting i use reflect to check if the given knownObjectKeys struct satisfies KeyGetter, and only call GetByKey if it does?

regarding staleness, I figured it's upto the clients to trust or re-get when they find a deletemarker, since not all clients really want the most upto date state? I documented that in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erm, i guess you mean a simple type assertion. Sure.
I still think it should be upto the clients, but doing as suggested should be easy enough.

@bprashanth
Copy link
Contributor Author

@lavalamp PTAL

@@ -334,7 +338,21 @@ func (f *DeltaFIFO) Replace(list []interface{}) error {
continue
}
}
if err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k}); err != nil {
var deletedObj interface{}
if keyGetter, ok := f.knownObjectKeys.(KeyGetter); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is what I was looking for.

@bprashanth
Copy link
Contributor Author

Addressed nits, PTAL

// an object was deleted but the watch deletion event was missed. In this
// case we don't know the final "resting" state of the object. Callers that
// absolutely need an upto-date state should not rely on the Obj member
// of the following struct, but relist using the Key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A relist is not possible because the object has been deleted...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes yes, thought i removed that but apparently i had it in multiple places

@bprashanth
Copy link
Contributor Author

CI passed, PTAL

@wojtek-t
Copy link
Member

Thanks @bprashanth - LGTM

@wojtek-t wojtek-t added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2015
wojtek-t added a commit that referenced this pull request May 28, 2015
Wake up rcs when pods get DeletionFinalStateUnknown tombstones
@wojtek-t wojtek-t merged commit 6ffe46a into kubernetes:master May 28, 2015
@ghost
Copy link

ghost commented May 28, 2015

It looks very much like this broke our e2e tests, and will need to be rolled back.

@thockin oncall FYI
@bprashanth FYI

Starting from test run number 6448 in our Jenkins CI "kubernetes-e2e-gce" a bunch of tests started failing. The only merge between that run and the previous one is this PR.

I haven't yet debugged exactly what's going on, but I'm pretty sure the fault is with this PR.

@thockin
Copy link
Member

thockin commented May 28, 2015

Echoing this. Trying to get a better handle on it, but this is the first victim if I can't get something soon.

@bprashanth
Copy link
Contributor Author

Will debug but feel free to revert, I ran e2e when i uploaded the pr but not after that (2 days ago). At the time there was only a single failure, from my logs [Fail] PD [It] should schedule a pod w/ a RW PD, remove it, then schedule it on another host.

@ghost
Copy link

ghost commented May 28, 2015

I am looking into each of the reporducable cases now

On Thu, May 28, 2015 at 9:30 AM, Prashanth B notifications@github.com
wrote:

Will debug but feel free to revert, I ran e2e when i uploaded the pr but
not after that (2 days ago). At the time there was only a single failure,
from my logs [Fail] PD [It] should schedule a pod w/ a RW PD, remove it,
then schedule it on another host.


Reply to this email directly or view it on GitHub
#8822 (comment)
.

@thockin
Copy link
Member

thockin commented May 28, 2015

No luck reproducing failures so far. Going to broad-spectrum solutions, reverting this.

@wojtek-t
Copy link
Member

@bprashanth - will you be able to look into it and resubmit it? I was looking through our scalability tests on Jenkins and observed cases when this problem appeared (although due to higher timeout they passed).

@bprashanth
Copy link
Contributor Author

@wojtek-t this was reverted to clear the water, there's nothing wrong with it. I'm going to unrevert it (after I rerun e2e at head etc) since we figured out the leaky routes problem yesterday

@wojtek-t
Copy link
Member

Great thanks!

@bprashanth bprashanth deleted the fifo_rc branch October 26, 2015 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants