New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix legacy fallback stale for aggregated discovery #115770
Fix legacy fallback stale for aggregated discovery #115770
Conversation
/triage accepted |
/retest |
6979a0f
to
1b2a71c
Compare
1b2a71c
to
ed98c1d
Compare
@@ -292,9 +292,9 @@ func (dm *discoveryManager) fetchFreshDiscoveryForService(gv metav1.GroupVersion | |||
lastUpdated: now, | |||
} | |||
|
|||
// Save the resolve, because it is still useful in case other services | |||
// are already marked dirty. THey can use it without making http request | |||
dm.setCacheEntryForService(info.service, cached) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior to this change, the discovery check interval was effectively once every minute for each apiservice. After this change it is an unbounded number of checks per unit time. Can we get back the "once every minute" aspect of the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one check every minute is done here:
kubernetes/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go
Lines 426 to 439 in ed98c1d
wait.PollUntil(1*time.Minute, func() (done bool, err error) { | |
dm.servicesLock.Lock() | |
defer dm.servicesLock.Unlock() | |
now := time.Now() | |
// Mark all non-local APIServices as dirty | |
for key, info := range dm.apiServices { | |
info.lastMarkedDirty = now | |
dm.apiServices[key] = info | |
dm.dirtyAPIServiceQueue.Add(key) | |
} | |
return false, nil | |
}, stopCh) |
The behavior before is that the check is once a minute, but if multiple apiservices share the same service, a fetch will not be performed (and cached result used instead) so the behavior was effectively at most one check per apiservice per minute. Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery, the behavior is unchanged at most one check per apiservice per minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery
If AddAPIService is called more than once for whatever reason (maybe status got touched?), after this PR it is checked more than once per minute, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think also prior to this PR, calling AddAPIService
more than once per minute would result in the service being added to the queue every time, if it had synced by then, and result in refetching the discovery document.
Before and after this PR, the controller conflates an update to the APIService as a possible change to the underlying service. I think it is worth filing a separate issue to fix.
I think at the very least maybe we should add a check to at least see if the Spec.Service
has changed at all before marking the APIService/service as dirty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think also prior to this PR, calling AddAPIService more than once per minute would result in the service being added to the queue every time
I thought the dirty time was set for one minute later, but I can't find that line now. I'll double check tomorrow, but if I can't find it, I'm ok with this change.
I think at the very least maybe we should add a check to at least see if the Spec.Service has changed at all before marking the APIService/service as dirty.
The check needs to be slightly smarter, since you do need to periodically check the content to see if the etag changed in the modern case and in the legacy case, you probably want to check every few minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like now
everywhere. I must have mis-remembered.
/approve leaving lgtm with @alexzielenski since he's in it too. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, Jefftree The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: ecd84b95950e45c4056c5344a057109976d27057
|
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
1 similar comment
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/kind bug
What this PR does / why we need it:
Fixes legacy fallback returning no resources for API Services that share the same service object.
The test does exercise this specific case, but I don't know if it's the best we can do. The bug reproduces only in the scenario that the number of APIServices added exceeds the amount that the workers are able to process concurrently (current 2 workers https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L402, so the test has three APIServices).
In this specific case, lastMarkedDirty is set to a time before the first request to the legacy fallback and subsequent calls to fetchFreshDiscoveryForService will use the cached version rather than fetching from the legacy endpoint.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L182-L187
The cache is problematic because we cache the entire service as a whole. However in the case of the legacy fallback, only the resources for one group-version is fetched so we should not cache it as the return value for the entire service since it can also serve other group-versions.
@deads2k has an alternate fix that involves partitioning the cache by APIService rather than by service, it's about the same performance if every service has a legacy fallback, but is probably slightly inefficient once the service migrates to use aggregated discovery: #115728
Which issue(s) this PR fixes:
Fixes #
Fixes #115559
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/cc @deads2k @alexzielenski @apelisse @seans3