Fix legacy fallback stale for aggregated discovery #115770

Jefftree · 2023-02-14T16:41:39Z

/kind bug

What this PR does / why we need it:

Fixes legacy fallback returning no resources for API Services that share the same service object.

The test does exercise this specific case, but I don't know if it's the best we can do. The bug reproduces only in the scenario that the number of APIServices added exceeds the amount that the workers are able to process concurrently (current 2 workers https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L402, so the test has three APIServices).

In this specific case, lastMarkedDirty is set to a time before the first request to the legacy fallback and subsequent calls to fetchFreshDiscoveryForService will use the cached version rather than fetching from the legacy endpoint.

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L182-L187

The cache is problematic because we cache the entire service as a whole. However in the case of the legacy fallback, only the resources for one group-version is fetched so we should not cache it as the return value for the entire service since it can also serve other group-versions.

@deads2k has an alternate fix that involves partitioning the cache by APIService rather than by service, it's about the same performance if every service has a legacy fallback, but is probably slightly inefficient once the service migrates to use aggregated discovery: #115728

Which issue(s) this PR fixes:

Fixes #

Fixes #115559

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Yes, discovery document will correctly return the resources for aggregated apiservers that do not implement aggregated disovery

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/cc @deads2k @alexzielenski @apelisse @seans3

Jefftree · 2023-02-14T16:42:02Z

/triage accepted

Jefftree · 2023-02-14T17:18:37Z

/retest

deads2k · 2023-02-14T18:06:28Z

staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go

@@ -292,9 +292,9 @@ func (dm *discoveryManager) fetchFreshDiscoveryForService(gv metav1.GroupVersion
 			lastUpdated: now,
 		}

-		// Save the resolve, because it is still useful in case other services
-		// are already marked dirty. THey can use it without making http request
-		dm.setCacheEntryForService(info.service, cached)


prior to this change, the discovery check interval was effectively once every minute for each apiservice. After this change it is an unbounded number of checks per unit time. Can we get back the "once every minute" aspect of the original?

The one check every minute is done here:

kubernetes/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go

Lines 426 to 439 in ed98c1d

wait.PollUntil(1*time.Minute, func() (done bool, err error) {

dm.servicesLock.Lock()

defer dm.servicesLock.Unlock()

now := time.Now()

// Mark all non-local APIServices as dirty

for key, info := range dm.apiServices {

info.lastMarkedDirty = now

dm.apiServices[key] = info

dm.dirtyAPIServiceQueue.Add(key)

}

return false, nil

}, stopCh)

and is untouched in this PR.

The behavior before is that the check is once a minute, but if multiple apiservices share the same service, a fetch will not be performed (and cached result used instead) so the behavior was effectively at most one check per apiservice per minute. Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery, the behavior is unchanged at most one check per apiservice per minute.

Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery

If AddAPIService is called more than once for whatever reason (maybe status got touched?), after this PR it is checked more than once per minute, right?

I think also prior to this PR, calling AddAPIService more than once per minute would result in the service being added to the queue every time, if it had synced by then, and result in refetching the discovery document.

Before and after this PR, the controller conflates an update to the APIService as a possible change to the underlying service. I think it is worth filing a separate issue to fix.

I think at the very least maybe we should add a check to at least see if the Spec.Service has changed at all before marking the APIService/service as dirty.

I think also prior to this PR, calling AddAPIService more than once per minute would result in the service being added to the queue every time

I thought the dirty time was set for one minute later, but I can't find that line now. I'll double check tomorrow, but if I can't find it, I'm ok with this change.

I think at the very least maybe we should add a check to at least see if the Spec.Service has changed at all before marking the APIService/service as dirty.

The check needs to be slightly smarter, since you do need to periodically check the content to see if the etag changed in the modern case and in the legacy case, you probably want to check every few minutes.

Looks like now everywhere. I must have mis-remembered.

deads2k · 2023-02-15T13:17:42Z

/approve
/assign @alexzielenski

leaving lgtm with @alexzielenski since he's in it too.

k8s-ci-robot · 2023-02-15T13:18:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, Jefftree

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/kube-aggregator/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2023-02-21T21:06:14Z

/lgtm

k8s-ci-robot · 2023-02-21T21:06:23Z

LGTM label has been added.

Git tree hash: ecd84b95950e45c4056c5344a057109976d27057

k8s-triage-robot · 2023-02-21T23:25:45Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2023-02-22T05:45:45Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

…f-#115302-upstream-release-1.26 Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 14, 2023

k8s-ci-robot requested review from alexzielenski and apelisse February 14, 2023 16:41

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 14, 2023

k8s-ci-robot requested a review from deads2k February 14, 2023 16:41

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 14, 2023

k8s-ci-robot requested a review from seans3 February 14, 2023 16:41

Jefftree force-pushed the aggregated-discovery-legacy-fix branch 2 times, most recently from 6979a0f to 1b2a71c Compare February 14, 2023 17:45

Fix legacy fallback stale for aggregated discovery

ed98c1d

Jefftree force-pushed the aggregated-discovery-legacy-fix branch from 1b2a71c to ed98c1d Compare February 14, 2023 17:46

deads2k reviewed Feb 14, 2023

View reviewed changes

k8s-ci-robot assigned alexzielenski Feb 15, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2023

k8s-ci-robot assigned deads2k Feb 21, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2023

k8s-ci-robot merged commit 1bafca3 into kubernetes:master Feb 22, 2023

k8s-ci-robot added this to the v1.27 milestone Feb 22, 2023

deads2k mentioned this pull request Feb 22, 2023

Automated cherry pick of #115770: Fix legacy fallback stale for aggregated discovery #115971

Closed

openshift-ci-robot mentioned this pull request Feb 22, 2023

UPSTREAM: 115770: Fix legacy fallback stale for aggregated discovery openshift/kubernetes#1484

Closed

pacoxu mentioned this pull request Feb 23, 2023

Revert "Fix legacy fallback stale for aggregated discovery" #115981

Closed

gjkim42 mentioned this pull request Mar 1, 2023

Deflake tests in staging/src/k8s.io/kube-aggregator/pkg/apiserver #115859

Merged

alexzielenski mentioned this pull request Mar 2, 2023

Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859 #115805

Merged

Jefftree mentioned this pull request Mar 10, 2023

Aggregated Discovery kubernetes/enhancements#3352

Open

11 tasks

k8s-ci-robot added a commit that referenced this pull request Mar 11, 2023

Merge pull request #115805 from alexzielenski/automated-cherry-pick-o…

44c0247

…f-#115302-upstream-release-1.26 Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859

Jefftree deleted the aggregated-discovery-legacy-fix branch March 21, 2023 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix legacy fallback stale for aggregated discovery #115770

Fix legacy fallback stale for aggregated discovery #115770

Jefftree commented Feb 14, 2023

Jefftree commented Feb 14, 2023

Jefftree commented Feb 14, 2023

deads2k Feb 14, 2023

Jefftree Feb 14, 2023

deads2k Feb 14, 2023

alexzielenski Feb 14, 2023

deads2k Feb 14, 2023

deads2k Feb 15, 2023

deads2k commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

deads2k commented Feb 21, 2023

k8s-ci-robot commented Feb 21, 2023

k8s-triage-robot commented Feb 21, 2023

k8s-triage-robot commented Feb 22, 2023

	wait.PollUntil(1*time.Minute, func() (done bool, err error) {
	dm.servicesLock.Lock()
	defer dm.servicesLock.Unlock()

	now := time.Now()

	// Mark all non-local APIServices as dirty
	for key, info := range dm.apiServices {
	info.lastMarkedDirty = now
	dm.apiServices[key] = info
	dm.dirtyAPIServiceQueue.Add(key)
	}
	return false, nil
	}, stopCh)

Fix legacy fallback stale for aggregated discovery #115770

Fix legacy fallback stale for aggregated discovery #115770

Conversation

Jefftree commented Feb 14, 2023

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Jefftree commented Feb 14, 2023

Jefftree commented Feb 14, 2023

deads2k Feb 14, 2023

Choose a reason for hiding this comment

Jefftree Feb 14, 2023

Choose a reason for hiding this comment

deads2k Feb 14, 2023

Choose a reason for hiding this comment

alexzielenski Feb 14, 2023

Choose a reason for hiding this comment

deads2k Feb 14, 2023

Choose a reason for hiding this comment

deads2k Feb 15, 2023

Choose a reason for hiding this comment

deads2k commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

deads2k commented Feb 21, 2023

k8s-ci-robot commented Feb 21, 2023

k8s-triage-robot commented Feb 21, 2023

k8s-triage-robot commented Feb 22, 2023