Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix legacy fallback stale for aggregated discovery #115770

Merged

Conversation

Jefftree
Copy link
Member

/kind bug

What this PR does / why we need it:

Fixes legacy fallback returning no resources for API Services that share the same service object.

The test does exercise this specific case, but I don't know if it's the best we can do. The bug reproduces only in the scenario that the number of APIServices added exceeds the amount that the workers are able to process concurrently (current 2 workers https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L402, so the test has three APIServices).

In this specific case, lastMarkedDirty is set to a time before the first request to the legacy fallback and subsequent calls to fetchFreshDiscoveryForService will use the cached version rather than fetching from the legacy endpoint.

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go#L182-L187

The cache is problematic because we cache the entire service as a whole. However in the case of the legacy fallback, only the resources for one group-version is fetched so we should not cache it as the return value for the entire service since it can also serve other group-versions.

@deads2k has an alternate fix that involves partitioning the cache by APIService rather than by service, it's about the same performance if every service has a legacy fallback, but is probably slightly inefficient once the service migrates to use aggregated discovery: #115728

Which issue(s) this PR fixes:

Fixes #

Fixes #115559

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Yes, discovery document will correctly return the resources for aggregated apiservers that do not implement aggregated disovery

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/cc @deads2k @alexzielenski @apelisse @seans3

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2023
@Jefftree
Copy link
Member Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2023
@Jefftree
Copy link
Member Author

/retest

@Jefftree Jefftree force-pushed the aggregated-discovery-legacy-fix branch 2 times, most recently from 6979a0f to 1b2a71c Compare February 14, 2023 17:45
@Jefftree Jefftree force-pushed the aggregated-discovery-legacy-fix branch from 1b2a71c to ed98c1d Compare February 14, 2023 17:46
@@ -292,9 +292,9 @@ func (dm *discoveryManager) fetchFreshDiscoveryForService(gv metav1.GroupVersion
lastUpdated: now,
}

// Save the resolve, because it is still useful in case other services
// are already marked dirty. THey can use it without making http request
dm.setCacheEntryForService(info.service, cached)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior to this change, the discovery check interval was effectively once every minute for each apiservice. After this change it is an unbounded number of checks per unit time. Can we get back the "once every minute" aspect of the original?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one check every minute is done here:

wait.PollUntil(1*time.Minute, func() (done bool, err error) {
dm.servicesLock.Lock()
defer dm.servicesLock.Unlock()
now := time.Now()
// Mark all non-local APIServices as dirty
for key, info := range dm.apiServices {
info.lastMarkedDirty = now
dm.apiServices[key] = info
dm.dirtyAPIServiceQueue.Add(key)
}
return false, nil
}, stopCh)
and is untouched in this PR.

The behavior before is that the check is once a minute, but if multiple apiservices share the same service, a fetch will not be performed (and cached result used instead) so the behavior was effectively at most one check per apiservice per minute. Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery, the behavior is unchanged at most one check per apiservice per minute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the behavior is exactly one check per minute per apiservice for apiservices that need to use the legacy fallback, while for apiservices who supported aggregated discovery

If AddAPIService is called more than once for whatever reason (maybe status got touched?), after this PR it is checked more than once per minute, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also prior to this PR, calling AddAPIService more than once per minute would result in the service being added to the queue every time, if it had synced by then, and result in refetching the discovery document.

Before and after this PR, the controller conflates an update to the APIService as a possible change to the underlying service. I think it is worth filing a separate issue to fix.

I think at the very least maybe we should add a check to at least see if the Spec.Service has changed at all before marking the APIService/service as dirty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also prior to this PR, calling AddAPIService more than once per minute would result in the service being added to the queue every time

I thought the dirty time was set for one minute later, but I can't find that line now. I'll double check tomorrow, but if I can't find it, I'm ok with this change.

I think at the very least maybe we should add a check to at least see if the Spec.Service has changed at all before marking the APIService/service as dirty.

The check needs to be slightly smarter, since you do need to periodically check the content to see if the etag changed in the modern case and in the legacy case, you probably want to check every few minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like now everywhere. I must have mis-remembered.

@deads2k
Copy link
Contributor

deads2k commented Feb 15, 2023

/approve
/assign @alexzielenski

leaving lgtm with @alexzielenski since he's in it too.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, Jefftree

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2023
@deads2k
Copy link
Contributor

deads2k commented Feb 21, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ecd84b95950e45c4056c5344a057109976d27057

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

1 similar comment
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregated Discovery should not cache on legacy fallback
5 participants