Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

absence of openapi configuration in integration tests makes server-side apply panic (broke apf controller when it switched to SSA) #107727

Closed
tkashem opened this issue Jan 24, 2022 · 47 comments · Fixed by #110173
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tkashem
Copy link
Contributor

tkashem commented Jan 24, 2022

What happened?

Integration test timed out due to the following error

E0120 14:05:02.398256  118058 runtime.go:77] Observed a panic: FieldManager must be installed to run apply

The error appeared 748 times in the log. Looks like lots of churning happens during the apf configuration bootstrapping phase.
Is there a race condition between when the type is being registered and the API calls?

Stack trace:

I0120 14:05:02.397911  118058 apf_controller.go:879] Triggered API priority and fairness config reloading because priority level exempt is undesired and idle
I0120 14:05:02.398135  118058 panic.go:1038] "HTTP" verb="APPLY" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/system-node-high/status?fieldManager=api-priority-and-fairness-config-consumer-v1&force=true" latency="650.951µs" userAgent="Go-http-client/1.1" audit-ID="bc3c561e-1a80-403e-b611-63ba7bd484b9" srcIP="127.0.0.1:46572" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=0
E0120 14:05:02.398256  118058 runtime.go:77] Observed a panic: FieldManager must be installed to run apply
goroutine 136765 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:105 +0xaf
panic({0x3fc0bc0, 0x542eb00})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*applyPatcher).applyPatchToCurrentObject(0x551ceb0, {0x553da98, 0xc0887ccc00}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:482 +0x449
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).applyPatch(0xc0bf781a40, {0x553da98, 0xc0887ccc00}, {0x4c296a0, 0xc0887da480}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:566 +0xd3
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/rest.(*defaultUpdatedObjectInfo).UpdatedObject(0x81f4878, {0x553da98, 0xc0887ccc00}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/rest/update.go:229 +0xd0
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update.func1({0x551ceb0, 0xc0887da480}, {0xc07278e397b5dfde, 0x8b5f27eafb})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go:533 +0x1f9
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3.(*store).updateState(0xc03d204bd0, 0xc06ba15310, 0x42)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3/store.go:894 +0x3e
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3.(*store).GuaranteedUpdate(0xc03d204bd0, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x1d}, {0x551ceb0, 0xc0887da300}, 0x1, 0x40ef94, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3/store.go:365 +0x56e
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.(*Cacher).GuaranteedUpdate(0xc0e9754c60, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x1d}, {0x551ceb0, 0xc0887da300}, 0xa8, 0x4b86ae0, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go:721 +0x1b5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*DryRunnableStorage).GuaranteedUpdate(0x4cbc218, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x4cac0b0}, {0x551ceb0, 0xc0887da300}, 0xf2, 0x1, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/dryrun.go:97 +0x1c7
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update(0xc0e5c5b2c0, {0x553da98, 0xc0887ccc00}, {0xc0412d2b3d, 0xb}, {0x5523328, 0xc0887cccf0}, 0xc0105369b0, 0x4ea4ac8, 0x0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go:521 +0x508
k8s.io/kubernetes/pkg/registry/flowcontrol/flowschema/storage.(*StatusREST).Update(0xc0887ccd50, {0x553da98, 0xc0887ccc00}, {0xc0412d2b3d, 0xc086dd5bf0}, {0x5523328, 0xc0887cccf0}, 0xc0991a2748, 0x40ef94, 0x1, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/registry/flowcontrol/flowschema/storage/storage.go:93 +0x52
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).patchResource.func2()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:665 +0xa7
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).patchResource.func3()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:671 +0x38
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:117 +0x8f
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:92 +0xe5

Integration job link: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784
PR: #107456

I searched for this error in CI and this PR, looks like we only see this error in this PR so far. So I am hoping it has not introduced any flake. (maybe we need to let more time to pass to start seeing flakes)

What did you expect to happen?

We should not see this error FieldManager must be installed to run apply

How can we reproduce it (as minimally and precisely as possible)?

I saw it only in the PR mentioned.

@tkashem tkashem added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022
@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2022

/sig api-machinery

cc @wojtek-t @MikeSpreitzer

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2022
@MikeSpreitzer
Copy link
Member

I see

	if p.fieldManager == nil {
		panic("FieldManager must be installed to run apply")
	}

@MikeSpreitzer
Copy link
Member

The fact that it happens to a non-mandatory object implies that the object was received from the server, so the type must be in the server.

The APF config-consuming controller does this:

		applyOptions := metav1.ApplyOptions{FieldManager: cfgCtlr.asFieldManager, Force: true}

		// the condition field in fsStatusUpdate holds the new condition we want to update.
		// TODO: this will break when we have multiple conditions for a flowschema
		_, err := fsIfc.ApplyStatus(context.TODO(), toFlowSchemaApplyConfiguration(fsu), applyOptions)

@MikeSpreitzer
Copy link
Member

Look for the output from this line:

	klog.V(2).Infof("NewTestableController %q with serverConcurrencyLimit=%d, requestWaitLimit=%s, name=%s, asFieldManager=%q", cfgCtlr.name, cfgCtlr.serverConcurrencyLimit, cfgCtlr.requestWaitLimit, cfgCtlr.name, cfgCtlr.asFieldManager)

@MadhavJivrajani
Copy link
Contributor

Another PR with this failure: #105483

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

the panic means the request scope FieldManager is nil

that is initialized here:

if a.group.OpenAPIModels != nil && utilfeature.DefaultFeatureGate.Enabled(features.ServerSideApply) {
reqScope.FieldManager, err = fieldmanager.NewDefaultFieldManager(
a.group.TypeConverter,
a.group.UnsafeConvertor,
a.group.Defaulter,
a.group.Creater,
fqKindToRegister,
reqScope.HubGroupVersion,
subresource,
resetFields,
)
if err != nil {
return nil, nil, fmt.Errorf("failed to create field manager: %v", err)
}
}

OpenAPIModels is set here:

// installAPIResources is a private method for installing the REST storage backing each api groupversionresource
func (s *GenericAPIServer) installAPIResources(apiPrefix string, apiGroupInfo *APIGroupInfo, openAPIModels openapiproto.Models) error {
var resourceInfos []*storageversion.ResourceInfo
for _, groupVersion := range apiGroupInfo.PrioritizedVersions {
if len(apiGroupInfo.VersionedResourcesStorageMap[groupVersion.Version]) == 0 {
klog.Warningf("Skipping API %v because it has no resources.", groupVersion)
continue
}
apiGroupVersion := s.getAPIGroupVersion(apiGroupInfo, groupVersion, apiPrefix)
if apiGroupInfo.OptionsExternalVersion != nil {
apiGroupVersion.OptionsExternalVersion = apiGroupInfo.OptionsExternalVersion
}
apiGroupVersion.OpenAPIModels = openAPIModels

called from here:

openAPIModels, err := s.getOpenAPIModels(APIGroupPrefix, apiGroupInfos...)
if err != nil {
return fmt.Errorf("unable to get openapi models: %v", err)
}
for _, apiGroupInfo := range apiGroupInfos {
if err := s.installAPIResources(APIGroupPrefix, apiGroupInfo, openAPIModels); err != nil {

and openAPIModels is constructed here:

// getOpenAPIModels is a private method for getting the OpenAPI models
func (s *GenericAPIServer) getOpenAPIModels(apiPrefix string, apiGroupInfos ...*APIGroupInfo) (openapiproto.Models, error) {
if s.openAPIConfig == nil {
return nil, nil
}

it looks like it is possible for openapiconfig to be unset, which causes the openapi models to be nil, which causes apply to panic. I don't know if it is only the integration tests that are subject to that configuration.

cc @apelisse @Jefftree
/priority important-soon
(since this is now being exercised in bring-up of integration test servers)

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 25, 2022
@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

if APF is creating/bootstrapping the objects without using apply, and then updating status using apply, that would explain how the object exists but apply fails

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

I suspect that APF is not successfully updating status at all in the integration tests with this server configuration

@tkashem
Copy link
Contributor Author

tkashem commented Jan 25, 2022

@liggitt do you want me to revert #107456, while we work on a solution?

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

I don't know how bad the test impact is. Work with @apelisse / @Jefftree to determine if we should revert or fix forward

@MikeSpreitzer
Copy link
Member

@liggitt : I do not follow why you talked about server-side stuff then suggested the problem is in the client (a controller in the kube-apiserver) not supplying a fieldManager.

The client does supply the fieldManager, for both creates and updates. See the following lines:

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

@liggitt : I do not follow why you talked about server-side stuff then suggested the problem is in the client (a controller in the kube-apiserver) not supplying a fieldManager.

Different field manager. reqScope.FieldManager is set server-side in the endpoint handler based on the server configuration.

@MikeSpreitzer
Copy link
Member

So if the problem is server-side, why did you suggest the problem is in the client?

@MikeSpreitzer
Copy link
Member

Regarding the client, I looked in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784/build-log.txt and found this line:

I0120 13:57:15.480218  117563 apf_controller.go:217] NewTestableController "Controller" with serverConcurrencyLimit=600, requestWaitLimit=15s, name=Controller, asFieldManager="api-priority-and-fairness-config-consumer-v1"

actually several of the same form, all with the same fieldManager.

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

So if the problem is server-side, why did you suggest the problem is in the client?

I don't think I did suggest that :) all of the setup and calls in #107727 (comment) are server-side

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Jan 25, 2022

@liggitt : please help me understand #107727 (comment) . That is about the client side. Are you saying that a client that creates an object using Create and later updates the object using Apply --- with a non-empty fieldManager in both --- will run into this server-side scenario? If so, is that because this client behavior is wrong or because there is a bug server side?

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

@liggitt : please help me understand #107727 (comment) . That is about the client side.

That was in response to #107727 (comment), trying to figure out how the object was created but could not be updated.

Are you saying that a client that creates an object using Create and later updates the object using Apply --- with a non-empty fieldManager in both --- will run into this server-side scenario?

Yes, when the server is not configured to enable openapi and therefore does not have apply enabled.

If so, is that because this client behavior is wrong or because there is a bug server side?

Because the server is misconfigured (which, from what I can tell, is limited to some of the server instances started for integration tests)

@MikeSpreitzer
Copy link
Member

Looking at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ and https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ it is not obvious to me whether or how openapi could be disabled. Would that be by turning off the OpenAPIV3 feature flag? FYI, the inclusion of "V3" in the flag makes me unsure what the server does about other versions.

@liggitt : I think you are saying that it is possible to disable openapi and that this in turn disables server-side apply. That seems kind of surprising to me. Does this mean that it is simply incorrect to run a client that does server-side apply with a server that has openapi disabled? Is this intended? Is it documented anywhere?

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

@liggitt : I think you are saying that it is possible to disable openapi and that this in turn disables server-side apply

It's not intended as an externalized configuration option. integration tests have the ability to fiddle with many of the config struct options in the server that a kube-apiserver binary invocation could not. They intentionally use this to do things like inject specialized authorizers / authenticators for testing purposes, but also can unintentionally result in a test server that doesn't match a real API server invocation.

I think this is an integration test server configuration problem, which should be fixed.

@Jefftree
Copy link
Member

@liggitt is correct that we have a misconfiguration in the integration tests and SSA is only enabled on a subset of integration tests.

I'd like to understand the test failure conditions a bit better though. The Observed a panic: FieldManager must be installed to run apply can be consistently reproduced on a non SSA targeted test. (Eg: TestUnschedulableNodeDaemonDoesLaunchPod selected from one of the test failures). However, I'm seeing the panic message while the test is still green and passes locally. it's a bit bizarre what contributes to the flakiness of the test because the FieldManager panic can be reproduced 100% of the time.

@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

The panic occurs when APF is updating status (which it just switched to do using SSA). The APF status update failure does not block the API requests the test cares about, so the test passes. I could see extra CPU usage contributing to flakiness.

@liggitt
Copy link
Member

liggitt commented Mar 25, 2022

looks like it got reintroduced in #107797, revert open in #109030

@lavalamp
Copy link
Member

For integration tests, maybe we can cheat and cache the OpenAPI data, so that it's not recomputed for each test?

@liggitt
Copy link
Member

liggitt commented Mar 25, 2022

:-/

maybe... but I really don't want to have non-real init paths in integration if at all possible

@apelisse
Copy link
Member

I wish we could go a little further in lazily initializing openapi (especially v2), so let's see what we can do here (after code-freeze probably)

@liggitt
Copy link
Member

liggitt commented Mar 27, 2022

if SSA is actively using the openapi, it being lazy won't help, right?

@dims
Copy link
Member

dims commented Mar 27, 2022

@liggitt does the revert #109030 mean that we can remove milestone v1.24 from this issue?

@liggitt liggitt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 27, 2022
@liggitt liggitt removed this from the v1.24 milestone Mar 27, 2022
@liggitt
Copy link
Member

liggitt commented Mar 27, 2022

Yes, cleared

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 25, 2022
@MikeSpreitzer
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 25, 2022
@alexzielenski
Copy link
Contributor

alexzielenski commented Jul 25, 2022

Taking a fresh look at this PR and wondering if it has been mostly fixed already?

From @liggitt's comments my understanding is that this panic occurred due to two conditions:

  1. kube-apiserver running as part of the integration test suite is asked to service an SSA request while OpenAPIModels are nil. The apiservice has logic to panic in this case. From the code path @liggitt commented OpenAPIModels is nil when OpenAPIConfig the server is created with is nil.
  2. An APF worker (running alongside an integration test) issues an SSA request to the test apiserver. Since the apiserver is misconfigured it panics. This panic could happen sporadically because sometimes the tests finish before the APF worker is able to issue the request.

As a brief check to see if this might be fixed, I ran the test TestUnschedulableNodeDaemonDoesLaunchPod @Jefftree mentioned was failing from SSA. I logged out the Node's created and found that they all had managed fields. I saw this as an indication SSA was now functioning properly.

Indeed, to start the test server the setup function calls kubeapiservertesting.StartTestServerOrDie. This function has a code path which unconditionally leads to the apiserver's OpenAPIConfig being set.

StartTestServer calls CreateServerChain before running the returned apiserver:

t.Logf("runtime-config=%v", completedOptions.APIEnablement.RuntimeConfig)
t.Logf("Starting kube-apiserver on port %d...", s.SecureServing.BindPort)
server, err := app.CreateServerChain(completedOptions)
if err != nil {
return result, fmt.Errorf("failed to create server chain: %v", err)
}
if instanceOptions.StorageVersionWrapFunc != nil {
server.GenericAPIServer.StorageVersionManager = instanceOptions.StorageVersionWrapFunc(server.GenericAPIServer.StorageVersionManager)
}
errCh = make(chan error)
go func(stopCh <-chan struct{}) {
defer close(errCh)
prepared, err := server.PrepareRun()
if err != nil {
errCh <- err
} else if err := prepared.Run(stopCh); err != nil {
errCh <- err
}
}(stopCh)

CreateServerChain calls CreateKubeAPIServerConfig before using the returned config to create the returned apiserver:

func CreateServerChain(completedOptions completedServerRunOptions) (*aggregatorapiserver.APIAggregator, error) {
kubeAPIServerConfig, serviceResolver, pluginInitializer, err := CreateKubeAPIServerConfig(completedOptions)
if err != nil {
return nil, err
}
// If additional API servers are added, they should be gated.
apiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount,
serviceResolver, webhook.NewDefaultAuthenticationInfoResolverWrapper(kubeAPIServerConfig.ExtraConfig.ProxyTransport, kubeAPIServerConfig.GenericConfig.EgressSelector, kubeAPIServerConfig.GenericConfig.LoopbackClientConfig, kubeAPIServerConfig.GenericConfig.TracerProvider))
if err != nil {
return nil, err
}
notFoundHandler := notfoundhandler.New(kubeAPIServerConfig.GenericConfig.Serializer, genericapifilters.NoMuxAndDiscoveryIncompleteKey)
apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegateWithCustomHandler(notFoundHandler))
if err != nil {
return nil, err
}
kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)

CreateKubeAPIServerConfig uses buildGenericConfig which sets OpenAPIConfig:

func CreateKubeAPIServerConfig(s completedServerRunOptions) (
*controlplane.Config,
aggregatorapiserver.ServiceResolver,
[]admission.PluginInitializer,
error,
) {
proxyTransport := CreateProxyTransport()
genericConfig, versionedInformers, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport)
if err != nil {
return nil, nil, nil, err
}

// wrap the definitions to revert any changes from disabled features
getOpenAPIDefinitions := openapi.GetOpenAPIDefinitionsWithoutDisabledFeatures(generatedopenapi.GetOpenAPIDefinitions)
genericConfig.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(getOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme))
genericConfig.OpenAPIConfig.Info.Title = "Kubernetes"
if utilfeature.DefaultFeatureGate.Enabled(genericfeatures.OpenAPIV3) {
genericConfig.OpenAPIV3Config = genericapiserver.DefaultOpenAPIV3Config(getOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme))
genericConfig.OpenAPIV3Config.Info.Title = "Kubernetes"
}

Thus, any test using kubeapiservertesting.StartTestServerOrDie should no longer be affected by this panic. The question then becomes: do all integration tests start their servers in this way?

It's hard to say. I found a PR (#110529, ) which refactors many occurrences of the old stanza:

controlPlaneConfig := framework.NewIntegrationTestControlPlaneConfig()
 _, server, closeFn := framework.RunAnAPIServer(controlPlaneConfig)

Into something using the new method:

server := kubeapiservertesting.StartTestServerOrDie(t, nil, nil, framework.SharedEtcd())
defer server.TearDownFn()

But it appears RunAnAPIServer is still being used in a few integration tests:

Should these occurrences be refactored to use kubeapiservertesting.StartTestServerOrDie too?

@wojtek-t
Copy link
Member

Should these occurrences be refactored to use kubeapiservertesting.StartTestServerOrDie too?

There was a ton of changes I did around that. But the last PR (already removing the old RunAnAPIServer is already out for review): #111395
@aojea - FYI

Thinking about it now, I guess I agree that the problem might have been 10 different way kube-apiserver was set up through different tests. I just don't know why it was causing flakes, not failing consistently...

@liggitt
Copy link
Member

liggitt commented Jul 26, 2022

There were definitely test setups that were not enabling openapi in the recent past.

Two possibilities for flakes:

  • Enabling openapi in all integration test setups was expensive and caused spurious timeouts
  • The excessive numbers of background panics handled and logged in integration tests when the API server tried to handle SSA requests bogged down the CI machine (I think something like 40,000 panics with full stack traces each were getting logged)

@wojtek-t
Copy link
Member

Enabling openapi in all integration test setups was expensive and caused spurious timeouts

i don't think that's the case. With the recent changes that I was doing as part of #108483 I think OpenAPI is already enabled for all tests having kube-apiserver (almost all) and that seems to work.

But you remainded me that back then we also had an issue that we were running many more tests simulanuously, which could have been contributing to this problem significantly.

The second reason is definitely possible contributor too.

I guess my concern was that if some requests were panicing due to lack of OpenAPI, this means that these aren't really needed. But that actually might make sense - the fact that we weren't able to upgrade status of PriorityLevels/FlowSchemas shouldn't prevent non P&F-tests from succeeding, and that worked in P&F tests...

@liggitt
Copy link
Member

liggitt commented Jul 26, 2022

I guess my concern was that if some requests were panicing due to lack of OpenAPI, this means that these aren't really needed.

I think the panics were happening in async controllers, and no functionality blocked by the controllers intersected with the surface being tested, it just (maybe) caused log/perf issues

@tkashem
Copy link
Contributor Author

tkashem commented Sep 14, 2022

I believe #110529 has resolved the panic issue, I was able to verify with #110173 which changes the priority & fairness controller to use server side apply.

Before #110529 merged, I analyzed the integration log:

Integration test log file from an unrelated pr does not contain any trace of the panic as expected:

$ cat junit_20220606-141751.stdout | grep 'Observed a panic: FieldManager must be installed to run apply' | wc -l
0

On the other hand, the log file from integration job for #110173 (before #110529 merged) had plenty of these panics even though the integration job turned green:

$ cat ssa-enabled-junit_20220624-163055.stdout | grep 'Observed a panic: FieldManager must be installed to run apply' | wc -l
1482

I wanted to check which tests produced these panics, so I did a breakdown by package

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r '. | select(.Output | . != null and contains("Observed a panic: FieldManager must be installed to run apply"))' | jq -r '.Package' | sort | uniq -c
   1170 k8s.io/kubernetes/test/integration/controlplane
    312 k8s.io/kubernetes/test/integration/serviceaccount

Then I narrowed it down to the individual test(s):

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r '. | select(.Output | . != null and contains("Observed a panic: FieldManager must be installed to run apply"))' | jq -r '.Package + " " + .Test' | sort | uniq -c
     78 k8s.io/kubernetes/test/integration/controlplane TestAccept
     78 k8s.io/kubernetes/test/integration/controlplane TestAPIServerService
     78 k8s.io/kubernetes/test/integration/controlplane TestAppsGroupBackwardCompatibility
     78 k8s.io/kubernetes/test/integration/controlplane TestAppsPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestAutoscalingGroupBackwardCompatibility
     78 k8s.io/kubernetes/test/integration/controlplane TestAutoscalingPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestBatchPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestEmptyList
     78 k8s.io/kubernetes/test/integration/controlplane TestKubernetesService
     78 k8s.io/kubernetes/test/integration/controlplane TestObjectSizeResponses
     78 k8s.io/kubernetes/test/integration/controlplane TestServiceAlloc
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/401
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/403
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/404
     78 k8s.io/kubernetes/test/integration/controlplane TestWatchSucceedsWithoutArgs
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountAutoCreate
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAuthentication
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAutoCreate
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAutoMount

Before #110529 merged, the above tests used framework.RunAnAPIServer(config *controlplane.Config), and either config is nil or config.GenericConfig.OpenAPIConfig is nil which results in a.group.OpenAPIModels to be nil.
But a quick code search revealed that many more integration tests used framework.RunAnAPIServer(config *controlplane.Config), so why were the other tests not affected by the panic?

I inspected the log of a test that was not affected by the panic, and grepped for apf_controller.go, and found the following:

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r 'select(.Test == "TestAPIServerProcessMetrics") | select(.Output | . != null ) | .Output' | grep "apf_controller.go"
I0624 16:47:28.069669  119514 apf_controller.go:316] Starting API Priority and Fairness config controller
I0624 16:47:28.169773  119514 apf_controller.go:321] Running API Priority and Fairness config worker
I0624 16:47:29.369326  119514 apf_controller.go:325] Shutting down API Priority and Fairness config worker

In the above case, the apf controller had not advanced far enough, the controller was active for about 1s, may be these tests finish quickly and signals the server to shutdown before the apf controller can send an APPLY command.

On the other hand, log from a test that was affected by the panic shows that the apf controller was active for more than 5s, and has lot more activities:

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r 'select(.Test == "TestAccept") | select(.Output | . != null ) | .Output' | grep "apf_controller.go"
I0624 16:42:48.237080  116136 apf_controller.go:316] Starting API Priority and Fairness config controller
I0624 16:42:48.337227  116136 apf_controller.go:321] Running API Priority and Fairness config worker
...
I0624 16:42:48.421153  116136 apf_controller.go:630] Retaining mandatory priority level "exempt" despite lack of API object
I0624 16:42:48.438136  116136 apf_controller.go:879] Triggered API priority and fairness config reloading because priority level exempt is undesired and idle
I0624 16:42:52.559560  116136 apf_controller.go:325] Shutting down API Priority and Fairness config worker

I am assuming that integration tests have at least log level 4.

Also, the size of the log file differed significantly, without panics around 45MB, and with panics over 200MB

#110529 and #110569 changed all these tests to use kubeapiservertesting.StartTestServerOrDie and now integration log for #110173 shows no panic.

@wojtek-t
Copy link
Member

Great! (This also proves why eliminating gazilion different things of initializing stuff in favor of one makes sense).
@liggitt - FYI ^^

@tkashem - let's merge the PR then - I just added one question there, so let's wait for Jordan to respond it.

@apelisse
Copy link
Member

Can we just force the openapi/fieldmanager to be set at start-time now that SSA is GA, it should basically always be present, rather than fail when SSA is unavailable (that's probably wrong by the way). If that triggers any test flakiness, then we would solve these. We can wait for kubernetes/kube-openapi#315 to be solved since it will almost certainly have a huge impact on this.

@apelisse
Copy link
Member

Can we just force the openapi/fieldmanager to be set at start-time now that SSA is GA, it should basically always be present, rather than fail when SSA is unavailable (that's probably wrong by the way). If that triggers any test flakiness, then we would solve these. We can wait for kubernetes/kube-openapi#315 to be solved since it will almost certainly have a huge impact on this.

Has this been done Jeff?

@apelisse
Copy link
Member

@Jefftree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.