absence of openapi configuration in integration tests makes server-side apply panic (broke apf controller when it switched to SSA) #107727

tkashem · 2022-01-24T15:10:34Z

What happened?

Integration test timed out due to the following error

E0120 14:05:02.398256  118058 runtime.go:77] Observed a panic: FieldManager must be installed to run apply

The error appeared 748 times in the log. Looks like lots of churning happens during the apf configuration bootstrapping phase.
Is there a race condition between when the type is being registered and the API calls?

Stack trace:

I0120 14:05:02.397911  118058 apf_controller.go:879] Triggered API priority and fairness config reloading because priority level exempt is undesired and idle
I0120 14:05:02.398135  118058 panic.go:1038] "HTTP" verb="APPLY" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/system-node-high/status?fieldManager=api-priority-and-fairness-config-consumer-v1&force=true" latency="650.951µs" userAgent="Go-http-client/1.1" audit-ID="bc3c561e-1a80-403e-b611-63ba7bd484b9" srcIP="127.0.0.1:46572" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=0
E0120 14:05:02.398256  118058 runtime.go:77] Observed a panic: FieldManager must be installed to run apply
goroutine 136765 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:105 +0xaf
panic({0x3fc0bc0, 0x542eb00})
	/usr/local/go/src/runtime/panic.go:1038 +0x215
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*applyPatcher).applyPatchToCurrentObject(0x551ceb0, {0x553da98, 0xc0887ccc00}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:482 +0x449
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).applyPatch(0xc0bf781a40, {0x553da98, 0xc0887ccc00}, {0x4c296a0, 0xc0887da480}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:566 +0xd3
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/rest.(*defaultUpdatedObjectInfo).UpdatedObject(0x81f4878, {0x553da98, 0xc0887ccc00}, {0x551ceb0, 0xc0887da480})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/rest/update.go:229 +0xd0
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update.func1({0x551ceb0, 0xc0887da480}, {0xc07278e397b5dfde, 0x8b5f27eafb})
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go:533 +0x1f9
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3.(*store).updateState(0xc03d204bd0, 0xc06ba15310, 0x42)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3/store.go:894 +0x3e
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3.(*store).GuaranteedUpdate(0xc03d204bd0, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x1d}, {0x551ceb0, 0xc0887da300}, 0x1, 0x40ef94, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/etcd3/store.go:365 +0x56e
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.(*Cacher).GuaranteedUpdate(0xc0e9754c60, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x1d}, {0x551ceb0, 0xc0887da300}, 0xa8, 0x4b86ae0, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher/cacher.go:721 +0x1b5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*DryRunnableStorage).GuaranteedUpdate(0x4cbc218, {0x553da98, 0xc0887ccc00}, {0xc01908f2e0, 0x4cac0b0}, {0x551ceb0, 0xc0887da300}, 0xf2, 0x1, 0xc0848d34a0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/dryrun.go:97 +0x1c7
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).Update(0xc0e5c5b2c0, {0x553da98, 0xc0887ccc00}, {0xc0412d2b3d, 0xb}, {0x5523328, 0xc0887cccf0}, 0xc0105369b0, 0x4ea4ac8, 0x0, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go:521 +0x508
k8s.io/kubernetes/pkg/registry/flowcontrol/flowschema/storage.(*StatusREST).Update(0xc0887ccd50, {0x553da98, 0xc0887ccc00}, {0xc0412d2b3d, 0xc086dd5bf0}, {0x5523328, 0xc0887cccf0}, 0xc0991a2748, 0x40ef94, 0x1, ...)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/registry/flowcontrol/flowschema/storage/storage.go:93 +0x52
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).patchResource.func2()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:665 +0xa7
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers.(*patcher).patchResource.func3()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/patch.go:671 +0x38
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:117 +0x8f
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/handlers/finisher/finisher.go:92 +0xe5

Integration job link: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784
PR: #107456

I searched for this error in CI and this PR, looks like we only see this error in this PR so far. So I am hoping it has not introduced any flake. (maybe we need to let more time to pass to start seeing flakes)

What did you expect to happen?

We should not see this error FieldManager must be installed to run apply

How can we reproduce it (as minimally and precisely as possible)?

I saw it only in the PR mentioned.

The text was updated successfully, but these errors were encountered:

tkashem · 2022-01-24T15:11:43Z

/sig api-machinery

cc @wojtek-t @MikeSpreitzer

MikeSpreitzer · 2022-01-24T15:37:12Z

I see

	if p.fieldManager == nil {
		panic("FieldManager must be installed to run apply")
	}

MikeSpreitzer · 2022-01-24T15:38:57Z

The fact that it happens to a non-mandatory object implies that the object was received from the server, so the type must be in the server.

The APF config-consuming controller does this:

		applyOptions := metav1.ApplyOptions{FieldManager: cfgCtlr.asFieldManager, Force: true}

		// the condition field in fsStatusUpdate holds the new condition we want to update.
		// TODO: this will break when we have multiple conditions for a flowschema
		_, err := fsIfc.ApplyStatus(context.TODO(), toFlowSchemaApplyConfiguration(fsu), applyOptions)

MikeSpreitzer · 2022-01-24T15:39:56Z

Look for the output from this line:

	klog.V(2).Infof("NewTestableController %q with serverConcurrencyLimit=%d, requestWaitLimit=%s, name=%s, asFieldManager=%q", cfgCtlr.name, cfgCtlr.serverConcurrencyLimit, cfgCtlr.requestWaitLimit, cfgCtlr.name, cfgCtlr.asFieldManager)

MadhavJivrajani · 2022-01-25T05:26:38Z

Another PR with this failure: #105483

liggitt · 2022-01-25T13:58:22Z

the panic means the request scope FieldManager is nil

that is initialized here:

kubernetes/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go

Lines 607 to 621 in e5ae3f8

    
           if a.group.OpenAPIModels != nil && utilfeature.DefaultFeatureGate.Enabled(features.ServerSideApply) { 
        
           	reqScope.FieldManager, err = fieldmanager.NewDefaultFieldManager( 
        
           		a.group.TypeConverter, 
        
           		a.group.UnsafeConvertor, 
        
           		a.group.Defaulter, 
        
           		a.group.Creater, 
        
           		fqKindToRegister, 
        
           		reqScope.HubGroupVersion, 
        
           		subresource, 
        
           		resetFields, 
        
           	) 
        
           	if err != nil { 
        
           		return nil, nil, fmt.Errorf("failed to create field manager: %v", err) 
        
           	} 
        
           }

OpenAPIModels is set here:

kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go

Lines 533 to 546 in e5ae3f8

    
           // installAPIResources is a private method for installing the REST storage backing each api groupversionresource 
        
           func (s *GenericAPIServer) installAPIResources(apiPrefix string, apiGroupInfo *APIGroupInfo, openAPIModels openapiproto.Models) error { 
        
           	var resourceInfos []*storageversion.ResourceInfo 
        
           	for _, groupVersion := range apiGroupInfo.PrioritizedVersions { 
        
           		if len(apiGroupInfo.VersionedResourcesStorageMap[groupVersion.Version]) == 0 { 
        
           			klog.Warningf("Skipping API %v because it has no resources.", groupVersion) 
        
           			continue 
        
           		} 
        
           		apiGroupVersion := s.getAPIGroupVersion(apiGroupInfo, groupVersion, apiPrefix) 
        
           		if apiGroupInfo.OptionsExternalVersion != nil { 
        
           			apiGroupVersion.OptionsExternalVersion = apiGroupInfo.OptionsExternalVersion 
        
           		} 
        
           		apiGroupVersion.OpenAPIModels = openAPIModels

called from here:

kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go

Lines 610 to 616 in e5ae3f8

    
           openAPIModels, err := s.getOpenAPIModels(APIGroupPrefix, apiGroupInfos...) 
        
           if err != nil { 
        
           	return fmt.Errorf("unable to get openapi models: %v", err) 
        
           } 
        
           for _, apiGroupInfo := range apiGroupInfos { 
        
           	if err := s.installAPIResources(APIGroupPrefix, apiGroupInfo, openAPIModels); err != nil {

and openAPIModels is constructed here:

kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go

Lines 703 to 707 in e5ae3f8

    
           // getOpenAPIModels is a private method for getting the OpenAPI models 
        
           func (s *GenericAPIServer) getOpenAPIModels(apiPrefix string, apiGroupInfos ...*APIGroupInfo) (openapiproto.Models, error) { 
        
           	if s.openAPIConfig == nil { 
        
           		return nil, nil 
        
           	}

it looks like it is possible for openapiconfig to be unset, which causes the openapi models to be nil, which causes apply to panic. I don't know if it is only the integration tests that are subject to that configuration.

cc @apelisse @Jefftree
/priority important-soon
(since this is now being exercised in bring-up of integration test servers)

liggitt · 2022-01-25T13:59:04Z

if APF is creating/bootstrapping the objects without using apply, and then updating status using apply, that would explain how the object exists but apply fails

liggitt · 2022-01-25T14:01:55Z

I suspect that APF is not successfully updating status at all in the integration tests with this server configuration

tkashem · 2022-01-25T14:47:17Z

@liggitt do you want me to revert #107456, while we work on a solution?

liggitt · 2022-01-25T14:50:11Z

I don't know how bad the test impact is. Work with @apelisse / @Jefftree to determine if we should revert or fix forward

MikeSpreitzer · 2022-01-25T18:38:39Z

@liggitt : I do not follow why you talked about server-side stuff then suggested the problem is in the client (a controller in the kube-apiserver) not supplying a fieldManager.

The client does supply the fieldManager, for both creates and updates. See the following lines:

kubernetes/pkg/registry/flowcontrol/ensurer/flowschema.go

Line 156 in e5ae3f8

return fs.client.Create(context.TODO(), fsObject, metav1.CreateOptions{FieldManager: fieldManager})
kubernetes/pkg/registry/flowcontrol/ensurer/flowschema.go

Line 165 in e5ae3f8

return fs.client.Update(context.TODO(), fsObject, metav1.UpdateOptions{FieldManager: fieldManager})
kubernetes/pkg/registry/flowcontrol/ensurer/prioritylevelconfiguration.go

Line 157 in e5ae3f8

return fs.client.Create(context.TODO(), plObject, metav1.CreateOptions{FieldManager: fieldManager})
kubernetes/pkg/registry/flowcontrol/ensurer/prioritylevelconfiguration.go

Line 166 in e5ae3f8

return fs.client.Update(context.TODO(), fsObject, metav1.UpdateOptions{FieldManager: fieldManager})
kubernetes/pkg/registry/flowcontrol/ensurer/strategy.go

Line 35 in e5ae3f8

fieldManager = "api-priority-and-fairness-config-producer-v1"

liggitt · 2022-01-25T18:41:02Z

@liggitt : I do not follow why you talked about server-side stuff then suggested the problem is in the client (a controller in the kube-apiserver) not supplying a fieldManager.

Different field manager. reqScope.FieldManager is set server-side in the endpoint handler based on the server configuration.

MikeSpreitzer · 2022-01-25T18:42:07Z

So if the problem is server-side, why did you suggest the problem is in the client?

MikeSpreitzer · 2022-01-25T18:43:01Z

Regarding the client, I looked in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784/build-log.txt and found this line:

I0120 13:57:15.480218  117563 apf_controller.go:217] NewTestableController "Controller" with serverConcurrencyLimit=600, requestWaitLimit=15s, name=Controller, asFieldManager="api-priority-and-fairness-config-consumer-v1"

actually several of the same form, all with the same fieldManager.

liggitt · 2022-01-25T18:44:19Z

So if the problem is server-side, why did you suggest the problem is in the client?

I don't think I did suggest that :) all of the setup and calls in #107727 (comment) are server-side

MikeSpreitzer · 2022-01-25T18:48:30Z

@liggitt : please help me understand #107727 (comment) . That is about the client side. Are you saying that a client that creates an object using Create and later updates the object using Apply --- with a non-empty fieldManager in both --- will run into this server-side scenario? If so, is that because this client behavior is wrong or because there is a bug server side?

liggitt · 2022-01-25T18:51:13Z

@liggitt : please help me understand #107727 (comment) . That is about the client side.

That was in response to #107727 (comment), trying to figure out how the object was created but could not be updated.

Are you saying that a client that creates an object using Create and later updates the object using Apply --- with a non-empty fieldManager in both --- will run into this server-side scenario?

Yes, when the server is not configured to enable openapi and therefore does not have apply enabled.

If so, is that because this client behavior is wrong or because there is a bug server side?

Because the server is misconfigured (which, from what I can tell, is limited to some of the server instances started for integration tests)

MikeSpreitzer · 2022-01-25T19:14:22Z

Looking at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ and https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ it is not obvious to me whether or how openapi could be disabled. Would that be by turning off the OpenAPIV3 feature flag? FYI, the inclusion of "V3" in the flag makes me unsure what the server does about other versions.

@liggitt : I think you are saying that it is possible to disable openapi and that this in turn disables server-side apply. That seems kind of surprising to me. Does this mean that it is simply incorrect to run a client that does server-side apply with a server that has openapi disabled? Is this intended? Is it documented anywhere?

liggitt · 2022-01-25T19:19:42Z

@liggitt : I think you are saying that it is possible to disable openapi and that this in turn disables server-side apply

It's not intended as an externalized configuration option. integration tests have the ability to fiddle with many of the config struct options in the server that a kube-apiserver binary invocation could not. They intentionally use this to do things like inject specialized authorizers / authenticators for testing purposes, but also can unintentionally result in a test server that doesn't match a real API server invocation.

I think this is an integration test server configuration problem, which should be fixed.

Jefftree · 2022-01-25T20:05:26Z

@liggitt is correct that we have a misconfiguration in the integration tests and SSA is only enabled on a subset of integration tests.

I'd like to understand the test failure conditions a bit better though. The Observed a panic: FieldManager must be installed to run apply can be consistently reproduced on a non SSA targeted test. (Eg: TestUnschedulableNodeDaemonDoesLaunchPod selected from one of the test failures). However, I'm seeing the panic message while the test is still green and passes locally. it's a bit bizarre what contributes to the flakiness of the test because the FieldManager panic can be reproduced 100% of the time.

liggitt · 2022-01-25T20:08:42Z

The panic occurs when APF is updating status (which it just switched to do using SSA). The APF status update failure does not block the API requests the test cares about, so the test passes. I could see extra CPU usage contributing to flakiness.

liggitt · 2022-03-25T18:42:27Z

looks like it got reintroduced in #107797, revert open in #109030

lavalamp · 2022-03-25T21:49:08Z

For integration tests, maybe we can cheat and cache the OpenAPI data, so that it's not recomputed for each test?

liggitt · 2022-03-25T21:51:16Z

:-/

maybe... but I really don't want to have non-real init paths in integration if at all possible

apelisse · 2022-03-27T01:41:46Z

I wish we could go a little further in lazily initializing openapi (especially v2), so let's see what we can do here (after code-freeze probably)

liggitt · 2022-03-27T01:49:19Z

if SSA is actively using the openapi, it being lazy won't help, right?

dims · 2022-03-27T13:23:59Z

@liggitt does the revert #109030 mean that we can remove milestone v1.24 from this issue?

liggitt · 2022-03-27T16:07:20Z

Yes, cleared

k8s-triage-robot · 2022-06-25T16:20:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-25T16:40:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

MikeSpreitzer · 2022-07-25T18:15:15Z

/remove-lifecycle rotten

alexzielenski · 2022-07-25T19:26:04Z

Taking a fresh look at this PR and wondering if it has been mostly fixed already?

From @liggitt's comments my understanding is that this panic occurred due to two conditions:

kube-apiserver running as part of the integration test suite is asked to service an SSA request while OpenAPIModels are nil. The apiservice has logic to panic in this case. From the code path @liggitt commented OpenAPIModels is nil when OpenAPIConfig the server is created with is nil.
An APF worker (running alongside an integration test) issues an SSA request to the test apiserver. Since the apiserver is misconfigured it panics. This panic could happen sporadically because sometimes the tests finish before the APF worker is able to issue the request.

As a brief check to see if this might be fixed, I ran the test TestUnschedulableNodeDaemonDoesLaunchPod @Jefftree mentioned was failing from SSA. I logged out the Node's created and found that they all had managed fields. I saw this as an indication SSA was now functioning properly.

Indeed, to start the test server the setup function calls kubeapiservertesting.StartTestServerOrDie. This function has a code path which unconditionally leads to the apiserver's OpenAPIConfig being set.

StartTestServer calls CreateServerChain before running the returned apiserver:

kubernetes/cmd/kube-apiserver/app/testing/testserver.go

Lines 212 to 231 in 6f70677

    
           t.Logf("runtime-config=%v", completedOptions.APIEnablement.RuntimeConfig) 
        
           t.Logf("Starting kube-apiserver on port %d...", s.SecureServing.BindPort) 
        
           server, err := app.CreateServerChain(completedOptions) 
        
           if err != nil { 
        
           	return result, fmt.Errorf("failed to create server chain: %v", err) 
        
           } 
        
           if instanceOptions.StorageVersionWrapFunc != nil { 
        
           	server.GenericAPIServer.StorageVersionManager = instanceOptions.StorageVersionWrapFunc(server.GenericAPIServer.StorageVersionManager) 
        
           } 
        
           errCh = make(chan error) 
        
           go func(stopCh <-chan struct{}) { 
        
           	defer close(errCh) 
        
           	prepared, err := server.PrepareRun() 
        
           	if err != nil { 
        
           		errCh <- err 
        
           	} else if err := prepared.Run(stopCh); err != nil { 
        
           		errCh <- err 
        
           	} 
        
           }(stopCh)

CreateServerChain calls CreateKubeAPIServerConfig before using the returned config to create the returned apiserver:

kubernetes/cmd/kube-apiserver/app/server.go

Lines 176 to 195 in 0527a0d

    
           func CreateServerChain(completedOptions completedServerRunOptions) (*aggregatorapiserver.APIAggregator, error) { 
        
           	kubeAPIServerConfig, serviceResolver, pluginInitializer, err := CreateKubeAPIServerConfig(completedOptions) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	// If additional API servers are added, they should be gated. 
        
           	apiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount, 
        
           		serviceResolver, webhook.NewDefaultAuthenticationInfoResolverWrapper(kubeAPIServerConfig.ExtraConfig.ProxyTransport, kubeAPIServerConfig.GenericConfig.EgressSelector, kubeAPIServerConfig.GenericConfig.LoopbackClientConfig, kubeAPIServerConfig.GenericConfig.TracerProvider)) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	notFoundHandler := notfoundhandler.New(kubeAPIServerConfig.GenericConfig.Serializer, genericapifilters.NoMuxAndDiscoveryIncompleteKey) 
        
           	apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegateWithCustomHandler(notFoundHandler)) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)

CreateKubeAPIServerConfig uses buildGenericConfig which sets OpenAPIConfig:

kubernetes/cmd/kube-apiserver/app/server.go

Lines 237 to 248 in 0527a0d

    
           func CreateKubeAPIServerConfig(s completedServerRunOptions) ( 
        
           	*controlplane.Config, 
        
           	aggregatorapiserver.ServiceResolver, 
        
           	[]admission.PluginInitializer, 
        
           	error, 
        
           ) { 
        
           	proxyTransport := CreateProxyTransport() 
        
           	genericConfig, versionedInformers, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport) 
        
           	if err != nil { 
        
           		return nil, nil, nil, err 
        
           	}

kubernetes/cmd/kube-apiserver/app/server.go

Lines 388 to 396 in 0527a0d

    
           // wrap the definitions to revert any changes from disabled features 
        
           getOpenAPIDefinitions := openapi.GetOpenAPIDefinitionsWithoutDisabledFeatures(generatedopenapi.GetOpenAPIDefinitions) 
        
           genericConfig.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(getOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme)) 
        
           genericConfig.OpenAPIConfig.Info.Title = "Kubernetes" 
        
           if utilfeature.DefaultFeatureGate.Enabled(genericfeatures.OpenAPIV3) { 
        
           	genericConfig.OpenAPIV3Config = genericapiserver.DefaultOpenAPIV3Config(getOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme)) 
        
           	genericConfig.OpenAPIV3Config.Info.Title = "Kubernetes" 
        
           }

Thus, any test using kubeapiservertesting.StartTestServerOrDie should no longer be affected by this panic. The question then becomes: do all integration tests start their servers in this way?

It's hard to say. I found a PR (#110529, ) which refactors many occurrences of the old stanza:

controlPlaneConfig := framework.NewIntegrationTestControlPlaneConfig()
 _, server, closeFn := framework.RunAnAPIServer(controlPlaneConfig)

Into something using the new method:

server := kubeapiservertesting.StartTestServerOrDie(t, nil, nil, framework.SharedEtcd())
defer server.TearDownFn()

But it appears RunAnAPIServer is still being used in a few integration tests:

kubernetes/test/integration/auth/svcaccttoken_test.go

Line 91 in 92422a7

controlPlaneConfig := framework.NewIntegrationTestControlPlaneConfig()
https://github.com/kubernetes/kubernetes/blob/50bea1dad89930ad565526910aadc314b9e9f38b/test/integration/controlplane/synthetic_controlplane_test.go
kubernetes/test/integration/util/util.go

Line 70 in 8a95939

_, s, closeFn := framework.RunAnAPIServer(framework.NewIntegrationTestControlPlaneConfig())

Should these occurrences be refactored to use kubeapiservertesting.StartTestServerOrDie too?

wojtek-t · 2022-07-26T07:31:30Z

Should these occurrences be refactored to use kubeapiservertesting.StartTestServerOrDie too?

There was a ton of changes I did around that. But the last PR (already removing the old RunAnAPIServer is already out for review): #111395
@aojea - FYI

Thinking about it now, I guess I agree that the problem might have been 10 different way kube-apiserver was set up through different tests. I just don't know why it was causing flakes, not failing consistently...

liggitt · 2022-07-26T13:01:26Z

There were definitely test setups that were not enabling openapi in the recent past.

Two possibilities for flakes:

Enabling openapi in all integration test setups was expensive and caused spurious timeouts
The excessive numbers of background panics handled and logged in integration tests when the API server tried to handle SSA requests bogged down the CI machine (I think something like 40,000 panics with full stack traces each were getting logged)

wojtek-t · 2022-07-26T16:47:54Z

Enabling openapi in all integration test setups was expensive and caused spurious timeouts

i don't think that's the case. With the recent changes that I was doing as part of #108483 I think OpenAPI is already enabled for all tests having kube-apiserver (almost all) and that seems to work.

But you remainded me that back then we also had an issue that we were running many more tests simulanuously, which could have been contributing to this problem significantly.

The second reason is definitely possible contributor too.

I guess my concern was that if some requests were panicing due to lack of OpenAPI, this means that these aren't really needed. But that actually might make sense - the fact that we weren't able to upgrade status of PriorityLevels/FlowSchemas shouldn't prevent non P&F-tests from succeeding, and that worked in P&F tests...

liggitt · 2022-07-26T16:56:46Z

I guess my concern was that if some requests were panicing due to lack of OpenAPI, this means that these aren't really needed.

I think the panics were happening in async controllers, and no functionality blocked by the controllers intersected with the surface being tested, it just (maybe) caused log/perf issues

tkashem · 2022-09-14T02:34:52Z

I believe #110529 has resolved the panic issue, I was able to verify with #110173 which changes the priority & fairness controller to use server side apply.

Before #110529 merged, I analyzed the integration log:

Integration test log file from an unrelated pr does not contain any trace of the panic as expected:

$ cat junit_20220606-141751.stdout | grep 'Observed a panic: FieldManager must be installed to run apply' | wc -l
0

On the other hand, the log file from integration job for #110173 (before #110529 merged) had plenty of these panics even though the integration job turned green:

$ cat ssa-enabled-junit_20220624-163055.stdout | grep 'Observed a panic: FieldManager must be installed to run apply' | wc -l
1482

I wanted to check which tests produced these panics, so I did a breakdown by package

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r '. | select(.Output | . != null and contains("Observed a panic: FieldManager must be installed to run apply"))' | jq -r '.Package' | sort | uniq -c
   1170 k8s.io/kubernetes/test/integration/controlplane
    312 k8s.io/kubernetes/test/integration/serviceaccount

Then I narrowed it down to the individual test(s):

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r '. | select(.Output | . != null and contains("Observed a panic: FieldManager must be installed to run apply"))' | jq -r '.Package + " " + .Test' | sort | uniq -c
     78 k8s.io/kubernetes/test/integration/controlplane TestAccept
     78 k8s.io/kubernetes/test/integration/controlplane TestAPIServerService
     78 k8s.io/kubernetes/test/integration/controlplane TestAppsGroupBackwardCompatibility
     78 k8s.io/kubernetes/test/integration/controlplane TestAppsPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestAutoscalingGroupBackwardCompatibility
     78 k8s.io/kubernetes/test/integration/controlplane TestAutoscalingPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestBatchPrefix
     78 k8s.io/kubernetes/test/integration/controlplane TestEmptyList
     78 k8s.io/kubernetes/test/integration/controlplane TestKubernetesService
     78 k8s.io/kubernetes/test/integration/controlplane TestObjectSizeResponses
     78 k8s.io/kubernetes/test/integration/controlplane TestServiceAlloc
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/401
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/403
     78 k8s.io/kubernetes/test/integration/controlplane TestStatus/404
     78 k8s.io/kubernetes/test/integration/controlplane TestWatchSucceedsWithoutArgs
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountAutoCreate
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAuthentication
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAutoCreate
     78 k8s.io/kubernetes/test/integration/serviceaccount TestServiceAccountTokenAutoMount

Before #110529 merged, the above tests used framework.RunAnAPIServer(config *controlplane.Config), and either config is nil or config.GenericConfig.OpenAPIConfig is nil which results in a.group.OpenAPIModels to be nil.
But a quick code search revealed that many more integration tests used framework.RunAnAPIServer(config *controlplane.Config), so why were the other tests not affected by the panic?

I inspected the log of a test that was not affected by the panic, and grepped for apf_controller.go, and found the following:

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r 'select(.Test == "TestAPIServerProcessMetrics") | select(.Output | . != null ) | .Output' | grep "apf_controller.go"
I0624 16:47:28.069669  119514 apf_controller.go:316] Starting API Priority and Fairness config controller
I0624 16:47:28.169773  119514 apf_controller.go:321] Running API Priority and Fairness config worker
I0624 16:47:29.369326  119514 apf_controller.go:325] Shutting down API Priority and Fairness config worker

In the above case, the apf controller had not advanced far enough, the controller was active for about 1s, may be these tests finish quickly and signals the server to shutdown before the apf controller can send an APPLY command.

On the other hand, log from a test that was affected by the panic shows that the apf controller was active for more than 5s, and has lot more activities:

$ cat ssa-enabled-junit_20220624-163055.stdout | jq -r 'select(.Test == "TestAccept") | select(.Output | . != null ) | .Output' | grep "apf_controller.go"
I0624 16:42:48.237080  116136 apf_controller.go:316] Starting API Priority and Fairness config controller
I0624 16:42:48.337227  116136 apf_controller.go:321] Running API Priority and Fairness config worker
...
I0624 16:42:48.421153  116136 apf_controller.go:630] Retaining mandatory priority level "exempt" despite lack of API object
I0624 16:42:48.438136  116136 apf_controller.go:879] Triggered API priority and fairness config reloading because priority level exempt is undesired and idle
I0624 16:42:52.559560  116136 apf_controller.go:325] Shutting down API Priority and Fairness config worker

I am assuming that integration tests have at least log level 4.

Also, the size of the log file differed significantly, without panics around 45MB, and with panics over 200MB

#110529 and #110569 changed all these tests to use kubeapiservertesting.StartTestServerOrDie and now integration log for #110173 shows no panic.

wojtek-t · 2022-09-14T11:23:41Z

Great! (This also proves why eliminating gazilion different things of initializing stuff in favor of one makes sense).
@liggitt - FYI ^^

@tkashem - let's merge the PR then - I just added one question there, so let's wait for Jordan to respond it.

apelisse · 2022-09-14T15:24:24Z

Can we just force the openapi/fieldmanager to be set at start-time now that SSA is GA, it should basically always be present, rather than fail when SSA is unavailable (that's probably wrong by the way). If that triggers any test flakiness, then we would solve these. We can wait for kubernetes/kube-openapi#315 to be solved since it will almost certainly have a huge impact on this.

apelisse · 2022-10-18T16:17:25Z

Can we just force the openapi/fieldmanager to be set at start-time now that SSA is GA, it should basically always be present, rather than fail when SSA is unavailable (that's probably wrong by the way). If that triggers any test flakiness, then we would solve these. We can wait for kubernetes/kube-openapi#315 to be solved since it will almost certainly have a huge impact on this.

Has this been done Jeff?

apelisse · 2022-10-18T16:18:05Z

@Jefftree

tkashem added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2022

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 24, 2022

tkashem mentioned this issue Jan 24, 2022

apf: create suggested pl before flowschema #107729

Merged

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 25, 2022

liggitt assigned apelisse and Jefftree Jan 25, 2022

liggitt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 27, 2022

liggitt removed this from the v1.24 milestone Mar 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 25, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 25, 2022

tkashem mentioned this issue Sep 14, 2022

apf: change controller to use SSA for patches #110173

Merged

tkashem mentioned this issue Sep 16, 2022

flowschemas resource is currently skipped by TestApplyResetFields test. #104842

Closed

k8s-ci-robot closed this as completed in #110173 Oct 17, 2022

absence of openapi configuration in integration tests makes server-side apply panic (broke apf controller when it switched to SSA) #107727

absence of openapi configuration in integration tests makes server-side apply panic (broke apf controller when it switched to SSA) #107727

Comments

tkashem commented Jan 24, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

tkashem commented Jan 24, 2022

MikeSpreitzer commented Jan 24, 2022

MikeSpreitzer commented Jan 24, 2022

MikeSpreitzer commented Jan 24, 2022

MadhavJivrajani commented Jan 25, 2022

liggitt commented Jan 25, 2022

liggitt commented Jan 25, 2022

liggitt commented Jan 25, 2022

tkashem commented Jan 25, 2022

liggitt commented Jan 25, 2022

MikeSpreitzer commented Jan 25, 2022

liggitt commented Jan 25, 2022

MikeSpreitzer commented Jan 25, 2022

MikeSpreitzer commented Jan 25, 2022

liggitt commented Jan 25, 2022

MikeSpreitzer commented Jan 25, 2022 • edited

liggitt commented Jan 25, 2022 • edited

MikeSpreitzer commented Jan 25, 2022

liggitt commented Jan 25, 2022

Jefftree commented Jan 25, 2022

liggitt commented Jan 25, 2022

liggitt commented Mar 25, 2022 • edited

lavalamp commented Mar 25, 2022

liggitt commented Mar 25, 2022

apelisse commented Mar 27, 2022

liggitt commented Mar 27, 2022

dims commented Mar 27, 2022

liggitt commented Mar 27, 2022

k8s-triage-robot commented Jun 25, 2022

k8s-triage-robot commented Jul 25, 2022

MikeSpreitzer commented Jul 25, 2022

alexzielenski commented Jul 25, 2022 • edited

wojtek-t commented Jul 26, 2022

liggitt commented Jul 26, 2022

wojtek-t commented Jul 26, 2022

liggitt commented Jul 26, 2022

tkashem commented Sep 14, 2022 • edited

wojtek-t commented Sep 14, 2022

apelisse commented Sep 14, 2022

apelisse commented Oct 18, 2022

apelisse commented Oct 18, 2022

tkashem commented Jan 24, 2022 •

edited

MikeSpreitzer commented Jan 25, 2022 •

edited

liggitt commented Jan 25, 2022 •

edited

liggitt commented Mar 25, 2022 •

edited

alexzielenski commented Jul 25, 2022 •

edited

tkashem commented Sep 14, 2022 •

edited