New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
absence of openapi configuration in integration tests makes server-side apply panic (broke apf controller when it switched to SSA) #107727
Comments
/sig api-machinery |
I see
|
The fact that it happens to a non-mandatory object implies that the object was received from the server, so the type must be in the server. The APF config-consuming controller does this:
|
Look for the output from this line:
|
Another PR with this failure: #105483 |
the panic means the request scope FieldManager is nil that is initialized here: kubernetes/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go Lines 607 to 621 in e5ae3f8
OpenAPIModels is set here: kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go Lines 533 to 546 in e5ae3f8
called from here: kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go Lines 610 to 616 in e5ae3f8
and openAPIModels is constructed here: kubernetes/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go Lines 703 to 707 in e5ae3f8
it looks like it is possible for openapiconfig to be unset, which causes the openapi models to be nil, which causes apply to panic. I don't know if it is only the integration tests that are subject to that configuration. cc @apelisse @Jefftree |
if APF is creating/bootstrapping the objects without using apply, and then updating status using apply, that would explain how the object exists but apply fails |
I suspect that APF is not successfully updating status at all in the integration tests with this server configuration |
@liggitt : I do not follow why you talked about server-side stuff then suggested the problem is in the client (a controller in the kube-apiserver) not supplying a fieldManager. The client does supply the fieldManager, for both creates and updates. See the following lines:
|
Different field manager. |
So if the problem is server-side, why did you suggest the problem is in the client? |
Regarding the client, I looked in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784/build-log.txt and found this line:
actually several of the same form, all with the same fieldManager. |
I don't think I did suggest that :) all of the setup and calls in #107727 (comment) are server-side |
@liggitt : please help me understand #107727 (comment) . That is about the client side. Are you saying that a client that creates an object using |
That was in response to #107727 (comment), trying to figure out how the object was created but could not be updated.
Yes, when the server is not configured to enable openapi and therefore does not have apply enabled.
Because the server is misconfigured (which, from what I can tell, is limited to some of the server instances started for integration tests) |
Looking at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ and https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ it is not obvious to me whether or how openapi could be disabled. Would that be by turning off the @liggitt : I think you are saying that it is possible to disable openapi and that this in turn disables server-side apply. That seems kind of surprising to me. Does this mean that it is simply incorrect to run a client that does server-side apply with a server that has openapi disabled? Is this intended? Is it documented anywhere? |
It's not intended as an externalized configuration option. integration tests have the ability to fiddle with many of the config struct options in the server that a kube-apiserver binary invocation could not. They intentionally use this to do things like inject specialized authorizers / authenticators for testing purposes, but also can unintentionally result in a test server that doesn't match a real API server invocation. I think this is an integration test server configuration problem, which should be fixed. |
@liggitt is correct that we have a misconfiguration in the integration tests and SSA is only enabled on a subset of integration tests. I'd like to understand the test failure conditions a bit better though. The |
The panic occurs when APF is updating status (which it just switched to do using SSA). The APF status update failure does not block the API requests the test cares about, so the test passes. I could see extra CPU usage contributing to flakiness. |
For integration tests, maybe we can cheat and cache the OpenAPI data, so that it's not recomputed for each test? |
:-/ maybe... but I really don't want to have non-real init paths in integration if at all possible |
I wish we could go a little further in lazily initializing openapi (especially v2), so let's see what we can do here (after code-freeze probably) |
if SSA is actively using the openapi, it being lazy won't help, right? |
Yes, cleared |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Taking a fresh look at this PR and wondering if it has been mostly fixed already? From @liggitt's comments my understanding is that this panic occurred due to two conditions:
As a brief check to see if this might be fixed, I ran the test Indeed, to start the test server the
kubernetes/cmd/kube-apiserver/app/testing/testserver.go Lines 212 to 231 in 6f70677
kubernetes/cmd/kube-apiserver/app/server.go Lines 176 to 195 in 0527a0d
kubernetes/cmd/kube-apiserver/app/server.go Lines 237 to 248 in 0527a0d
kubernetes/cmd/kube-apiserver/app/server.go Lines 388 to 396 in 0527a0d
Thus, any test using It's hard to say. I found a PR (#110529, ) which refactors many occurrences of the old stanza: controlPlaneConfig := framework.NewIntegrationTestControlPlaneConfig()
_, server, closeFn := framework.RunAnAPIServer(controlPlaneConfig) Into something using the new method: server := kubeapiservertesting.StartTestServerOrDie(t, nil, nil, framework.SharedEtcd())
defer server.TearDownFn() But it appears
Should these occurrences be refactored to use |
There was a ton of changes I did around that. But the last PR (already removing the old Thinking about it now, I guess I agree that the problem might have been 10 different way kube-apiserver was set up through different tests. I just don't know why it was causing flakes, not failing consistently... |
There were definitely test setups that were not enabling openapi in the recent past. Two possibilities for flakes:
|
i don't think that's the case. With the recent changes that I was doing as part of #108483 I think OpenAPI is already enabled for all tests having kube-apiserver (almost all) and that seems to work. But you remainded me that back then we also had an issue that we were running many more tests simulanuously, which could have been contributing to this problem significantly. The second reason is definitely possible contributor too. I guess my concern was that if some requests were panicing due to lack of OpenAPI, this means that these aren't really needed. But that actually might make sense - the fact that we weren't able to upgrade status of PriorityLevels/FlowSchemas shouldn't prevent non P&F-tests from succeeding, and that worked in P&F tests... |
I think the panics were happening in async controllers, and no functionality blocked by the controllers intersected with the surface being tested, it just (maybe) caused log/perf issues |
I believe #110529 has resolved the panic issue, I was able to verify with #110173 which changes the priority & fairness controller to use server side apply. Before #110529 merged, I analyzed the integration log: Integration test log file from an unrelated pr does not contain any trace of the panic as expected:
On the other hand, the log file from integration job for #110173 (before #110529 merged) had plenty of these panics even though the integration job turned green:
I wanted to check which tests produced these panics, so I did a breakdown by package
Then I narrowed it down to the individual test(s):
Before #110529 merged, the above tests used I inspected the log of a test that was not affected by the panic, and grepped for
In the above case, the apf controller had not advanced far enough, the controller was active for about On the other hand, log from a test that was affected by the panic shows that the apf controller was active for more than
I am assuming that integration tests have at least log level 4. Also, the size of the log file differed significantly, without panics around #110529 and #110569 changed all these tests to use |
Can we just force the openapi/fieldmanager to be set at start-time now that SSA is GA, it should basically always be present, rather than fail when SSA is unavailable (that's probably wrong by the way). If that triggers any test flakiness, then we would solve these. We can wait for kubernetes/kube-openapi#315 to be solved since it will almost certainly have a huge impact on this. |
Has this been done Jeff? |
What happened?
Integration test timed out due to the following error
The error appeared 748 times in the log. Looks like lots of churning happens during the apf configuration bootstrapping phase.
Is there a race condition between when the type is being registered and the API calls?
Stack trace:
Integration job link: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107456/pull-kubernetes-integration/1484156493719670784
PR: #107456
I searched for this error in CI and this PR, looks like we only see this error in this PR so far. So I am hoping it has not introduced any flake. (maybe we need to let more time to pass to start seeing flakes)
What did you expect to happen?
We should not see this error
FieldManager must be installed to run apply
How can we reproduce it (as minimally and precisely as possible)?
I saw it only in the PR mentioned.
The text was updated successfully, but these errors were encountered: