New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Ensure there is one running static pod with the same full name" #107734
Revert "Ensure there is one running static pod with the same full name" #107734
Conversation
/kind regression |
I think the revert makes sense. I noticed when I was trying to add coverage for #107695 that we added We lack coverage for this: kubernetes/pkg/kubelet/pod_workers.go Lines 572 to 576 in 5426da8
So @rphillips tried to add in the missing pod fullname, but I'm not 100% sure we have a reproducer that can catch that. I started adding a case: diff --git a/pkg/kubelet/pod_workers_test.go b/pkg/kubelet/pod_workers_test.go
index 4028c06c292..fc5f975acfd 100644
--- a/pkg/kubelet/pod_workers_test.go
+++ b/pkg/kubelet/pod_workers_test.go
@@ -164,6 +164,21 @@ func newStaticPod(uid, name string) *v1.Pod {
}
}
+func newStaticPodWithPhase(uid, name string, phase v1.PodPhase) *v1.Pod {
+ return &v1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ UID: types.UID(uid),
+ Name: name,
+ Annotations: map[string]string{
+ kubetypes.ConfigSourceAnnotationKey: kubetypes.FileSource,
+ },
+ },
+ Status: v1.PodStatus{
+ Phase: phase,
+ },
+ }
+}
+
// syncPodRecord is a record of a sync pod call
type syncPodRecord struct {
name string
@@ -856,6 +871,24 @@ func Test_allowPodStart(t *testing.T) {
},
allowed: true,
},
+ {
+ desc: "static pod if the static pod is terminated and rebooting",
+ pod: newStaticPodWithPhase("uid-0", "foo", v1.PodSucceeded),
+ podSyncStatuses: map[types.UID]*podSyncStatus{
+ "uid-0": {
+ fullname: "foo_",
+ },
+ },
+ // waitingToStartStaticPodsByFullname: map[string][]types.UID{
+ // "foo_": {
+ // types.UID("uid-0"),
+ // },
+ // },
+ startedStaticPodsByFullname: map[string]types.UID{
+ "foo_": types.UID("uid-0"),
+ },
+ allowed: true,
+ },
}
for _, tc := range testCases { But then I realized that we don't have anything mocking the worker podCache, which will be necessary to give proper coverage, and so we probably have a whole matrix of missing test cases. Hence, I think it might be easiest to revert for now. /lgtm /release-note-edit
|
@ehashman: /release-note-edit must be used with a single release note block. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Found a reproducer here. And made a fix for it. Maybe the fix could be an alternative. |
I believe the reproducer is to delete and recreate the static pod multiple times, such that we hit a race between teardown and setup. We are seeing that sometimes the new static pod never comes up, such as in the linked run above. @rphillips has confirmed through some local testing that a revert fixes the issue we're seeing, whereas other patches still seem to run into it (e.g. #107854 (comment)) Are we sure that this patch fixes the bug in question? We had a lot of other patches go in from Clayton in this release, it's possible that this bug this intended to fix was already fixed elsewhere. I think right now the best path forward is to revert this, add test coverage that demonstrates the problem, and then add a patch that fixes the problem. The original PR here only added unit tests and did not add an e2e loop with static pods. /lgtm |
/hold cancel +1 with @ehashman ... We need to revisit the original PR and get more testing around it. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrunalp, rphillips The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
/hold So one serious concern here is that the fix was addressing a demonstrable hard break in static pods. If we've regressed elsewhere, reverting just moves the problem around (doesn't result in fewer problems). I'd prefer to spend a small amount of time and find a fix, while at the same time improving the test coverage based on our new failure mode. |
I believe the demonstrable break was your PR #104847 which fixed #104648. I think there may be some mixed communication around bugs and patches which caused this patch to land; it is not clear to me what this is actually fixing, and it's definitely causing a regression. /test pull-kubernetes-e2e-gce-ubuntu-containerd |
+1 the original PR clearly regresses the static pods, and the reasoning behind the issue may have been fixed in the followup pod lifecycle PRs we fixed last year. |
#104743 (comment) and #104743 (comment) summarizes why that fix is necessary - we don't want people to use static UIDs with static pods, and we want the "obvious correct behavior". All static pod updates look like force deletion on the apiserver to the kubelet - the old version disappears, the new version is held on the kubelet until the old version completes. Therefore, reverting #104743 would allow static pods without static uids to be reentrant, which is surprising. The attached fix looks correct, and after reviewing the test it also seems to reproduce the problem described. I had some additional feedback on the issue, but I don't think a revert is appropriate until we determine the fix has further issues or there is another issue buried underneath it. |
@smarterclayton Prior to #104743 there is logic to prevent static pods of the same name to be started. The name is derived from the [namespace]_[pod name] tuple. This check would prevent the static uid from being re-entrant since the block is just derived from the tuple and not the UID. |
I guess that the logic has not been enough to prevent the new static pod to be reentrant, and #104743 was intended to guarantee the graceful termination of a static pod. I can understand that this bug is more urgent to be fixed. However, the graceful termination of a static pod that #104743 wanted to guarantee should be fixed someday unless we decide not to support it. |
}, 19*time.Second, 200*time.Millisecond).Should(gomega.BeNil()) | ||
}) | ||
|
||
ginkgo.It("mirror pod termination should satisfy grace period when static pod is updated [NodeConformance]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the e2e test of this PR demonstrates the failed case.
or you can reproduce it using #97722 (comment) original issue's description.
@rphillips |
@rphillips One handles the recreation of static pods having the same uid with the same fullname(technically one pod). |
@gjkim42 After reviewing the original PR, I noticed this change in logic. completeTerminating does not cleanup startedStaticPodsByFullname, when it did cleanup terminatingStaticPodFullnames before. I tested a build with similar cleanup logic and it does seem to help the issue. |
@rphillips |
closing in favor of #107900 |
Reverts #104743 #107695
Issue: #107733