New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1948551: apiserver-watcher: use a lockfile. #2674
Conversation
| @@ -122,6 +134,9 @@ func runRunCmd(cmd *cobra.Command, args []string) error { | |||
| if err := handler.onFailure(); err != nil { | |||
| glog.Infof("Failed to mark service down on signal: %s", err) | |||
| } | |||
| if err := handler.lockfile.Unlock(); err != nil { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't put much though on this sorry, but I have the feeling this can go wrong in several places, just brainstorming here:
- we get the lock on start and fail if we can't
- we ONLY unlock on signal received
I don't know if there are some edge case we can exit without unlocking so we never will be able to start.
In case we always unlock on exit, a parallel pod restart will release the lock, despite the original pod is still active, since it never checks again if he has the lock ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're in HostPID namespace, we'll do the right thing and remove stale locks if the holding process is dead. I think we should be safe in this regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, smart
// Please note, that existing lockfiles containing pids of dead processes
// and lockfiles containing no pid at all are simply deleted.
|
hey @aojea I've rebased this for 4.10. Want to take a look? There's nobody else who understands this better than you and me. |
@mfojtik you've implemented something similar for the apiserver-operator and I remember the topic of kubelet updating static pod was discussed, but I can't remember the conclusion , I know that someone proposed that if the kubelet keeps 2 instance of the same static pod that should be considered a bug |
|
Right, the issue is that we want to change the namespace of the watcher. So those two pods are independent. It's not a kubelet bug, just a rename :/ |
| if err := os.MkdirAll(filepath.Dir(lockpath), 0755); err != nil { | ||
| return nil, fmt.Errorf("could not create run directory %s: %w", filepath.Dir(lockpath), err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this created before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was being created by the iptables scripts.
|
/retest |
|
/lgtm |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
We eventually want to rename this pod (since it's in the wrong namespace), but before we can do that, we have to handle multiple instances of the process. This is because the kubelet often handles static pod updates by creating the new one before deleting the old one. So, write a lockfile. Fortunately, we're hostPID, so we can actually detect stale lockfiles. Whew.
And do make go-deps.
|
@aojea @s-urbaniak reminded me we need to get this in. So, I've rebased and answered your question. Would you mind re-lgtm-ing? |
|
@squeed: This pull request references Bugzilla bug 1948551, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm /retest required |
|
@aojea: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aojea, squeed The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-agnostic-upgrade |
|
/bugzilla refresh |
|
@aojea: This pull request references Bugzilla bug 1948551, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@squeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
the alternative looks much simpler 😄 |
We eventually want to rename this pod (since it's in the wrong namespace), but before we can do that, we have to handle multiple instances of the process. This is because the kubelet often handles static pod updates by creating the new one before deleting the old one.
So, write a lockfile. Fortunately, we're hostPID, so we can actually detect stale lockfiles. Whew.
(This is so we can fix https://bugzilla.redhat.com/show_bug.cgi?id=1948551 in 4.10)
/cc @aojea