Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1948551: apiserver-watcher: use a lockfile. #2674

Closed
wants to merge 2 commits into from
Closed

Bug 1948551: apiserver-watcher: use a lockfile. #2674

wants to merge 2 commits into from

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Jul 12, 2021

We eventually want to rename this pod (since it's in the wrong namespace), but before we can do that, we have to handle multiple instances of the process. This is because the kubelet often handles static pod updates by creating the new one before deleting the old one.

So, write a lockfile. Fortunately, we're hostPID, so we can actually detect stale lockfiles. Whew.

(This is so we can fix https://bugzilla.redhat.com/show_bug.cgi?id=1948551 in 4.10)

/cc @aojea

@openshift-ci openshift-ci bot requested a review from aojea July 12, 2021 11:31
@@ -122,6 +134,9 @@ func runRunCmd(cmd *cobra.Command, args []string) error {
if err := handler.onFailure(); err != nil {
glog.Infof("Failed to mark service down on signal: %s", err)
}
if err := handler.lockfile.Unlock(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't put much though on this sorry, but I have the feeling this can go wrong in several places, just brainstorming here:

  1. we get the lock on start and fail if we can't
  2. we ONLY unlock on signal received

I don't know if there are some edge case we can exit without unlocking so we never will be able to start.

In case we always unlock on exit, a parallel pod restart will release the lock, despite the original pod is still active, since it never checks again if he has the lock ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're in HostPID namespace, we'll do the right thing and remove stale locks if the holding process is dead. I think we should be safe in this regard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, smart

// Please note, that existing lockfiles containing pids of dead processes
// and lockfiles containing no pid at all are simply deleted.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2021
@squeed
Copy link
Contributor Author

squeed commented Sep 8, 2021

hey @aojea I've rebased this for 4.10. Want to take a look? There's nobody else who understands this better than you and me.

@aojea
Copy link
Contributor

aojea commented Sep 8, 2021

This is because the kubelet often handles static pod updates by creating the new one before deleting the old one.

@mfojtik you've implemented something similar for the apiserver-operator and I remember the topic of kubelet updating static pod was discussed, but I can't remember the conclusion , I know that someone proposed that if the kubelet keeps 2 instance of the same static pod that should be considered a bug

@squeed
Copy link
Contributor Author

squeed commented Sep 8, 2021

Right, the issue is that we want to change the namespace of the watcher. So those two pods are independent. It's not a kubelet bug, just a rename :/

Comment on lines +168 to +170
if err := os.MkdirAll(filepath.Dir(lockpath), 0755); err != nil {
return nil, fmt.Errorf("could not create run directory %s: %w", filepath.Dir(lockpath), err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this created before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was being created by the iptables scripts.

@aojea
Copy link
Contributor

aojea commented Sep 8, 2021

/retest
lgtm
one question about why we didnt have to create the folder before

@aojea
Copy link
Contributor

aojea commented Sep 9, 2021

/lgtm
/hold
holding just for this question https://github.com/openshift/machine-config-operator/pull/2674/files#r704800469 , the tests are passing , so feel free to unhold if needed

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2021
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2022
We eventually want to rename this pod (since it's in the wrong
namespace), but before we can do that, we have to handle multiple
instances of the process. This is because the kubelet often handles
static pod updates by creating the new one before deleting the old one.

So, write a lockfile. Fortunately, we're hostPID, so we can actually
detect stale lockfiles. Whew.
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 20, 2022
@squeed
Copy link
Contributor Author

squeed commented Apr 20, 2022

@aojea @s-urbaniak reminded me we need to get this in. So, I've rebased and answered your question. Would you mind re-lgtm-ing?

@kikisdeliveryservice kikisdeliveryservice changed the title apiserver-watcher: use a lockfile. Bugzilla1948551: apiserver-watcher: use a lockfile. Apr 20, 2022
@kikisdeliveryservice kikisdeliveryservice changed the title Bugzilla1948551: apiserver-watcher: use a lockfile. Bug 1948551: apiserver-watcher: use a lockfile. Apr 20, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 20, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 20, 2022

@squeed: This pull request references Bugzilla bug 1948551, which is invalid:

  • expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1948551: apiserver-watcher: use a lockfile.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice kikisdeliveryservice removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2022
@aojea
Copy link
Contributor

aojea commented Apr 22, 2022

/lgtm
/approve
/hold cancel
Sorry for missing it 😞

/retest required

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 22, 2022

@aojea: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test 4.11-upgrade-from-stable-4.10-images
  • /test cluster-bootimages
  • /test e2e-agnostic-upgrade
  • /test e2e-aws
  • /test e2e-gcp-op
  • /test images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-bootstrap-unit
  • pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/lgtm
/approve
/hold cancel
Sorry for missing it 😞

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 22, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea, squeed
To complete the pull request process, please assign yuqi-zhang after the PR has been reviewed.
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aojea
Copy link
Contributor

aojea commented Apr 22, 2022

/test e2e-agnostic-upgrade

@aojea
Copy link
Contributor

aojea commented Apr 22, 2022

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 22, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 22, 2022

@aojea: This pull request references Bugzilla bug 1948551, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @xingxingxia

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from xingxingxia April 22, 2022 07:39
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 22, 2022

@squeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-workers-rhel8 f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-aws-workers-rhel8
ci/prow/e2e-ovn-step-registry f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-ovn-step-registry
ci/prow/e2e-aws-workers-rhel7 f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-techpreview-featuregate f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-aws-techpreview-featuregate
ci/prow/okd-e2e-aws f326b5044cac7690e0a52037c29bdee4f9c892ef link /test okd-e2e-aws
ci/prow/e2e-metal-ipi f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-metal-ipi
ci/prow/e2e-aws-disruptive f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-aws-disruptive
ci/prow/e2e-gcp-op-single-node f326b5044cac7690e0a52037c29bdee4f9c892ef link /test e2e-gcp-op-single-node
ci/prow/4.12-upgrade-from-stable-4.11-images f326b5044cac7690e0a52037c29bdee4f9c892ef link true /test 4.12-upgrade-from-stable-4.11-images
ci/prow/e2e-gcp-op ea2e595 link true /test e2e-gcp-op
ci/prow/e2e-agnostic-upgrade ea2e595 link true /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Contributor

There is an alternative PR #3106

Adding a hold to both and you can let me know which you want to go with.

/hold
/assign @deads2k @squeed @aojea

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2022
@aojea
Copy link
Contributor

aojea commented Apr 26, 2022

the alternative looks much simpler 😄

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants