Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1893362: Ensure tail processes exit with parent #859

Merged
merged 1 commit into from Nov 4, 2020

Conversation

cgwalters
Copy link
Member

There's a great Linux-specific feature that allows sending a signal
to a child process when its parent dies. This allows strongly
"lifecycle binding" two processes together and avoiding the default
Unix behavior where the child process will just be reparented.

I think this will help avoid problems like
https://bugzilla.redhat.com/show_bug.cgi?id=1893362
where when the bash process exits, the tail process sticks around.
Perhaps something isn't removing the pid file? In any case when
this pod is requested to terminate, the tail processes should die too.

@cgwalters
Copy link
Member Author

(Not tested)

@cgwalters
Copy link
Member Author

e.g. coreos/coreos-assembler@2118af4

@cgwalters cgwalters changed the title Ensure tail processes exit with parent Bug 1893362: Ensure tail processes exit with parent Nov 2, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Nov 2, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1893362, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1893362: Ensure tail processes exit with parent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Nov 2, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1893362, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Contributor

aojea commented Nov 3, 2020

/retest

@cgwalters
Copy link
Member Author

cgwalters commented Nov 3, 2020

Hm actually the new problem may be the exec tail -F path. Looking at that.

EDIT: yep it's the other one.

@cgwalters
Copy link
Member Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2020
@cgwalters
Copy link
Member Author

OK updated.
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2020
@cgwalters cgwalters force-pushed the tail-pdeathsig branch 2 times, most recently from a721c06 to 9c33133 Compare November 3, 2020 14:30
@cgwalters
Copy link
Member Author

cgwalters commented Nov 3, 2020

What I've been doing to test is:

$ oc run tailtest --image=registry.fedoraproject.org/fedora:33 --restart=Never -- bash -c 'function quit {
                                                  echo got SIGTERM
                                                  exit 0
                                                  }
                                                  trap quit SIGTERM
                                                  setpriv --pdeathsig TERM -- tail -f /etc/hosts & wait; echo done'
$ oc delete pod/tailtest

Terminates right away. But if I remove the setpriv or the trap it hangs.

So while I didn't test this complete change in a cluster, I am pretty confident this should fix it.

@cgwalters
Copy link
Member Author

Why is this bash code duplicated in openshift-sdn and ovn-kubernetes btw?

@dcbw
Copy link
Member

dcbw commented Nov 3, 2020

Why is this bash code duplicated in openshift-sdn and ovn-kubernetes btw?

@cgwalters because we don't have a bash library script that we add to all the containers that we can call?

@cgwalters
Copy link
Member Author

@cgwalters because we don't have a bash library script that we add to all the containers that we can call?

OK, that seems not too hard to fix but we can do that later.

You might say there's a long tail of problems here.

@cgwalters
Copy link
Member Author

cgwalters commented Nov 3, 2020

Argh, util-linux in RHEL8 is too old. That's rather annoying.

I switched to using --pid=$BASHPID which basically works though it's less elegant.

We were proxying SIGTERM in the "ovs in container" path, we need
to do the same with the ovs-on-host path.

I think this will help avoid problems like
https://bugzilla.redhat.com/show_bug.cgi?id=1893362
@cgwalters
Copy link
Member Author

/test e2e-agnostic-upgrade
/test e2e-azure-ovn
Flakes

@cgwalters
Copy link
Member Author

This one is passing CI, can I get the approves and lgtms and 🎉 emoji etc.?

@aojea
Copy link
Contributor

aojea commented Nov 3, 2020

/lgtm
impressive lesson on linux signals and processes 😄

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2020
@cgwalters
Copy link
Member Author

Also needs a /approve, thanks!

# Don't need to worry about restoring flows; this can only change if we've rebooted
exec tail -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log
tail --pid=$BASHPID -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why does exec tail ... not exit when it gets a SIGTERM? Shouldn't that always exit? Bash is gone...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default in Linux (possibly Unix in general) pid 1 has SIGTERM set to ignore by default... For basically bad reasons this is still the default with pid namespaces; it clearly would have been better to change it but we can't now.

https://vagga.readthedocs.io/en/latest/pid1mode.html
https://hynek.me/articles/docker-signals/
etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OHHHHHHHH. Wow.

@squeed
Copy link
Contributor

squeed commented Nov 4, 2020

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, cgwalters, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member Author

vsphere looks like provisioning failures.
upgrade...hmmm, looks fairly red across the board.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 4720ead into openshift:master Nov 4, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: All pull requests linked via external trackers have merged:

Bugzilla bug 1893362 has been moved to the MODIFIED state.

In response to this:

Bug 1893362: Ensure tail processes exit with parent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants