Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1878602: Support for stall daemon. #152

Merged
merged 1 commit into from Sep 14, 2020

Conversation

jmencak
Copy link
Contributor

@jmencak jmencak commented Aug 27, 2020

This PR adds support for the stalld program to prevent the starvation of
operating system threads in a Linux system.

Changes:

  • support for [service] Tuned plugin
  • support the installation of stalld binary and systemd units on hosts fully-managed by the Machine Config Operator (MCO).

Example usage:

$ oc label node <node-name> node-role.kubernetes.io/worker-rt=

$ oc create -f- <<EOF
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-realtime
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift realtime profile
      include=openshift-node,realtime
      [service]
      service.stalld=start,enable
    name: openshift-realtime

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-rt"
    priority: 30
    profile: openshift-realtime
EOF

$ oc create -f examples/realtime-mcp.yaml

A host-supplied configuration file can be used to override the stalld systemd unit. This file can then be referred to by prefixing the absolute path to the file on the host by the /host prefix. For example:

      [service]
      service.stalld=start,enable,file:/host/etc/tuned/openshift-realtime/stalld.conf

Note that the file needs to be created on the host by other means, for example, by using a MachineConfig. One of the uses of the drop-in file is to supply environment variables to the stalld unit. For example, the drop-in file may contain:

[Service]
Environment=CLIST=
Environment=AGGR=
Environment=BP="-p 1000000000"
Environment=BR="-r 10000"
Environment=BD="-d 3"
Environment=THRESH="-t 60"
Environment=LOGGING="--log_syslog --log_kmsg"
Environment=FG=--foreground
Environment=PF="--pidfile /run/stalld.pid"

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Aug 27, 2020

/cc @MarSik
/cc @yanirq

Copy link

@kpouget kpouget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test the code, but what I read/understood in the patch seem sensible and consistent, thanks!
I only made a minor style question.

I didn't read in detail the Python code (the service plugin patch), is it new code or is it coming from elsewhere?

pkg/operator/controller.go Outdated Show resolved Hide resolved
@jmencak
Copy link
Contributor Author

jmencak commented Aug 28, 2020

The "Operand cleanup, using long option for stalld." got rid of the limitation in the description. Removing. This is a Friday night commit -- it seems to work but more testing and cleanup needed.

@yanirq
Copy link
Contributor

yanirq commented Aug 31, 2020

Will tests be added in this PR ? (should be)

@jmencak
Copy link
Contributor Author

jmencak commented Aug 31, 2020

Will tests be added in this PR ? (should be)

As time allows once all issues are resolved. I still see two issues that need to be addressed.

Edit: only 1 known issue left: redhat-performance/tuned#293 (comment)

@jmencak
Copy link
Contributor Author

jmencak commented Sep 2, 2020

/test e2e-aws-upgrade

@yanirq
Copy link
Contributor

yanirq commented Sep 8, 2020

I would add the PR's description to NTO's readme section (or open a new sub folder doc)

@jmencak jmencak changed the title WiP: Support for stall daemon. Support for stall daemon. Sep 10, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Sep 10, 2020

@yanirq cannot reproduce the "degraded" MCP state issues I was having last week after several successful tests. There were some changes in MCO which hopefully fixed those. I believe I've addressed all your concerns? Could you please have another look through the code and indicate whether there's something missing? Thanks!

@yanirq
Copy link
Contributor

yanirq commented Sep 13, 2020

/retest

@yanirq
Copy link
Contributor

yanirq commented Sep 13, 2020

/lgtm
/hold
looks good overall (unless im missing something) if you feel comfortable release the hold
I do think we need tests embedded (they usually come with the PR) but because of urgency we can merge then in a very soon to follow PR

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 13, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

/lgtm
/hold
looks good overall (unless im missing something) if you feel comfortable release the hold
I do think we need tests embedded (they usually come with the PR) but because of urgency we can merge then in a very soon to follow PR

Thanks @yanirq . Which tests do you have in mind? There is an embedded e2e tests for this new functionality.

@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

@kpouget / @dagrayvid would you be able to provide feedback for this PR? This is close to being complete. Thank you!

@jmencak jmencak changed the title Support for stall daemon. Bug 1878602: Support for stall daemon. Sep 14, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Sep 14, 2020
@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request references Bugzilla bug 1878602, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1878602: Support for stall daemon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Sep 14, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request references Bugzilla bug 1878602, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. labels Sep 14, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

Reiterating
/hold
from Yanir as the commit messages need to be squashed before merge. We also need another review from PSAP team.

Copy link

@kpouget kpouget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, except a possible race condition. Please have a look and let me know if you agree, or if I missed something in the synchronization!

assets/tuned/patches/070-service-plugin.diff Outdated Show resolved Hide resolved
pkg/tuned/host_payload.go Outdated Show resolved Hide resolved
pkg/tuned/tuned.go Outdated Show resolved Hide resolved
@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request references Bugzilla bug 1878602, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1878602: Support for stall daemon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This PR adds support for the stalld program to prevent the starvation of
operating system threads in a Linux system.

Changes:
  - support for [service] Tuned plugin
  - support the installation of stalld binary and systemd units on hosts
    fully-managed by the Machine Config Operator (MCO).
@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

/retest

@jmencak
Copy link
Contributor Author

jmencak commented Sep 14, 2020

/hold cancel
I believe I've addressed reviewer's comments. Commits squashed. If this still looks good, @yanirq feel free to lgtm.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2020
@yanirq
Copy link
Contributor

yanirq commented Sep 14, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 14, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit a395118 into openshift:master Sep 14, 2020
@openshift-ci-robot
Copy link
Contributor

@jmencak: All pull requests linked via external trackers have merged:

Bugzilla bug 1878602 has been moved to the MODIFIED state.

In response to this:

Bug 1878602: Support for stall daemon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmencak jmencak deleted the 4.6-stalld branch September 14, 2020 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants