Machine Configs should support multiple roles #269

kikisdeliveryservice · 2019-01-07T21:10:34Z

The current labelling scheme does not support having multiple roles for one machine config. Because label is a map, we are unable for have 2 values (roles) for the same key. Switching to a different labelling method would allow us to create 1 machineconfig that can be applied to multiple roles.

For example, when creating ssh machine configs we have to create 1 per role instead of having a master ssh machineconfig that could be applied to all/certain roles: https://github.com/kikisdeliveryservice/machine-config-operator/blob/a46fcadaea8bc4af095e5826645f8096ab1ae5f6/pkg/controller/template/render.go#L82

This came about due to my SSH key work and conversation with @abhinavdahiya . Happy to work on this once the last PR is completed.

abhinavdahiya · 2019-01-07T21:15:27Z

So currently we label machineconfigs with label

machineconfiguration.openshift.io/role=<role-name>

and we want the ssh key to be picked by all current roles that means we need these labels

machineconfiguration.openshift.io/role=<role-name-0>
machineconfiguration.openshift.io/role=<role-name-1>

but as @kikisdeliveryservice mentioned the label is a map, so we can't have 2 values for same key...
need need to switch to different scheme for labels...

I suggest:
role.machineconfiguration.openshift.io/<role-name>=""

This will allow us to add a machineconfig that is used for multiple roles. (A common usecase)

role.machineconfiguration.openshift.io/<role-name-0>=""
role.machineconfiguration.openshift.io/<role-name-1>=""

cgwalters · 2019-01-07T21:35:53Z

This is then symmetric with the way node-role.kubernetes.io/worker label acts too right?

abhinavdahiya · 2019-01-07T21:47:12Z

This is then symmetric with the way node-role.kubernetes.io/worker label acts too right?

correct

kikisdeliveryservice · 2019-01-21T18:08:41Z

/assign

abhinavdahiya · 2019-02-28T13:57:05Z

ping @kikideliveryservice

Is this going to be fixed before 4.0 GA

williamcaban · 2019-07-22T17:05:16Z

An alternate solution could be supporting nested MCP (i.e. A child MCP that inherit all the MCs from the parent MCP) or from multiple MCPs.

yrobla · 2019-07-22T17:11:19Z

Inheritance has a lot of sense when talking about heterogeneous nodes. So all workers are going to be just workers, and need to have all generic MachineConfig that applies to a worker. But there can be workers that need different configurations (real time, sriov, dpdk nodes, different types of tuning...) so they need specific MachineConfigs applied to them.

runcom · 2019-07-22T18:02:14Z

There's already something like this and you can create MCP for worker-like pools to use worker MCs as the base and ad-hoc MCs for the given custom pool.

runcom · 2019-07-22T18:03:44Z

More info and steps here #429 (comment)

yrobla · 2019-08-06T08:13:31Z

I tested that approach, doing the following:

enrolled a worker node, labelled with just worker role
labelled the node also as worker-ran. So node has 2 roles: worker, worker-ran
created an MCP with the following:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-ran
spec:
machineConfigSelector:
matchExpressions:
- {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-ran]}
maxUnavailable: null
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-ran: ""
paused: false

created an MC with the following content:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker-ran
name: machine-config-worker-ran-rt
spec:
config:
ignition:
version: 2.2.0
storage:
files:
-
contents:
source: data:text/plain;base64,xxx
filesystem: root
mode: 0777
path: /opt/setup_rt.sh
systemd:
units:
-
contents: |
[Unit]
Description=One time service to install rt and tuned profile
After=network-online.target
ConditionPathExists=!/opt/rt_executed
[Service]
Type=oneshot
ExecStart=/opt/setup_rt.sh
[Install]
WantedBy=multi-user.target
enabled: true
name: install_realtime.service

The behaviour i see is the following:

machineconfigpool worker had been properly updated before i applied the mcp and mc, i start from a working environment
as soon as i create machineconfigpool/machineconfig for worker-ran, the content is applied correctly. Node is rebooted and i can see the expected modifications done.
but after a while, seems that the system is trying to apply machineconfigpool for worker again. Then, i see:
worker rendered-worker-e5227386b7abc32c08c2c946cecfdb28 False True False
worker-ran rendered-worker-ran-2f36b6d1eb977bcf752cc0d1a901256e True False False

At this point, worker is never finishing the update, and the node is being constantly rebooted (max uptime i've been able to see is over 30 min). This is what i see in mcp/worker :

lastTransitionTime: "2019-08-05T20:43:27Z"
message: ""
reason: All nodes are updating to rendered-worker-e5227386b7abc32c08c2c946cecfdb28
status: "True"
type: Updating

In the machineconfigdaemon for the worker pool, i see:
I0806 08:11:12.678614 4915 daemon.go:904] Validated on-disk state
I0806 08:11:12.693262 4915 update.go:801] logger doesn't support --jounald, logging json directly
I0806 08:11:12.702897 4915 daemon.go:938] Completing pending config rendered-worker-ran-2f36b6d1eb977bcf752cc0d1a901256e
I0806 08:11:12.708129 4915 update.go:836] completed update for config rendered-worker-ran-2f36b6d1eb977bcf752cc0d1a901256e
I0806 08:11:12.718672 4915 daemon.go:944] In desired config rendered-worker-ran-2f36b6d1eb977bcf752cc0d1a901256e

runcom · 2019-08-06T09:40:39Z

@kikisdeliveryservice is working on this

runcom · 2019-08-08T07:42:20Z

I'm verifying the bug reported above also

runcom · 2019-08-08T08:20:21Z

but after a while, seems that the system is trying to apply machineconfigpool for worker again. Then, i see:
worker rendered-worker-e5227386b7abc32c08c2c946cecfdb28 False True False
worker-ran rendered-worker-ran-2f36b6d1eb977bcf752cc0d1a901256e True False False

can you be more precise on this? after a while? after following your report, I have a perfectly fine running cluster w/o the behavior described

runcom · 2019-08-08T12:20:43Z

I have a cluster running for hours now, with 2 custom pools, I cannot reproduce #269 (comment):

NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED
infra        rendered-infra-7d597e9ce0c1b032ad97285c96773085        True      False      False
master       rendered-master-2519381d008a0e622f1e0cbad71401a4       True      False      False
worker       rendered-worker-d4c9dc5b0a0dc47047ef5d259d1b8a37       True      False      False
worker-ran   rendered-worker-ran-7d597e9ce0c1b032ad97285c96773085   True      False      False

yrobla · 2019-08-08T12:37:27Z

I was hitting that every 30 min, i was watching uptime and node was rebooted every 30 min. The node i use is RHEL, not RHCOS, not sure if that's a difference.

runcom · 2019-08-08T12:40:02Z

use is RHEL, not RHCOS, not sure if that's a difference.

uhm, it shouldn't have any difference afaict, do you have a cluster with RHEL nodes you can hand to me so I can check?

yrobla · 2019-08-09T06:43:43Z

No, sorry, the test that i am doing is with a local cluster in my machines... they don't have external access.

kikisdeliveryservice · 2019-08-09T16:08:30Z

what version of installer/mco you are running? are you running libvirt? have you reproduced this in aws/cloud?

kikisdeliveryservice · 2019-08-09T16:12:55Z

also really would prefer if this was being discussed in a seperate issue rather than this one.

nnachefski · 2020-03-29T17:59:48Z

So i ran into this... having worker nodes with two 'roles' blew up the MC controller. My use case is:

All RHCOS 4.4 RC1 nodes. 3 masters, 3 infras, 3 workers (all KVM/libvirt VMs), and 1 bare-metal. All OSes were delivered w/ iPXE (using matchbox) The bare-metal RHCOS node has two Nvidia GPUs that i intend to test ML workloads with. When going through the process of adding a MC for entitled pods (required for nvidia operator) the rollout gets stuck. The two roles i have on my nvidia metal node are 'nvidia,worker'. I guess i dont really need a second 'nvidia' role. I also have two roles on my 'infra' nodes (infra,worker) that plays hell with the MCP/MC too. That was an easy fix, i just killed the 'worker' role on those nodes and created a separate 'infra' MCP as mentioned above.

I guess this boils down to "dont have multiple roles on ANY nodes".

openshift-bot · 2020-10-04T17:57:43Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-11-03T19:47:28Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-12-03T21:42:07Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-12-03T21:43:26Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters mentioned this issue Jan 14, 2019

allow operator to tie into node upgrade status #265

Merged

openshift-ci-robot assigned kikisdeliveryservice Jan 21, 2019

runcom added the jira label May 11, 2019

dustymabe removed the jira label Sep 5, 2019

cgwalters mentioned this issue Feb 27, 2020

Add timeservers section to MachineConfig #629

Closed

nnachefski mentioned this issue Mar 29, 2020

Documentation should call out that Openshift doesnt support multiple 'roles' on nodes NVIDIA/gpu-operator#49

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2020

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020

openshift-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 3, 2020

openshift-ci-robot closed this as completed Dec 3, 2020

jtriley mentioned this issue Mar 16, 2023

fix NIC setup on nerc-ocp-test OCP-on-NERC/nerc-ocp-config#217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Configs should support multiple roles #269

Machine Configs should support multiple roles #269

kikisdeliveryservice commented Jan 7, 2019 •

edited

Loading

abhinavdahiya commented Jan 7, 2019

cgwalters commented Jan 7, 2019

abhinavdahiya commented Jan 7, 2019

kikisdeliveryservice commented Jan 21, 2019

abhinavdahiya commented Feb 28, 2019

williamcaban commented Jul 22, 2019

yrobla commented Jul 22, 2019

runcom commented Jul 22, 2019

runcom commented Jul 22, 2019

yrobla commented Aug 6, 2019

runcom commented Aug 6, 2019

runcom commented Aug 8, 2019

runcom commented Aug 8, 2019

runcom commented Aug 8, 2019

yrobla commented Aug 8, 2019

runcom commented Aug 8, 2019 •

edited

Loading

yrobla commented Aug 9, 2019

kikisdeliveryservice commented Aug 9, 2019 •

edited

Loading

kikisdeliveryservice commented Aug 9, 2019 •

edited

Loading

nnachefski commented Mar 29, 2020

openshift-bot commented Oct 4, 2020

openshift-bot commented Nov 3, 2020

openshift-bot commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020

Machine Configs should support multiple roles #269

Machine Configs should support multiple roles #269

Comments

kikisdeliveryservice commented Jan 7, 2019 • edited Loading

abhinavdahiya commented Jan 7, 2019

cgwalters commented Jan 7, 2019

abhinavdahiya commented Jan 7, 2019

kikisdeliveryservice commented Jan 21, 2019

abhinavdahiya commented Feb 28, 2019

williamcaban commented Jul 22, 2019

yrobla commented Jul 22, 2019

runcom commented Jul 22, 2019

runcom commented Jul 22, 2019

yrobla commented Aug 6, 2019

runcom commented Aug 6, 2019

runcom commented Aug 8, 2019

runcom commented Aug 8, 2019

runcom commented Aug 8, 2019

yrobla commented Aug 8, 2019

runcom commented Aug 8, 2019 • edited Loading

yrobla commented Aug 9, 2019

kikisdeliveryservice commented Aug 9, 2019 • edited Loading

kikisdeliveryservice commented Aug 9, 2019 • edited Loading

nnachefski commented Mar 29, 2020

openshift-bot commented Oct 4, 2020

openshift-bot commented Nov 3, 2020

openshift-bot commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020

kikisdeliveryservice commented Jan 7, 2019 •

edited

Loading

runcom commented Aug 8, 2019 •

edited

Loading

kikisdeliveryservice commented Aug 9, 2019 •

edited

Loading

kikisdeliveryservice commented Aug 9, 2019 •

edited

Loading