-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine Configs should support multiple roles #269
Comments
So currently we label machineconfigs with label
and we want the ssh key to be picked by all current roles that means we need these labels
but as @kikisdeliveryservice mentioned the label is a map, so we can't have 2 values for same key... I suggest: This will allow us to add a machineconfig that is used for multiple roles. (A common usecase)
|
This is then symmetric with the way |
correct |
/assign |
ping @kikideliveryservice Is this going to be fixed before 4.0 GA |
An alternate solution could be supporting nested MCP (i.e. A child MCP that inherit all the MCs from the parent MCP) or from multiple MCPs. |
Inheritance has a lot of sense when talking about heterogeneous nodes. So all workers are going to be just workers, and need to have all generic MachineConfig that applies to a worker. But there can be workers that need different configurations (real time, sriov, dpdk nodes, different types of tuning...) so they need specific MachineConfigs applied to them. |
There's already something like this and you can create MCP for worker-like pools to use worker MCs as the base and ad-hoc MCs for the given custom pool. |
More info and steps here #429 (comment) |
I tested that approach, doing the following:
apiVersion: machineconfiguration.openshift.io/v1
apiVersion: machineconfiguration.openshift.io/v1 The behaviour i see is the following:
At this point, worker is never finishing the update, and the node is being constantly rebooted (max uptime i've been able to see is over 30 min). This is what i see in mcp/worker :
In the machineconfigdaemon for the worker pool, i see: |
@kikisdeliveryservice is working on this |
I'm verifying the bug reported above also |
can you be more precise on this? after a while? after following your report, I have a perfectly fine running cluster w/o the behavior described |
I have a cluster running for hours now, with 2 custom pools, I cannot reproduce #269 (comment):
|
I was hitting that every 30 min, i was watching uptime and node was rebooted every 30 min. The node i use is RHEL, not RHCOS, not sure if that's a difference. |
uhm, it shouldn't have any difference afaict, do you have a cluster with RHEL nodes you can hand to me so I can check? |
No, sorry, the test that i am doing is with a local cluster in my machines... they don't have external access. |
what version of installer/mco you are running? are you running libvirt? have you reproduced this in aws/cloud? |
also really would prefer if this was being discussed in a seperate issue rather than this one. |
So i ran into this... having worker nodes with two 'roles' blew up the MC controller. My use case is: All RHCOS 4.4 RC1 nodes. 3 masters, 3 infras, 3 workers (all KVM/libvirt VMs), and 1 bare-metal. All OSes were delivered w/ iPXE (using matchbox) The bare-metal RHCOS node has two Nvidia GPUs that i intend to test ML workloads with. When going through the process of adding a MC for entitled pods (required for nvidia operator) the rollout gets stuck. The two roles i have on my nvidia metal node are 'nvidia,worker'. I guess i dont really need a second 'nvidia' role. I also have two roles on my 'infra' nodes (infra,worker) that plays hell with the MCP/MC too. That was an easy fix, i just killed the 'worker' role on those nodes and created a separate 'infra' MCP as mentioned above. I guess this boils down to "dont have multiple roles on ANY nodes". |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The current labelling scheme does not support having multiple roles for one machine config. Because label is a map, we are unable for have 2 values (roles) for the same key. Switching to a different labelling method would allow us to create 1 machineconfig that can be applied to multiple roles.
For example, when creating ssh machine configs we have to create 1 per role instead of having a master ssh machineconfig that could be applied to all/certain roles: https://github.com/kikisdeliveryservice/machine-config-operator/blob/a46fcadaea8bc4af095e5826645f8096ab1ae5f6/pkg/controller/template/render.go#L82
This came about due to my SSH key work and conversation with @abhinavdahiya . Happy to work on this once the last PR is completed.
The text was updated successfully, but these errors were encountered: