Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO fails to apply rendered configuration to infra nodes #1270

Closed
samuelvl opened this issue Nov 15, 2019 · 9 comments
Closed

MCO fails to apply rendered configuration to infra nodes #1270

samuelvl opened this issue Nov 15, 2019 · 9 comments
Assignees

Comments

@samuelvl
Copy link

samuelvl commented Nov 15, 2019

Description

When using a custom machine config pool for infra nodes, the MCO fails to manage the rolling update when a new worker node is labeled to infra.

Steps to reproduce the issue

  1. Create a custom machine config pool for infra nodes that also inherits workers configuration.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
    - key: machineconfiguration.openshift.io/role
      operator: In
      values:
      - worker
      - infra
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""
  1. Create one or more infra nodes using a MachineSet according to Openshift official documentation:
oc scale machineset cluster-8kk6g-infra-eu-central-1a --replicas=1
  1. Wait until the node is created (about 4 minutes):
NAME                                            STATUS   ROLES    AGE     VERSION
ip-10-0-141-253.eu-central-1.compute.internal   Ready    worker   72s    v1.14.6+7e13ab9a7
  1. Change the default worker label to infra to apply custom MCP:
oc label node ip-10-0-141-253.eu-central-1.compute.internal node-role.kubernetes.io/infra= node-role.kubernetes.io/worker-
  1. Check the label is applied correctly:
NAME                                            STATUS   ROLES    AGE     VERSION
ip-10-0-141-253.eu-central-1.compute.internal   Ready    infra    4m34s   v1.14.6+7e13ab9a7

Describe the results you received

The MCO cannot handle the label change because the machine-config-controller is continually restarted due to a memory violation error.

...
I1115 11:44:35.408664       1 node_controller.go:457] Pool infra: node ip-10-0-141-253.eu-central-1.compute.internal changed labels
E1115 11:44:35.408934       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:522
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:84
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:261
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:261
/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:604
/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:615
/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:125
/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:475
/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:116
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:202
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe60bf5]

goroutine 155 [running]:
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x15aec80, 0x28dbcd0)
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 +0x1b9
github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1.(*MachineConfigPool).GetNamespace(0x0, 0x0, 0x19e98a0)
	<autogenerated>:1 +0x5
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc(0x1778ea0, 0x0, 0xc0001d8de0, 0x1528080, 0xc0008cebc0, 0x12a05f200)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:84 +0x114
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc(0x1778ea0, 0x0, 0xc0008cebc0, 0x12a05f200, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:261 +0x6a
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000430280, 0x0, 0x12a05f200)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:604 +0x41
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(0xc000430280, 0x0)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:615 +0x44
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault-fm(0x0)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:125 +0x34
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000430280, 0x1794e00, 0xc000ab9080, 0x1794e00, 0xc000f40840)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:475 +0x84c
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode-fm(0x1794e00, 0xc000ab9080, 0x1794e00, 0xc000f40840)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:116 +0x52
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(0xc0004b24d0, 0xc0004b24e0, 0xc0004b24f0, 0x1794e00, 0xc000ab9080, 0x1794e00, 0xc000f40840)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:202 +0x5d
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x0, 0xc00033a300, 0xc000148550)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552 +0x18b
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc000551e18, 0x429692, 0xc000148580)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:265 +0x51
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x79
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0007aef68)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000551f68, 0xdf8475800, 0x0, 0x1582101, 0xc0004d2300)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xbe
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc0007aef68, 0xdf8475800, 0xc0004d2300)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc0008b8f00)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x8d
github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache.(*processorListener).run-fm()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390 +0x2a
github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000372df0, 0xc0004b2b90)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

Describe the results you expected:

The MCO should apply MachineConfigPool configuration to the new infra node as expected.

Additional environment details (platform, options, etc.):

  • Provider: AWS IPI
  • Openshift: 4.2.2
  • MCO: v4.2.1-201910230818-dirty (d73d5c6)
@kikisdeliveryservice
Copy link
Contributor

As per https://github.com/openshift/machine-config-operator/blob/master/docs/custom-pools.md

You are not changing the default pool but adding your custom pool. Custom pools will inherit from worker so that custom pools will have labels worker and infra(or whatever the name is). Your infra pool only has an infra role in step 5 which will cause MCs to not roll out to it.

@samuelvl
Copy link
Author

samuelvl commented Nov 15, 2019

machineConfigSelector renders all machine config under matchExpressions condition into a single rendered configuration, and nodeSelector applies that rendered configuration to all nodes that match the labels criteria. As infra nodes are matched by nodeSelector, the MCP should affect them.

Indeed, the custom MCP works well if it is created after labeled nodes exists, but not if a new node is created.

It seems that the exception is not correctly catched, making the pod to be restarted forever and corrupting the normal operator behavior. There is a null pointer exception in the error trace that makes the controller to stop working.

E1115 11:44:35.408934       1 runtime.go:69] Observed a panic: "invalid memory address... 

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Nov 15, 2019

Please add the infra label in addition to the worker label as shown in the documentation and report back if you hit the same issue, thanks:
https://github.com/openshift/machine-config-operator/blob/master/docs/custom-pools.md

The MCO dictate that custom pools inherit from worker pools. Your step 4 above shows that your node is labelled incorrectly. The error logs (above) show that you are hitting a problem when it is getting pools for the node as it expects a custom pool and a worker pool.

@kikisdeliveryservice
Copy link
Contributor

Note: will an nil check to error out to avoid pushing nil and ensuring that nodes with custom pools have correct labelling (custom pool & worker)

@ChetHosey
Copy link

Could you clarify -- what labels should infrastructure-only nodes have, and how should clusters be configured so that workloads aren't scheduled on them?

I'm on UPI. I first used the following to ensure that workloads are only scheduled on nodes with the worker label:

oc patch scheduler/cluster --type='json' -p='[{"op":"replace","path":"/spec/defaultNodeSelector","value":"node-role.kubernetes.io/worker="}]'

To create the infra nodes I started by booting machines with the worker ignition config. After signing the initial bootstrap CSR, I used "oc edit node" to replace the "worker" label with "infra", and then signed the machine certificate. I then moved the infra components as described in https://docs.openshift.com/container-platform/4.1/machine_management/creating-infrastructure-machinesets.html.

This worked well on 4.1 -- normal workloads ended up on normal workers, and infra workloads ran on the infra nodes. (We don't want to mix the two, as infra-only nodes don't count toward licensing.)

If infra nodes are expected to have both "infra" and "worker" labels under 4.2, then I suppose we'd have to apply a "yesthisreallyisaworker" label to "real" workers, and use that as the default node selector for scheduling.

@kikisdeliveryservice
Copy link
Contributor

@amosbeh
Copy link

amosbeh commented Dec 3, 2019

I am also facing similar error by reproducing the steps.
Have put back worker role to allow update to succeed.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Dec 3, 2019

I am also facing similar error by reproducing the steps.
Have put back worker role to allow update to succeed.

infra/worker tag is correct and I believe the official documentation will be updated w/r/t taints to land workloards correctly.

for now the official docs do show that infra/worker is the correct setup:https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/machine_management/creating-infrastructure-machinesets#moving-resources-to-infrastructure-machinesets

@samuelvl
Copy link
Author

It has been fixed in last Openshift versions BZ-1772490 and BZ-1772680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants