Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine-api is failing on 4.1.0 rc.3 #537

Closed
4 tasks done
stbenjam opened this issue May 15, 2019 · 12 comments
Closed
4 tasks done

machine-api is failing on 4.1.0 rc.3 #537

stbenjam opened this issue May 15, 2019 · 12 comments
Assignees

Comments

@stbenjam
Copy link
Member

stbenjam commented May 15, 2019

machine-api fails to come up on latest rebase PR's:

openshift-machine-api                                   machine-api-controllers-54586c8cb9-759kz                          2/3     CrashLoopBackOff   8          19m
# oc logs machine-api-controllers-54586c8cb9-759kz -n openshift-machine-api -c controller-manager

2019/05/15 15:39:21 Registering Components.

2019/05/15 15:39:21 no matches for kind "MachineDeployment" in version "machine.openshift.io/v1beta1

TODO checklist (CAPBM is cluster-api-provider-baremetal)

@russellb russellb self-assigned this May 15, 2019
@russellb
Copy link
Member

I'm on it.

@russellb
Copy link
Member

Here's the deal ...

cluster-api-provider-baremetal is integrated into OpenShift, but since we've been using a pinned release, we haven't tested it with the latest version of OpenShift until now. It appears it went stale and needs updates. I'm working on that now.

This means that it will remain broken until we get a fix into a newer release of OpenShift. To use this OpenShift release, we'll have to hack up dev-scripts to run a custom image of cluster-api-provider-baremetal. To do that, we'll have to first tell CVO to stop managing machine-api-operator. It's far from ideal. There's some hints about this in here: https://github.com/openshift-metal3/dev-scripts/blob/master/docs/custom-mao-and-capbm.md

I'm going to fix cluster-api-provider-baremetal first, and then I can help with how to layer it on a release with a broken cluster-api-provider-baremetal ...

@markmc
Copy link
Contributor

markmc commented May 15, 2019

This means that it will remain broken until we get a fix into a newer release of OpenShift. To use this OpenShift release, we'll have to hack up dev-scripts to run a custom image of cluster-api-provider-baremetal.

Well, we have the ability to produce our own release payloads - we could totally add a custom image for this as a temporary workaround

i.e. we can build on #401

@russellb
Copy link
Member

This means that it will remain broken until we get a fix into a newer release of OpenShift. To use this OpenShift release, we'll have to hack up dev-scripts to run a custom image of cluster-api-provider-baremetal.

Well, we have the ability to produce our own release payloads - we could totally add a custom image for this as a temporary workaround

i.e. we can build on #401

Oh, awesome. That sounds much nicer. I'll follow up once I have a fixed version available.

@russellb
Copy link
Member

First PR is up. openshift/cluster-api-provider-baremetal#23

I'll start a checklist at the top of this issue.

@russellb
Copy link
Member

I'm going to continue tracking the changes needed in openshift in this issue, but to unblock testing, I pushed a fixed image here: quay.io/openshift-metal3/baremetal-machine-controllers. So, if we build a new release image based on 4.1.0-rc3 that replaced the baremetal machine controllers image with that one, you should get past this problem.

@markmc
Copy link
Contributor

markmc commented May 16, 2019

Ok, I've built registry.svc.ci.openshift.org/kni/release:4.1.0-rc.3-kni.0 with this baremetal-machine-controllers image

Details on the build here: https://gist.github.com/markmc/f8a78d7cea7252a9e0f29dadcfaa1253

Haven't tested yet

@russellb
Copy link
Member

Our workaround is in place in a custom release, and all of the appropriate fixes have landed in openshift/cluster-api-provider-baremetal, so I'm closing this as resolved.

@russellb russellb reopened this May 16, 2019
@russellb
Copy link
Member

russellb commented May 16, 2019

I'm looking at a cluster deployed with the new cluster-api provider, and the worker deployment isn't working still, but for a new reason.

I0516 15:25:08.244896       1 controller.go:129] Reconciling Machine "ostest-worker-0-f6gtp"
E0516 15:25:08.245035       1 controller.go:133] "ostest-worker-0-f6gtp" machine validation failed: spec.labels: Invalid value: map[string]string{"sigs.k8s.io/cluster-api-machineset":"ostest-worker-0", "sigs.k8s.io/cluster-api-cluster":"ostest", "sigs.k8s.io/cluster-api-machine-role":"worker", "sigs.k8s.io/cluster-api-machine-type":"worker"}: missing machine.openshift.io/cluster-api-cluster label.
{"level":"error","ts":1558020308.2450933,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"machine-controller","request":"openshift-machine-api/ostest-worker-0-f6gtp","error":"\"ostest-worker-0-f6gtp\" machine validation failed: spec.labels: Invalid value: map[string]string{\"sigs.k8s.io/cluster-api-machineset\":\"ostest-worker-0\", \"sigs.k8s.io/cluster-api-cluster\":\"ostest\", \"sigs.k8s.io/cluster-api-machine-role\":\"worker\", \"sigs.k8s.io/cluster-api-machine-type\":\"worker\"}: missing machine.openshift.io/cluster-api-cluster label.","stacktrace":"github.com/openshift/cluster-api-provider-baremetal/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/openshift/cluster-api-provider-baremetal/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/openshift/cluster-api-provider-baremetal/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/cluster-api-provider-baremetal/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

I need to sort out what's causing that. Because of that error, our cluster-api provider is not claiming a BareMetalHost for a Machine, so a worker will not get deployed

@russellb
Copy link
Member

To fix this last problem, we need the following change in kni-installer for the baremetal platform: openshift/installer@be7bf8e

@russellb
Copy link
Member

To fix this last problem, we need the following change in kni-installer for the baremetal platform: openshift/installer@be7bf8e

I'll send a kni-installer PR for this

russellb added a commit to russellb/kni-installer that referenced this issue May 16, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
@russellb
Copy link
Member

Fix here: openshift-metal3/kni-installer#80

russellb added a commit to russellb/kni-installer that referenced this issue May 16, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
markmc pushed a commit to openshift-metal3/kni-installer that referenced this issue May 20, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue May 20, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to openshift-metal3/kni-installer that referenced this issue May 29, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue May 29, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jun 3, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jun 18, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jun 19, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jun 21, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jul 1, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
stbenjam pushed a commit to stbenjam/kni-installer that referenced this issue Jul 3, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
markmc pushed a commit to markmc/kni-installer that referenced this issue Jul 8, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
markmc pushed a commit to openshift-metal3/kni-installer that referenced this issue Jul 8, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
hardys pushed a commit to hardys/kni-installer that referenced this issue Jul 12, 2019
openshift/installer changed the Machine label names in commit
be7bf8e.  The newest
openshift/cluster-api code validates Machine objects against this
updated label as well, causing the failure seen here:

openshift-metal3/dev-scripts#537 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants