Certs not getting auto approved, stuck in pending state #260

russellb · 2019-04-01T19:13:49Z

All of our deployments have a problem where certs aren't being approved automatically, and the following command is needed as a workaround:

oc get csr -o name | xargs -n 1 oc adm certificate approve

Other PRs / issues tracked through this investigation:

The text was updated successfully, but these errors were encountered:

russellb · 2019-04-01T19:14:11Z

This issue looks related: openshift/cluster-machine-approver#15

russellb · 2019-04-01T19:14:19Z

My machine-approver log has a bunch of:

I0321 18:09:39.113975       1 main.go:97] CSR csr-grgft added
I0321 18:09:39.133134       1 main.go:123] CSR csr-grgft not authorized: No target machine
I0321 18:14:32.453059       1 main.go:97] CSR csr-f66gv added
I0321 18:14:32.464451       1 main.go:123] CSR csr-f66gv not authorized: No target machine
I0321 18:18:11.088381       1 main.go:97] CSR csr-jmf46 added
I0321 18:18:11.102598       1 main.go:123] CSR csr-jmf46 not authorized: No target machine
I0321 18:22:27.112564       1 main.go:97] CSR csr-zxjkn added
I0321 18:22:27.127977       1 main.go:123] CSR csr-zxjkn not authorized: No target machine
I0321 18:27:20.454085       1 main.go:97] CSR csr-f85k9 added

The code of interest is authorizeCSR() in here:

https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L91

It's failing right now as it can't find a Machine with a NodeRef with a name that matches the requestor. The request includes something like machine-0, and that does match the name in Nodes in our env.

The issue appears to be that NodeRef isn't getting set on our Machine.

and that is because we don't have our actuator / machine controller running. That's what would be setting NodeRef.

If I'm reading all this right, we need to complete integration of our cluster-api provider to get this to work properly.

openshift/machine-api-operator#235

russellb · 2019-04-01T19:15:05Z

This is where NodeRef gets set:
https://github.com/metalkube/cluster-api-provider-baremetal/blob/master/vendor/sigs.k8s.io/cluster-api/pkg/controller/node/node.go#L76

russellb · 2019-04-01T19:28:37Z

Right now NodeRef will never get set because our Nodes do not have the machine.openshift.io/machine annotation set on the Node.

That is supposed to be set by the nodelink-controller here, which is run by the machine-api-operator.

https://github.com/openshift/machine-api-operator/blob/master/cmd/nodelink-controller/main.go#L330

In my current cluster, the machine-api-operator is not running the clusterapi-manager-controllers pod, which would include the nodelink controller. This is due to other errors during machine-api-operator startup.

russellb · 2019-04-01T19:50:42Z

machine-api-operator has the following error:

E0401 19:16:52.677137       1 operator.go:177] Failed getting operator config: no platform provider found on install config

which implies that it's seeing the provider type as blank in the Infrastructure config object.

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/config.go#L48

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L192-L200

However, I see it set to BareMetal when I look at it manually:

$ oc get infrastructures.config.openshift.io --all-namespaces -oyaml                                                                                                     
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: 2019-04-01T17:15:56Z
    generation: 1
    name: cluster
    resourceVersion: "396"
    selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
    uid: cfc641b8-54a1-11e9-ba61-52fdfc072182
  spec: {}
  status:
    apiServerURL: https://api.ostest.test.metalkube.org:6443
    etcdDiscoveryDomain: ostest.test.metalkube.org
    platform: BareMetal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

now I'm trying to understand why the machine-api-operator wouldn't be getting that successfully ...

russellb · 2019-04-01T20:14:54Z

I started running the machine-api-operator locally and got different behavior. It was able to get the provider type successfully from the Infrastructure config object, but was falling back to the no-op provider due to the following problem:

openshift/machine-api-operator#268

russellb · 2019-04-01T20:44:46Z

next problem: the clusterapi-manager-controllers pod is running, but in CrashLoopBackoff.

Here's the error:

$ oc logs -n openshift-machine-api pod/clusterapi-manager-controllers-58f6f9769f-xckj7 -c controller-manager
flag provided but not defined: -logtostderr
Usage of ./manager:
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
        The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

This is the same problem we fixed in metal3-io/cluster-api-provider-baremetal#39 except this time it's manager that we build from ./vendor/github.com/openshift/cluster-api/cmd/manager.

Indeed, if I build and run this locally from openshift/cluster-api-provider-baremetal, I get the same behavior.

$ make
go build -o bin/machine-controller-manager ./cmd/manager
go build -o bin/manager ./vendor/github.com/openshift/cluster-api/cmd/manager

$ bin/manager -logtostderr=true
flag provided but not defined: -logtostderr
Usage of bin/manager:
  -kubeconfig string
    	Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
    	The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

russellb · 2019-04-01T20:52:29Z

It seems this flags issue was fixed in openshift/clsuter-api-provider-aws by directly modifying the vendored copy of openshift/cluster-api and not upstreaming that fix into openshift/cluster-api.

openshift/cluster-api-provider-aws@50732ee#diff-6ae4328a95448a2a20bcb23ee01dca50

next we need this fix applied to openshift/cluster-api and then to update our copy of it in openshift/cluster-api-provider-baremetal.

russellb · 2019-04-01T21:08:14Z

also for reference, openshift/cluster-api-provider-libvirt got the fix in a similar way, by directly modifying the vendored openshift/cluster-api in this commit:

openshift/cluster-api-provider-libvirt@07a9858#diff-6ae4328a95448a2a20bcb23ee01dca50

russellb · 2019-04-02T01:11:32Z

Patch proposed to fix this directly in openshift/cluster-api-provider-baremetal here: openshift/cluster-api-provider-baremetal#9

The cluster-api copies were not modified directly. They switched to a different branch, and we needed to switch our provider as well.

russellb · 2019-04-02T20:43:35Z

The above patch to openshift/cluster-api-provider-baremetal merged, but I needed a way to run a custom build of the actuator. I documented how I'm doing that manually here: #271

The next error is in the container that's actually running our actuator.

$ oc logs -n openshift-machine-api clusterapi-manager-controllers-5cdf45d7f9-n9z4b -c machine-controller

...
E0402 20:22:42.603016       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:43.605226       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:44.606683       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
...

russellb · 2019-04-02T21:18:50Z

The RBAC error above is fixed by the following PR: openshift/machine-api-operator#271

With that in place, the baremetal machine controller is running successfully. \o/

Two things:

There are enough fixes in place that we should be able to proceed with actuator driven worker node deployments.
The original cert issue is not yet resolved, but we've gotten enough fixed that we're back to working on the Machine and Node resource linkage necessary to resolve the cert issue. The lack of that linkage is causing failures in: cluster-machine-approver pod, plus the nodelink-controller and machine-healthcheck containers of the clusterapi-machine-controllers pod.

Workaround for openshift-metal3#260. Because life is too short for broken certs. Signed-off-by: Zane Bitter <zbitter@redhat.com>

russellb · 2019-05-14T15:37:16Z

We now have Ironic introspection data available on a BareMetalHost.

I put up a PR to make the nodelink-controller deal with the fact that we may have multiple internal IPs listed for a bare metal Machine object: openshift/machine-api-operator#314

bcrochet · 2019-05-14T16:23:34Z

Actuator update to populate the IPs from the BMH: metal3-io/cluster-api-provider-baremetal#24

russellb · 2019-07-04T16:47:16Z

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

hardys · 2019-07-05T13:17:55Z

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

russellb · 2019-07-05T13:24:08Z

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

A couple of complications:

Manifests generated by the installer are generated prior to running terraform.
Introspection data isn't available until after terraform is running and the cluster is coming up
The hardware details are in the status section, which can't be set in a manifest anyway.

We first need a way to set this at all, and then we'll have to figure out how to use the installer and terraform to get the information into the cluster. I filed this to create a way to pass in the info: metal3-io/baremetal-operator#242

dhellmann · 2019-07-05T13:46:33Z

I expect hardware data to be available when https://github.com/metal3-io/metal3-docs/blob/master/design/hardware-status.md is implemented.

The only remaining change we have running after exiting early from the install process is adding IPs to the master Machines. This is to enable auto approval of CSRs. We also have a cron job that does this, so it's not necessary to do this in the middle of the install. The change moves it to post-install. Related issues: openshift-metal3/kni-installer#60 openshift-metal3#260 metal3-io/baremetal-operator#242

russellb · 2019-07-26T18:17:50Z

Here are some updates to the latest status of this issue.

I recently cleaned up dev-scripts a bit to remove some related hacks that are no longer required: #686

I reviewed the current code in OpenShift that automatically approves CSRs and documented my understanding of the process here: openshift/cluster-machine-approver#32

Now, status for the baremetal paltform:

CSRs for masters get automatically approved successfully. This is done by a service that runs on the bootstrap VM during installation.
CSRs for workers do not get approved successfully yet, but we are very close:

Approval of the client CSR was blocked on cluster-api-provider-baremetal not populating the NodeInternalDNS address field with the hostname. That was resolved here: metal3-io/cluster-api-provider-baremetal#100 and then added to OpenShift here: openshift/cluster-api-provider-baremetal#40

Once we are running a new enough release image to include the above change, automatic CSR approvals for workers should be working, but needs validation.

There are still things to consider for improvements:

Automatic CSR approval for workers relies on hardware introspection data on the BareMetalHost object. If the IPs used by the host change for some reason from what the host had during introspection, this will break. "Continuous Introspection" discussed in previous comments would resolve this.

Note that once the first CSR is approved, future CSRs will get approved automatically. The addresses just need to line up for the first one which occurs immediately post-deployment, so the problem of mismatched addresses is pretty unlikely.

russellb · 2019-07-30T13:58:56Z

Further clarification about masters. Indeed, both the client and server certs will be approved during bootstrapping. However, only the rotations of the client cert will get automatically approved from then on. Automation is still required for approval of the server certs.
I looked into it and now understand why this is required.

The docs were just updated last week to clarify that only the node client cert will be auto approved on an ongoing basis, and that some automation will always be required for the server certs.

https://bugzilla.redhat.com/show_bug.cgi?id=1720178

openshift/openshift-docs@6912507

openshift/openshift-docs#16060

This means that we are back to requiring the addresses on masters to enable the cluster-machine-approver to automate the server CSR approval for masters. Otherwise, we have to use a cron job of sorts to do it, either outside the cluster like dev-scripts has been doing it, or perhaps inside the cluster by doing something like https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml

karmab · 2019-07-31T16:12:58Z

proposing #713, which is a cronjob running within the cluster.

This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260

We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260

This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260

We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260

This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260

We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260

russellb · 2019-08-09T13:10:20Z

https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c2

This proposes a change to the cluster machine approver to be able to automatically approve server cert refreshes in a way that would not require us to solve the IP addresses-on-master-Machines issue.

dhellmann · 2019-08-12T14:19:17Z

#730 is a temporary work-around for this

The journey of openshift-metal3#260 continues. This was previously removed because automatic CSR approval was working for workers. Automatic CSR approval has stopped working because a piece of required information (the hostname) is no longer present on Machine objects. The hostname is copied to the Machine from the hardware introspection data on the BareMetalHost. Ironic was updated to support reporting the hostname received via DHCP here: https://review.opendev.org/#/c/663991/ More history about adding this to the baremetal-operator is here: metal3-io/baremetal-operator#190 Ironic was downgraded in OCP 4.2, meaning we lost the change to report the hostname. We'll need this cron job until we're able to upgrade Ironic again. For more information about automated CSR approval, see: https://github.com/openshift/cluster-machine-approver/blob/master/README.md Related issue: openshift-metal3#706

russellb · 2019-08-29T18:46:36Z

automatic CSR approvals doesn't work anymore after downgrading Ironic, more info here: #782

karmab · 2019-09-18T07:47:23Z

as this bug got merged, can we consider this issue can now be closed?

hardys · 2019-09-18T07:53:07Z

openshift/cluster-machine-approver#38 landed and @karmab reported an install-scripts deployment that stayed up for >24h without fix_certs, so this may now be resolved?

russellb · 2019-09-18T18:21:48Z

The PR you refer to resolves our issues for masters, but not workers. You'd still have to manually approve the first CSRs for each worker without some hack in place. This is because we're lacking the hostname after downgrading Ironic, and that's one of the pieces of info needed in the automated CSR approval workflow.

oglok · 2020-01-28T10:46:09Z

hi @russellb!

Do you mean that we're still hitting this bug? I can see how workers are not joining the cluster (at the node level) but they are provisioned and functional at machine/baremetalhost level. To make it work, I have to manually approve the pending CSRs. Thanks in advance!

Btw, I'm talking about real baremetal deployment, not dev-scripts.

russellb · 2020-01-28T13:17:06Z

AFAIK, this issue should be resolved.

For workers, we have enough information on the BareMetalHost and Machine resources to support automatic CSR approval. If that's not working, we need to check the logs of the cluster-machine-approver. That would be a regression.

Master CSRs are automatically approved during installation. Refreshing them should have been resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1737611

russellb · 2020-01-29T15:57:26Z

we've removed all hacks from dev-scripts as of #915

Please open new bugs if anyone sees a similar issue at this point

russellb mentioned this issue Apr 1, 2019

Applying resources to kubevirt namespace fails with "caches not synchronized" #141

Closed

russellb mentioned this issue Apr 2, 2019

Consider adding nodeRef to Machine.Status metal3-io/cluster-api-provider-baremetal#48

Closed

russellb mentioned this issue Apr 3, 2019

Set up the node -> machine mapping #286

Merged

zaneb added a commit to zaneb/dev-scripts that referenced this issue Apr 9, 2019

Run fix_certs script automatically

98ba9fb

Workaround for openshift-metal3#260. Because life is too short for broken certs. Signed-off-by: Zane Bitter <zbitter@redhat.com>

bcrochet mentioned this issue May 14, 2019

Add node addresses to Machine status metal3-io/cluster-api-provider-baremetal#24

Merged

hardys mentioned this issue May 17, 2019

When having multiple workers Installation should be complete when all the workers are up #542

Closed

gklein mentioned this issue Jun 5, 2019

Workers fails to to register - Unable to register node "worker-0" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API group "" at the cluster scope #597

Closed

russellb mentioned this issue Jun 5, 2019

After worker nodes boot it takes several minutes to get registered as nodes #570

Closed

stbenjam mentioned this issue Jun 10, 2019

Remove early exit from installer openshift-metal3/kni-installer#60

Closed

4 tasks

russellb mentioned this issue Jul 5, 2019

Demonstrate installer early exit is not necessary. #652

Merged

russellb mentioned this issue Jul 26, 2019

Remove some cert approval related hacks. #686

Merged

russellb mentioned this issue Jul 30, 2019

Fix certs even if no workers are deployed #708

Merged

russellb mentioned this issue Jul 31, 2019

Restore code to set IPs on master Machines. #714

Closed

This was referenced Aug 7, 2019

Machine/Node linkage required for machine healthcheck #725

Closed

Ensure master Machine/Node linkage post-install openshift-kni/install-scripts#14

Closed

russellb mentioned this issue Aug 29, 2019

Restore fix_certs cron job. #782

Merged

openshift-ci-robot mentioned this issue Sep 13, 2019

Bug 1737611: Simplified approval flow for renewing serving certs openshift/cluster-machine-approver#38

Merged

russellb closed this as completed Jan 29, 2020

jtomasek mentioned this issue Jan 30, 2020

Master machines are not assigned to nodes #917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certs not getting auto approved, stuck in pending state #260

Certs not getting auto approved, stuck in pending state #260

russellb commented Apr 1, 2019 •

edited

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019 •

edited

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 2, 2019 •

edited

russellb commented Apr 2, 2019

russellb commented Apr 2, 2019 •

edited

russellb commented May 14, 2019

bcrochet commented May 14, 2019

russellb commented Jul 4, 2019

hardys commented Jul 5, 2019

russellb commented Jul 5, 2019

dhellmann commented Jul 5, 2019

russellb commented Jul 26, 2019

russellb commented Jul 30, 2019

karmab commented Jul 31, 2019

russellb commented Aug 9, 2019

dhellmann commented Aug 12, 2019

russellb commented Aug 29, 2019

karmab commented Sep 18, 2019

hardys commented Sep 18, 2019

russellb commented Sep 18, 2019

oglok commented Jan 28, 2020 •

edited

russellb commented Jan 28, 2020

russellb commented Jan 29, 2020

Certs not getting auto approved, stuck in pending state #260

Certs not getting auto approved, stuck in pending state #260

Comments

russellb commented Apr 1, 2019 • edited

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019 • edited

russellb commented Apr 1, 2019

russellb commented Apr 1, 2019

russellb commented Apr 2, 2019 • edited

russellb commented Apr 2, 2019

russellb commented Apr 2, 2019 • edited

russellb commented May 14, 2019

bcrochet commented May 14, 2019

russellb commented Jul 4, 2019

hardys commented Jul 5, 2019

russellb commented Jul 5, 2019

dhellmann commented Jul 5, 2019

russellb commented Jul 26, 2019

russellb commented Jul 30, 2019

karmab commented Jul 31, 2019

russellb commented Aug 9, 2019

dhellmann commented Aug 12, 2019

russellb commented Aug 29, 2019

karmab commented Sep 18, 2019

hardys commented Sep 18, 2019

russellb commented Sep 18, 2019

oglok commented Jan 28, 2020 • edited

russellb commented Jan 28, 2020

russellb commented Jan 29, 2020

russellb commented Apr 1, 2019 •

edited

russellb commented Apr 1, 2019 •

edited

russellb commented Apr 2, 2019 •

edited

russellb commented Apr 2, 2019 •

edited

oglok commented Jan 28, 2020 •

edited