New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certs not getting auto approved, stuck in pending state #260
Comments
This issue looks related: openshift/cluster-machine-approver#15 |
My machine-approver log has a bunch of:
The code of interest is authorizeCSR() in here: https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L91 It's failing right now as it can't find a The issue appears to be that and that is because we don't have our actuator / machine controller running. That's what would be setting If I'm reading all this right, we need to complete integration of our cluster-api provider to get this to work properly. |
Right now NodeRef will never get set because our Nodes do not have the That is supposed to be set by the https://github.com/openshift/machine-api-operator/blob/master/cmd/nodelink-controller/main.go#L330 In my current cluster, the |
which implies that it's seeing the provider type as blank in the Infrastructure config object. https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/config.go#L48 https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L192-L200 However, I see it set to
now I'm trying to understand why the |
I started running the |
next problem: the Here's the error:
This is the same problem we fixed in metal3-io/cluster-api-provider-baremetal#39 except this time it's Indeed, if I build and run this locally from openshift/cluster-api-provider-baremetal, I get the same behavior.
|
It seems this flags issue was fixed in openshift/cluster-api-provider-aws@50732ee#diff-6ae4328a95448a2a20bcb23ee01dca50 next we need this fix applied to |
also for reference, openshift/cluster-api-provider-libvirt@07a9858#diff-6ae4328a95448a2a20bcb23ee01dca50 |
Patch proposed to fix this directly in The cluster-api copies were not modified directly. They switched to a different branch, and we needed to switch our provider as well. |
The above patch to openshift/cluster-api-provider-baremetal merged, but I needed a way to run a custom build of the actuator. I documented how I'm doing that manually here: #271 The next error is in the container that's actually running our actuator.
|
The RBAC error above is fixed by the following PR: openshift/machine-api-operator#271 With that in place, the baremetal machine controller is running successfully. \o/ Two things:
|
Workaround for openshift-metal3#260. Because life is too short for broken certs. Signed-off-by: Zane Bitter <zbitter@redhat.com>
We now have Ironic introspection data available on a BareMetalHost. I put up a PR to make the nodelink-controller deal with the fact that we may have multiple internal IPs listed for a bare metal Machine object: openshift/machine-api-operator#314 |
Actuator update to populate the IPs from the BMH: metal3-io/cluster-api-provider-baremetal#24 |
We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this. |
I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer. |
A couple of complications:
We first need a way to set this at all, and then we'll have to figure out how to use the installer and terraform to get the information into the cluster. I filed this to create a way to pass in the info: metal3-io/baremetal-operator#242 |
I expect hardware data to be available when https://github.com/metal3-io/metal3-docs/blob/master/design/hardware-status.md is implemented. |
The only remaining change we have running after exiting early from the install process is adding IPs to the master Machines. This is to enable auto approval of CSRs. We also have a cron job that does this, so it's not necessary to do this in the middle of the install. The change moves it to post-install. Related issues: openshift-metal3/kni-installer#60 openshift-metal3#260 metal3-io/baremetal-operator#242
The only remaining change we have running after exiting early from the install process is adding IPs to the master Machines. This is to enable auto approval of CSRs. We also have a cron job that does this, so it's not necessary to do this in the middle of the install. The change moves it to post-install. Related issues: openshift-metal3/kni-installer#60 openshift-metal3#260 metal3-io/baremetal-operator#242
Here are some updates to the latest status of this issue. I recently cleaned up dev-scripts a bit to remove some related hacks that are no longer required: #686 I reviewed the current code in OpenShift that automatically approves CSRs and documented my understanding of the process here: openshift/cluster-machine-approver#32 Now, status for the
Approval of the client CSR was blocked on cluster-api-provider-baremetal not populating the Once we are running a new enough release image to include the above change, automatic CSR approvals for workers should be working, but needs validation. There are still things to consider for improvements: Automatic CSR approval for workers relies on hardware introspection data on the Note that once the first CSR is approved, future CSRs will get approved automatically. The addresses just need to line up for the first one which occurs immediately post-deployment, so the problem of mismatched addresses is pretty unlikely. |
Further clarification about masters. Indeed, both the client and server certs will be approved during bootstrapping. However, only the rotations of the client cert will get automatically approved from then on. Automation is still required for approval of the server certs. The docs were just updated last week to clarify that only the node client cert will be auto approved on an ongoing basis, and that some automation will always be required for the server certs. https://bugzilla.redhat.com/show_bug.cgi?id=1720178 openshift/openshift-docs@6912507 openshift/openshift-docs#16060 This means that we are back to requiring the addresses on masters to enable the cluster-machine-approver to automate the server CSR approval for masters. Otherwise, we have to use a cron job of sorts to do it, either outside the cluster like dev-scripts has been doing it, or perhaps inside the cluster by doing something like https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml |
proposing #713, which is a cronjob running within the cluster. |
This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260
We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260
This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260
We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260
This reverts commit a991cca. Some of the removed code is still needed. In particular, we still need to set the IPs on master Machines, as we have no other mechanism to do that. Without it, the master Machines and Nodes will not get linked, which the UI depends on. Related to issue openshift-metal3#260
We should no longer set IPs on the worker Machine objects, as cluster-api-provider-baremetal should be doing that automatically. Since we're not setting IPs, there's no requirement for the script to wait for the worker to come up, either. This is still a useful verification, so move it to run_ci.sh, instead. Related to issue openshift-metal3#260
https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c2 This proposes a change to the cluster machine approver to be able to automatically approve server cert refreshes in a way that would not require us to solve the IP addresses-on-master-Machines issue. |
#730 is a temporary work-around for this |
The journey of openshift-metal3#260 continues. This was previously removed because automatic CSR approval was working for workers. Automatic CSR approval has stopped working because a piece of required information (the hostname) is no longer present on Machine objects. The hostname is copied to the Machine from the hardware introspection data on the BareMetalHost. Ironic was updated to support reporting the hostname received via DHCP here: https://review.opendev.org/#/c/663991/ More history about adding this to the baremetal-operator is here: metal3-io/baremetal-operator#190 Ironic was downgraded in OCP 4.2, meaning we lost the change to report the hostname. We'll need this cron job until we're able to upgrade Ironic again. For more information about automated CSR approval, see: https://github.com/openshift/cluster-machine-approver/blob/master/README.md Related issue: openshift-metal3#706
automatic CSR approvals doesn't work anymore after downgrading Ironic, more info here: #782 |
as this bug got merged, can we consider this issue can now be closed? |
openshift/cluster-machine-approver#38 landed and @karmab reported an install-scripts deployment that stayed up for >24h without fix_certs, so this may now be resolved? |
The PR you refer to resolves our issues for masters, but not workers. You'd still have to manually approve the first CSRs for each worker without some hack in place. This is because we're lacking the hostname after downgrading Ironic, and that's one of the pieces of info needed in the automated CSR approval workflow. |
hi @russellb! Do you mean that we're still hitting this bug? I can see how workers are not joining the cluster (at the node level) but they are provisioned and functional at machine/baremetalhost level. To make it work, I have to manually approve the pending CSRs. Thanks in advance! Btw, I'm talking about real baremetal deployment, not dev-scripts. |
AFAIK, this issue should be resolved. For workers, we have enough information on the BareMetalHost and Machine resources to support automatic CSR approval. If that's not working, we need to check the logs of the Master CSRs are automatically approved during installation. Refreshing them should have been resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1737611 |
we've removed all hacks from dev-scripts as of #915 Please open new bugs if anyone sees a similar issue at this point |
All of our deployments have a problem where certs aren't being approved automatically, and the following command is needed as a workaround:
Other PRs / issues tracked through this investigation:
The text was updated successfully, but these errors were encountered: