Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certs not getting auto approved, stuck in pending state #260

Closed
11 of 13 tasks
russellb opened this issue Apr 1, 2019 · 30 comments
Closed
11 of 13 tasks

Certs not getting auto approved, stuck in pending state #260

russellb opened this issue Apr 1, 2019 · 30 comments

Comments

@russellb
Copy link
Member

russellb commented Apr 1, 2019

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

This issue looks related: openshift/cluster-machine-approver#15

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

My machine-approver log has a bunch of:

I0321 18:09:39.113975       1 main.go:97] CSR csr-grgft added
I0321 18:09:39.133134       1 main.go:123] CSR csr-grgft not authorized: No target machine
I0321 18:14:32.453059       1 main.go:97] CSR csr-f66gv added
I0321 18:14:32.464451       1 main.go:123] CSR csr-f66gv not authorized: No target machine
I0321 18:18:11.088381       1 main.go:97] CSR csr-jmf46 added
I0321 18:18:11.102598       1 main.go:123] CSR csr-jmf46 not authorized: No target machine
I0321 18:22:27.112564       1 main.go:97] CSR csr-zxjkn added
I0321 18:22:27.127977       1 main.go:123] CSR csr-zxjkn not authorized: No target machine
I0321 18:27:20.454085       1 main.go:97] CSR csr-f85k9 added

The code of interest is authorizeCSR() in here:

https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L91

It's failing right now as it can't find a Machine with a NodeRef with a name that matches the requestor. The request includes something like machine-0, and that does match the name in Nodes in our env.

The issue appears to be that NodeRef isn't getting set on our Machine.

and that is because we don't have our actuator / machine controller running. That's what would be setting NodeRef.

If I'm reading all this right, we need to complete integration of our cluster-api provider to get this to work properly.

openshift/machine-api-operator#235

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

Right now NodeRef will never get set because our Nodes do not have the machine.openshift.io/machine annotation set on the Node.

That is supposed to be set by the nodelink-controller here, which is run by the machine-api-operator.

https://github.com/openshift/machine-api-operator/blob/master/cmd/nodelink-controller/main.go#L330

In my current cluster, the machine-api-operator is not running the clusterapi-manager-controllers pod, which would include the nodelink controller. This is due to other errors during machine-api-operator startup.

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

machine-api-operator has the following error:

E0401 19:16:52.677137       1 operator.go:177] Failed getting operator config: no platform provider found on install config

which implies that it's seeing the provider type as blank in the Infrastructure config object.

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/config.go#L48

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L192-L200

However, I see it set to BareMetal when I look at it manually:

$ oc get infrastructures.config.openshift.io --all-namespaces -oyaml                                                                                                     
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: 2019-04-01T17:15:56Z
    generation: 1
    name: cluster
    resourceVersion: "396"
    selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
    uid: cfc641b8-54a1-11e9-ba61-52fdfc072182
  spec: {}
  status:
    apiServerURL: https://api.ostest.test.metalkube.org:6443
    etcdDiscoveryDomain: ostest.test.metalkube.org
    platform: BareMetal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

now I'm trying to understand why the machine-api-operator wouldn't be getting that successfully ...

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

I started running the machine-api-operator locally and got different behavior. It was able to get the provider type successfully from the Infrastructure config object, but was falling back to the no-op provider due to the following problem:

openshift/machine-api-operator#268

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

next problem: the clusterapi-manager-controllers pod is running, but in CrashLoopBackoff.

Here's the error:

$ oc logs -n openshift-machine-api pod/clusterapi-manager-controllers-58f6f9769f-xckj7 -c controller-manager
flag provided but not defined: -logtostderr
Usage of ./manager:
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
        The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

This is the same problem we fixed in metal3-io/cluster-api-provider-baremetal#39 except this time it's manager that we build from ./vendor/github.com/openshift/cluster-api/cmd/manager.

Indeed, if I build and run this locally from openshift/cluster-api-provider-baremetal, I get the same behavior.

$ make
go build -o bin/machine-controller-manager ./cmd/manager
go build -o bin/manager ./vendor/github.com/openshift/cluster-api/cmd/manager

$ bin/manager -logtostderr=true
flag provided but not defined: -logtostderr
Usage of bin/manager:
  -kubeconfig string
    	Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
    	The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

It seems this flags issue was fixed in openshift/clsuter-api-provider-aws by directly modifying the vendored copy of openshift/cluster-api and not upstreaming that fix into openshift/cluster-api.

openshift/cluster-api-provider-aws@50732ee#diff-6ae4328a95448a2a20bcb23ee01dca50

next we need this fix applied to openshift/cluster-api and then to update our copy of it in openshift/cluster-api-provider-baremetal.

@russellb
Copy link
Member Author

russellb commented Apr 1, 2019

also for reference, openshift/cluster-api-provider-libvirt got the fix in a similar way, by directly modifying the vendored openshift/cluster-api in this commit:

openshift/cluster-api-provider-libvirt@07a9858#diff-6ae4328a95448a2a20bcb23ee01dca50

@russellb
Copy link
Member Author

russellb commented Apr 2, 2019

Patch proposed to fix this directly in openshift/cluster-api-provider-baremetal here: openshift/cluster-api-provider-baremetal#9

The cluster-api copies were not modified directly. They switched to a different branch, and we needed to switch our provider as well.

@russellb
Copy link
Member Author

russellb commented Apr 2, 2019

The above patch to openshift/cluster-api-provider-baremetal merged, but I needed a way to run a custom build of the actuator. I documented how I'm doing that manually here: #271

The next error is in the container that's actually running our actuator.

$ oc logs -n openshift-machine-api clusterapi-manager-controllers-5cdf45d7f9-n9z4b -c machine-controller

...
E0402 20:22:42.603016       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:43.605226       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:44.606683       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
...

@russellb
Copy link
Member Author

russellb commented Apr 2, 2019

The RBAC error above is fixed by the following PR: openshift/machine-api-operator#271

With that in place, the baremetal machine controller is running successfully. \o/

Two things:

  1. There are enough fixes in place that we should be able to proceed with actuator driven worker node deployments.

  2. The original cert issue is not yet resolved, but we've gotten enough fixed that we're back to working on the Machine and Node resource linkage necessary to resolve the cert issue. The lack of that linkage is causing failures in: cluster-machine-approver pod, plus the nodelink-controller and machine-healthcheck containers of the clusterapi-machine-controllers pod.

zaneb added a commit to zaneb/dev-scripts that referenced this issue Apr 9, 2019
Workaround for openshift-metal3#260. Because life is too short for broken certs.

Signed-off-by: Zane Bitter <zbitter@redhat.com>
@russellb
Copy link
Member Author

We now have Ironic introspection data available on a BareMetalHost.

I put up a PR to make the nodelink-controller deal with the fact that we may have multiple internal IPs listed for a bare metal Machine object: openshift/machine-api-operator#314

@bcrochet
Copy link
Contributor

Actuator update to populate the IPs from the BMH: metal3-io/cluster-api-provider-baremetal#24

@russellb
Copy link
Member Author

russellb commented Jul 4, 2019

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

@hardys
Copy link

hardys commented Jul 5, 2019

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

@russellb
Copy link
Member Author

russellb commented Jul 5, 2019

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

A couple of complications:

  • Manifests generated by the installer are generated prior to running terraform.
  • Introspection data isn't available until after terraform is running and the cluster is coming up
  • The hardware details are in the status section, which can't be set in a manifest anyway.

We first need a way to set this at all, and then we'll have to figure out how to use the installer and terraform to get the information into the cluster. I filed this to create a way to pass in the info: metal3-io/baremetal-operator#242

@dhellmann
Copy link
Member

I expect hardware data to be available when https://github.com/metal3-io/metal3-docs/blob/master/design/hardware-status.md is implemented.

russellb added a commit to russellb/dev-scripts that referenced this issue Jul 5, 2019
The only remaining change we have running after exiting early from the
install process is adding IPs to the master Machines.  This is to
enable auto approval of CSRs.  We also have a cron job that does this,
so it's not necessary to do this in the middle of the install.  The
change moves it to post-install.

Related issues:
openshift-metal3/kni-installer#60
openshift-metal3#260
metal3-io/baremetal-operator#242
russellb added a commit to russellb/dev-scripts that referenced this issue Jul 5, 2019
The only remaining change we have running after exiting early from the
install process is adding IPs to the master Machines.  This is to
enable auto approval of CSRs.  We also have a cron job that does this,
so it's not necessary to do this in the middle of the install.  The
change moves it to post-install.

Related issues:
  openshift-metal3/kni-installer#60
  openshift-metal3#260
  metal3-io/baremetal-operator#242
@russellb
Copy link
Member Author

Here are some updates to the latest status of this issue.

I recently cleaned up dev-scripts a bit to remove some related hacks that are no longer required: #686

I reviewed the current code in OpenShift that automatically approves CSRs and documented my understanding of the process here: openshift/cluster-machine-approver#32

Now, status for the baremetal paltform:

  • CSRs for masters get automatically approved successfully. This is done by a service that runs on the bootstrap VM during installation.

  • CSRs for workers do not get approved successfully yet, but we are very close:

Approval of the client CSR was blocked on cluster-api-provider-baremetal not populating the NodeInternalDNS address field with the hostname. That was resolved here: metal3-io/cluster-api-provider-baremetal#100 and then added to OpenShift here: openshift/cluster-api-provider-baremetal#40

Once we are running a new enough release image to include the above change, automatic CSR approvals for workers should be working, but needs validation.

There are still things to consider for improvements:

Automatic CSR approval for workers relies on hardware introspection data on the BareMetalHost object. If the IPs used by the host change for some reason from what the host had during introspection, this will break. "Continuous Introspection" discussed in previous comments would resolve this.

Note that once the first CSR is approved, future CSRs will get approved automatically. The addresses just need to line up for the first one which occurs immediately post-deployment, so the problem of mismatched addresses is pretty unlikely.

@russellb
Copy link
Member Author

Further clarification about masters. Indeed, both the client and server certs will be approved during bootstrapping. However, only the rotations of the client cert will get automatically approved from then on. Automation is still required for approval of the server certs.
I looked into it and now understand why this is required.

The docs were just updated last week to clarify that only the node client cert will be auto approved on an ongoing basis, and that some automation will always be required for the server certs.

https://bugzilla.redhat.com/show_bug.cgi?id=1720178

openshift/openshift-docs@6912507

openshift/openshift-docs#16060

This means that we are back to requiring the addresses on masters to enable the cluster-machine-approver to automate the server CSR approval for masters. Otherwise, we have to use a cron job of sorts to do it, either outside the cluster like dev-scripts has been doing it, or perhaps inside the cluster by doing something like https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml

@karmab
Copy link
Contributor

karmab commented Jul 31, 2019

proposing #713, which is a cronjob running within the cluster.

russellb added a commit to russellb/dev-scripts that referenced this issue Jul 31, 2019
This reverts commit a991cca.

Some of the removed code is still needed.  In particular, we still
need to set the IPs on master Machines, as we have no other mechanism
to do that.  Without it, the master Machines and Nodes will not get
linked, which the UI depends on.

Related to issue openshift-metal3#260
russellb added a commit to russellb/dev-scripts that referenced this issue Jul 31, 2019
We should no longer set IPs on the worker Machine objects, as
cluster-api-provider-baremetal should be doing that automatically.

Since we're not setting IPs, there's no requirement for the script to
wait for the worker to come up, either.

This is still a useful verification, so move it to run_ci.sh, instead.

Related to issue openshift-metal3#260
russellb added a commit to russellb/dev-scripts that referenced this issue Jul 31, 2019
This reverts commit a991cca.

Some of the removed code is still needed.  In particular, we still
need to set the IPs on master Machines, as we have no other mechanism
to do that.  Without it, the master Machines and Nodes will not get
linked, which the UI depends on.

Related to issue openshift-metal3#260
russellb added a commit to russellb/dev-scripts that referenced this issue Jul 31, 2019
We should no longer set IPs on the worker Machine objects, as
cluster-api-provider-baremetal should be doing that automatically.

Since we're not setting IPs, there's no requirement for the script to
wait for the worker to come up, either.

This is still a useful verification, so move it to run_ci.sh, instead.

Related to issue openshift-metal3#260
dhellmann pushed a commit to dhellmann/metal3-dev-scripts that referenced this issue Aug 7, 2019
This reverts commit a991cca.

Some of the removed code is still needed.  In particular, we still
need to set the IPs on master Machines, as we have no other mechanism
to do that.  Without it, the master Machines and Nodes will not get
linked, which the UI depends on.

Related to issue openshift-metal3#260
dhellmann pushed a commit to dhellmann/metal3-dev-scripts that referenced this issue Aug 7, 2019
We should no longer set IPs on the worker Machine objects, as
cluster-api-provider-baremetal should be doing that automatically.

Since we're not setting IPs, there's no requirement for the script to
wait for the worker to come up, either.

This is still a useful verification, so move it to run_ci.sh, instead.

Related to issue openshift-metal3#260
@russellb
Copy link
Member Author

russellb commented Aug 9, 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c2

This proposes a change to the cluster machine approver to be able to automatically approve server cert refreshes in a way that would not require us to solve the IP addresses-on-master-Machines issue.

@dhellmann
Copy link
Member

#730 is a temporary work-around for this

russellb added a commit to russellb/dev-scripts that referenced this issue Aug 29, 2019
The journey of openshift-metal3#260 continues.

This was previously removed because automatic CSR approval was working
for workers.  Automatic CSR approval has stopped working because a
piece of required information (the hostname) is no longer present on
Machine objects.

The hostname is copied to the Machine from the hardware introspection
data on the BareMetalHost.  Ironic was updated to support reporting
the hostname received via DHCP here: https://review.opendev.org/#/c/663991/
More history about adding this to the baremetal-operator is here:
metal3-io/baremetal-operator#190

Ironic was downgraded in OCP 4.2, meaning we lost the change to report
the hostname.  We'll need this cron job until we're able to upgrade
Ironic again.

For more information about automated CSR approval, see:
https://github.com/openshift/cluster-machine-approver/blob/master/README.md

Related issue: openshift-metal3#706
@russellb
Copy link
Member Author

automatic CSR approvals doesn't work anymore after downgrading Ironic, more info here: #782

@karmab
Copy link
Contributor

karmab commented Sep 18, 2019

as this bug got merged, can we consider this issue can now be closed?

@hardys
Copy link

hardys commented Sep 18, 2019

openshift/cluster-machine-approver#38 landed and @karmab reported an install-scripts deployment that stayed up for >24h without fix_certs, so this may now be resolved?

@russellb
Copy link
Member Author

The PR you refer to resolves our issues for masters, but not workers. You'd still have to manually approve the first CSRs for each worker without some hack in place. This is because we're lacking the hostname after downgrading Ironic, and that's one of the pieces of info needed in the automated CSR approval workflow.

@oglok
Copy link

oglok commented Jan 28, 2020

hi @russellb!

Do you mean that we're still hitting this bug? I can see how workers are not joining the cluster (at the node level) but they are provisioned and functional at machine/baremetalhost level. To make it work, I have to manually approve the pending CSRs. Thanks in advance!

Btw, I'm talking about real baremetal deployment, not dev-scripts.

@russellb
Copy link
Member Author

AFAIK, this issue should be resolved.

For workers, we have enough information on the BareMetalHost and Machine resources to support automatic CSR approval. If that's not working, we need to check the logs of the cluster-machine-approver. That would be a regression.

Master CSRs are automatically approved during installation. Refreshing them should have been resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1737611

@russellb
Copy link
Member Author

we've removed all hacks from dev-scripts as of #915

Please open new bugs if anyone sees a similar issue at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants