Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSR not getting approved for libvirt & baremetal platforms #1893

Closed
zeenix opened this issue Jun 24, 2019 · 18 comments
Closed

CSR not getting approved for libvirt & baremetal platforms #1893

zeenix opened this issue Jun 24, 2019 · 18 comments

Comments

@zeenix
Copy link
Contributor

zeenix commented Jun 24, 2019

Version: master

Platform: libvirt

What happened?

Our e2e-libvirt CI job currently doesn't succeed because of CSR not getting approved. This is likely due to misconfigured firewall but needs investigation.

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1883/pull-ci-openshift-installer-master-e2e-libvirt/478/build-log.txt

/assign @praveenkumar
/label platform/libvirt

@zeenix
Copy link
Contributor Author

zeenix commented Jun 26, 2019

/priority critical-urgent

@openshift-ci-robot openshift-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 26, 2019
@zeenix
Copy link
Contributor Author

zeenix commented Jun 28, 2019

/assign

@zeenix
Copy link
Contributor Author

zeenix commented Jul 1, 2019

Some hopefully helpful log:

$ oc get csr
NAME        AGE    REQUESTOR                                                                   CONDITION
csr-q2ppk   12m    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-xbdkl   11m    system:node:pupu-tdcvn-master-0                                             Approved,Issued
csr-xbwjw   5m7s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

$ oc logs pods/machine-approver-669dff69cb-rn2nj -n openshift-cluster-machine-approver
I0701 16:35:10.855320       1 config.go:33] using default as failed to load config /var/run/configmaps/config/config.yaml: open /var/run/configmaps/config/config.yaml: no such file or directory
I0701 16:35:10.855593       1 config.go:23] machine approver config: {NodeClientCert:{Disabled:false}}
E0701 16:35:10.856664       1 reflector.go:126] github.com/openshift/cluster-machine-approver/main.go:185: Failed to list *v1beta1.CertificateSigningRequest: Get https://127.0.0.1:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
E0701 16:35:11.858356       1 reflector.go:126] github.com/openshift/cluster-machine-approver/main.go:185: Failed to list *v1beta1.CertificateSigningRequest: Get https://127.0.0.1:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
# same message over and over again.

@zeenix
Copy link
Contributor Author

zeenix commented Jul 2, 2019

After a bit more testing, I believe the:

E0701 16:35:11.858356       1 reflector.go:126] github.com/openshift/cluster-machine-approver/main.go:185: Failed to list *v1beta1.CertificateSigningRequest: Get https://127.0.0.1:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused

error goes away after a while (I guess when the API server is finally up) but then we've another issue:

E0701 18:15:26.348374       1 main.go:174] CSR csr-xl8fs creation time 2019-07-01 18:15:26 +0000 UTC not in range (2019-07-01 17:40:40 +0000 UTC, 2019
-07-01 17:50:50 +0000 UTC)

So I think the main issue we're facing here is the API server taking way too long on the nested virt environment to come up. The check for csr creation time was added in mid-May.

@zeenix
Copy link
Contributor Author

zeenix commented Jul 2, 2019

With iptables flushed, now the Installer still fails (it didn't when started debugging this last week) and the issue is:

E0702 14:52:06.642505       1 main.go:174] CSR csr-z98ff creation time 2019-07-02 14:52:06 +0000 UTC not in range (2019-07-02 14:18:33 +0000 UTC, 2019-07-02 14:28:43 +0000 UTC)

So most likely now we've two issue. I'll test again with full (default) firewall rules and see if we still get this error or not, to be sure.

@zeenix
Copy link
Contributor Author

zeenix commented Jul 3, 2019

with the default iptables rules, it's exactly the same:

E0702 15:49:11.485028       1 main.go:174] CSR csr-fd8fp creation time 2019-07-02 15:49:11 +0000 UTC not in range (2019-07-02 15:14:22 +0000 UTC, 2019
-07-02 15:24:32 +0000 UTC)
E0702 16:04:19.872160       1 main.go:174] CSR csr-28flw creation time 2019-07-02 16:04:19 +0000 UTC not in range (2019-07-02 15:14:22 +0000 UTC, 2019
-07-02 15:24:32 +0000 UTC)

so maybe the actual issue we faced last week is somehow gone and now we've this issue? I say that cause last week flushing iptables allowed a successful cluster creation.

@cgwalters
Copy link
Member

Maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=1723955

@zeenix
Copy link
Contributor Author

zeenix commented Jul 3, 2019

@cgwalters yeah, seems like the same issue to me. No upstream issue for this?

@staebler
Copy link
Contributor

staebler commented Jul 3, 2019

Is it taking more than 10 minutes between when the Machine resource is created and when the CSR is created? If so, then the auto-approver will reject the request, and the user is required to manually approve the CSR.

praveenkumar added a commit to praveenkumar/openshift4-libvirt-gcp that referenced this issue Jul 5, 2019
As of now because of openshift/installer#1893
csr approval not going through so as a workaround we need to approve it
ourself.
praveenkumar added a commit to praveenkumar/openshift4-libvirt-gcp that referenced this issue Jul 5, 2019
As of now because of openshift/installer#1893
csr approval not going through so as a workaround we need to approve it
ourself.
praveenkumar added a commit to praveenkumar/openshift4-libvirt-gcp that referenced this issue Jul 5, 2019
As of now because of openshift/installer#1893
csr approval not going through so as a workaround we need to approve it
ourself.
@zeenix
Copy link
Contributor Author

zeenix commented Jul 8, 2019

I'm removing libvirt label since it doesn't seem to be specific to libvirt any more (#1893 (comment)).

/remove-label platform/libvirt

@zeenix zeenix removed their assignment Jul 8, 2019
@zeenix
Copy link
Contributor Author

zeenix commented Jul 8, 2019

/unassign @praveenkumar

@zeenix zeenix changed the title e2e-libvirt CI job broken because of CSR not getting approved CSR not getting approved for libvirt and baremetal platforms Jul 8, 2019
@zeenix zeenix changed the title CSR not getting approved for libvirt and baremetal platforms CSR not getting approved for libvirt & baremetal platforms Jul 8, 2019
@DanyC97
Copy link
Contributor

DanyC97 commented Jul 8, 2019

Is it taking more than 10 minutes between when the Machine resource is created and when the CSR is created? If so, then the auto-approver will reject the request, and the user is required to manually approve the CSR.

hi @staebler , curious - where in the code we have this cut off time of 10 min ?

@staebler
Copy link
Contributor

staebler commented Jul 8, 2019

Is it taking more than 10 minutes between when the Machine resource is created and when the CSR is created? If so, then the auto-approver will reject the request, and the user is required to manually approve the CSR.

hi @staebler , curious - where in the code we have this cut off time of 10 min ?

https://github.com/openshift/cluster-machine-approver/blob/da894d46de80bc1e1fff5256addfc744868f8df3/csr_check.go#L31

@zeenix
Copy link
Contributor Author

zeenix commented Jul 22, 2019

Not sure why the bot relabled this as libvirt.

/remove-label platform/libvirt

@abhinavdahiya
Copy link
Contributor

This seems like this belongs on https://github.com/openshift/cluster-machine-approver
Please open a bugzilla or issue against cluster-machine-approver

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closing this issue.

In response to this:

This seems like this belongs on https://github.com/openshift/cluster-machine-approver
Please open a bugzilla or issue against cluster-machine-approver

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zeenix
Copy link
Contributor Author

zeenix commented Jul 26, 2019

@abhinavdahiya openshift/cluster-machine-approver#36

/remove-priority critical-urgent

@openshift-ci-robot openshift-ci-robot removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 26, 2019
@TuranTimur
Copy link

Hi. For human like me who has no idea what's is going on in this thread,
Please check link below.

https://github.com/openshift/installer/tree/release-4.2/docs/dev/libvirt#console-doesnt-come-up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants