Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lack of feedback when using invalid pull secret #901

Closed
joelddiaz opened this issue Dec 13, 2018 · 7 comments
Closed

lack of feedback when using invalid pull secret #901

joelddiaz opened this issue Dec 13, 2018 · 7 comments

Comments

@joelddiaz
Copy link
Contributor

Version

$ openshift-install version
[jdiaz@minigoomba os-install-0.6.0]$ ./openshift-install version
./openshift-install v0.6.0
Terraform v0.11.10

Platform (aws|libvirt|openstack):

aws

What happened?

Attempted an install with an invalid pull secret. The installer makes a lot of progress and ends up endlessly waiting with these lines repeated:

DEBUG Apply complete! Resources: 157 added, 0 changed, 0 destroyed. 
DEBUG                                              
DEBUG The state of your infrastructure has been saved to the path 
DEBUG below. This state is required to modify and destroy your 
DEBUG infrastructure, so keep it safe. To inspect the complete state 
DEBUG use the `terraform show` command.            
DEBUG                                              
DEBUG State path: /tmp/openshift-install-190141245/terraform.tfstate 
INFO Waiting 30m0s for the Kubernetes API...      
DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 34.206.246.140:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 34.206.246.140:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 54.211.253.207:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 34.196.94.179:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 34.236.211.202:6443: connect: connection refused

Jumping into the bootstrap node you can see that it's an auth problem:

[jdiaz@minigoomba Downloads]$ ssh -i ~/.ssh/libra.pem -l core ec2-54-160-13-191.compute-1.amazonaws.com journalctl -b -f -u bootkube.service
-- Logs begin at Thu 2018-12-13 20:00:23 UTC. --
Dec 13 20:13:51 ip-10-0-15-229 systemd[1]: bootkube.service failed.
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: bootkube.service holdoff time over, scheduling restart.
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: Started Bootstrap a Kubernetes cluster.
Dec 13 20:13:56 ip-10-0-15-229 bootkube.sh[12040]: Pulling release image...
Dec 13 20:13:56 ip-10-0-15-229 bootkube.sh[12040]: Trying to pull quay.io/openshift-release-dev/ocp-release:4.0.0-4...Failed
Dec 13 20:13:56 ip-10-0-15-229 bootkube.sh[12040]: error pulling image "quay.io/openshift-release-dev/ocp-release:4.0.0-4": unable to pull quay.io/openshift-release-dev/ocp-release:4.0.0-4: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-release:4.0.0-4: unable to retrieve auth token: invalid username/password
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: bootkube.service: main process exited, code=exited, status=125/n/a
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: Unit bootkube.service entered failed state.
Dec 13 20:13:56 ip-10-0-15-229 systemd[1]: bootkube.service failed.

What you expected to happen?

Installer should make it more clear that I'm trying to use an invalid pull secret.

How to reproduce it (as minimally and precisely as possible)?

Attempt to install using https://github.com/openshift/installer/releases/download/v0.6.0/openshift-install-linux-amd64 with an invalid pull secret (I changed a couple of characters in my pull secret to get an invalid one).

Anything else we need to know?

?

References

@staebler
Copy link
Contributor

#711 adds some validation to the format of the pull secret. It won't validate that the pull secret provided will grant the necessary permissions to pull all the images used by the cluster, though.

@wking
Copy link
Member

wking commented Dec 15, 2018

#711 has landed. Personally I'm fine closing is based on that PR. We still don't cover the "are my creds sufficient for all the images I'll need?", but short if drilling into the update payload and attempting to pull each referenced image (which would be a lot of overhead) I don't seee a robust way to pre-check. We may be able to surface these issues though, with something like @crawford's streaming-logs idea or similar. Thoughts?

@joelddiaz
Copy link
Contributor Author

it's not just the pre-flight checks (which are helpful nonetheless), but just having no visibility when the reason the install isn't progressing is due to the bad pull secret makes it hard for someone not familiar with the internals of the installation process.

@thomasmckay
Copy link

This is a very bad experience. All of the pods get stuck in Pending and there is no feedback as to why. In addition to wasting a half hour in raw time, there is the lost time in trying to debug what could possibly be wrong. Throw on top of this that there is no option to re-prompt during create cluster so new users are forced to figure out that problem as well in order to reset the auth.

@wking
Copy link
Member

wking commented Jan 2, 2019

All of the pods get stuck in Pending and there is no feedback as to why.

Can you post more details from your cluster? What version of the installer did you use? What were your bootkube.service? How did you determine there were pending pods? I'd have expected an invalid pull secret to fail to pull the update payload, like @joelddiaz's:

Dec 13 20:13:56 ip-10-0-15-229 bootkube.sh[12040]: Trying to pull quay.io/openshift-release-dev/ocp-release:4.0.0-4...Failed
Dec 13 20:13:56 ip-10-0-15-229 bootkube.sh[12040]: error pulling image "quay.io/openshift-release-dev/ocp-release:4.0.0-4": unable to pull quay.io/openshift-release-dev/ocp-release:4.0.0-4: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-release:4.0.0-4: unable to retrieve auth token: invalid username/password

from the topic post. But in that case there would be no cluster at all, so what API would be showing you pending pods?

Throw on top of this that there is no option to re-prompt during create cluster so new users are forced to figure out that problem as well in order to reset the auth.

How are you re-running create cluster? The current recommendation for creating multiple clusters from one config run is to save your install-config.yaml, and in that case it should be fairly straightforward to edit the YAML to replace your broken pull secret with a new value.

... but just having no visibility when the reason the install isn't progressing is due to the bad pull secret makes it hard for someone not familiar with the internals of the installation process.

I agree that increasing visibility here would be good. Currently you only see:

DEBUG Still waiting for the Kubernetes API: Get https://jdiaz-tectonic2-api.new-installer.openshift.com:6443/version?timeout=32s: dial tcp 34.236.211.202:6443: connect: connection refused

so newcomers will need to start at the Kubernetes API is Unavailable docs and work down until they get to the Troubleshooting the Bootstrap Node docs. Something like @crawford's bootstrap log-streaming would help, but at the moment the installer is just as clueless about why the Kubernetes API isn't coming up as the user is. And I don't see a clear way for the installer to SSH into the bootstrap node to check on things, because it only has the public SSH key (not the private key) and the user may have decided to not configure a SSH key at all.

@wking
Copy link
Member

wking commented Jan 24, 2019

Cross-linking Bugzilla.

@eparis
Copy link
Member

eparis commented Feb 20, 2019

Closing this issue as bz is a much better tracking tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants