Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws] pulling ocp-release images includes @sha256@sha256 on latest installer 0.9.1 causing installation break #1066

Closed
jatanmalde opened this issue Jan 15, 2019 · 6 comments

Comments

@jatanmalde
Copy link

Version

[root@localhost ~]# openshift-install version
openshift-install v0.9.1
[root@localhost ~]# oc version
oc v4.0.0-0.79.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Platform (aws|libvirt|openstack):

aws

What happened?

Hey Guys, I am trying to install one master and three worker nodes for OCP 4.0 on aws,
I can see the bootstrapper node and master nodes on the aws console.
On checking the openshift-install.log file I can see it failing here,

time="2019-01-14T23:10:06+05:30" level=debug msg="Still waiting for the Kubernetes API: Get https://test-api.aws.cee.redhat.com:6443/version?timeout=32s: dial tcp 18.224.189.175:6443: connect: connection refused"
time="2019-01-14T23:10:57+05:30" level=info msg="API v1.11.0+0583818 up"
time="2019-01-14T23:10:57+05:30" level=info msg="Waiting up to 30m0s for the bootstrap-complete event..."
time="2019-01-14T23:10:57+05:30" level=debug msg="added kube-controller-manager.1579c7d32898b0b7: ip-10-0-1-144_8699fbf7-1823-11e9-83c1-02978bf4bc4e became leader"
time="2019-01-14T23:10:57+05:30" level=debug msg="added kube-scheduler.1579c7d3575f93ce: ip-10-0-1-144_86bb137d-1823-11e9-a99e-02978bf4bc4e became leader"
time="2019-01-15T09:58:28+05:30" level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 92"
time="2019-01-15T09:58:29+05:30" level=warning msg="Failed to connect events watcher: Get https://test-api.aws.cee.redhat.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=92&watch=true: dial tcp 18.224.189.175:6443: connect: connection refused"

Checking the bootstrap node I could see the bootkube.service in failed state reporting,

[core@ip-10-0-1-144 ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Mon 2019-01-14 17:35:11 UTC. --
Jan 15 09:53:32 ip-10-0-1-144 systemd[1]: bootkube.service: main process exited, code=exited, status=125/n/a
Jan 15 09:53:32 ip-10-0-1-144 systemd[1]: Unit bootkube.service entered failed state.
Jan 15 09:53:32 ip-10-0-1-144 systemd[1]: bootkube.service failed.
Jan 15 09:53:37 ip-10-0-1-144 systemd[1]: bootkube.service holdoff time over, scheduling restart.
Jan 15 09:53:37 ip-10-0-1-144 systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Jan 15 09:53:37 ip-10-0-1-144 systemd[1]: Started Bootstrap a Kubernetes cluster.
Jan 15 09:53:37 ip-10-0-1-144 bootkube.sh[3055]: unable to pull quay.io/openshift-release-dev/ocp-release@sha256@sha256:e237499d3b118e25890550daad8b17274af93baf855914a9c6f8f07ebc095dea: error getting default registries to try: invalid reference format

I can see that the ocp-release image getting pulled but I can see @sha256 keyword getting repeated on it.
Checking the /usr/local/bin/bootkube.sh file I cannot see any reference to the sha256 value but only could see the image tag which when manually pulled works fine.

How can I can continue back the installation? Do I have to destroy the cluster and rebuild it?

Let me know if you are looking for more logs.

I can see this issue already reported #2086 but this is happening with the latest installer as well.

What you expected to happen?

  • The single master cluster should be up and running on aws.

How to reproduce it (as minimally and precisely as possible)?

$ your-commands-here

References

  • enter text here.
@wking
Copy link
Member

wking commented Jan 15, 2019

Dup of #933 and #1032. We're just waiting on a newer RHCOS for Podman 1.0 and containers/podman#2106.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closing this issue.

In response to this:

Dup of #933 and #1032. We're just waiting on a newer RHCOS for Podman 1.0 and containers/podman#2106.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jatanmalde
Copy link
Author

@wking I get your point. But how do we troubleshoot this if in future we still get this hitting as I can see you reopened #933.

But is there something we can do with the stuck installation.

@wking
Copy link
Member

wking commented Jan 15, 2019

But how do we troubleshoot this if in future...

You comment in #933?

But is there something we can do with the stuck installation.

Destroy it and launch a new cluster. I dunno why some folks see this more than others, but you can apply #1032 locally if you get tired of it and can't wait for a RHCOS with Podman 1.0 and its fix.

@jatanmalde
Copy link
Author

@wking I replied on the other issue, and went ahead and made the changes to bootkube.sh on the bootstrapper node. I can see the service getting restarted,

[root@ip-10-0-1-144 ~]# systemctl status bootkube.service
● bootkube.service - Bootstrap a Kubernetes cluster
   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
   Active: active (running) since Wed 2019-01-16 09:13:40 UTC; 2min 10s ago
 Main PID: 15506 (bash)
   Memory: 17.7M
   CGroup: /system.slice/bootkube.service
           ├─15506 bash /usr/local/bin/bootkube.sh
           └─17231 podman run --rm --volume /var/opt/openshift:/assets:z --volume /etc/kubernetes:/etc/kubernetes:z --network=host quay.io/openshift-release-dev/ocp-v4.0@sha256:8e6fdc3f01...

Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/openshift-service-signer-secret.yaml Secret openshift-service-cert-signer/service-serving-ce...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/pull.json Secret kube-system/coreos-pull-secret: secrets "coreos-pull-secret" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-aggregator-client.yaml Secret openshift-kube-apiserver/aggregator-client: secrets "ag...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-cluster-signing-ca.yaml Secret openshift-kube-controller-manager/cluster-signing-ca: ...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-kubeconfig.yaml Secret openshift-kube-controller-manager/controller-manager-kubeconfi...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-kubelet-client.yaml Secret openshift-kube-apiserver/kubelet-client: secrets "kubelet-...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-service-account-private-key.yaml Secret openshift-kube-controller-manager/service-acc...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-serving-cert.yaml Secret openshift-kube-apiserver/serving-cert: secrets "serving-cert...eady exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: NOTE: Bootkube failed to create some cluster assets. It is important that manifest errors are resolved and resubmitted to the apiserver.
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: For example, after resolving issues: kubectl create -f <failed-manifest>
Hint: Some lines were ellipsized, use -l to show in full.

[root@ip-10-0-1-144 ~]# journalctl -b -f -u bootkube.service
-- Logs begin at Tue 2019-01-15 10:06:43 UTC. --
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/openshift-service-signer-secret.yaml Secret openshift-service-cert-signer/service-serving-cert-signer-signing-key: secrets "service-serving-cert-signer-signing-key" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/pull.json Secret kube-system/coreos-pull-secret: secrets "coreos-pull-secret" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-aggregator-client.yaml Secret openshift-kube-apiserver/aggregator-client: secrets "aggregator-client" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-cluster-signing-ca.yaml Secret openshift-kube-controller-manager/cluster-signing-ca: secrets "cluster-signing-ca" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-kubeconfig.yaml Secret openshift-kube-controller-manager/controller-manager-kubeconfig: secrets "controller-manager-kubeconfig" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-kubelet-client.yaml Secret openshift-kube-apiserver/kubelet-client: secrets "kubelet-client" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-service-account-private-key.yaml Secret openshift-kube-controller-manager/service-account-private-key: secrets "service-account-private-key" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/secret-serving-cert.yaml Secret openshift-kube-apiserver/serving-cert: secrets "serving-cert" already exists
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: NOTE: Bootkube failed to create some cluster assets. It is important that manifest errors are resolved and resubmitted to the apiserver.
Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: For example, after resolving issues: kubectl create -f <failed-manifest>

But how do I proced now? Do I need to run the installer again on the same directory on just go on the master and check the status of the operators?

What do you suggest?

@wking
Copy link
Member

wking commented Jan 16, 2019

Jan 16 09:14:06 ip-10-0-1-144 bootkube.sh[15506]: Failed creating /assets/manifests/openshift-service-signer-secret.yaml Secret openshift-service-cert-signer/service-serving-cert-signer-signing-key: secrets "service-serving-cert-signer-signing-key" already exists

These mean an earlier round of bootkube.service died after having pushed some resources. You should reach father back into the bootkube.service log to find that original error.

Do I need to run the installer again on the same directory on just go on the master and check the status of the operators?

You should cleanup and start over from scratch. The bootkube service will never recover from this situation, and it's likely that the cluster is missing important resources. You could probably force things though manually, but it's less work to just launch a fresh cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants