-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster deployment does not finish on vsphere #1884
Comments
|
This looks to be an occurrence of a bug that happens occasionally with the installation. The memory requirements look to be sufficient. Would you be willing to retry the installation? |
I re-tried the installation. I used the https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.1.3/openshift-install-linux-4.1.3.tar.gz installer. But obviously I shouldn't have:
I'll try again now with 4.1.2. |
OK. Re-trying with 4.1.2 comes up to this stage:
I'll try and dig a little deeper. If you have any hints for me, feel free :-) |
Now this:
What does |
Not even 4.1.0 works:
Why is even 4.1.0 pulled from the "fast" channel? I'd expected "channel: stable-4.1". Where is this channel selected? |
|
Have you set a storage backend for the image registry? See https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-vsphere#installation-operators-config_installing-vsphere. For the cluster operators that are not available, you can get the reason for the failure from the yaml for the operator. For example, |
Haven't setup a storage backend because the cluster did not complete its setup, yet. Should I proceed anyway? Prerequisite seems to be a "provisioned persistent volume (PV) with ReadWriteMany access mode, such as NFS." How do I provision such a volume (and why do I have to do it when I gave the vsphere user the required rights beforehand)? "Verify you do not have a registry pod:"
I already have one, obviously. "Check the registry configuration:"
|
You need to set the storage backend for the image registry in order for the installation to complete. The image-registry operator will not become available until the storage backend has been configured. For the production use case, you need to provision your own storage because the vSphere cloud provider does not support ReadWriteMany access for its storage. If this is for non-production purposes, you can set the storage backend to emptyDir. |
For the errors in the other operators, do you have a wildcard DNS entry for the Ingress router pods? This is by default a |
I have NFS available in that environment (I guess, there's a NetApp somewhere). How/Where would I configure the mountpoint for that nfs volume? |
Regarding the DNS wildcard record: yes - exists and resolves to a vip that balances on the worker nodes. |
OK. From the co list, it looks like the ingress operator is having problems. What is the output of |
|
What about |
|
Do you have worker VMs that are not getting added as nodes? Or do you not have any worker VMs? A handful of operators will not function without worker nodes. |
Also, it looks like your machines do not have hostnames that are resolvable by the other machines. It looks like the hostnames are using the default
The hostname is used as the node name for the machine. If all of the machines have the same hostname of |
I have worker nodes and they get a kubelet being deployed on them. |
We have changed that now. Is there a recommendation for the names? Does it have to be a fqdn or is "worker1" sufficient? |
You can use any name you like, so long as the other machines can resolve the name to an IP address. The node name has a limit of 64 characters. If your fqdn fits within that limit, then you can use that. If you use a shortname, then you can configure your machines to have a DNS search domain for your domain. Personally, I use shortnames of control-plane-0, control-plane-1, compute-0, etc. |
That was a big part of the problem.
|
authentication gives this error now:
|
console this:
|
and obviously:
|
The kube-apiserver operator is still progressing. Things may settle down some after that completes is progression. If that operaror does not progress fully, please add the yaml for that operator. |
I seem to have a similar issue in Azure #1817 . Simply, the Prometheus operator is not installing the CRDs (servicemonitors.monitoring.coreos.com):
Can you point me to the place in the code where this gets executed? |
Everything (except console and authentication) are AVAILABLE now:
|
OK - console problem was a loadbalancer configuration issue - current status:
|
That fixed authentication as well:
Thank you for your relentless support :-) |
Whew! I'm glad that it all worked out for you in the end. Sorry that it wasn't a smoother journey. I will take some of the pitfalls that you ran into as a cause for improving areas of the docs. |
thanks for the info !! this type of info i'd expect to be in the docs as i bet folks will enable it and then .. surprise surprise. |
It's in the docs: https://docs.openshift.com/container-platform/4.1/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage |
I like to provide an nfs-share for the registry - but I have to use a second network interface for this. Will that work with the provided ignition file or do I have to modify it? If yes, would that be along the lines of https://coreos.com/os/docs/latest/network-config-with-networkd.html? Can this be used with RHCOS? |
I've transferred this to another issue (#1943). This issue can be closed |
Version
Platform (aws|libvirt|openstack):
Platform is vsphere
[root@console ~]# govc about
Name: VMware vCenter Server
Vendor: VMware, Inc.
Version: 6.7.0
Build: 13007421
OS type: linux-x64
API type: VirtualCenter
API version: 6.7.2
Product ID: vpx
UUID: 1d884c6e-a1ac-4daf-9e25-b197e7f6bd91
VMs have a hw version of 15.
What happened?
The cluster does not finish deployment.
[root@console]# oc --config=/root/sgl-1/auth/kubeconfig get clusterversion -oyaml
apiVersion: v1
items:
kind: ClusterVersion
metadata:
creationTimestamp: "2019-06-21T10:11:17Z"
generation: 1
name: version
resourceVersion: "180651"
selfLink: /apis/config.openshift.io/v1/clusterversions/version
uid: e8d08059-940c-11e9-91f9-005056acf7ce
spec:
channel: stable-4.1
clusterID: 8ff0010a-1f47-4f05-a555-3e7ef1321d70
upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
availableUpdates: null
conditions:
status: "False"
type: Available
status: "False"
type: Failing
message: 'Working towards 4.1.2: 80% complete'
status: "True"
type: Progressing
status: "True"
type: RetrievedUpdates
desired:
force: false
image: quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b
version: 4.1.2
history:
image: quay.io/openshift-release-dev/ocp-release@sha256:9c5f0df8b192a0d7b46cd5f6a4da2289c155fd5302dec7954f8f06c878160b8b
startedTime: "2019-06-21T10:11:47Z"
state: Partial
verified: false
version: 4.1.2
observedGeneration: 1
versionHash: CGRQCirWw8Y=
kind: List
metadata:
resourceVersion: ""
selfLink: ""
[root@console sgl-1]# oc --config=/root/sgl-1/auth/kubeconfig get clusterversion version -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
Available False
Failing False
Progressing True Working towards 4.1.2: 73% complete
RetrievedUpdates True
[root@console sgl-1]# oc --config=/root/sgl-1/auth/kubeconfig get clusteroperator
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
cloud-credential 4.1.2 True False False 102m
cluster-autoscaler 4.1.2 True False False 102m
dns 4.1.2 True False False 11m
kube-apiserver 4.1.2 False True False 101m
kube-controller-manager 4.1.2 False True False 102m
kube-scheduler 4.1.2 False True False 101m
machine-api 4.1.2 True False False 102m
machine-config 4.1.2 False True False 102m
network 4.1.2 True False False 102m
openshift-apiserver 4.1.2 Unknown Unknown False 102m
openshift-controller-manager False True False 101m
operator-lifecycle-manager 4.1.2 True False False 98m
operator-lifecycle-manager-catalog 4.1.2 True False False 98m
service-ca True True False 101m
The master-nodes reboot pretty often for reasons unknown to me. The have the required resources regarding memory and cpu according to the docs. Those VMs are the only ones on the vsphere cluster. Basically each vm runs on its own ESX host.
What you expected to happen?
I expect the openshift cluster setup to succeed.
Anything else we need to know?
The vcenter is flagging the VMs "red" because of their memory consumption. Is more memory (than 16 GB) needed on the master nodes?
As you see I could not execute the above commands fast enough before the master nodes rebooted themselves. That's why once we see "'Working towards 4.1.2: 80% complete" and the second time "Progressing True Working towards 4.1.2: 73% complete". I guess we do not reach 100%. Question is why (and what progress is indicated here)?
I'm available for installation debugging on vsphere as this is a non-production cluster.
Thanx for your help.
The text was updated successfully, but these errors were encountered: