OKD 4.5 GA installation fails on UPI #292

youdzinn · 2020-08-08T23:40:06Z

Describe the bug
Hi all, trying to install OKD 4.5 GA on KVM VMs following these guides: VSphere, Proxmox.
There are 3 masters without any workers config, and after wait-for bootstrap-complete nodes are just fine:

NAME                STATUS   ROLES           AGE   VERSION
mgr-0.okd.lifo.ml   Ready    master,worker   50m   v1.18.3
mgr-1.okd.lifo.ml   Ready    master,worker   51m   v1.18.3
mgr-2.okd.lifo.ml   Ready    master,worker   50m   v1.18.3

But authentication and kube-* operators are in degraded state:

NAME                                       VERSION                            AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                Unknown     Unknown       True       56m
cloud-credential                           4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      64m
cluster-autoscaler
config-operator
console
csi-snapshot-controller                    4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      44m
dns                                        4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      53m
etcd                                       4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      51m
image-registry                             4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      45m
ingress                                    4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      41m
insights                                   4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      47m
kube-apiserver                             4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       51m
kube-controller-manager                    4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       52m
kube-scheduler                             4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       51m

Seems it's certification issue here:

# oc get co authentication -oyaml 
... trim...
status:
  conditions:
  - lastTransitionTime: "2020-08-08T22:27:23Z"
    message: 'RouteHealthDegraded: failed to GET route: x509: certificate is valid
      for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local,
      openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local,
      172.30.0.1, not oauth-openshift.apps.okd.lifo.ml'

Version
Proxmox 6.2-10
FCOS 32.20200715.3.0
OKD 4.5.0-0.okd-2020-07-14-153706-ga

How reproducible
100%

Log bundle
bundle

install-config.yaml

apiVersion: v1
baseDomain: lifo.ml
compute:
- name: worker
  replicas: 0
controlPlane:
  name: master
  replicas: 3
metadata:
  name: okd
networking:
  clusterNetworks:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: 'ssh-rsa AAA...'

Any hint is appreciated, thanks in advance.

The text was updated successfully, but these errors were encountered:

vrutkovs · 2020-08-13T09:38:58Z

You should tear down the bootstrap after wait-for bootstrap-complete - it is no longer necessary and prevents the LB from using the API server on master.

Could you check that the route is being resolved correctly on the nodes after bootstrap machine is destroyed (and LB updated to no longer mention it)?

youdzinn · 2020-08-13T11:05:48Z

Hi @vrutkovs first of all thanks for the response.

After numerous tries and tedious investigation I've found that master VMs had only two CPUs configured (obviously my bad!). Probably that was a reason prevented kube-* pods to start properly, for example oc get co -oyaml kube-apiserver answered the following:

status:
  conditions:
  - lastTransitionTime: "2020-08-13T05:44:46Z"
    message: "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded:
      \nStaticPodsDegraded: pods \"kube-apiserver-mgr-2.okd.lifo.ml\" not found"
    reason: NodeInstaller_InstallerPodFailed::StaticPods_Error
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-08-13T05:42:47Z"
    message: 'NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at
      revision 6; 0 nodes have achieved new revision 7'
    reason: NodeInstaller
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-08-13T06:45:53Z"
    message: 'StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0;
      2 nodes are at revision 6; 0 nodes have achieved new revision 7'
    reason: AsExpected
    status: "True"
    type: Available

Redeploying these pods didn't do the trick:

oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}"

But after changing master VMs config to 3 CPU installation was done in about 30 min, so CPU is really crucial in this case. So the conclusion is: one must not ignore the minimum resource requirements :)

Think we can close the issue.

vrutkovs · 2020-08-13T11:11:18Z

Ooh, right, I was wondering why kube-api operator has been degraded. Resource requirements are not in fact "minimal" but rather "some sane number which would work throughout the cluster lifetime". So you might want to adjust it to have 4 vCPUs and 16GB RAM so that the cluster won't get stuck during upgrade.

Okay, lets close it.

vrutkovs added the triage/needs-information Indicates an issue needs more information in order to work on it. label Aug 13, 2020

vrutkovs closed this as completed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OKD 4.5 GA installation fails on UPI #292

OKD 4.5 GA installation fails on UPI #292

youdzinn commented Aug 8, 2020 •

edited

Loading

vrutkovs commented Aug 13, 2020

youdzinn commented Aug 13, 2020

vrutkovs commented Aug 13, 2020

OKD 4.5 GA installation fails on UPI #292

OKD 4.5 GA installation fails on UPI #292

Comments

youdzinn commented Aug 8, 2020 • edited Loading

vrutkovs commented Aug 13, 2020

youdzinn commented Aug 13, 2020

vrutkovs commented Aug 13, 2020

youdzinn commented Aug 8, 2020 •

edited

Loading