Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OKD 4.5 GA installation fails on UPI #292

Closed
youdzinn opened this issue Aug 8, 2020 · 3 comments
Closed

OKD 4.5 GA installation fails on UPI #292

youdzinn opened this issue Aug 8, 2020 · 3 comments
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@youdzinn
Copy link

youdzinn commented Aug 8, 2020

Describe the bug
Hi all, trying to install OKD 4.5 GA on KVM VMs following these guides: VSphere, Proxmox.
There are 3 masters without any workers config, and after wait-for bootstrap-complete nodes are just fine:

NAME                STATUS   ROLES           AGE   VERSION
mgr-0.okd.lifo.ml   Ready    master,worker   50m   v1.18.3
mgr-1.okd.lifo.ml   Ready    master,worker   51m   v1.18.3
mgr-2.okd.lifo.ml   Ready    master,worker   50m   v1.18.3

But authentication and kube-* operators are in degraded state:

NAME                                       VERSION                            AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                Unknown     Unknown       True       56m
cloud-credential                           4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      64m
cluster-autoscaler
config-operator
console
csi-snapshot-controller                    4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      44m
dns                                        4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      53m
etcd                                       4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      51m
image-registry                             4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      45m
ingress                                    4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      41m
insights                                   4.5.0-0.okd-2020-07-14-153706-ga   True        False         False      47m
kube-apiserver                             4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       51m
kube-controller-manager                    4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       52m
kube-scheduler                             4.5.0-0.okd-2020-07-14-153706-ga   True        True          True       51m

Seems it's certification issue here:

# oc get co authentication -oyaml 
... trim...
status:
  conditions:
  - lastTransitionTime: "2020-08-08T22:27:23Z"
    message: 'RouteHealthDegraded: failed to GET route: x509: certificate is valid
      for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local,
      openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local,
      172.30.0.1, not oauth-openshift.apps.okd.lifo.ml'

Version
Proxmox 6.2-10
FCOS 32.20200715.3.0
OKD 4.5.0-0.okd-2020-07-14-153706-ga

How reproducible
100%

Log bundle
bundle

install-config.yaml

apiVersion: v1
baseDomain: lifo.ml
compute:
- name: worker
  replicas: 0
controlPlane:
  name: master
  replicas: 3
metadata:
  name: okd
networking:
  clusterNetworks:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: 'ssh-rsa AAA...'

Any hint is appreciated, thanks in advance.

@vrutkovs
Copy link
Member

You should tear down the bootstrap after wait-for bootstrap-complete - it is no longer necessary and prevents the LB from using the API server on master.

Could you check that the route is being resolved correctly on the nodes after bootstrap machine is destroyed (and LB updated to no longer mention it)?

@vrutkovs vrutkovs added the triage/needs-information Indicates an issue needs more information in order to work on it. label Aug 13, 2020
@youdzinn
Copy link
Author

Hi @vrutkovs first of all thanks for the response.

After numerous tries and tedious investigation I've found that master VMs had only two CPUs configured (obviously my bad!). Probably that was a reason prevented kube-* pods to start properly, for example oc get co -oyaml kube-apiserver answered the following:

status:
  conditions:
  - lastTransitionTime: "2020-08-13T05:44:46Z"
    message: "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded:
      \nStaticPodsDegraded: pods \"kube-apiserver-mgr-2.okd.lifo.ml\" not found"
    reason: NodeInstaller_InstallerPodFailed::StaticPods_Error
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-08-13T05:42:47Z"
    message: 'NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at
      revision 6; 0 nodes have achieved new revision 7'
    reason: NodeInstaller
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-08-13T06:45:53Z"
    message: 'StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0;
      2 nodes are at revision 6; 0 nodes have achieved new revision 7'
    reason: AsExpected
    status: "True"
    type: Available

Redeploying these pods didn't do the trick:

oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}"

But after changing master VMs config to 3 CPU installation was done in about 30 min, so CPU is really crucial in this case. So the conclusion is: one must not ignore the minimum resource requirements :)

Think we can close the issue.

@vrutkovs
Copy link
Member

Ooh, right, I was wondering why kube-api operator has been degraded. Resource requirements are not in fact "minimal" but rather "some sane number which would work throughout the cluster lifetime". So you might want to adjust it to have 4 vCPUs and 16GB RAM so that the cluster won't get stuck during upgrade.

Okay, lets close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants