New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Console pod CrashLoopBackOff: 'secrets "webconsole-serving-cert" not found' #37

Closed
naumannt opened this Issue Apr 26, 2018 · 18 comments

Comments

Projects
None yet
6 participants
@naumannt

naumannt commented Apr 26, 2018

I used the AWS playbooks located here to set up an openshift cluster; I just installed the master node and wanted to make sure everything is running fine. The external master ELB leads to the error message detailed in this issue.
I therefore tried to install the console from the template linked in this repo's README.

Used configuration file:

apiVersion: webconsole.config.openshift.io/v1
kind: WebConsoleConfiguration
clusterInfo:
  consolePublicURL: https://ec****.eu-central-1.compute.amazonaws.com:8443/console/
  loggingPublicURL: ""
  logoutPublicURL: ""
  masterPublicURL: https://ec****.eu-central-1.compute.amazonaws.com:8443
  metricsPublicURL: ""
extensions:
  scriptURLs: []
  stylesheetURLs: []
  properties: null
features:
  inactivityTimeoutMinutes: 0
  clusterResourceOverridesEnabled: false
servingInfo:
  bindAddress: 0.0.0.0:8443
  bindNetwork: tcp4
  certFile: /root/certs/tls.crt
  clientCA: ""
  keyFile: /root/certs/tls.key
  maxRequestsInFlight: 0
  namedCertificates: null
  requestTimeoutSeconds: 0

It's difficult to get a clear error message from the pod, it apparently just prints the config file when I do 'oc logs -f '.
Event log:

Events:
  Type     Reason                 Age                 From                                                     Message
  ----     ------                 ----                ----                                                     -------
  Normal   Scheduled              28m                 default-scheduler                                        Successfully assigned webconsole-5d4cfd95db-7m49z to ip-****.eu-central-1.compute.internal
  Normal   SuccessfulMountVolume  28m                 kubelet, ip-****.eu-central-1.compute.internal  MountVolume.SetUp succeeded for volume "webconsole-config"
  Normal   SuccessfulMountVolume  28m                 kubelet, ip-****.eu-central-1.compute.internal  MountVolume.SetUp succeeded for volume "webconsole-token-2spzk"
  Warning  FailedMount            28m                 kubelet, ip-****.eu-central-1.compute.internal  MountVolume.SetUp failed for volume "serving-cert" : secrets "webconsole-serving-cert" not found
  Normal   SuccessfulMountVolume  28m                 kubelet, ip-****.eu-central-1.compute.internal  MountVolume.SetUp succeeded for volume "serving-cert"
  Normal   Created                28m (x4 over 28m)   kubelet, ip-****.eu-central-1.compute.internal  Created container
  Normal   Started                28m (x4 over 28m)   kubelet, ip-****.eu-central-1.compute.internal  Started container
  Normal   Pulled                 27m (x5 over 28m)   kubelet, ip-****.eu-central-1.compute.internal  Container image "openshift/origin-web-console:latest" already present on machine
  Warning  BackOff                3m (x119 over 28m)  kubelet, ip-****.eu-central-1.compute.internal  Back-off restarting failed container

Why is there no clearer documentation on this? It's a pretty big change from earlier versions (the console not coming with the master by default anymore).

@naumannt

This comment has been minimized.

naumannt commented Apr 26, 2018

For some unknown reason (no changes) I can now see a clearer error message in the pod:

Error: AssetConfig.webconsole.config.openshift.io "" is invalid: [config.servingInfo.certFile: Invalid value: "/root/certs/tls.crt": could not read file: stat /root/certs/tls.crt: no such file or directory, config.servingInfo.keyFile: Invalid value: "/root/certs/tls.key": could not read file: stat /root/certs/tls.key: no such file or directory]

I'm not entirely sure why this is the case; the certs are in the folders. Is it because the pod is not able to access the file system the way I expected? How do I put the certs in there if not via a path?
(again documentation for the config file would be great)

@spadgett

This comment has been minimized.

Member

spadgett commented Apr 27, 2018

@naumannt Can you include the last part of the output from Ansible you saw if you still have it? It should print a lot of debugging info when the console install fails, which would tell us what went wrong.

The issue you pointed to was because users were installed an earlier OpenShift version with playbooks from the master branch. You're probably hitting another problem. It would be good to understand what went wrong in case it's a bug in the AWS playbooks.

You should not need to install the template manually... It is possible to just run the console playbook with your inventory file if you need.

I'm not entirely sure why this is the case; the certs are in the folders. Is it because the pod is not able to access the file system the way I expected? How do I put the certs in there if not via a path?
(again documentation for the config file would be great)

This is not a property you should need to change... These certs are not what users see in the browser.

Since the console is running in a pod, it won't be able to access files on the master filesystem. It's using a generated cert that is mounted in from a secret. Can you try to revert the changes you made to those paths and apply the template again?

@naumannt

This comment has been minimized.

naumannt commented Apr 27, 2018

I think the Master software installation ran without any problems, so I have no logs saved from that installation. I just noticed that the console is missing when I checked it after.
I don't really have an inventory file (the referenced playbook folder builds an inventory dynamically based on AWS resources afaik).
Will try running the console template again with original paths asap.

@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

Finally got around to running the template with default paths; only thing I changed from the original configuration are master and console hostname.

Result stays the same:

F0514 09:44:11.667874       1 console.go:35] unable to load server certificate: open /var/serving-cert/tls.crt: permission denied
@spadgett

This comment has been minimized.

Member

spadgett commented May 14, 2018

@naumannt Can you see if this fixes it for you?

https://access.redhat.com/solutions/3428351

@sdodson fyi

@spadgett spadgett self-assigned this May 14, 2018

@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

I used oc adm policy add-scc-to-user privileged system:serviceaccount:openshift-web-console:webconsole. Not entirely sure if that worked because I can't confirm it with oc describe sa webconsole(it doesn't say anything about privileged access).
Still crash-looping with same error message, though.

@spadgett

This comment has been minimized.

Member

spadgett commented May 14, 2018

Yeah, the console shouldn't need privileged access. Apparently there is a bug in the install where the service account is incorrectly added to the anyuid SCC. Can you try to remove anyuid and use restricted instead? (See link above.)

@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

Sorry, misunderstood the linked solution, tried oc adm policy remove-scc-from-user privileged system:serviceaccount:openshift-web-console:webconsole
and
oc adm policy remove-scc-from-user anyuid system:serviceaccount:openshift-web-console:webconsole
and
oc adm policy add-role-to-user restricted system:serviceaccount:openshift-web-console:webconsole
now; sadly unchanged:

F0514 12:43:12.672295       1 console.go:35] unable to load server certificate: open /var/serving-cert/tls.crt: permission denied
@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

We noticed that the "annotations"-section of the service account isn't updated by the oc adm policy command; oc export sa webconsole returned:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"labels":{"app":"openshift-web-console"},"name":"webconsole","namespace":"openshift-web-console"}}
  creationTimestamp: null
  labels:
    app: openshift-web-console
  name: webconsole

I edited it via oc edit sa webconsole and added "restricted" in there:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"restricted"},"labels":{"app":"openshift-web-console"},"name":"webconsole","namespace":"openshift-web-console"}}
  creationTimestamp: null
  labels:
    app: openshift-web-console
  name: webconsole

Update: oc get rolebindings now correctly displays that webconsole is restricted. problem should no longer steam from that direction.
Sadly still same error loop.

@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

Another Update: after the rolebindung changes, the pod now successfully starts; the console is "available" at the supposed URL (hostname/console), but it's not responsive ("Loading..." keeps on for quite a while until the sidebars for the catalog view load, but those never finish loading their contents).

@spadgett

This comment has been minimized.

Member

spadgett commented May 14, 2018

@naumannt Can you check if there are any JavaScript or Network errors in your browser developer tools?

https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools

@naumannt

This comment has been minimized.

naumannt commented May 14, 2018

Some warning regarding bootstrap-slider.js, and when the view switches from "Loading..." to sidebars, three Failed to load resource: net::ERR_CONNECTION_TIMED_OUT appear, their URL pointing to the original ec2-publicIP (before Route53 hostname) - I have a suspicion why this might be the case, I'll try the template with a modified hostname real quick

@cmoulliard

This comment has been minimized.

cmoulliard commented May 18, 2018

If we use oc cluster up, kill the Replicationset of the Webconsole, then the pod is restarted correcty

1) Start

oc cluster up --create-machine
Creating docker-machine openshift
coPulling image openshift/origin:v3.9.0
Pulled 1/4 layers, 26% complete
Pulled 1/4 layers, 48% complete
Pulled 1/4 layers, 70% complete
Pulled 2/4 layers, 81% complete
Pulled 3/4 layers, 86% complete
Pulled 3/4 layers, 100% complete
Pulled 4/4 layers, 100% complete
Extracting
Image pull complete
Using Docker shared volumes for OpenShift volumes
Using docker-machine IP 192.168.99.100 as the host IP
Using 192.168.99.100 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
OpenShift server started.

The server is accessible via web console at:
    https://192.168.99.100:8443

You are logged in as:
    User:     developer
    Password: <any value>

To login as administrator:
oc login -u system:admin
Logged into "https://192.168.99.100:8443" as "system:admin" using existing credentials.

You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
    kube-public
    kube-system
  * myproject
    openshift
    openshift-infra
    openshift-node
    openshift-web-console

Using project "myproject".

2) Add cluster-role
oc login -u system:admin
oc adm policy add-cluster-role-to-user cluster-admin admin
cluster role "cluster-admin" added: "admin"

oc login -u admin -p admin
Login successful.

You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
    kube-public
    kube-system
  * myproject
    openshift
    openshift-infra
    openshift-node
    openshift-web-console

Using project "myproject".

oc project openshift-web-console
Now using project "openshift-web-console" on server "https://192.168.99.100:8443".


3) Check annotation of the webconsole pod
oc get pods
NAME                          READY     STATUS    RESTARTS   AGE
webconsole-7dfbffd44d-w8mwj   1/1       Running   0          2m

oc describe po/webconsole-7dfbffd44d-w8mwj
Name:           webconsole-7dfbffd44d-w8mwj
Namespace:      openshift-web-console
Node:           localhost/10.0.2.15
Start Time:     Fri, 18 May 2018 08:52:04 +0200
Labels:         app=openshift-web-console
                pod-template-hash=3896998008
                webconsole=true
Annotations:    openshift.io/scc=restricted
Status:         Running
IP:             172.17.0.5
...
Events:
  Type    Reason                 Age   From                Message
  ----    ------                 ----  ----                -------
  Normal  Scheduled              3m    default-scheduler   Successfully assigned webconsole-7dfbffd44d-w8mwj to localhost
  Normal  SuccessfulMountVolume  3m    kubelet, localhost  MountVolume.SetUp succeeded for volume "webconsole-config"
  Normal  SuccessfulMountVolume  3m    kubelet, localhost  MountVolume.SetUp succeeded for volume "serving-cert"
  Normal  SuccessfulMountVolume  3m    kubelet, localhost  MountVolume.SetUp succeeded for volume "webconsole-token-scn9w"
  Normal  Pulling                2m    kubelet, localhost  pulling image "openshift/origin-web-console:v3.9.0"
  Normal  Pulled                 2m    kubelet, localhost  Successfully pulled image "openshift/origin-web-console:v3.9.0"
  Normal  Created                2m    kubelet, localhost  Created container
  Normal  Started                2m    kubelet, localhost  Started container

4) Check scc config

oc get scc/restricted -o yaml
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegedContainer: false
allowedCapabilities: null
allowedFlexVolumes: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: MustRunAs
groups:
- system:authenticated
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: restricted denies access to all host features and requires
      pods to be run with a UID, and SELinux context that are allocated to the namespace.  This
      is the most restrictive SCC and it is used by default for authenticated users.
  creationTimestamp: 2018-05-18T06:51:51Z
  name: restricted
  resourceVersion: "70"
  selfLink: /apis/security.openshift.io/v1/securitycontextconstraints/restricted
  uid: f18c6d5c-5a67-11e8-86cd-627578c225b4
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: MustRunAsRange
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

oc get scc/anyuid -o yaml
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegedContainer: false
allowedCapabilities: null
allowedFlexVolumes: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups:
- system:cluster-admins
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: anyuid provides all features of the restricted SCC
      but allows users to run with any UID and any GID.
  creationTimestamp: 2018-05-18T06:51:51Z
  name: anyuid
  resourceVersion: "71"
  selfLink: /apis/security.openshift.io/v1/securitycontextconstraints/anyuid
  uid: f18cc483-5a67-11e8-86cd-627578c225b4
priority: 10
readOnlyRootFilesystem: false
requiredDropCapabilities:
- MKNOD
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

oc get scc/privileged -o yaml
allowHostDirVolumePlugin: true
allowHostIPC: true
allowHostNetwork: true
allowHostPID: true
allowHostPorts: true
allowPrivilegedContainer: true
allowedCapabilities:
- '*'
allowedFlexVolumes: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups:
- system:cluster-admins
- system:nodes
- system:masters
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: 'privileged allows access to all privileged and host
      features and the ability to run as any user, any group, any fsGroup, and with
      any SELinux context.  WARNING: this is the most relaxed SCC and should be used
      only for cluster administration. Grant with caution.'
  creationTimestamp: 2018-05-18T06:51:51Z
  name: privileged
  resourceVersion: "314"
  selfLink: /apis/security.openshift.io/v1/securitycontextconstraints/privileged
  uid: f18802b0-5a67-11e8-86cd-627578c225b4
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities: null
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: RunAsAny
seccompProfiles:
- '*'
supplementalGroups:
  type: RunAsAny
users:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
- system:serviceaccount:default:pvinstaller
- system:serviceaccount:default:registry
- system:serviceaccount:default:router
volumes:
- '*'

oc get scc
NAME               PRIV      CAPS      SELINUX     RUNASUSER          FSGROUP     SUPGROUP    PRIORITY   READONLYROOTFS   VOLUMES
anyuid             false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    10         false            [configMap downwardAPI emptyDir persistentVolumeClaim projected secret]
hostaccess         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath persistentVolumeClaim projected secret]
hostmount-anyuid   false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim projected secret]
hostnetwork        false     []        MustRunAs   MustRunAsRange     MustRunAs   MustRunAs   <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim projected secret]
nonroot            false     []        MustRunAs   MustRunAsNonRoot   RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim projected secret]
privileged         true      [*]       RunAsAny    RunAsAny           RunAsAny    RunAsAny    <none>     false            [*]
restricted         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim projected secret]

5) Recreate pod
oc get rs
NAME                    DESIRED   CURRENT   READY     AGE
webconsole-7dfbffd44d   1         1         1         4m

oc delete rs/webconsole-7dfbffd44d
replicaset "webconsole-7dfbffd44d" deleted

oc get pods -w
NAME                          READY     STATUS        RESTARTS   AGE
webconsole-7dfbffd44d-crq9c   0/1       Running       0          4s
webconsole-7dfbffd44d-w8mwj   0/1       Terminating   0          4m
webconsole-7dfbffd44d-w8mwj   0/1       Terminating   0         4m
webconsole-7dfbffd44d-w8mwj   0/1       Terminating   0         4m
webconsole-7dfbffd44d-crq9c   1/1       Running   0         12s
^C                                                                                                                                                                                                                         oc get pods
NAME                          READY     STATUS    RESTARTS   AGE
webconsole-7dfbffd44d-crq9c   1/1       Running   0          27s

6) Secret is well mounted with certs 

total 4
drwxrwsrwt  3 root 1000060000  120 May 18 06:56 .
drwxr-xr-x 27 root root       4096 May 18 06:56 ..
drwxr-sr-x  2 root 1000060000   80 May 18 06:56 ..2018_05_18_06_56_54.575948199
lrwxrwxrwx  1 root root         31 May 18 06:56 ..data -> ..2018_05_18_06_56_54.575948199
lrwxrwxrwx  1 root root         14 May 18 06:56 tls.crt -> ..data/tls.crt
lrwxrwxrwx  1 root root         14 May 18 06:56 tls.key -> ..data/tls.key
sh-4.2$ ls -la /var/serving-cert/..data
lrwxrwxrwx 1 root root 31 May 18 06:56 /var/serving-cert/..data -> ..2018_05_18_06_56_54.575948199
@djfoley01

This comment has been minimized.

djfoley01 commented Jul 9, 2018

Resolved the issue changing the secret permissions. Probably not resolving the root cause, but the console is working again for us.

oc edit deploy webconsole
change the permissions for the secret and configmap to 444

image

@spadgett

This comment has been minimized.

Member

spadgett commented Jul 9, 2018

@djfoley01 Yeah, we recently changed the file permissions in the template used during install. They were incorrect before.

See https://github.com/openshift/openshift-ansible/pull/8558/files

This should fix the problem. If not, please reopen the issue. Thanks!

/close

@mahurtado

This comment has been minimized.

mahurtado commented Aug 31, 2018

Hello @djfoley01, @spadgett , same problem here. I am using minishift. When trying your idea, changes are not saved. Does your workaround work in minishift?

oc edit deploy webconsole

change to mode 444 >> save >> but it does not persist:
...

screen shot 2018-08-31 at 10 19 02

@spadgett

This comment has been minimized.

Member

spadgett commented Aug 31, 2018

@mahurtado minishift uses cluster up, which is using a version of the template without this fix:

https://github.com/openshift/origin/blob/master/install/origin-web-console/console-template.yaml

The console operator is likely overwriting your change. I'm not sure if there's a good way to workaround this without fixing in origin.

cc @deads2k

@mahurtado

This comment has been minimized.

mahurtado commented Sep 2, 2018

@spadgett thank you for your response. Looks like then webconsole is unavailable from minishift.
Does it have sense to reopen the issue or create a new one?

praveenkumar added a commit to praveenkumar/origin that referenced this issue Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment