Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mas install pre check fails on ARO cluster #561

Closed
kathleenhosang opened this issue Oct 16, 2023 · 16 comments
Closed

mas install pre check fails on ARO cluster #561

kathleenhosang opened this issue Oct 16, 2023 · 16 comments
Assignees
Labels
Bug Report Something isn't working

Comments

@kathleenhosang
Copy link

kathleenhosang commented Oct 16, 2023

Running mas install on ARO cluster yields the following error message:

TASK [ibm.mas_devops.ocp_verify : Debug cluster certificate secret search] *****
ok: [localhost] => 
  msg:
  - Found Router Default Secret ........... False
  - Found Cluster Ingress Secret .......... False
  - Found Cluster Primary Secret .......... False
  - Cluster Ingress Cert Secret Name ...... missing
  - Cluster Ingress Cert .................. missing

TASK [ibm.mas_devops.ocp_verify : Fail if one of the cluster required secrets does not exist] ***
fatal: [localhost]: FAILED! => changed=false 
  assertion: cluster_ingress_secret_name is defined
  evaluated_to: false
  msg: This cluster does not contain any of the secrets known to contain the TLS certificate for the cluster ingress.

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost                  : ok=15   changed=1    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0   

Saw issue in client environment and replicated issue in test ARO cluster.

I looked at the default ingress controller configuration in test env and can see the default ingress certificate as part of the yaml file:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2023-10-16T19:59:03Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 2
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "46153"
  uid: 6df12313-5db2-47a8-993f-06899e95ca60
spec:
  clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  defaultCertificate:
    name: e89b2efb-e26c-447a-89d3-29a9160ba725-ingress

I should be able to get around this by using the ansible playbooks directly, but wanted to raise, since it will be difficult to manage ARO clusters without mas cli support.

@durera durera self-assigned this Oct 17, 2023
@durera
Copy link
Contributor

durera commented Oct 17, 2023

Been meaning to make improvements in this area for a while, using this as the kick to finally do it...

@durera durera added the Bug Report Something isn't working label Oct 17, 2023
@durera
Copy link
Contributor

durera commented Oct 17, 2023

I have no access to a system to test the fix .. @kathleenhosang could you try please see if this change helps:

Start a docker container with the fix: docker run -ri quay.io/ibmmas/cli:7.4.4-pre.ingress

Inside the container, create a playbook like so ocpverify.yaml:

- hosts: localhost
  any_errors_fatal: true
  vars:
    verify_cluster: False
    verify_catalogsources: False
    verify_subscriptions: False
    verify_workloads: False
    verify_ingres: True
  roles:
    - ibm.mas_devops.ocp_verify

Run the playbook: ansible-playbook ocpverify.yaml

Should hopefully see output like this:

TASK [ocp_verify : Debug cluster certificate secret search] ********
ok: [localhost] =>
  msg:
  - Found Router Default Secret ........... False
  - Found Cluster Ingress Secret .......... False
  - Found Cluster Primary Secret .......... False
  - Found Ingress Controller Secret ....... True
  - Cluster Ingress Cert Secret Name ...... YOURCERTNAME

TASK [ocp_verify : Fail if one of the cluster required secrets does not exist] *******
ok: [localhost] => changed=false
  msg: All assertions passed

I don't have any clusters where the first 3 ways of finding the certificate fails, I had hoped the IngressController would be a universal way to get the secret name, but sadly it's as fallable as the others; there doesn't seem to be a single reliable way to look this up that works across all OCP clusters.

@kathleenhosang
Copy link
Author

Running the playbook failed:

TASK [ibm.mas_devops.ocp_verify : Debug cluster certificate secret search] *********************************************************************************************************************************************************************************************
ok: [localhost] => 
  msg:
  - Found Router Default Secret ........... False
  - Found Cluster Ingress Secret .......... False
  - Found Cluster Primary Secret .......... False
  - Found Ingress Controller Secret ....... False
  - Cluster Ingress Cert Secret Name ...... missing

TASK [ibm.mas_devops.ocp_verify : Fail if one of the cluster required secrets does not exist] **************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => changed=false 
  assertion: cluster_ingress_secret_name is defined
  evaluated_to: false
  msg: This cluster does not contain any of the secrets known to contain the TLS certificate for the cluster ingress.

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=14   changed=0    unreachable=0    failed=1    skipped=10   rescued=0    ignored=0   

[ibmmas/cli:7.4.4-pre.ingress]mascli$ 

@kathleenhosang
Copy link
Author

kathleenhosang commented Oct 17, 2023

@durera I need a little help getting around this issue in the short term :)

When trying to run ansible-playbook ibm.mas_devops.oneclick_core, I get the following error:

TASK [ibm.mas_devops.mongodb : community : install : Create new MongoDb admin user credentials secret] *****************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => 
  msg: 'An unhandled exception occurred while running the lookup plugin ''template''. Error was a <class ''ansible.errors.AnsibleError''>, original message: An unhandled exception occurred while running the lookup plugin ''password''. Error was a <class ''PermissionError''>, original message: [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. An unhandled exception occurred while running the lookup plugin ''password''. Error was a <class ''PermissionError''>, original message: [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile'''

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=74   changed=5    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0  

Rather than try to debug, I thought it would be easier to have the pipeline skip the verify ingress task. I commented that task out in the following files locally:

/mascli/ansible-devops/roles/ocp_verify/tasks/main.yml 
/mascli/ansible-devops/roles/ocp_verify/defaults/main.yml

But the pipeline is still checking for the ingress secret and failing. I must be missing something- how do you suggest I move forward?

@durera
Copy link
Contributor

durera commented Oct 18, 2023

Running the playbook failed:

TASK [ibm.mas_devops.ocp_verify : Debug cluster certificate secret search] *********************************************************************************************************************************************************************************************
ok: [localhost] => 
  msg:
  - Found Router Default Secret ........... False
  - Found Cluster Ingress Secret .......... False
  - Found Cluster Primary Secret .......... False
  - Found Ingress Controller Secret ....... False
  - Cluster Ingress Cert Secret Name ...... missing

TASK [ibm.mas_devops.ocp_verify : Fail if one of the cluster required secrets does not exist] **************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => changed=false 
  assertion: cluster_ingress_secret_name is defined
  evaluated_to: false
  msg: This cluster does not contain any of the secrets known to contain the TLS certificate for the cluster ingress.

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=14   changed=0    unreachable=0    failed=1    skipped=10   rescued=0    ignored=0   

[ibmmas/cli:7.4.4-pre.ingress]mascli$ 

What's the output from before that? These are the 4 new tasks I added that would need to see what they returned in your cluster: https://github.com/ibm-mas/ansible-devops/blob/ec5f3f0a762cb72c3041d900c83b31e8dd4d8ad6/ibm/mas_devops/common_tasks/get_signed_ingress_cert.yml#L96-L139

@durera
Copy link
Contributor

durera commented Oct 18, 2023

But the pipeline is still checking for the ingress secret and failing. I must be missing something- how do you suggest I move forward?

This isn't an optional thing that can be skipped, we put this check at the front to avoid wasting time debugging install failures that will happen later if we can't identify what secret contains this certificate ... some of our dependencies use this secret on their routes, if we don't know what it is they are using then we can't set everything up.

@kathleenhosang
Copy link
Author

@durera ok, makes sense. But the ansible playbook to install mas core should work? I am getting the permissions error

TASK [ibm.mas_devops.mongodb : community : install : Create new MongoDb admin user credentials secret] ***
fatal: [localhost]: FAILED! => 
  msg: 'An unhandled exception occurred while running the lookup plugin ''template''. Error was a <class ''ansible.errors.AnsibleError''>, original message: An unhandled exception occurred while running the lookup plugin ''password''. Error was a <class ''PermissionError''>, original message: [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. An unhandled exception occurred while running the lookup plugin ''password''. Error was a <class ''PermissionError''>, original message: [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile''. [Errno 13] Permission denied: b''/tmp/55347da53eb8de5cc6a655730b79772d98ca41a6.ansible_lockfile'''

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost                  : ok=74   changed=5    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0   

[ibmmas/cli:7.4.3]mascli$ 

I can't imagine this is related to the ingress secret? Is this a separate bug?

@durera
Copy link
Contributor

durera commented Oct 19, 2023

Looks like a problem in the container with an inability to write to /tmp directory.

@kathleenhosang
Copy link
Author

kathleenhosang commented Oct 23, 2023

@durera you are correct, resolved that issue, but now seeing a new one.

The mas-mongo-ce-0 pod is in crashing due to readiness probe failing. I replicated this issue in a test ARO env, and ran identical commands in fyre with no issues. The certificates, secrets, and configmaps for mongo seem to have been created properly, my hunch is we may be running into the same issue with finding the certificate?

Let me know what you think, we can't move forward with the mas cli or ansible playbooks. The client opened an IBM support ticket (TS014529626) which we will use to debug this issue.

@mudspringhiker
Copy link
Contributor

@kathleenhosang I asked for the mongoce logs in the support ticket. I'll create an internal issue when I receive the logs. Thanks.

@terenceq
Copy link
Contributor

Just for anyone that comes across this - with respect to the MongoDB CE Operator and ARO - the following storage class should be leveraged for the PVCs:

managed-premium , with provider kubernetes.io/azure-disk

@kathleenhosang
Copy link
Author

Just for anyone that comes across this - with respect to the MongoDB CE Operator and ARO - the following storage class should be leveraged for the PVCs:

managed-premium , with provider kubernetes.io/azure-disk

In ARO 4.12, the default storage class is managed-csi provisioned by disk.csi.azure.com. This can also be used with mongoDB.

@kathleenhosang
Copy link
Author

kathleenhosang commented Oct 25, 2023

FYI, the initially noted ingress cert error surfaces when using ansible directly to install MAS core. This is being debugged in TS014555001

@lahmad1
Copy link

lahmad1 commented Nov 2, 2023

I am seeing same behavior when performing MAS core upgrade on the Openshift cluster hosted in Azure.

TASK [ibm.mas_devops.ocp_verify : Debug cluster certificate secret search] *****
ok: [localhost] =>
msg:

  • Found Router Default Secret ........... False
  • Found Cluster Ingress Secret .......... False
  • Found Cluster Primary Secret .......... False
  • Cluster Ingress Cert Secret Name ...... missing
  • Cluster Ingress Cert .................. missing

TASK [ibm.mas_devops.ocp_verify : Fail if one of the cluster required secrets does not exist] ***
fatal: [localhost]: FAILED! => changed=false
assertion: cluster_ingress_secret_name is defined
evaluated_to: false
msg: This cluster does not contain any of the secrets known to contain the TLS certificate for the cluster ingress.

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost : ok=13 changed=0 unreachable=0 failed=1 skipped=7 rescued=0 ignored=0

Issue only occurs when we are using publicly signed certificates. Is there a way to define certificate secrets?

@mmgbrouwers
Copy link

Any progress on this?

@terenceq
Copy link
Contributor

Closing as this has the original issue related to ingress was addressed with the following PR:

ibm-mas/ansible-devops#1197

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Report Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants