Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "SSL connection errors" connecting to a Data Science Pipelines apiserver instance #244

Closed
1 task done
gregsheremeta opened this issue Aug 2, 2023 · 15 comments
Closed
1 task done
Assignees
Labels
field-priority Identified as a high priority issue by users in the field. kind/bug Something isn't working priority/normal An issue with the product; fix when possible triage/accepted

Comments

@gregsheremeta
Copy link
Contributor

gregsheremeta commented Aug 2, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Deploy type

ODH Dashboard UI

Version

varies

Environment

varies

Current Behavior

We're encountered a few instances of people contacting us about components getting "SSL connection errors" when they try to connect to a Data Science Pipelines instance. I'm concerned that something is nebulously "not quite right" with the way the DSP apiserver is using TLS.

One report:
[RHODS-8860] Data Science Pipelines error "unable to get issuer certificate" on OSD cluster
https://issues.redhat.com/browse/RHODS-8860
For this one, QE tests passed on an OpenStack cluster, but the problem was seen on an OSD cluster, which uses a LetsEncrypt certifcate installed by the OSD infrastructure.

Another report:
"I've got a partner running into certification issues while submitting pipelines through Elyra. They're using a custom certificate in their OCP cluster, which seems to break the Data Science Pipelines integration. Has anyone encountered similar issues? Elyra requests to Data Science Pipelines fail with this error message:
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129). Running RHODS self-managed 1.29.0 on OCP 4.11.44."

Developers for Elyra and Dashboard have worked around this problem by allowing insecure connections in the caller, but that is a sub-optimal solution -- we need to get to the bottom of this.

Expected Behavior

In clusters with trusted certificates installed (e.g. LetsEncrypt), SSL connections to DSP apiserver from Dashboard, Elyra, or a laptop should all "just work" without needing to allow insecure connections or passing a custom certificate bundle to the client code.

In clusters with self-signed certificates installed, ???

  • for clients connecting to the DSP apiserver from outside the cluster, the client would need to allow insecure connections or be passed the signer's certificate
  • for pods running in the cluster, ??? - is there a way we can make these "just work" too?

Steps To Reproduce

unknown

Workaround (if any)

No response

Anything else

No response

@gregsheremeta gregsheremeta added kind/bug Something isn't working priority/normal An issue with the product; fix when possible labels Aug 2, 2023
@gregsheremeta
Copy link
Contributor Author

related: opendatahub-io/notebooks#159

@HumairAK HumairAK added field-priority Identified as a high priority issue by users in the field. triage/accepted labels Aug 3, 2023
@amadhusu amadhusu self-assigned this Aug 23, 2023
@gregsheremeta
Copy link
Contributor Author

[internal Red Hat] slack thread about this: https://redhat-internal.slack.com/archives/C05KDB2HFQQ/p1693240911518529

@amadhusu
Copy link
Contributor

Seeing another case of this from the field will post their findings here once I receive the same.

@amadhusu
Copy link
Contributor

amadhusu commented Oct 5, 2023

#362 - This issue is also relevant to the topic being discussed here. I am working on a fix which might solve both the referenced and issue and this one. I will need to find a way test both the fixes once I am done.

@HumairAK HumairAK self-assigned this Oct 17, 2023
@gregsheremeta
Copy link
Contributor Author

Investigated this a bit. Note this drawing

Screenshot from 2023-10-17 07-33-23

This ticket is about the blue circle, whereas #362 is about the red circle. They will have different solutions.

@HumairAK
Copy link
Collaborator

We should confirm whether this is a problem on dsp backend side

@gregsheremeta
Copy link
Contributor Author

gregsheremeta commented Nov 2, 2023

We were seeing this on known good properly secured clusters -- specifically OSD, which plumbs in LetsEncrypt on the apps route. So this is not quite the same problem as the "use my self signed cert everywhere" problem. With LE properly on board the route, we were seeing both Elyra and Dashboard having TLS problems connecting to our route.

In the case where LE is on the route, neither Elyra nor Dashboard should have a problem connecting to our route. The fact that they both do made me nervous that something was wrong with the DSP route, because the chances that Elyra and Dashboard both have to enable insecure connection to connect to the DSP route ... something is fishy. The chance that they are both struggling to load the correct OS-level CA bundle from their respective container image bases seems pretty low.

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 3, 2023

Okay so I tried taking a look at this and my conclusion is that this is not a DSP related issue.

I reproduced this error by enabling tls verify in the node js server for odh dashboard here. And deploying both DSP and ODH Dashboard on a hypershift cluster secured via LetsEncrypt, which is recognized by basically most major operating systems/browsers.

Let's understand this toggle:

rejectUnauthorized If not false, the server certificate is verified against the list of supplied CAs. An 'error' event is emitted if verification fails; err.code contains the OpenSSL error code. Default: true. source

As we all know, this field is disabled by odh-dashboard currently to bypass the current issue above which is not ideal.

Technically we would expect connections from Dashboard -> Pipelines route to always work (even on self-signed certs), because we use Re-encrypt Routes. Which means the Dashboard pod default cert bundles should be able to validate the cert chains received from DSP route.

Indeed, if I curl the DSP route from the Dashboard pod from a LetsEncrypt secured OCP cluster I do not get any insecure tls errors :

Curl successful from ODH Dashboard pod to DSP pod:

sh-4.4$ curl -I --request GET "https://ds-pipeline-pipelines-definition-test.apps.rosa.greg-1102.kv5l.p3.openshiftapps.com/apis/v1beta1/runs"  -H "Authorization: Bearer <omited>"
HTTP/1.1 200 OK
content-length: 2
content-type: application/json
date: Fri, 03 Nov 2023 17:52:58 GMT
gap-auth: hukhan@redhat.com@cluster.local
gap-upstream-address: localhost:8888
grpc-metadata-content-type: application/grpc
set-cookie: 16569b7999b96e75b5553be61c12f275=c764d98e5e4250350bd87f7716e922e4; path=/; HttpOnly; Secure; SameSite=None
cache-control: private

If it was behind an unrecognized cert hitting the route would yield:

$ curl -I https://ds-pipeline-sample-dspa.apps.hukhan-3.dev.datahub.redhat.com/apis/v1beta1/runs -H "Authorization: Bearer <omitted>"
curl: (60) SSL certificate problem: self-signed certificate in certificate chain
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

And I would need to enable curl's --insecure flag to disable tls verification and bypass this error, much like how odh-dashboard currently bypasses tls veri

So we know the odh-dashboard pod can indeed run rest calls against the DSP route with tls enabled. Then why do we see the error above only when running calls via odh dashboard proxy?

Here I have no clear answers, but my hunch is the culprit is here.

When making proxy calls this CA seems to be passed in as part of the https.RequestOptions, and I wonder if this is somehow forcing the request calls to ignore the well-known CAs shipped as part of the default pod bundles. I think this option is described here:

Default is to trust the well-known CAs curated by Mozilla. Mozilla's CAs are completely replaced when CAs are explicitly specified using this option. The value can be a string or Buffer, or an Array of strings and/or Buffers. Any string or Buffer can contain multiple PEM CAs concatenated together.....

I tried to mess around with the CAs being mounted here though and found little success, currently it seems the CA passed in is the same as the one found in /var/run/secrets/kubernetes.io/serviceaccount. If this is not the issue, I suspect the solution probably lies somewhere in how the TLS requests are being configured when proxy requests are being made.

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 3, 2023

Before closing this out, I think it would be good to get an ack from dashboard team that this is dashboard/UI side based on the above, if there's still opinions that this is a dsp issue we can continue pursuing / investigating this further.

@andrewballantyne

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 6, 2023

It is also worth mentioning currently Dashboard connects to DSP api server via an OCP route and not the internal svc A record names, so traffic probably leaves the cluster then is routed back to the cluster. I think, if instead dashboard uses the internal A record to communicate with api server example something like https://ds-pipeline-sample.dspa.svc.cluster.local:8443, then the traffic would be kept internal and the insecure flag is not as big of a deal.

@gregsheremeta
Copy link
Contributor Author

then the traffic would be kept internal and the insecure flag is not as big of a deal

I would encourage the dashboard team to remove the insecure flag, especially if they change to connect to the svc.cluster.local URL instead of the route. Insecure should not be needed when connecting to svc.cluster.local

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 9, 2023

Technically we would expect connections from Dashboard -> Pipelines route to always work (even on self-signed certs), because we use Re-encrypt Routes. Which means the Dashboard pod default cert bundles should be able to validate the cert chains received from DSP route.

Just wanted to expand on this and provide a simple script anyone can use to validate these finding themselves on self-signed clusters. You can basically retrieve the certs from the dashboard pods and make api requests against the DSP api server without using --insecure in curl:

# Set some env vars
DS_PROJECT=testing
DS_ROUTE=$(echo https://$(oc get routes -n ${DS_PROJECT} ds-pipeline-pipelines-definition --template={{.spec.host}}))
TOKEN=$(oc whoami --show-token)
DASHBOARD_NAMESPACE=opendatahub
DASHBOARD_POD_NAME=$(oc get pods  | grep odh-dashboard | awk '{print $1}')

# Our working directory
cd $(mktemp -d)

# Copy the symlink to known location so we can easily retrieve it from the pod, if connecting via k8s service and not route, the path used should be /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
oc rsh -n opendatahub $(oc get pods  | grep odh-dashboard | awk '{print $1}')
mkdir /tmp/testcert && cd /tmp/testcert
cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt .
exit

# Get the cert to our local machine from the dashboard pod
oc cp ${DASHBOARD_NAMESPACE}/${DASHBOARD_POD_NAME}:/tmp/testcert/ca.crt ./ca.crt

# Use this cert to make a recognized connection to the dsp route
curl -I --request GET "${DS_ROUTE}/apis/v1beta1/runs"  \
  -H "Authorization: Bearer ${TOKEN}" --cacert $(pwd)/ca.crt

Interestingly enough when rejectUnauthorized: true is set on self signed clusters, the proxy calls work, which is probably because like I said here:

it seems the CA passed in is the same as the one found in /var/run/secrets/kubernetes.io/serviceaccount

@HumairAK
Copy link
Collaborator

HumairAK commented Nov 10, 2023

Okay yeah think I've got it. It is indeed because of this.

I believe dashboard can do the following to allow secure connections via proxy calls:

  1. Remove the ca being passed in here. This will use the default well-known trusted CA's for proxy calls.
  2. Re-enable rejeactUnauthorized here
  3. Then set the following:
      env:
        - name: NODE_EXTRA_CA_CERTS
          value: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

In the dashboard container here. For the self-signed portion.

This should allow dashboard to re-enable secure connections with dsp (and other public routes using well known trusted CA's). Using NODE_EXTRA_CA_CERTS extends the well-known trusted CA's, whereas passing in the CA to httpsRequest.request will override them.

@HumairAK
Copy link
Collaborator

fyi @andrewballantyne ^

@HumairAK
Copy link
Collaborator

Based on the findings above, we conclude that the issue is not on backend dspo side. Closing this issue, see suggestion above for what we suggest Dashboard can do to re-enable secure connections. We suspect elyra will need to do something similar (i.e. extend well-known bundle, and not override). We have forwarded this information to both dev teams.

Please feel free to:

  • Re-open this issue if there is reason to believe this issue caused by dsp backend
  • Continue to ask for questions/clarifications on above findings either here or in slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
field-priority Identified as a high priority issue by users in the field. kind/bug Something isn't working priority/normal An issue with the product; fix when possible triage/accepted
Projects
Status: Done
Status: No status
Development

No branches or pull requests

3 participants