Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Kubeflow Pipelines in RHODS Pipelines #156

Closed
Zongshun96 opened this issue Jul 3, 2023 · 6 comments
Closed

Load Kubeflow Pipelines in RHODS Pipelines #156

Zongshun96 opened this issue Jul 3, 2023 · 6 comments
Assignees
Labels
openshift This issue pertains to NERC OpenShift

Comments

@Zongshun96
Copy link

Description

It seems the Kubeflow Pipelines SDK cannot be used with RHODS Pipeline at the moment. The Kubeflow pipeline endpoint is not exposed.

Forwarding ds-pipeline-pipelines-definition service in my openshift project(ns) didn't solve the problem as my code (kpf_tekton with my bearer token, adapted from here) complaining certificate verify failed. Also it is not safe to simply forward the service.

Trevor Royer commented the problem could be that "the container likely has a cert built into it that is self signed so your cert verification fails."

Proposed Solution

Trevor suggested to add a route for the oauth port in the service, e.g., oc create route reencrypt --service=dsp-def-service --port=oauth. He mentioned "you can setup a route and the cluster will create a new cert that is already trusted for you."

Screen Shot 2023-06-30 at 5 31 58 PM

He also mentioned some workarounds. While I think we need a permenant fix to this problem, I am listing them here for record.

  • "You can extract the cert from the container and add it to your trusted certs list on your machine"
  • "kpf_tekton may provide an option to allow you to connect without authenticating the cert"

Reproducibility

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

Notes

  • Please manually create the S3 bucket for your Pipeline Server. (Check oc describe <pod> and AWS Cloudtrail)
  • "The other gotcha that I have found is that in upstream you need to point kfp to <route>/pipelines and in dsp you will need to use just the route. No /pipelines"
  • "I think when you create a DSP instance it creates a pod called something like data-science-pipelines-defenition and the route for the API endpoint will be the one pointing towards that"
  • "So basic rule of thumb is anything for connecting to the dsp server or compiling you want to use kfp_tekton. Any normal pipeline defenition pieces use kfp."
@Zongshun96
Copy link
Author

Now we need to figure out how to correctly configure the TLS params for the routes.

Following tutorial (https://www.redhat.com/sysadmin/cert-manager-operator-openshift), we make sure that the cert-manager is running and I was able to create an issuer in my namespace.

Then we try to follow this tutorial to manually generate a certificate and add the secret to my route. There were two issues here. One I cannot create a certificate using the ClusterIssuer the certificate stays in issuing stage forever. Second, although I can create a certificate with the issuer in my namespace, after adding the TLS information to my route based on the secret generated by the certificate, I still saw the same certificate verify failed.

Try to create ClusterIssuer

Screen Shot 2023-07-13 at 4 14 13 PM
Screen Shot 2023-07-13 at 4 14 20 PM

Using Issuer in my namespace

Screen Shot 2023-07-13 at 4 15 27 PM
Screen Shot 2023-07-13 at 3 19 44 PM
Screen Shot 2023-07-13 at 3 22 22 PM
Screen Shot 2023-07-13 at 3 23 04 PM

Next Attempt

Can we apply this cert-manager-openshift-routes in the test cluster and it should generate the certificate and populate my routes automatically.

@Zongshun96
Copy link
Author

Zongshun96 commented Jul 14, 2023

It turns out that without the cert-manager-openshift-routes we can still manualy setup certificate for routes and connect to KPF endpoint. While I think having this be automatically done in the test cluster would be nice, I have the steps to do so manually below.
Thank you for all the helps from Trevor and Dylan!

Steps

  1. Generate Certificate with ClusterIssuer (named selfsigned in NERC test cluster). It will create the corresponding secret in the same namespace/project.
    Screen Shot 2023-07-14 at 3 51 37 PM
    Screen Shot 2023-07-14 at 3 52 16 PM

  2. Copy & paste the cert and private key to your route. Configure your route with spec.tls.termination: reencrypt
    Screen Shot 2023-07-14 at 3 54 45 PM

  3. Add certificate to your kfp client in python (kfp_tekton==1.5.0).
    Note: with python3.10 there are issues with versions of pyyaml and urllib3. Please try python3 -m pip install kfp_tekton==1.5.0 pyyaml==5.3.1 urllib3==1.26.15 requests-toolbelt==0.10.1 kubernetes

client = kfp_tekton.TektonClient(
    host=kubeflow_endpoint,
    existing_token=bearer_token,
    ssl_ca_cert = '/home/ubuntu/Praxi-Pipeline/ca.crt'
)
  1. Now, your kfp client should be able to connect to the kfp endpoint.
    Screen Shot 2023-07-14 at 4 10 25 PM
    https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

@dystewart
Copy link

In the meantime we should document this as a solution. I also agree it would be nice to have an automated way of doing this. Nice work @Zongshun96!

@Zongshun96
Copy link
Author

Zongshun96 commented Jul 19, 2023

Problem Description

I am facing a new error when deploying a kfp pipeline with intermediate data. The PVC is mounted to a volume but the container cannot mount that volume.
It seems the cephfs/rbd Plugin pod is not working correctly(?). rook/rook#4896 (comment)

Reproducibility 1

Deploying the pipeline below.
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines/11_iris_training_pipeline.py

Error Logs

Screen Shot 2023-07-19 at 12 47 26 AM Screen Shot 2023-07-19 at 12 47 04 AM

Reproducibility 2

Deploying a single busybox container pod with a PVC also shows the same error.

interm-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: interm-pvc
  namespace: praxi 
  # labels:
  #   app: snapshot
spec:
  storageClassName: ocs-external-storagecluster-ceph-rbd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

fake-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: snapshot-fake-deployment
  namespace: praxi 
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snapshot
  template:
    metadata:
      labels:
        app: snapshot
    spec:
      containers:
        - name: snapshot-fake
          image: busybox:latest
          imagePullPolicy: "IfNotPresent"
          command: [ "/bin/bash", "-c", "--" ]
          args: [ "while true; do sleep 30; done;" ]
          volumeMounts:
            - mountPath: /fake-snapshot
              name: snapshot-vol1
      volumes:
        - name: snapshot-vol1
          persistentVolumeClaim:
            claimName: interm-pvc

Error Logs

Screen Shot 2023-07-19 at 1 00 40 AM Screen Shot 2023-07-19 at 1 02 50 AM

@Zongshun96
Copy link
Author

Zongshun96 commented Jul 31, 2023

The storage class issue was fixed by enforce node affinity to avoid those newly introduced GPU nodes. There seemst to be some permissions haven't been setup. #170

For now my fix is to apply the node affinity to my components. The following is an example to add_affinity to generate_loadmod_op component.

    # create affinity objects
    terms = kubernetes.client.models.V1NodeSelectorTerm(
        match_expressions=[
            {'key': 'kubernetes.io/hostname',
            'operator': 'NotIn',
            'values': ["wrk-10", "wrk-11"]}
        ]
    )
    node_selector = kubernetes.client.models.V1NodeSelector(node_selector_terms=[terms])
    node_affinity = kubernetes.client.models.V1NodeAffinity(
        required_during_scheduling_ignored_during_execution=node_selector
    )
    affinity = kubernetes.client.models.V1Affinity(node_affinity=node_affinity)

    model = generate_loadmod_op().apply(use_image_pull_policy()).add_affinity(affinity)

A working pipeline is shown here.
https://github.com/ai4cloudops/Praxi-Pipeline/blob/7fac19b79ac56f41b098d5adb380a510038f3ddf/Praxi-Pipeline-xgb.py

Some useful pointers to recall

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/tree/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines
https://github.com/cert-manager/openshift-routes#usage

Thank you!

@Zongshun96
Copy link
Author

It seems the AWS access key in mlpipeline-minio-artifact secret won't update automatically to reflect changes (new AWS access key and secret) in the data connection. This problem causes pods to fail with the following error.

ubuntu@test-retrieving-logs:~/Praxi-Pipeline$ oc logs submitted-pipeline-4c585-generate-changesets-pod -c step-copy-artifacts
tar: Removing leading `/' from member names
/tekton/home/tep-results/args
upload failed: ./args.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/args.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
tar: Removing leading `/' from member names
/tekton/home/tep-results/cs
upload failed: ./cs.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/cs.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.

The solution is to manully update the mlpipeline-minio-artifact secret with new AWS access key and secret. This can be done through openshift console.

@joachimweyl joachimweyl added the openshift This issue pertains to NERC OpenShift label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

3 participants