Skip to content

Conversation

@Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Dec 13, 2023

Issue link

Closes #431

What changes have been made

Ingresses have the label ingress-owner: appwrapper which is used to identify them.
When mcad=True the SDK gets the the ownership reference from the ingress object to identify and check it is a match with the appwrapper name.

When mcad=False the ingresses are listed and the domain is gathered from the ingress with the ingress-owner label.
For both scenarios the ingress domain is extracted from the host of the ingress.

Included steps to recreate scenarios where ingress_options are set.

Fixed missing ingress creation steps for the _component_resources_up and _component_resources_down functions for cases when mcad=False.

Verification steps

Run through a demo notebook and create a Ray Cluster on Kind and OpenShift.
run cluster = get_cluster(name, namespace)
cluster.down() and other cluster. commands should work.

Create a Ray Cluster on Kind with ingress options set see here.
Run cluster = get_cluster(name, namespace)

Create a Ray cluster with mcad=False an ingress should be created now and get_cluster() should work too.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@sutaakar
Copy link
Contributor

The get_cluster() function now asks for ingress_domain for the case of Kind clusters. It defaults at None and is not required on OpenShift clusters.

Why does get_cluster() function need to know Ingress domain? The cluster already exists, Ingress was applied when cluster was created.

@Bobbins228
Copy link
Contributor Author

@sutaakar

Why does get_cluster() function need to know Ingress domain? The cluster already exists, Ingress was applied when cluster was created.

The get_cluster() function uses the from_k8_cluster_object(rc, mcad=True) function to create a new Cluster Configuration

We use the Ray cluster spec rc to gather info like image, name, machine_types, etc.
Unfortunately we cannot gather the domain name which was used originally to create the cluster because it is not included in the Ray cluster spec and was put in place when the ingress object was created in the AppWrapper.

This is necessary as we plan to reintroduce routes in #407 and we will not be supplying domains for them as OpenShift does so automatically.

Alternatively we could go under the assumption that the ingress is created successfully and use list_namespaced_object to gather the ingress and extract the domain from the host that way. If that doesn't work we can ask the user to supply the ingress name in get_cluster WDYT?

@sutaakar
Copy link
Contributor

@Bobbins228 I am still not sure what do you need the domain for when calling get_cluster().
If you create a cluster then you need this value to create Ingress resource. But for get_cluster(), the Ingress already exists, you don't need to create it any more. So, why is the value needed?

Do you need domain to compute what is the Dashboard URL, or is there any other reason?

@Bobbins228
Copy link
Contributor Author

@sutaakar
When from_k8_cluster_object creates that new ClusterConfiguration an appwrapper is created the exact same way we would create one before calling cluster.up() see ClusterConfiguration init here

To create this app wrapper we need all required fields to be fulfilled in the case of Kind clusters ingress_domain is required despite the fact the cluster and ingresses have already been created previously.

from_k8_cluster_object does most of the job filling in the appwrapper for us but ingress_domain is not as simple to fill in automatically.

@sutaakar
Copy link
Contributor

@Bobbins228 What will happen in case user fills some random value as ingress_domain, i.e. "abcde.com"?
From your description it looks like the value is solely needed to fulfill internal implementation requirements without any functionality overlaps.

@Bobbins228
Copy link
Contributor Author

@sutaakar
I see your point in the case the user provides the wrong ingress_domain I think you could still use it for any cluster. functions but ultimately the aw is wrong in that scenario.

It is not ideal hence the suggestion to first check for the ingress in the namespace and gather the domain from the host before asking for the ingress_domain.

@KPostOffice
Copy link
Contributor

Yeah, I'm not a huge fan of this either. Could you add an annotation to the ray-cluster that has the ingress domain and then get it directly from there?

@sutaakar
Copy link
Contributor

Another option could be to assign an owner for Ingress/Route (if it doesn't have) and use it to query for Ingress belonging to the AppWrapper or Ray cluster.

@Bobbins228
Copy link
Contributor Author

@KPostOffice @sutaakar
I have updated the PR,

We now get the appwrapper object and extract the domain from the ingress Host when mcad=True and in the opposite case we get the ingress object that has the ingress-owner: cluster_name label and extract the domain from the host there.

You no longer have to provide the ingress_domain for get_cluster

@Bobbins228
Copy link
Contributor Author

@sutaakar I was able to apply the owner functionality in relation to ray clusters created with app wrappers but ingresses created when mcad=False do not have an owner so I have kept in the label for that case.

@sutaakar
Copy link
Contributor

sutaakar commented Dec 20, 2023

When does the mcad=False case happen?

Edit - It is provided as a parameter for ClusterConfiguration.
What is its meaning? Does it mean that AppWrapper is not used? If so, what is used then?

@Bobbins228
Copy link
Contributor Author

@sutaakar
Yes it is a parameter of the cluster configuration.
By setting mcad=False an appwrapper is not created instead a yaml file featuring the RC, Ingress and secret.
The SDK will use KubeRay to create the ray cluster and individually create the ingress and secrets which would make them have no owner in the traditional sense.

@sutaakar
Copy link
Contributor

sutaakar commented Dec 20, 2023

hmm, so SDK creates RayCluster CR directly. Would it have sense to add owner for Secret and Ingress pointing to RayCluster CR?

From my perspective they are coupled together (there is no reason for Secret or Ingress to exist outside of RayCluster CR lifecycle)?

@Bobbins228
Copy link
Contributor Author

@sutaakar To add a owner reference I need to add in the uid of the ray cluster right?

@sutaakar
Copy link
Contributor

@Bobbins228 Right, uid is required.

@Bobbins228
Copy link
Contributor Author

@sutaakar We won't be able to apply the uid of the Ray Cluster to the ingress because it is created when the rc is up. Ingresses and secrets are created at the same time.

For the case of mcad=False we will have to keep the ingress-owner label

@sutaakar
Copy link
Contributor

sutaakar commented Jan 3, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2024
@sutaakar
Copy link
Contributor

sutaakar commented Jan 3, 2024

@Srihari1192 do you want to review the PR too?

@Srihari1192
Copy link
Contributor

@Srihari1192 do you want to review the PR too?

Sure will take a look

@Srihari1192 Srihari1192 self-requested a review January 4, 2024 06:28
@Srihari1192
Copy link
Contributor

@Bobbins228 @sutaakar I have noticed below error when testing with this changes in this PR at step get_cluster('mnist',namespace)

Traceback (most recent call last): File "mnist_rayjob.py", line 12, in <module> cluster = get_cluster('mnist',namespace) File "/codeflare-sdk/codeflare-sdk/src/codeflare_sdk/cluster/cluster.py", line 710, in get_cluster ingress_domain = ingress_host.split(".", 1)[1] IndexError: list index out of range

@Bobbins228
Copy link
Contributor Author

@Srihari1192
Have you set your ingress up with ingress_domain or ingress_options?

@Srihari1192
Copy link
Contributor

@Bobbins228 Yes the Ingress_options is set same as in sdk test here and also i tried verifying existing sdk tests same error here aswell https://github.com/project-codeflare/codeflare-operator/actions/runs/7406823001/job/20151892482?pr=424

@Bobbins228
Copy link
Contributor Author

@Bobbins228
Copy link
Contributor Author

@Srihari1192
Okay there is an issue here with the way we get the ingress domain from the host.
Here we get the host from the ingress associated with the Ray Cluster and try to get the domain by extracting the string after a .

I will have a look further to see if there is a better way to extract the ingress domain from the host.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2024
Copy link
Contributor

@Srihari1192 Srihari1192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the changes here looks good

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice, Srihari1192

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 0feab0f into project-codeflare:main Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing to get Clusters info using get_cluster() function for the Ray cluster created with Ingress configuration

4 participants