fix: get_cluster logic #476

Bobbins228 · 2024-03-05T10:37:43Z

Issue link

What changes have been made

The addition of the new volumeMounts has caused the get_cluster() function to fail as it assumes the new VMs are the local interactive VMs.
This makes the function try to recreate the local_interactive related resources which causes the
ValueError: ingress_domain is invalid. For creating the client route/ingress please specify an ingress domain
error because ingress_domain would not be specified.

Also added simplified import for the get_cluster() command. It can now be imported like this.
from codeflare_sdk import get_cluster

Verification steps

Run through a demo notebook and create a Ray Cluster on Kind and OpenShift.
run cluster = get_cluster(name, namespace)
cluster.down() and other cluster. commands should work.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

Fiona-Waters

Have run through a notebook on OpenShift and all works as expected. Will run on a kind cluster next. Just a couple of nitpicks at the moment. 🙂

tests/unit_test.py

Srihari1192 · 2024-03-20T05:15:14Z

@Bobbins228 Cluster details fetching successfully with get_cluster() function , but when we try to submit job after getting cluster details using get_cluster() function , job submission failing with below error.
Tested against the RHOAI 2.8 in notebooks and cluster creation with openshift_oauth=True.

Note: Job submission works fine without getting cluster details using get_cluster() function.

Error log :

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /opt/app-root/lib64/python3.9/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File /opt/app-root/lib64/python3.9/site-packages/urllib3/connectionpool.py:799, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    797     e = ProtocolError("Connection aborted.", e)
--> 799 retries = retries.increment(
    800     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    801 )
    802 retries.sleep()

File /opt/app-root/lib64/python3.9/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:191, in DDPJob.__init__(self, job_definition, cluster)
    189 self.cluster = cluster
    190 if self.cluster:
--> 191     definition, runner = job_definition._dry_run(cluster)
    192     self._app_handle = runner.schedule(definition)
    193     self._runner = runner

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:99, in DDPJobDefinition._dry_run(self, cluster)
     97 def _dry_run(self, cluster: "Cluster"):
     98     j = f"{cluster.config.num_workers}x{max(cluster.config.num_gpus, 1)}"  # # of proc. = # of gpus
---> 99     runner = get_runner(ray_client=cluster.job_client)
    100     runner._scheduler_instances["ray"] = RayScheduler(
    101         session_name=runner._name, ray_client=cluster.job_client
    102     )
    103     return (
    104         runner.dryrun(
    105             app=ddp(
   (...)
    129         runner,
    130     )

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/cluster/cluster.py:106, in Cluster.job_client(self)
    100     self._job_submission_client = JobSubmissionClient(
    101         self.cluster_dashboard_uri(),
    102         headers=self._client_headers,
    103         verify=self._client_verify_tls,
    104     )
    105 else:
--> 106     self._job_submission_client = JobSubmissionClient(
    107         self.cluster_dashboard_uri()
    108     )
    109 return self._job_submission_client

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/job/sdk.py:110, in JobSubmissionClient.__init__(self, address, create_cluster_if_needed, cookies, metadata, headers, verify)
    100 api_server_url = get_address_for_submission_client(address)
    102 super().__init__(
    103     address=api_server_url,
    104     create_cluster_if_needed=create_cluster_if_needed,
   (...)
    108     verify=verify,
    109 )
--> 110 self._check_connection_and_version(
    111     min_version="1.9",
    112     version_error_message="Jobs API is not supported on the Ray "
    113     "cluster. Please ensure the cluster is "
    114     "running Ray 1.9 or higher.",
    115 )
    117 # In ray>=2.0, the client sends the new kwarg `submission_id` to the server
    118 # upon every job submission, which causes servers with ray<2.0 to error.
    119 if packaging.version.parse(self._client_ray_version) > packaging.version.parse(
    120     "2.0"
    121 ):

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:248, in SubmissionClient._check_connection_and_version(self, min_version, version_error_message)
    245 def _check_connection_and_version(
    246     self, min_version: str = "1.9", version_error_message: str = None
    247 ):
--> 248     self._check_connection_and_version_with_url(min_version, version_error_message)

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:278, in SubmissionClient._check_connection_and_version_with_url(self, min_version, version_error_message, url)
    273         raise RuntimeError(
    274             f"Ray version {running_ray_version} is running on the cluster. "
    275             + version_error_message
    276         )
    277 except requests.exceptions.ConnectionError:
--> 278     raise ConnectionError(
    279         f"Failed to connect to Ray at address: {self._address}."
    280     )```

Bobbins228 · 2024-03-20T16:22:01Z

@Srihari1192
I tested this out by performing the following:

Setup on RHOAI 2.8
Launched a Notebook
Authenticated with Token + Server with ski_tls=True
Created a Ray Cluster
Ran get_cluster()
Submitted job

I was able to submit a job without failure I am unsure what other differences there can be between our setups or how to recreate the error unless I missed a step?

src/codeflare_sdk/cluster/cluster.py

tests/unit_test.py

ChristianZaccaria · 2024-03-25T14:08:26Z

Through a diff checker I compared the yaml from the one created on setting the ClusterConfiguration [left side] and the one retrieved with get_cluster [on the right side].

The same is seen as well under WorkerGroupSpec.
I think both ways work well, should we keep the use of integers here when retrieving the cluster?

ChristianZaccaria · 2024-03-25T14:12:37Z

ca.crt and ca.key differs too, is this expected?

Bobbins228 · 2024-03-25T14:20:55Z

Through a diff checker I compared the yaml from the one created on setting the ClusterConfiguration [left side] and the one retrieved with get_cluster [on the right side].

I have a local fix for that. I will push later today :)

ca.crt and ca.key differs too, is this expected?

With the way get_cluster() works yes.
The get_cluster() command essentially creates a new ClusterConfiguration based on an existing Ray Cluster.
This causes a brand new cert to be created for the grabbed RC/Appwrapper. I am not sure this breaks the local_interactive demo as the certs already exist on the Ray Cluster in your K8s environment but I will investigate. @ChristianZaccaria

changed the get cluster logic to reflect the new volume mounts added References RHOAIENG-4028

Srihari1192 · 2024-03-26T13:28:12Z

@Bobbins228 Cluster details fetching successfully with get_cluster() function , but when we try to submit job after getting cluster details using get_cluster() function , job submission failing with below error. Tested against the RHOAI 2.8 in notebooks and cluster creation with openshift_oauth=True.

Note: Job submission works fine without getting cluster details using get_cluster() function.

Error log :

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /opt/app-root/lib64/python3.9/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File /opt/app-root/lib64/python3.9/site-packages/urllib3/connectionpool.py:799, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    797     e = ProtocolError("Connection aborted.", e)
--> 799 retries = retries.increment(
    800     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    801 )
    802 retries.sleep()

File /opt/app-root/lib64/python3.9/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:191, in DDPJob.__init__(self, job_definition, cluster)
    189 self.cluster = cluster
    190 if self.cluster:
--> 191     definition, runner = job_definition._dry_run(cluster)
    192     self._app_handle = runner.schedule(definition)
    193     self._runner = runner

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:99, in DDPJobDefinition._dry_run(self, cluster)
     97 def _dry_run(self, cluster: "Cluster"):
     98     j = f"{cluster.config.num_workers}x{max(cluster.config.num_gpus, 1)}"  # # of proc. = # of gpus
---> 99     runner = get_runner(ray_client=cluster.job_client)
    100     runner._scheduler_instances["ray"] = RayScheduler(
    101         session_name=runner._name, ray_client=cluster.job_client
    102     )
    103     return (
    104         runner.dryrun(
    105             app=ddp(
   (...)
    129         runner,
    130     )

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/cluster/cluster.py:106, in Cluster.job_client(self)
    100     self._job_submission_client = JobSubmissionClient(
    101         self.cluster_dashboard_uri(),
    102         headers=self._client_headers,
    103         verify=self._client_verify_tls,
    104     )
    105 else:
--> 106     self._job_submission_client = JobSubmissionClient(
    107         self.cluster_dashboard_uri()
    108     )
    109 return self._job_submission_client

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/job/sdk.py:110, in JobSubmissionClient.__init__(self, address, create_cluster_if_needed, cookies, metadata, headers, verify)
    100 api_server_url = get_address_for_submission_client(address)
    102 super().__init__(
    103     address=api_server_url,
    104     create_cluster_if_needed=create_cluster_if_needed,
   (...)
    108     verify=verify,
    109 )
--> 110 self._check_connection_and_version(
    111     min_version="1.9",
    112     version_error_message="Jobs API is not supported on the Ray "
    113     "cluster. Please ensure the cluster is "
    114     "running Ray 1.9 or higher.",
    115 )
    117 # In ray>=2.0, the client sends the new kwarg `submission_id` to the server
    118 # upon every job submission, which causes servers with ray<2.0 to error.
    119 if packaging.version.parse(self._client_ray_version) > packaging.version.parse(
    120     "2.0"
    121 ):

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:248, in SubmissionClient._check_connection_and_version(self, min_version, version_error_message)
    245 def _check_connection_and_version(
    246     self, min_version: str = "1.9", version_error_message: str = None
    247 ):
--> 248     self._check_connection_and_version_with_url(min_version, version_error_message)

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:278, in SubmissionClient._check_connection_and_version_with_url(self, min_version, version_error_message, url)
    273         raise RuntimeError(
    274             f"Ray version {running_ray_version} is running on the cluster. "
    275             + version_error_message
    276         )
    277 except requests.exceptions.ConnectionError:
--> 278     raise ConnectionError(
    279         f"Failed to connect to Ray at address: {self._address}."
    280     )```

@Bobbins228 Tested with the new changes , get_cluster function works as expected.
but looking at the error which i reported seems similar to the known issue https://issues.redhat.com/browse/RHOAIENG-4240. may be the fix which you have implemented will be similar for this
CC @KPostOffice

KPostOffice

lgtm other than one small nit

src/codeflare_sdk/cluster/cluster.py

KPostOffice

/lgtm

openshift-ci · 2024-03-28T13:52:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [KPostOffice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from KPostOffice and astefanutti March 5, 2024 10:37

Fiona-Waters reviewed Mar 5, 2024

View reviewed changes

tests/unit_test.py Outdated Show resolved Hide resolved

tests/unit_test.py Outdated Show resolved Hide resolved

Bobbins228 changed the title ~~fix: fix get_cluster logic~~ fix: get_cluster logic Mar 11, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2024

Bobbins228 force-pushed the get-cluster branch from e07fcec to 89b20b2 Compare March 20, 2024 16:35

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2024

Bobbins228 force-pushed the get-cluster branch from 89b20b2 to 1a5b58c Compare March 22, 2024 15:43

ChristianZaccaria suggested changes Mar 25, 2024

View reviewed changes

src/codeflare_sdk/cluster/cluster.py Outdated Show resolved Hide resolved

tests/unit_test.py Outdated Show resolved Hide resolved

openshift-ci bot assigned ChristianZaccaria Mar 25, 2024

Bobbins228 added 6 commits March 25, 2024 16:58

fix(users): fix get_cluster logic

4b810b7

changed the get cluster logic to reflect the new volume mounts added References RHOAIENG-4028

Altered local_interactive logic & added tests

b678b54

Added local_interactive annotation

36f6a31

Updated tests & get_cluster for in memory

5ba9397

Updated ingress_domain annotation

9544aac

Added openshift_oauth to get_cluster

825160b

Bobbins228 force-pushed the get-cluster branch from 1a5b58c to 825160b Compare March 25, 2024 16:59

Review changes local_interactive

627647d

Srihari1192 self-requested a review March 26, 2024 11:59

KPostOffice reviewed Mar 27, 2024

View reviewed changes

src/codeflare_sdk/cluster/cluster.py Outdated Show resolved Hide resolved

src/codeflare_sdk/cluster/cluster.py Show resolved Hide resolved

Bobbins228 requested a review from KPostOffice March 27, 2024 19:18

Review changes: conditionals

8077fbd

Bobbins228 force-pushed the get-cluster branch from 93d31b2 to 8077fbd Compare March 27, 2024 19:55

KPostOffice approved these changes Mar 28, 2024

View reviewed changes

openshift-ci bot assigned KPostOffice Mar 28, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2024

openshift-merge-bot bot merged commit f1a8622 into project-codeflare:main Mar 28, 2024

fix: get_cluster logic #476

fix: get_cluster logic #476

Uh oh!

Conversation

Bobbins228 commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

Fiona-Waters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Srihari1192 commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bobbins228 commented Mar 20, 2024

Uh oh!

Uh oh!

Uh oh!

ChristianZaccaria commented Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChristianZaccaria commented Mar 25, 2024

Uh oh!

Bobbins228 commented Mar 25, 2024

Uh oh!

Srihari1192 commented Mar 26, 2024

Uh oh!

KPostOffice left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KPostOffice left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Bobbins228 commented Mar 5, 2024 •

edited

Loading

Srihari1192 commented Mar 20, 2024 •

edited

Loading

ChristianZaccaria commented Mar 25, 2024 •

edited

Loading