Skip to content

Conversation

@Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Mar 5, 2024

Issue link

References RHOAIENG-4028

What changes have been made

The addition of the new volumeMounts has caused the get_cluster() function to fail as it assumes the new VMs are the local interactive VMs.
This makes the function try to recreate the local_interactive related resources which causes the
ValueError: ingress_domain is invalid. For creating the client route/ingress please specify an ingress domain
error because ingress_domain would not be specified.

Also added simplified import for the get_cluster() command. It can now be imported like this.
from codeflare_sdk import get_cluster

Verification steps

Run through a demo notebook and create a Ray Cluster on Kind and OpenShift.
run cluster = get_cluster(name, namespace)
cluster.down() and other cluster. commands should work.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have run through a notebook on OpenShift and all works as expected. Will run on a kind cluster next. Just a couple of nitpicks at the moment. 🙂

@Bobbins228 Bobbins228 changed the title fix: fix get_cluster logic fix: get_cluster logic Mar 11, 2024
@Srihari1192
Copy link
Contributor

Srihari1192 commented Mar 20, 2024

@Bobbins228 Cluster details fetching successfully with get_cluster() function , but when we try to submit job after getting cluster details using get_cluster() function , job submission failing with below error.
Tested against the RHOAI 2.8 in notebooks and cluster creation with openshift_oauth=True.

Note: Job submission works fine without getting cluster details using get_cluster() function.

Error log :

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /opt/app-root/lib64/python3.9/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File /opt/app-root/lib64/python3.9/site-packages/urllib3/connectionpool.py:799, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    797     e = ProtocolError("Connection aborted.", e)
--> 799 retries = retries.increment(
    800     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    801 )
    802 retries.sleep()

File /opt/app-root/lib64/python3.9/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:191, in DDPJob.__init__(self, job_definition, cluster)
    189 self.cluster = cluster
    190 if self.cluster:
--> 191     definition, runner = job_definition._dry_run(cluster)
    192     self._app_handle = runner.schedule(definition)
    193     self._runner = runner

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:99, in DDPJobDefinition._dry_run(self, cluster)
     97 def _dry_run(self, cluster: "Cluster"):
     98     j = f"{cluster.config.num_workers}x{max(cluster.config.num_gpus, 1)}"  # # of proc. = # of gpus
---> 99     runner = get_runner(ray_client=cluster.job_client)
    100     runner._scheduler_instances["ray"] = RayScheduler(
    101         session_name=runner._name, ray_client=cluster.job_client
    102     )
    103     return (
    104         runner.dryrun(
    105             app=ddp(
   (...)
    129         runner,
    130     )

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/cluster/cluster.py:106, in Cluster.job_client(self)
    100     self._job_submission_client = JobSubmissionClient(
    101         self.cluster_dashboard_uri(),
    102         headers=self._client_headers,
    103         verify=self._client_verify_tls,
    104     )
    105 else:
--> 106     self._job_submission_client = JobSubmissionClient(
    107         self.cluster_dashboard_uri()
    108     )
    109 return self._job_submission_client

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/job/sdk.py:110, in JobSubmissionClient.__init__(self, address, create_cluster_if_needed, cookies, metadata, headers, verify)
    100 api_server_url = get_address_for_submission_client(address)
    102 super().__init__(
    103     address=api_server_url,
    104     create_cluster_if_needed=create_cluster_if_needed,
   (...)
    108     verify=verify,
    109 )
--> 110 self._check_connection_and_version(
    111     min_version="1.9",
    112     version_error_message="Jobs API is not supported on the Ray "
    113     "cluster. Please ensure the cluster is "
    114     "running Ray 1.9 or higher.",
    115 )
    117 # In ray>=2.0, the client sends the new kwarg `submission_id` to the server
    118 # upon every job submission, which causes servers with ray<2.0 to error.
    119 if packaging.version.parse(self._client_ray_version) > packaging.version.parse(
    120     "2.0"
    121 ):

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:248, in SubmissionClient._check_connection_and_version(self, min_version, version_error_message)
    245 def _check_connection_and_version(
    246     self, min_version: str = "1.9", version_error_message: str = None
    247 ):
--> 248     self._check_connection_and_version_with_url(min_version, version_error_message)

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:278, in SubmissionClient._check_connection_and_version_with_url(self, min_version, version_error_message, url)
    273         raise RuntimeError(
    274             f"Ray version {running_ray_version} is running on the cluster. "
    275             + version_error_message
    276         )
    277 except requests.exceptions.ConnectionError:
--> 278     raise ConnectionError(
    279         f"Failed to connect to Ray at address: {self._address}."
    280     )```

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2024
@Bobbins228
Copy link
Contributor Author

@Srihari1192
I tested this out by performing the following:

  • Setup on RHOAI 2.8
  • Launched a Notebook
  • Authenticated with Token + Server with ski_tls=True
  • Created a Ray Cluster
  • Ran get_cluster()
  • Submitted job

I was able to submit a job without failure I am unsure what other differences there can be between our setups or how to recreate the error unless I missed a step?

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2024
@ChristianZaccaria
Copy link
Contributor

ChristianZaccaria commented Mar 25, 2024

Through a diff checker I compared the yaml from the one created on setting the ClusterConfiguration [left side] and the one retrieved with get_cluster [on the right side].

image
The same is seen as well under WorkerGroupSpec.
I think both ways work well, should we keep the use of integers here when retrieving the cluster?

@ChristianZaccaria
Copy link
Contributor

ca.crt and ca.key differs too, is this expected?

image

@Bobbins228
Copy link
Contributor Author

Through a diff checker I compared the yaml from the one created on setting the ClusterConfiguration [left side] and the one retrieved with get_cluster [on the right side].

I have a local fix for that. I will push later today :)

ca.crt and ca.key differs too, is this expected?

With the way get_cluster() works yes.
The get_cluster() command essentially creates a new ClusterConfiguration based on an existing Ray Cluster.
This causes a brand new cert to be created for the grabbed RC/Appwrapper. I am not sure this breaks the local_interactive demo as the certs already exist on the Ray Cluster in your K8s environment but I will investigate. @ChristianZaccaria

@Srihari1192 Srihari1192 self-requested a review March 26, 2024 11:59
@Srihari1192
Copy link
Contributor

@Bobbins228 Cluster details fetching successfully with get_cluster() function , but when we try to submit job after getting cluster details using get_cluster() function , job submission failing with below error. Tested against the RHOAI 2.8 in notebooks and cluster creation with openshift_oauth=True.

Note: Job submission works fine without getting cluster details using get_cluster() function.

Error log :

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /opt/app-root/lib64/python3.9/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File /opt/app-root/lib64/python3.9/site-packages/urllib3/connectionpool.py:799, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    797     e = ProtocolError("Connection aborted.", e)
--> 799 retries = retries.increment(
    800     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    801 )
    802 retries.sleep()

File /opt/app-root/lib64/python3.9/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:191, in DDPJob.__init__(self, job_definition, cluster)
    189 self.cluster = cluster
    190 if self.cluster:
--> 191     definition, runner = job_definition._dry_run(cluster)
    192     self._app_handle = runner.schedule(definition)
    193     self._runner = runner

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/job/jobs.py:99, in DDPJobDefinition._dry_run(self, cluster)
     97 def _dry_run(self, cluster: "Cluster"):
     98     j = f"{cluster.config.num_workers}x{max(cluster.config.num_gpus, 1)}"  # # of proc. = # of gpus
---> 99     runner = get_runner(ray_client=cluster.job_client)
    100     runner._scheduler_instances["ray"] = RayScheduler(
    101         session_name=runner._name, ray_client=cluster.job_client
    102     )
    103     return (
    104         runner.dryrun(
    105             app=ddp(
   (...)
    129         runner,
    130     )

File /opt/app-root/lib64/python3.9/site-packages/codeflare_sdk/cluster/cluster.py:106, in Cluster.job_client(self)
    100     self._job_submission_client = JobSubmissionClient(
    101         self.cluster_dashboard_uri(),
    102         headers=self._client_headers,
    103         verify=self._client_verify_tls,
    104     )
    105 else:
--> 106     self._job_submission_client = JobSubmissionClient(
    107         self.cluster_dashboard_uri()
    108     )
    109 return self._job_submission_client

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/job/sdk.py:110, in JobSubmissionClient.__init__(self, address, create_cluster_if_needed, cookies, metadata, headers, verify)
    100 api_server_url = get_address_for_submission_client(address)
    102 super().__init__(
    103     address=api_server_url,
    104     create_cluster_if_needed=create_cluster_if_needed,
   (...)
    108     verify=verify,
    109 )
--> 110 self._check_connection_and_version(
    111     min_version="1.9",
    112     version_error_message="Jobs API is not supported on the Ray "
    113     "cluster. Please ensure the cluster is "
    114     "running Ray 1.9 or higher.",
    115 )
    117 # In ray>=2.0, the client sends the new kwarg `submission_id` to the server
    118 # upon every job submission, which causes servers with ray<2.0 to error.
    119 if packaging.version.parse(self._client_ray_version) > packaging.version.parse(
    120     "2.0"
    121 ):

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:248, in SubmissionClient._check_connection_and_version(self, min_version, version_error_message)
    245 def _check_connection_and_version(
    246     self, min_version: str = "1.9", version_error_message: str = None
    247 ):
--> 248     self._check_connection_and_version_with_url(min_version, version_error_message)

File /opt/app-root/lib64/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py:278, in SubmissionClient._check_connection_and_version_with_url(self, min_version, version_error_message, url)
    273         raise RuntimeError(
    274             f"Ray version {running_ray_version} is running on the cluster. "
    275             + version_error_message
    276         )
    277 except requests.exceptions.ConnectionError:
--> 278     raise ConnectionError(
    279         f"Failed to connect to Ray at address: {self._address}."
    280     )```

@Bobbins228 Tested with the new changes , get_cluster function works as expected.
but looking at the error which i reported seems similar to the known issue https://issues.redhat.com/browse/RHOAIENG-4240. may be the fix which you have implemented will be similar for this
CC @KPostOffice

Copy link
Contributor

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm other than one small nit

@Bobbins228 Bobbins228 requested a review from KPostOffice March 27, 2024 19:18
Copy link
Contributor

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit f1a8622 into project-codeflare:main Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants