Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of init_myservice initContainer #412

Merged

Conversation

Fiona-Waters
Copy link
Contributor

Issue link

Jira Issue
Closes #297

What changes have been made

Removal of the init_myservice initContainer for the ray workgroupspec because the operator also injects a wait-gcs-ready initContainer which does similar thing.

Verification steps

Install codeflare sdk on a cluster - test notebooks and ensure that they work correctly.
Run unit tests and ensure that they work correctly.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2023
@Fiona-Waters Fiona-Waters reopened this Nov 27, 2023
@Fiona-Waters Fiona-Waters marked this pull request as ready for review November 28, 2023 09:33
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2023
@Fiona-Waters
Copy link
Contributor Author

After speaking to @Bobbins228 I tried out the local interactive demo notebook and unfortunately am seeing this error. Not sure what it means yet but looking into it. @tedhtchang would you have any insight here? Thanks

2023-11-28 06:59:10,622 INFO usage_lib.py:416 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-11-28 06:59:10,622 INFO scripts.py:744 -- Local node IP: 10.131.2.27
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 771, in start
node = ray._private.node.Node(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 307, in __init__
self.start_head_processes()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 1404, in start_head_processes
assert self.get_gcs_client() is not None
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 676, in get_gcs_client
self._init_gcs_client()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 721, in _init_gcs_client
raise RuntimeError(
RuntimeError: Failed to start GCS. Last 1 lines of error files:
[2023-11-28 06:59:10,701 C 13 13] (gcs_server) grpc_server.cc:119: Check failed: server_ Failed to start the grpc server. The specified port is 6379. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running sudo lsof -i :6379 to check if there are other processes listening to the port.
.Please check /tmp/ray/session_2023-11-28_06-59-10_623863_8/logs/gcs_server.out for details

@tedhtchang
Copy link
Member

tedhtchang commented Nov 29, 2023

After speaking to @Bobbins228 I tried out the local interactive demo notebook and unfortunately am seeing this error. Not sure what it means yet but looking into it. @tedhtchang would you have any insight here? Thanks

2023-11-28 06:59:10,622 INFO usage_lib.py:416 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-11-28 06:59:10,622 INFO scripts.py:744 -- Local node IP: 10.131.2.27
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/scripts/scripts.py", line 771, in start
node = ray._private.node.Node(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 307, in __init__
self.start_head_processes()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 1404, in start_head_processes
assert self.get_gcs_client() is not None
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 676, in get_gcs_client
self._init_gcs_client()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/node.py", line 721, in _init_gcs_client
raise RuntimeError(
RuntimeError: Failed to start GCS. Last 1 lines of error files:
[2023-11-28 06:59:10,701 C 13 13] (gcs_server) grpc_server.cc:119: Check failed: server_ Failed to start the grpc server. The specified port is 6379. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running sudo lsof -i :6379 to check if there are other processes listening to the port.
.Please check /tmp/ray/session_2023-11-28_06-59-10_623863_8/logs/gcs_server.out for details

When the the local_interactive is true the SSL communication is enabled between the worker and head node. This requires all nodes to have SSL certificates which is what the create-cert initContainer does. I suspect this was the reason for this error.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2023
@Fiona-Waters Fiona-Waters force-pushed the remove-init-myservice branch 2 times, most recently from 67c8c95 to e378fd3 Compare December 4, 2023 22:59
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few pieces on this.

We should reintroduce the init containers necessary for that local interactive use case and also add the old deletion logic back and update it to delete just those init containers.

If they are added back to the base template there we shouldn't need to have the extra local interactive template.

src/codeflare_sdk/utils/generate_yaml.py Show resolved Hide resolved
src/codeflare_sdk/templates/base-template.yaml Outdated Show resolved Hide resolved
@Fiona-Waters
Copy link
Contributor Author

Just a few pieces on this.

We should reintroduce the init containers necessary for that local interactive use case and also add the old deletion logic back and update it to delete just those init containers.

If they are added back to the base template there we shouldn't need to have the extra local interactive template.

Thanks @Bobbins228 I have made changes as discussed. Please review again when you have time.

Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment

src/codeflare_sdk/utils/generate_yaml.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm Awesome stuff Fiona

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested local_interactive notebook on a KinD cluster. /LGTM

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2023
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Contributor

openshift-ci bot commented Dec 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, Bobbins228, tedhtchang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2023
@openshift-merge-bot openshift-merge-bot bot merged commit b4f19db into project-codeflare:main Dec 6, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider remove the init_myservice initContainer
5 participants