Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move APIs to codeflare.dev groups #591

Merged
merged 3 commits into from
Aug 31, 2023
Merged

Move APIs to codeflare.dev groups #591

merged 3 commits into from
Aug 31, 2023

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Aug 21, 2023

Part of project-codeflare/adr#7.

Closes #341.

@jbusche
Copy link
Contributor

jbusche commented Aug 22, 2023

I've built the MCAD image and helm-installed it - I'm able to successfully deploy perf-test appwrappers. It's a huge PR though, not sure how to fully test everything to ensure it all works with InstaScale, etc...

oc get pods |grep mcad
mcad-controller-78df9fb9d4-njbmx          1/1     Running   0             10m

and

test/perf-test/perf.sh
....
How many appwrapper jobs do you want?10
jobs number is 10
 
Jobs started at: 11:37:54
 
Submitting job 1
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-1 created
Submitting job 2
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-2 created
Submitting job 3
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-3 created
Submitting job 4
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-4 created
Submitting job 5
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-5 created
Submitting job 6
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-6 created
Submitting job 7
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-7 created
Submitting job 8
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-8 created
Submitting job 9
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-9 created
Submitting job 10
appwrapper.workload.codeflare.dev/defaultaw-schd-spec-with-timeout-10 created
No resources found in default namespace.
Number of completed jobs is:  0  and the goal is:  10
No resources found in default namespace.
Number of completed jobs is:  0  and the goal is:  10
Number of completed jobs is:  4  and the goal is:  10
Number of completed jobs is:  8  and the goal is:  10
 
All 10 jobs finished: 11:38:41
Total amount of time for 10 appwrappers is: 47 seconds

(Note, I had to make a modification to the preempt-exp.yaml, commenting out the Items: [] line due to some recent change, doesn't seem related to this PR... I'll create an issue to fix this.)

@jbusche
Copy link
Contributor

jbusche commented Aug 22, 2023

@asm582, is there additional/better ways to test other than building and helm installing?

@asm582
Copy link
Member

asm582 commented Aug 22, 2023

@asm582, is there additional/better ways to test other than building and helm installing?

@jbusche I thought we were moving away from helm and planned to use only codeflare operator to deploy MCAD.
Anyways I think for now helm seems an easy way.
We would have to run manual tests for MCAD + InstaScale:

  • Deploy MCAD through the helm
  • Deploy InstaScale using the steps mentioned in the repo
  • Fire AW that needs resources and check if InstaScale picks up the AW (Note we only check logs whether InstaScale got invoked without actually adding resources to the cluster)

@jbusche
Copy link
Contributor

jbusche commented Aug 22, 2023

Tried some appwrappers with 100 fake KWOK nodes too - seems slower, but is completing.

@jbusche
Copy link
Contributor

jbusche commented Aug 22, 2023

Sure @asm582, let me try with helm and InstaScale - I'd have to think how I'd get a KFDEF deployed with this MCAD branch...

@asm582
Copy link
Member

asm582 commented Aug 22, 2023

Tried some appwrappers with 100 fake KWOK nodes too - seems slower, but is completing.

Thanks, it is great that we are using KWOK. can you please create a documentation issue on using KWOK to do scale tests please?
Also, a simple shell script would be great if we can do it, based on your previous perf script, KWOK enabled shell script should ask something like:

  • How many nodes do you need?
  • How many GPUs per node do you need?
  • Default num pods should be 110 per node
  • Finally should ask, How many AWs do you need?
    Fire those many AWs and produce a report.

I am thinking if we can reuse this PR: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/469/files

@astefanutti
Copy link
Contributor Author

@jbusche thanks for the testing. Here is the accompanying PR for InstaScale, that can be used for testing project-codeflare/instascale#127.

@anishasthana
Copy link
Member

/hold

@jbusche
Copy link
Contributor

jbusche commented Aug 24, 2023

OK - I just tested this MCAD PR 591 together with the InstaScale PR 127, building images for both and using helm install to deploy of each specific branch, and it's looking good to me. Note, I don't actually have a real scalable cluster that can leverage InstaScale, but InstaScale IS picking up the appwrappers:

oc logs -f instascale-7bd6d676b9-trk9z
I0824 23:50:11.983705       1 request.go:690] Waited for 1.037949519s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
2023-08-24T23:50:14Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
E0824 23:50:14.092261       1 appwrapper_controller.go:165] Got config map named instascale-config from namespace kube-system that configures max nodes in cluster to value 15
E0824 23:50:14.094696       1 appwrapper_controller.go:189] Error getting instascale-ocm-secret from namespace default: secrets "instascale-ocm-secret" not found
I0824 23:50:14.094712       1 appwrapper_controller.go:190] If you are looking to use OCM, ensure that the 'instascale-ocm-secret' secret is available on the cluster within namespace default
I0824 23:50:14.094717       1 appwrapper_controller.go:191] Setting useMachineSets to true.
2023-08-24T23:50:14Z	INFO	setup	starting manager
2023-08-24T23:50:14Z	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2023-08-24T23:50:14Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
2023-08-24T23:50:14Z	INFO	Starting EventSource	{"controller": "appwrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "source": "kind source: *v1beta1.AppWrapper"}
2023-08-24T23:50:14Z	INFO	Starting Controller	{"controller": "appwrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper"}
2023-08-24T23:50:14Z	INFO	Starting workers	{"controller": "appwrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "worker count": 1}
I0824 23:51:46.063611       1 appwrapper_controller.go:242] Found Appwrapper named defaultaw-schd-spec-with-timeout-1 that has status Pending
I0824 23:51:46.063637       1 appwrapper_controller.go:293] Found AW defaultaw-schd-spec-with-timeout-1 that cannot be scaled due to missing orderedinstance label
I0824 23:51:46.063646       1 machineset.go:225] The nodes allowed: 15 and total nodes in cluster after node scale-out 0
I0824 23:51:46.063651       1 appwrapper_controller.go:328] Completed Scaling for defaultaw-schd-spec-with-timeout-1
I0824 23:51:46.349739       1 appwrapper_controller.go:242] Found Appwrapper named defaultaw-schd-spec-with-timeout-2 that has status 
I0824 23:51:46.715860       1 appwrapper_controller.go:242] Found Appwrapper named defaultaw-schd-spec-with-timeout-3 that has status 
I0824 23:51:47.085314       1 appwrapper_controller.go:242] Found Appwrapper named defaultaw-schd-spec-with-timeout-4 that has status 
I0824 23:51:47.451514       1 appwrapper_controller.go:242] Found Appwrapper named defaultaw-schd-spec-with-timeout-5 that has status 
I0824 23:52:06.778000       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0824 23:52:11.878350       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0824 23:52:11.953459       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0824 23:52:12.658344       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0824 23:52:13.598901       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0824 23:52:23.954608       1 appwrapper_controller.go:396] Appwrapper defaultaw-schd-spec-with-timeout-1 deleted, scaling down machines
I0824 23:52:24.068309       1 appwrapper_controller.go:396] Appwrapper defaultaw-schd-spec-with-timeout-2 deleted, scaling down machines
I0824 23:52:24.182505       1 appwrapper_controller.go:396] Appwrapper defaultaw-schd-spec-with-timeout-3 deleted, scaling down machines
I0824 23:52:24.291684       1 appwrapper_controller.go:396] Appwrapper defaultaw-schd-spec-with-timeout-4 deleted, scaling down machines
I0824 23:52:24.390921       1 appwrapper_controller.go:396] Appwrapper defaultaw-schd-spec-with-timeout-5 deleted, scaling down machines

@jbusche
Copy link
Contributor

jbusche commented Aug 25, 2023

Deployed with 20 fake kwok nodes, and both MCAD and InstaScale detected and processed the appwrappers:

oc logs -f instascale-7bd6d676b9-trk9z
...
I0825 00:13:34.160127       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0825 00:13:47.903276       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0825 00:13:47.993945       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0825 00:13:48.619900       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0825 00:13:49.568952       1 appwrapper_controller.go:263] Job completed, deleting resources owned
I0825 00:14:05.074198       1 appwrapper_controller.go:396] Appwrapper fake-defaultaw-schd-spec-with-timeout-1 deleted, scaling down machines
I0825 00:14:05.191447       1 appwrapper_controller.go:396] Appwrapper fake-defaultaw-schd-spec-with-timeout-2 deleted, scaling down machines
I0825 00:14:05.314616       1 appwrapper_controller.go:396] Appwrapper fake-defaultaw-schd-spec-with-timeout-3 deleted, scaling down machines
I0825 00:14:05.462589       1 appwrapper_controller.go:396] Appwrapper fake-defaultaw-schd-spec-with-timeout-4 deleted, scaling down machines
I0825 00:14:05.550734       1 appwrapper_controller.go:396] Appwrapper fake-defaultaw-schd-spec-with-timeout-5 deleted, scaling down machines

@astefanutti
Copy link
Contributor Author

@jbusche thanks a lot for the testing and feedback. I think it's safe to assume that the rest of the InstaScale auto-scaling logic should not be impacted, as long as it can process the AppWrapper in the new API group. So I think that's enough to validate this PR, so we can start rolling the migration in downstream components.

Let's wait for the upcoming release, and unhold / merge this at the beginning of the next iteration.

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

/lgtm

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

quota borrowing tests which was a recently merged PR are failing, I think we missed something during the merge. borrow quota trees are not found in the build.

@anishasthana
Copy link
Member

/unhold

@astefanutti
Copy link
Contributor Author

quota borrowing tests which was a recently merged PR are failing, I think we missed something during the merge. borrow quota trees are not found in the build.

Ah good catch, thanks. I've updated the tests.

@asm582
Copy link
Member

asm582 commented Aug 31, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 31, 2023
@asm582
Copy link
Member

asm582 commented Aug 31, 2023

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asm582

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 05b1af8 into project-codeflare:main Aug 31, 2023
2 checks passed
@astefanutti astefanutti deleted the pr-12 branch August 31, 2023 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate MCAD CRDs to CodeFlare domain
5 participants