-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move 8x V100 & 2x A100SXM4 hosts out of OpenStack #680
Comments
OpenStack maintenance is going to be scheduled for whole day Tuesday Sep 3, 2024 |
@joachimweyl These nodes were removed during yesterday's maintenance. That said, @hakasapl should sync up with @aabaris to confirm their status and also to figure out the switch ports that are required in order to flip OBM and data over to ESI nets. |
@hakasapl can you please move the nodes ASAP? The InstructLab team is blocked by this issue currently, and they are my highest priority Red Hat project. |
@hpdempsey I just checked with @aabaris, he would like to confirm that they are ready to be moved at the 1/2PM meetings today and then I can move them. I am at MGHPCC today so I will do this tonight after I get home assuming we get the confirmations we need |
Thanks. |
@hpdempsey Sorry for just noticing this. The V100s from openstack are in the NERC network not in the MOC network, so getting those onto ESI right away will be impossible. The A100s I can do ASAP. After we set up the new NERC core switches the plan was to move all the NERC switches to the MOCA network but that's not in place yet. I suggest we discuss in the 1pm meeting further |
If they are needed asap one option is to physically move them to a rack where we have MOCA switching, which I could do today and set them up in the evening, depends on the priority level |
@hakasapl The instructlab team that is blocked are using the A100s |
Perfect, those I can do right away |
Thank you! |
@tzumainn PR ready for you: CCI-MOC/esi-pilot#72 |
thanks! testing the nodes now; I'll update when things are ready or if there's a problem |
nodes work great! I've assigned them as follows:
|
@jtriley what are the next steps for getting the 7 V100s into OpenShift Prod? |
@joachimweyl I'm working on adding the nodes to prod now - I'll update the checklist in the description when it's completed. |
The 7x V100 nodes have been added to prod: NOTE:
cc @naved001 Working on adding the 1x V100 host to the ocp-test cluster now. |
The 1x V100 node has been added to ocp-test:
|
@joachimweyl Just a note I updated the title and description to reference the number of hosts vs number of GPU cards as that was a bit confusing. |
Motivation
OpenStack is not utilizing the GPUs it has.
Completion Criteria
8 V100s moved out of OpenStack, 1 into OpenShift Test 7 into OpenShift Prod, 8 A100SXM4s out of OpenStack and into OpenShift Prod.
Description
Completion dates
Desired - 2024-08-21
Required - TBD
The text was updated successfully, but these errors were encountered: