Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move 8x V100 & 2x A100SXM4 hosts out of OpenStack #680

Closed
7 tasks done
joachimweyl opened this issue Aug 7, 2024 · 20 comments
Closed
7 tasks done

Move 8x V100 & 2x A100SXM4 hosts out of OpenStack #680

joachimweyl opened this issue Aug 7, 2024 · 20 comments
Assignees

Comments

@joachimweyl
Copy link
Contributor

joachimweyl commented Aug 7, 2024

Motivation

OpenStack is not utilizing the GPUs it has.

Completion Criteria

8 V100s moved out of OpenStack, 1 into OpenShift Test 7 into OpenShift Prod, 8 A100SXM4s out of OpenStack and into OpenShift Prod.

Description

  • Schedule OpenStack maintenance
  • Track down which 8x V100 hosts have no VMs on them
  • Track down which 2x A100SXM4 hosts have no VMs on them
  • GPUs out of OpenStack during Maintenance
  • Move 2 A100SXM4 Nodes to ESI
    • let these be floaters
  • Move 1 V100 Node to New hardware OpenShift Test Cluster
  • Move 7 V100 Nodes to OpenShift Prod Cluster

Completion dates

Desired - 2024-08-21
Required - TBD

@Milstein
Copy link

OpenStack maintenance is going to be scheduled for whole day Tuesday Sep 3, 2024

@joachimweyl
Copy link
Contributor Author

@jtriley can we confirm these nodes are out of OpenStack and ready for @hakasapl to provide the node manifests to @tzumainn to add to ESI?

@jtriley
Copy link

jtriley commented Sep 4, 2024

@joachimweyl These nodes were removed during yesterday's maintenance. That said, @hakasapl should sync up with @aabaris to confirm their status and also to figure out the switch ports that are required in order to flip OBM and data over to ESI nets.

@hpdempsey
Copy link

@hakasapl can you please move the nodes ASAP? The InstructLab team is blocked by this issue currently, and they are my highest priority Red Hat project.

@hakasapl
Copy link

hakasapl commented Sep 4, 2024

@hpdempsey I just checked with @aabaris, he would like to confirm that they are ready to be moved at the 1/2PM meetings today and then I can move them. I am at MGHPCC today so I will do this tonight after I get home assuming we get the confirmations we need

@hpdempsey
Copy link

Thanks.

@hakasapl
Copy link

hakasapl commented Sep 4, 2024

@hpdempsey Sorry for just noticing this. The V100s from openstack are in the NERC network not in the MOC network, so getting those onto ESI right away will be impossible. The A100s I can do ASAP.

After we set up the new NERC core switches the plan was to move all the NERC switches to the MOCA network but that's not in place yet. I suggest we discuss in the 1pm meeting further

@hakasapl
Copy link

hakasapl commented Sep 4, 2024

If they are needed asap one option is to physically move them to a rack where we have MOCA switching, which I could do today and set them up in the evening, depends on the priority level

@tssala23
Copy link

tssala23 commented Sep 4, 2024

@hakasapl The instructlab team that is blocked are using the A100s

@hakasapl
Copy link

hakasapl commented Sep 4, 2024

Perfect, those I can do right away

@tssala23
Copy link

tssala23 commented Sep 4, 2024

Thank you!

@hakasapl
Copy link

hakasapl commented Sep 4, 2024

@tzumainn PR ready for you: CCI-MOC/esi-pilot#72

@tzumainn
Copy link

tzumainn commented Sep 4, 2024

thanks! testing the nodes now; I'll update when things are ready or if there's a problem

@tzumainn
Copy link

tzumainn commented Sep 4, 2024

nodes work great! I've assigned them as follows:

  • MOC-R8PAC23U25: OpenShiftBeta
  • MOC-R8PAC23U26: research_rhelai

@tssala23
Copy link

tssala23 commented Sep 4, 2024

Thank you! @hakasapl @tzumainn

@joachimweyl joachimweyl assigned jtriley and unassigned jtriley and hakasapl Sep 6, 2024
@joachimweyl
Copy link
Contributor Author

@jtriley what are the next steps for getting the 7 V100s into OpenShift Prod?

@jtriley
Copy link

jtriley commented Sep 10, 2024

@joachimweyl I'm working on adding the nodes to prod now - I'll update the checklist in the description when it's completed.

@jtriley
Copy link

jtriley commented Sep 10, 2024

The 7x V100 nodes have been added to prod:

NOTE: wrk-88 and wrk-89 were the existing 2x V100 nodes in the cluster

$ oc get nodes -l 'nvidia.com/gpu.product=Tesla-V100-PCIE-32GB'
NAME      STATUS   ROLES    AGE     VERSION
wrk-102   Ready    worker   5m53s   v1.28.11+add48d0
wrk-103   Ready    worker   5m50s   v1.28.11+add48d0
wrk-104   Ready    worker   5m49s   v1.28.11+add48d0
wrk-105   Ready    worker   5m53s   v1.28.11+add48d0
wrk-106   Ready    worker   5m50s   v1.28.11+add48d0
wrk-107   Ready    worker   5m51s   v1.28.11+add48d0
wrk-108   Ready    worker   5m47s   v1.28.11+add48d0
wrk-88    Ready    worker   356d    v1.28.11+add48d0
wrk-89    Ready    worker   356d    v1.28.11+add48d0

cc @naved001

Working on adding the 1x V100 host to the ocp-test cluster now.

@jtriley
Copy link

jtriley commented Sep 10, 2024

The 1x V100 node has been added to ocp-test:

$ oc get nodes -l 'nvidia.com/gpu.product=Tesla-V100-PCIE-32GB'
NAME    STATUS   ROLES    AGE   VERSION
wrk-3   Ready    worker   35m   v1.28.12+396c881

@joachimweyl joachimweyl self-assigned this Sep 11, 2024
@jtriley jtriley changed the title Move 8 V100s & 8 A100SXM4s out of OpenStack Move 8x V100 & 2x A100SXM4 hosts out of OpenStack Sep 11, 2024
@jtriley
Copy link

jtriley commented Sep 11, 2024

@joachimweyl Just a note I updated the title and description to reference the number of hosts vs number of GPU cards as that was a bit confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants