Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow up: New NVIDIA A100 GPUs - Quality Test #504

Closed
3 of 8 tasks
schwesig opened this issue Mar 25, 2024 · 1 comment
Closed
3 of 8 tasks

follow up: New NVIDIA A100 GPUs - Quality Test #504

schwesig opened this issue Mar 25, 2024 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed observability rhods RHODS

Comments

@schwesig
Copy link
Member

follow up: New NVIDIA A100 GPUs - Quality Test

nice to have/ follow up for #482
feel free to participate in this testing, sharing experiences and results.
if help is wanted/needed for observing the tests... contact @schwesig

  • We are planning to conduct a quantity test for the newly installed NVIDIA A100 GPUs by spinning up 200 RHODS images with GPU claims.

We are planning to conduct a quality test for the newly installed NVIDIA A100 GPUs by running a newly developed "Tensorflow Jupyter CUDA" image, designed to test the computing power within our OpenShift AI environment; focusing on their performance and compatibility.
This test will utilize the new Tensorflow Jupyter CUDA image with 02_model_training_basics.ipynb.

This test does not need to be exclusive for this image/script. If anything is missing or there are new scripts or images useful, feel free.

Test Objectives:

  • Ensure the stability and performance when utilizing the GPUs.
  • Verify the compatibility and stability using Tensorflow Jupyter CUDA image.

Test Environments :

follow up, nice to have:

  • On Test Cluster
    • parallel workloads (2, 3, 4)
  • On Prod Cluster, ideally before classes start again (03/18/2024)
    • basic, single workload
    • parallel workloads (2, 3, 4)

Procedure:

  1. Conduct a series of TensorFlow jobs to test GPU performance.
  2. Monitor system stability, noting any crashes, errors.
  3. Deploy the "Tensorflow Jupyter CUDA" image on the Prod Cluster.
  4. Repeat the testing procedure on the Prod Cluster, ensuring consistency and reliability on different clusters.
  5. Document any issues encountered and the outcomes of the tests for both clusters.

This quality test aims to confirm that the new NVIDIA A100 GPUs are working and can be used for upcoming classes and projects.

@schwesig schwesig added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed rhods RHODS observability labels Mar 25, 2024
@schwesig
Copy link
Member Author

closed to follow up on the more reusable #534 idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed observability rhods RHODS
Projects
None yet
Development

No branches or pull requests

1 participant