Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect GPU Specification and Machine Type Mapping for A100 in Vertex API #37

Open
jeffhernandez1995 opened this issue Dec 12, 2023 · 0 comments

Comments

@jeffhernandez1995
Copy link

Hello,

I'd like to express my appreciation for the xmanager tool! However, I've noticed a couple of issues regarding the specification of the A100 GPU and its associated machine types in the Vertex API, which I'd like to bring to your attention:

  1. GPU Naming Discrepancy:
    According to the Google Cloud resource documentation, the correct name for the A100 GPU with 80GB is A100_80GB, not A100_80GIB. This naming inconsistency leads to an error when requesting this resource. Reference: Google Cloud Documentation . Additionally, I've attached an image from the documentation.
    Documentation Screenshot

  2. Incorrect API Call Formation:
    When the A100_80GIB is referenced in the Vertex API, it results in a string like 'NVIDIA_TESLA_A100_80GIB', whereas it should be NVIDIA_A100_80GB. I believe this error stems from the line: accelerator_type = 'NVIDIA_TESLA_' + str(resource).upper() in the vertex.py script .

  3. Machine Type Mismatch:
    The A100_80GB GPU should be associated with machine types such as 'a2-ultragpu-1g', 'a2-ultragpu-2g', 'a2-ultragpu-4g', and 'a2-ultragpu-8g'. However, the current specification only attempts to map A100 GPUs to the following machine types:

_A100_GPUS_TO_MACHINE_TYPE = {
    1: 'a2-highgpu-1g',
    2: 'a2-highgpu-2g',
    4: 'a2-highgpu-4g',
    8: 'a2-highgpu-8g',
    16: 'a2-megagpu-16g',
}

Thank you for your attention to this matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant