Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] If you add two vGPUs the VM won't boot #5289

Closed
noahgildersleeve opened this issue Mar 5, 2024 · 5 comments
Closed

[BUG] If you add two vGPUs the VM won't boot #5289

noahgildersleeve opened this issue Mar 5, 2024 · 5 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time require/doc Improvements or additions to documentation severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@noahgildersleeve
Copy link

noahgildersleeve commented Mar 5, 2024

Describe the bug

When you add two vGPUs to a VM it won't boot

To Reproduce
Steps to reproduce the behavior:

  1. Enable two vGPUs on existing VM
  2. Save and restart VM or start
  3. Wait for VM to start

Expected behavior

It should either boot or if we only support one vGPU it should
Support bundle

supportbundle_bf973e5e-935b-45fd-b911-432fb8a2a038_2024-03-05T01-35-13Z.zip

Environment

  • Harvester ISO version: v1.3.0-rc3
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 2 nodes DL360 servers bare metal. A102

Additional context
Add any other context about the problem here.
Found while testing #2764
The issue looks like it might be related to this error that I'm seeing in events. I also included a screenshot
Server error. command SyncVMI failed: "LibvirtError(Code=67, Domain=20, Message='unsupported configuration: Only one vgpu device can have 'ramfb' enabled')"

Greenshot 2024-03-04 17 44 17
Greenshot 2024-03-04 17 31 19

@noahgildersleeve noahgildersleeve added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) reproduce/always Reproducible 100% of the time labels Mar 5, 2024
@bk201 bk201 added this to the v1.3.0 milestone Mar 5, 2024
@ibrokethecloud
Copy link
Contributor

I can confirm that a VM with multiple vGPU's can boot once the following virtualGPUOptions are applied to device

virtualGPUOptions:
  display:
     ramFB:
       enabled: false 

However for this to work, the vGPU profile should support multiple vGPU allocation. Based on documentation https://docs.nvidia.com/grid/16.0/grid-vgpu-release-notes-generic-linux-kvm/index.html, our GPU's only support multiple vGPU allocation if the Q-series vGPU's are used:

image

I was able to create a VM with 2 A2-4Q vgpu profiles

image

And attach them to a VM with the additional virtualGPUOptions on one of the vGPU:

image

Post this change VM is able to boot successfully and devices are visible to guest

image

@bk201 bk201 added the require/doc Improvements or additions to documentation label Mar 5, 2024
@noahgildersleeve
Copy link
Author

I validated the workaround with version master-a2c98e96-head.

@bk201
Copy link
Member

bk201 commented Mar 14, 2024

The doc PR cover this: harvester/docs#526

@bk201 bk201 added the not-require/test-plan Skip to create a e2e automation test issue label Mar 14, 2024
@harvesterhci-io-github-bot

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

    • Does the PR include the explanation for the fix or the feature?

    • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
      The PR for the YAML change is at:
      The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at:

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@bk201
Copy link
Member

bk201 commented Mar 15, 2024

@bk201 bk201 closed this as completed Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time require/doc Improvements or additions to documentation severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

4 participants