[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

staedter · 2023-06-01T16:55:48Z

Describe the bug
We are in the process of setting up an onprem Harvester bare-metal cluster and when we are trying to use the LiveMigration feature, we get the following error message in the events of the virtualmachine resource.

VirtualMachineInstance migration uid c9091bc7-60ce-49b4-84dd-83adb17fbd9d failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: Attempt to migrate guest to the same host 03000200-0400-0500-0006-000700080009')

This is in line with the fact that all our hosts have the same UUID in the "More Information" section of the host resource (e.g. two examples)

We have verified that on all nodes the contents of the SMBIOS interface files /sys/class/dmi/id/product_uuid is excactly the same value 03000200-0400-0500-0006-000700080009 but the contents of /sys/class/dmi/id/product_serial are consecutive numbers like 9000160160, 9000160161

We would like to know how we can fix this so, that concerning rancher all nodes are recognized as different nodes so that we can use the LiveMigration feature.

Best regards
Chris

To Reproduce
Steps to reproduce the behavior:

Go to any VM running on Harvester CLuster
Select Migrate and chose a migration target
after a short while the migration fails with the above error message

Expected behavior
When I use the LiveMigration feature the chosen VM is migrated successfully to another node with a different UUID

Support bundle

I send the support bundle to harvester-support-bundle@suse.com with the correct issue ID

Environment

Harvester ISO version: harvester-v1.1.2-amd64
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 5x bare-metal AMD Dual-CPU RA2112-ASEPN Server (2x EPYC 75F3 CPUs)

Additional context
As the screenshot shows we are also experiencing the AMD KVM issue described here #3900 but we are waiting for the release of the annouced patch.

The text was updated successfully, but these errors were encountered:

w13915984028 · 2023-06-01T18:20:16Z

@staedter per source code,if product_uuid is there, kubelet->cadvisor will directly use it.

Seems you may try to update /sys/class/dmi/id/product_uuid manually, and this info will be updated to the NODE object system info.

If after your modify (not sure Linux allows and does not revert your change), the info is not synced to NODE, kill the kubelet process in that NODE, the newly created kubelet will do it.

But I suspect it will not survive after a NODE level rebooting.

https://utcc.utoronto.ca/~cks/space/blog/linux/DMIDataInSysfs

In sysfs, DMI information is found at /sys/class/dmi/id, which is a symlink to /sys/devices/virtual/dmi/id. This will commonly expose the DMI 'BIOS', 'Base Board', 'Chassis', and 'System Information' sections as bios_, board_, chassis_, and product_. The sys_vendor file is the Vendor field from the DMI 'System Information' section. There's also a modalias file that summarizes much of this if you want to have it all in one spot.

update it in SMBIOS ?

func (fs *realSysFs) GetSystemUUID() (string, error) {
        if id, err := ioutil.ReadFile(path.Join(dmiDir, "id", "product_uuid")); err == nil {
                return strings.TrimSpace(string(id)), nil
        } else if id, err = ioutil.ReadFile(path.Join(ppcDevTree, "system-id")); err == nil {
                return strings.TrimSpace(strings.TrimRight(string(id), "\000")), nil
        } else if id, err = ioutil.ReadFile(path.Join(ppcDevTree, "vm,uuid")); err == nil {
                return strings.TrimSpace(strings.TrimRight(string(id), "\000")), nil
        } else if id, err = ioutil.ReadFile(path.Join(s390xDevTree, "machine-id")); err == nil {
                return strings.TrimSpace(string(id)), nil
        } else {
                return "", err
        }
}


        blockDir     = "/sys/block"
        cacheDir     = "/sys/devices/system/cpu/cpu"
        netDir       = "/sys/class/net"
        dmiDir       = "/sys/class/dmi"
        ppcDevTree   = "/proc/device-tree"
        s390xDevTree = "/etc" // s390/s390x changes

staedter · 2023-06-02T08:01:01Z

Hello and thank you for the quick response.

Manually editing the information in this path /sys/class/dmi/id/product_uuid is unfortunately not possible because the file AFAAIK is just an interface for certain kernel functions. In any case non of our methods to change, modify or override this file has actually worked, so that the migration went through successfully (mounting another file inplace of the original, was promising and updated the "more Information" section in Harvester GUI, but after a while it reverted back to the original and even in the meantime, the migrations did not succeed).

The only way to edit it from inside the OS, would be to change those kernel files and recompile the whole kernel of the underlying OS, which is of course not a viable option for us.

We are already in contact with our hardware vendor, to try to change the SMBIOS values that the kernel would read from the mainboard by reflashing the whole BIOS and it looks like a promising avenue for us because our vendor is very helpful and knowledgeable. But nonetheless I think that might not always be the case for every vendor, so that maybe this kind of edge case should be considered and caught from the harvester/hypervisor side as well.

e.g.: Would it maybe possible to somehow make the default value read from the product_uuid overrideable with maybe a special annotation or something so that such a case would not require reflashing of the bare metal BIOSes on each server?

w13915984028 · 2023-06-02T08:25:58Z

@staedter

It seems to be the first time for Harvester to encounter this, several machines are of same product_uuid, I tried to read them from my local PC and the VM on it, each is unique.

PC:
sudo -i cat /sys/class/dmi/id/product_uuid 
30eb9a66-bcb7-25d8-4f74-04421ae88eac

KVM VM on this PC:
sudo -i cat /sys/class/dmi/id/product_uuid
39c53d5b-e7ec-45d1-b53a-263c4d46034f

As the source code is in kubelet, an agent running in each k8s NODE, I suspect the code is not easy to change. It will be much complex than current, kubelet needs to read values from underlayer, and then compare it with api-server, when duplicated, try to random it/use some other value....

And it was ever discussed:
kubernetes/kubeadm#31 (comment)

list this as a requirement for everything running smoothly.
Kubernetes and things running on top might require/assume that
a) The product_uuid is unique
b) The MAC address is unique
for every node.

@guangbochen @bk201 We could also add this into Harvester requirements.

harvesterhci-io-github-bot · 2023-06-02T08:53:24Z

Pre Ready-For-Testing Checklist

If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
The HEP PR is at:
Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: issue description
Is there a workaround for the issue? If so, where is it documented?
The workaround is at: issue description

* [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including `backport-needed/*`)? The PR is at: https://github.com/harvester/docs/pull/324
* [ ] Does the PR include the explanation for the fix or the feature? * [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:

If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at:
If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
The documentation/KB PR is at: emphasis Product_UUID needs to be unique docs#324

If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:
If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at:

harvesterhci-io-github-bot · 2023-06-02T08:53:25Z

Automation e2e test issue: harvester/tests#844

staedter · 2023-06-05T06:46:46Z

We could fix the issue with our servers bei changing the SMBIOS information on our ASUS-Mainboards via the amidmi.exe tool.

We had to create a FREEDOS-Bootstick and then use this command

amidmi.exe /u "00000000000000000000123456789123"

After a reboot the /sys/class/dmi/id/product_uuid finally showed different results and kube-virt startet working correctly.

Thank you for the help. Issue has been resolved

staedter added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jun 1, 2023

w13915984028 added the require/doc Improvements or additions to documentation label Jun 2, 2023

w13915984028 self-assigned this Jun 2, 2023

w13915984028 mentioned this issue Jun 2, 2023

emphasis Product_UUID needs to be unique harvester/docs#324

Merged

w13915984028 added this to the v1.2.0 milestone Jun 2, 2023

harvesterhci-io-github-bot mentioned this issue Jun 2, 2023

[e2e] [BUG] LiveMigration fails because of same product_uuid on same model hardware servers harvester/tests#844

Open

1 task

w13915984028 added not-require/test-plan Skip to create a e2e automation test issue severity/4 Function working but has a minor issue (a minor incident with low impact) reproduce/rare Reproducible less than 10% of the time labels Jun 2, 2023

staedter closed this as completed Jun 5, 2023

LucasSaintarbor added the require/doc-pr-opened label Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

staedter commented Jun 1, 2023 •

edited

w13915984028 commented Jun 1, 2023 •

edited

staedter commented Jun 2, 2023 •

edited

w13915984028 commented Jun 2, 2023 •

edited

harvesterhci-io-github-bot commented Jun 2, 2023 •

edited by w13915984028

harvesterhci-io-github-bot commented Jun 2, 2023

staedter commented Jun 5, 2023

[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

Comments

staedter commented Jun 1, 2023 • edited

w13915984028 commented Jun 1, 2023 • edited

staedter commented Jun 2, 2023 • edited

w13915984028 commented Jun 2, 2023 • edited

harvesterhci-io-github-bot commented Jun 2, 2023 • edited by w13915984028

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented Jun 2, 2023

staedter commented Jun 5, 2023

staedter commented Jun 1, 2023 •

edited

w13915984028 commented Jun 1, 2023 •

edited

staedter commented Jun 2, 2023 •

edited

w13915984028 commented Jun 2, 2023 •

edited

harvesterhci-io-github-bot commented Jun 2, 2023 •

edited by w13915984028