Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LiveMigration fails because of same product_uuid on same model hardware servers #4025

Closed
staedter opened this issue Jun 1, 2023 · 6 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/needed Reminder to add a reproduce label and to remove this one reproduce/rare Reproducible less than 10% of the time require/doc Improvements or additions to documentation severity/needed Reminder to add a severity label and to remove this one severity/4 Function working but has a minor issue (a minor incident with low impact)
Milestone

Comments

@staedter
Copy link

staedter commented Jun 1, 2023

Describe the bug
We are in the process of setting up an onprem Harvester bare-metal cluster and when we are trying to use the LiveMigration feature, we get the following error message in the events of the virtualmachine resource.

VirtualMachineInstance migration uid c9091bc7-60ce-49b4-84dd-83adb17fbd9d failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: Attempt to migrate guest to the same host 03000200-0400-0500-0006-000700080009')

image

This is in line with the fact that all our hosts have the same UUID in the "More Information" section of the host resource (e.g. two examples)
image
image

We have verified that on all nodes the contents of the SMBIOS interface files /sys/class/dmi/id/product_uuid is excactly the same value 03000200-0400-0500-0006-000700080009 but the contents of /sys/class/dmi/id/product_serial are consecutive numbers like 9000160160, 9000160161

We would like to know how we can fix this so, that concerning rancher all nodes are recognized as different nodes so that we can use the LiveMigration feature.

Best regards
Chris

To Reproduce
Steps to reproduce the behavior:

  1. Go to any VM running on Harvester CLuster
  2. Select Migrate and chose a migration target
  3. after a short while the migration fails with the above error message

Expected behavior
When I use the LiveMigration feature the chosen VM is migrated successfully to another node with a different UUID

Support bundle

I send the support bundle to harvester-support-bundle@suse.com with the correct issue ID

Environment

  • Harvester ISO version: harvester-v1.1.2-amd64
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 5x bare-metal AMD Dual-CPU RA2112-ASEPN Server (2x EPYC 75F3 CPUs)

Additional context
As the screenshot shows we are also experiencing the AMD KVM issue described here #3900 but we are waiting for the release of the annouced patch.

@staedter staedter added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jun 1, 2023
@w13915984028
Copy link
Member

w13915984028 commented Jun 1, 2023

@staedter per source code,if product_uuid is there, kubelet->cadvisor will directly use it.

Seems you may try to update /sys/class/dmi/id/product_uuid manually, and this info will be updated to the NODE object system info.

If after your modify (not sure Linux allows and does not revert your change), the info is not synced to NODE, kill the kubelet process in that NODE, the newly created kubelet will do it.

But I suspect it will not survive after a NODE level rebooting.

https://utcc.utoronto.ca/~cks/space/blog/linux/DMIDataInSysfs

In sysfs, DMI information is found at /sys/class/dmi/id, which is a symlink to /sys/devices/virtual/dmi/id. This will commonly expose the DMI 'BIOS', 'Base Board', 'Chassis', and 'System Information' sections as bios_, board_, chassis_, and product_. The sys_vendor file is the Vendor field from the DMI 'System Information' section. There's also a modalias file that summarizes much of this if you want to have it all in one spot.

update it in SMBIOS ?

func (fs *realSysFs) GetSystemUUID() (string, error) {
        if id, err := ioutil.ReadFile(path.Join(dmiDir, "id", "product_uuid")); err == nil {
                return strings.TrimSpace(string(id)), nil
        } else if id, err = ioutil.ReadFile(path.Join(ppcDevTree, "system-id")); err == nil {
                return strings.TrimSpace(strings.TrimRight(string(id), "\000")), nil
        } else if id, err = ioutil.ReadFile(path.Join(ppcDevTree, "vm,uuid")); err == nil {
                return strings.TrimSpace(strings.TrimRight(string(id), "\000")), nil
        } else if id, err = ioutil.ReadFile(path.Join(s390xDevTree, "machine-id")); err == nil {
                return strings.TrimSpace(string(id)), nil
        } else {
                return "", err
        }
}


        blockDir     = "/sys/block"
        cacheDir     = "/sys/devices/system/cpu/cpu"
        netDir       = "/sys/class/net"
        dmiDir       = "/sys/class/dmi"
        ppcDevTree   = "/proc/device-tree"
        s390xDevTree = "/etc" // s390/s390x changes

@staedter
Copy link
Author

staedter commented Jun 2, 2023

Hello and thank you for the quick response.

Manually editing the information in this path /sys/class/dmi/id/product_uuid is unfortunately not possible because the file AFAAIK is just an interface for certain kernel functions. In any case non of our methods to change, modify or override this file has actually worked, so that the migration went through successfully (mounting another file inplace of the original, was promising and updated the "more Information" section in Harvester GUI, but after a while it reverted back to the original and even in the meantime, the migrations did not succeed).

The only way to edit it from inside the OS, would be to change those kernel files and recompile the whole kernel of the underlying OS, which is of course not a viable option for us.

We are already in contact with our hardware vendor, to try to change the SMBIOS values that the kernel would read from the mainboard by reflashing the whole BIOS and it looks like a promising avenue for us because our vendor is very helpful and knowledgeable. But nonetheless I think that might not always be the case for every vendor, so that maybe this kind of edge case should be considered and caught from the harvester/hypervisor side as well.

e.g.: Would it maybe possible to somehow make the default value read from the product_uuid overrideable with maybe a special annotation or something so that such a case would not require reflashing of the bare metal BIOSes on each server?

@w13915984028
Copy link
Member

w13915984028 commented Jun 2, 2023

@staedter

It seems to be the first time for Harvester to encounter this, several machines are of same product_uuid, I tried to read them from my local PC and the VM on it, each is unique.

PC:
sudo -i cat /sys/class/dmi/id/product_uuid 
30eb9a66-bcb7-25d8-4f74-04421ae88eac

KVM VM on this PC:
sudo -i cat /sys/class/dmi/id/product_uuid
39c53d5b-e7ec-45d1-b53a-263c4d46034f

As the source code is in kubelet, an agent running in each k8s NODE, I suspect the code is not easy to change. It will be much complex than current, kubelet needs to read values from underlayer, and then compare it with api-server, when duplicated, try to random it/use some other value....

And it was ever discussed:
kubernetes/kubeadm#31 (comment)

list this as a requirement for everything running smoothly.
Kubernetes and things running on top might require/assume that
a) The product_uuid is unique
b) The MAC address is unique
for every node.

@guangbochen @bk201 We could also add this into Harvester requirements.

@w13915984028 w13915984028 added the require/doc Improvements or additions to documentation label Jun 2, 2023
@w13915984028 w13915984028 self-assigned this Jun 2, 2023
@w13915984028 w13915984028 added this to the v1.2.0 milestone Jun 2, 2023
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Jun 2, 2023

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: issue description

* [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including `backport-needed/*`)? The PR is at: https://github.com/harvester/docs/pull/324
* [ ] Does the PR include the explanation for the fix or the feature? 

* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:
  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at: emphasis Product_UUID needs to be unique docs#324

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@harvesterhci-io-github-bot

Automation e2e test issue: harvester/tests#844

@w13915984028 w13915984028 added not-require/test-plan Skip to create a e2e automation test issue severity/4 Function working but has a minor issue (a minor incident with low impact) reproduce/rare Reproducible less than 10% of the time labels Jun 2, 2023
@staedter
Copy link
Author

staedter commented Jun 5, 2023

We could fix the issue with our servers bei changing the SMBIOS information on our ASUS-Mainboards via the amidmi.exe tool.

We had to create a FREEDOS-Bootstick and then use this command

amidmi.exe /u "00000000000000000000123456789123"

After a reboot the /sys/class/dmi/id/product_uuid finally showed different results and kube-virt startet working correctly.

Thank you for the help. Issue has been resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/needed Reminder to add a reproduce label and to remove this one reproduce/rare Reproducible less than 10% of the time require/doc Improvements or additions to documentation severity/needed Reminder to add a severity label and to remove this one severity/4 Function working but has a minor issue (a minor incident with low impact)
Projects
None yet
Development

No branches or pull requests

4 participants