Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
CPU measurement fails after VM migrate between xen hosts #2871
Steps to reproduce:
CPU Measurement should continue without trouble.
No more CPU measurements, with the following error :
Rebooting host fixes the issue.
Thanks for your help !
Before migration :
After migration :
Can you also do again a few seconds later? The code in the plugin is such that even if the values do get screwed up once, subsequent gather intervals should work fine. But your report indicates this is not the case, so I want to see how the values are changing after the migration has completed.
Here is the new value (few minutes after) :
I can reproduce issue with "few seconds" later
@danielnelson How do you want to handle this? This smells like a bug in the kernel. Those steal times don't make any sense. At first I thought it might be some weird integer overflow, but that doesn't even hold up as the values should all be changing at roughly the same rate, but they're all over the place:
I googled around a little bit, but couldn't find any reported issues of screwed up CPU steal times on Xen after a migration.
Edit: My wild ass theory on what's going on is that the steal time uses some values provided by the hypervisor in addition to values tracked in the guest kernel, and performs math involving both of them (for example: hypervisor time since boot and guest kernel time since boot). Since the VM moved to another host, the hypervisor values changed, and so some calculations in the kernel are getting screwed up.