New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU measurement fails after VM migrate between xen hosts #2871

Closed
gallart opened this Issue May 31, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@gallart

gallart commented May 31, 2017

Bug report

Relevant telegraf.conf:

[[inputs.cpu]]
  # Whether to report per-cpu stats or not
  percpu = true
  # Whether to report total system cpu stats or not
  totalcpu = true
  # Comment this line if you want the raw CPU time metrics
  fielddrop = ["time_*"]

System info:

  • Telegraf version : 1.3.0
  • OS : Gentoo Linux
  • Virtual env : XenServer

Steps to reproduce:

  1. Start telegraf on Linux guest host
  2. Move host (live migration) from one hypervisor to another)

Expected behavior:

CPU Measurement should continue without trouble.

Actual behavior:

No more CPU measurements, with the following error :

* Plugin: inputs.cpu, Collection 1
* Plugin: inputs.cpu, Collection 2
2017-05-31T13:33:26Z E! Error: current total CPU time is less than previous total CPU time

Additional info:

Rebooting host fixes the issue.

Thanks for your help !

@phemmer

This comment has been minimized.

Contributor

phemmer commented May 31, 2017

This is really interesting. Would you be able to provide the output of grep '^cpu' /proc/stat before the migration, after the migration, and then again a few seconds later?

@gallart

This comment has been minimized.

gallart commented May 31, 2017

Before migration :

cpu  8496572 305946 943142 328322278 3023148 0 63260 194028 0 0
cpu0 1760022 80046 233060 81701764 1393126 0 24800 58059 0 0
cpu1 2273818 72875 234925 82237240 526798 0 12985 32368 0 0
cpu2 2292952 72327 238975 82189843 543429 0 13002 32781 0 0
cpu3 2169779 80696 236181 82193430 559794 0 12472 70820 0 0

After migration :

cpu  8497805 305947 943369 328393621 3023475 0 63270 1750538050819 0 0
cpu0 1760294 80048 233098 81719428 1393368 0 24804 1825121610327 0 0
cpu1 2274094 72875 234959 82255199 526853 0 12988 1841457961672 0 0
cpu2 2293320 72327 239009 82207724 543447 0 13003 1839967821051 0 0
cpu3 2170096 80696 236302 82211269 559806 0 12473 1778013879880 0 0
@phemmer

This comment has been minimized.

Contributor

phemmer commented May 31, 2017

Can you also do again a few seconds later? The code in the plugin is such that even if the values do get screwed up once, subsequent gather intervals should work fine. But your report indicates this is not the case, so I want to see how the values are changing after the migration has completed.

@gallart

This comment has been minimized.

gallart commented May 31, 2017

Here is the new value (few minutes after) :

cpu  8497805 305947 943369 328628732 3024618 0 63270 540640815829 0 0
cpu0 1760294 80048 233098 81778381 1393541 0 24804 709390804805 0 0
cpu1 2274094 72875 234959 82313823 527160 0 12988 1037106086598 0 0
cpu2 2293320 72327 239009 82266574 543783 0 13003 670685834267 0 0
cpu3 2170096 80696 236302 82269952 560133 0 12473 1812806904899 0 0

I can reproduce issue with "few seconds" later

@phemmer

This comment has been minimized.

Contributor

phemmer commented May 31, 2017

"few minutes" is fine. Thanks :-)

That also shows the issue too: 540640815829 - 1750538050819 = -1209897234990. WTF?

@phemmer

This comment has been minimized.

Contributor

phemmer commented May 31, 2017

@danielnelson How do you want to handle this? This smells like a bug in the kernel. Those steal times don't make any sense. At first I thought it might be some weird integer overflow, but that doesn't even hold up as the values should all be changing at roughly the same rate, but they're all over the place:

540640815829 - 1750538050819 = -1209897234990
709390804805 - 1825121610327 = -1115730805522
1037106086598 - 1841457961672 = -804351875074
670685834267 - 1839967821051 = -1169281986784
1812806904899 - 1778013879880 = 34793025019

I googled around a little bit, but couldn't find any reported issues of screwed up CPU steal times on Xen after a migration.

Edit: My wild ass theory on what's going on is that the steal time uses some values provided by the hypervisor in addition to values tracked in the guest kernel, and performs math involving both of them (for example: hypervisor time since boot and guest kernel time since boot). Since the VM moved to another host, the hypervisor values changed, and so some calculations in the kernel are getting screwed up.

@danielnelson

This comment has been minimized.

Contributor

danielnelson commented May 31, 2017

You should probably ask the XenServer team if this is a bug. I'm curious, how does the output of top or other cpu monitoring tools look?

@danielnelson

This comment has been minimized.

Contributor

danielnelson commented Jun 5, 2017

I'm going to close this issue for now since I don't think we can fix this in Telegraf. I hope you will keep us updated on this issue, feel free to comment here with any new information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment