New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix time going backwards on biolatency and biotop #2660
base: master
Are you sure you want to change the base?
Conversation
When, for some reason, the bpf_ktime_get_ns() in trace_req_done results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and the histogram shows something like this: Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 25 | | 32 -> 63 : 32 | | 64 -> 127 : 254 | | 128 -> 255 : 2164 |* | 256 -> 511 : 5421 |*** | 512 -> 1023 : 29795 |********************| 1024 -> 2047 : 17042 |*********** | 2048 -> 4095 : 6400 |**** | 4096 -> 8191 : 4935 |*** | 8192 -> 16383 : 588 | | 16384 -> 32767 : 54 | | 32768 -> 65535 : 14 | | 65536 -> 131071 : 4 | | 131072 -> 262143 : 4 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 0 | | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 0 | | 33554432 -> 67108863 : 0 | | 67108864 -> 134217727 : 0 | | 134217728 -> 268435455 : 0 | | 268435456 -> 536870911 : 0 | | 536870912 -> 1073741823 : 0 | | 1073741824 -> 2147483647 : 0 | | 2147483648 -> 4294967295 : 0 | | 4294967296 -> 8589934591 : 0 | | 8589934592 -> 17179869183 : 0 | | 17179869184 -> 34359738367 : 0 | | 34359738368 -> 68719476735 : 0 | | 68719476736 -> 137438953471 : 0 | | 137438953472 -> 274877906943 : 0 | | 274877906944 -> 549755813887 : 0 | | 549755813888 -> 1099511627775 : 0 | | 1099511627776 -> 2199023255551 : 0 | | 2199023255552 -> 4398046511103 : 0 | | 4398046511104 -> 8796093022207 : 0 | | 8796093022208 -> 17592186044415 : 0 | | 17592186044416 -> 35184372088831 : 0 | | 35184372088832 -> 70368744177663 : 0 | | 70368744177664 -> 140737488355327 : 0 | | 140737488355328 -> 281474976710655 : 0 | | 281474976710656 -> 562949953421311 : 0 | | 562949953421312 -> 1125899906842623 : 0 | | 1125899906842624 -> 2251799813685247 : 0 | | 2251799813685248 -> 4503599627370495 : 0 | | 4503599627370496 -> 9007199254740991 : 0 | | 9007199254740992 -> 18014398509481983 : 0 | | 18014398509481984 -> 36028797018963967 : 17 | | With this bug corrected, the output is: Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 163 | | 32 -> 63 : 339 | | 64 -> 127 : 258 | | 128 -> 255 : 2259 |** | 256 -> 511 : 5677 |****** | 512 -> 1023 : 32662 |****************************************| 1024 -> 2047 : 20252 |************************ | 2048 -> 4095 : 12544 |*************** | 4096 -> 8191 : 11006 |************* | 8192 -> 16383 : 1230 |* | 16384 -> 32767 : 201 | | 32768 -> 65535 : 544 | | 65536 -> 131071 : 746 | | 131072 -> 262143 : 881 |* | 262144 -> 524287 : 309 | | Which is very easier to read.
When, for some reason, the bpf_ktime_get_ns() in trace_req_completion results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and results in a very huge number appears in the biotop output. Added a check to discard the measured request when this situation happens.
Last commit changes delta_us to delta, but the operation to convert was wrong (was * 1000, but needs to be / 1000)
Reject. This patch just tosses out I/O where time went backwards, resulting in misleading output that is missing events. In some observability tools this may be the practical approach, and I would only accept it if the documentation in the man page to make it crystal clear that the output may be incomplete. I think a better approach is to update a map to say "dropped %d events due to backwards time" with the output. This patch does neither. You can update the patch to print the dropped message, or we can close this PR without merging. |
Just to be clear: the output with completely garbage values is preferable than silently dropping values. The completely garbage values tells the end user that something is wrong. By dropping them, you give them no chance. |
Ok, I will edit the patch to add the dropped count message. Thanks! |
Added a message warning the user about the amount of IO requests that are dropped due to bpf_ktime_get_ns() at the end of request being older than the one collected at start.
Added a message warning the user about the amount of IO requests that are dropped due to bpf_ktime_get_ns() at the end of request being older than the one collected at start.
[buildbot, test this please] |
[buildbot, ok to test] |
@piterpunk I would like to understand more when the negative time could happen. Could you help collect status include cpu (bpf_get_smp_processor_id()) for |
Ok, I modified biosnoop to show the CPU in 'start' and 'done', these are the last two fields. I ran it by four minutes and filtered out the user processes. This operation resulted in 707 events, 692 of them began and end in different CPUs, bellow were the 15 that began and end in the same CPU:
If you want more information, feel free to ask. Thanks! |
@piterpunk so when negative time happens, |
@yonghong-song Not always, in my four minutes collect, the majority (97.878%) of events with negative time, the 'start' and 'end' were in different CPUs. But there were a few (2.122%) events with negative time that were in the same CPU, those events were all listed in my previous post. |
Thanks for the information. I am wondering whether we missed some thing in the kernel why we could get negative time for the same cpu. Do you give some details how to reproduce the issue? |
Well, the issue happens everytime I execute biosnoop, biotop or biolatency. The specific machine where I am running this tests is a RHEL 7.7 system with kernel 3.10.0-1062.4.1.el7.x86_64. The server have four sockets, each one with 12 cores and with HT enabled, resulting a total of 96 available CPUs. The machine have around 300 LUNs, with multipath, so it sees circa 1200 block devices. And does an average of 7000/8000 iops per second, with peaks of 17000 iops. All the machines that I have access with this size and a comparable throughput that can use bcc are RHEL7 machines. |
Ok, searching a bit more, it seems to be a known issue: #728 If this backward in time situation is caused by the same issue, the fix seems to not been applied on RHEL7.7 or, if applied, it didn't works. I compared the file kernel/time/timekeeping.c between kernels 3.10.0-1062.4.1, which I know that have the problem, and the 3.10.0-1062.9.1, the last one available for RHEL7.7.z, and both are identical. |
Sometimes the time collected in the request completion is older than the one collected in the request start. This results in enormous numbers in the measured latency. The wrong latency makes histograms from biolatency occupies a lot of screen lines and the shows very high (and impossible) numbers on biotop.
This fix adds a check if signed delta is a negative number, in this case, the time of request completion is older than the request start and the offensor request is discarded.
This is the biolatency output before this change:
And this one is after the bug is fixed:
Thanks!