Fix time going backwards on biolatency and biotop #2660

piterpunk · 2019-12-22T03:51:07Z

Sometimes the time collected in the request completion is older than the one collected in the request start. This results in enormous numbers in the measured latency. The wrong latency makes histograms from biolatency occupies a lot of screen lines and the shows very high (and impossible) numbers on biotop.

This fix adds a check if signed delta is a negative number, in this case, the time of request completion is older than the request start and the offensor request is discarded.

This is the biolatency output before this change:

Tracing block device I/O... Hit Ctrl-C to end.
^C
               usecs                         : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 25       |                    |
                  32 -> 63                   : 32       |                    |
                  64 -> 127                  : 254      |                    |
                 128 -> 255                  : 2164     |*                   |
                 256 -> 511                  : 5421     |***                 |
                 512 -> 1023                 : 29795    |********************|
                1024 -> 2047                 : 17042    |***********         |
                2048 -> 4095                 : 6400     |****                |
                4096 -> 8191                 : 4935     |***                 |
                8192 -> 16383                : 588      |                    |
               16384 -> 32767                : 54       |                    |
               32768 -> 65535                : 14       |                    |
               65536 -> 131071               : 4        |                    |
              131072 -> 262143               : 4        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 0        |                    |
             4194304 -> 8388607              : 0        |                    |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 0        |                    |
            33554432 -> 67108863             : 0        |                    |
            67108864 -> 134217727            : 0        |                    |
           134217728 -> 268435455            : 0        |                    |
           268435456 -> 536870911            : 0        |                    |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 0        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 0        |                    |
          8589934592 -> 17179869183          : 0        |                    |
         17179869184 -> 34359738367          : 0        |                    |
         34359738368 -> 68719476735          : 0        |                    |
         68719476736 -> 137438953471         : 0        |                    |
        137438953472 -> 274877906943         : 0        |                    |
        274877906944 -> 549755813887         : 0        |                    |
        549755813888 -> 1099511627775        : 0        |                    |
       1099511627776 -> 2199023255551        : 0        |                    |
       2199023255552 -> 4398046511103        : 0        |                    |
       4398046511104 -> 8796093022207        : 0        |                    |
       8796093022208 -> 17592186044415       : 0        |                    |
      17592186044416 -> 35184372088831       : 0        |                    |
      35184372088832 -> 70368744177663       : 0        |                    |
      70368744177664 -> 140737488355327      : 0        |                    |
     140737488355328 -> 281474976710655      : 0        |                    |
     281474976710656 -> 562949953421311      : 0        |                    |
     562949953421312 -> 1125899906842623     : 0        |                    |
    1125899906842624 -> 2251799813685247     : 0        |                    |
    2251799813685248 -> 4503599627370495     : 0        |                    |
    4503599627370496 -> 9007199254740991     : 0        |                    |
    9007199254740992 -> 18014398509481983    : 0        |                    |
   18014398509481984 -> 36028797018963967    : 17       |                    |

And this one is after the bug is fixed:

Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 163      |                                        |
        32 -> 63         : 339      |                                        |
        64 -> 127        : 258      |                                        |
       128 -> 255        : 2259     |**                                      |
       256 -> 511        : 5677     |******                                  |
       512 -> 1023       : 32662    |****************************************|
      1024 -> 2047       : 20252    |************************                |
      2048 -> 4095       : 12544    |***************                         |
      4096 -> 8191       : 11006    |*************                           |
      8192 -> 16383      : 1230     |*                                       |
     16384 -> 32767      : 201      |                                        |
     32768 -> 65535      : 544      |                                        |
     65536 -> 131071     : 746      |                                        |
    131072 -> 262143     : 881      |*                                       |
    262144 -> 524287     : 309      |                                        |

Thanks!

When, for some reason, the bpf_ktime_get_ns() in trace_req_done results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and the histogram shows something like this: Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 25 | | 32 -> 63 : 32 | | 64 -> 127 : 254 | | 128 -> 255 : 2164 |* | 256 -> 511 : 5421 |*** | 512 -> 1023 : 29795 |********************| 1024 -> 2047 : 17042 |*********** | 2048 -> 4095 : 6400 |**** | 4096 -> 8191 : 4935 |*** | 8192 -> 16383 : 588 | | 16384 -> 32767 : 54 | | 32768 -> 65535 : 14 | | 65536 -> 131071 : 4 | | 131072 -> 262143 : 4 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 0 | | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 0 | | 16777216 -> 33554431 : 0 | | 33554432 -> 67108863 : 0 | | 67108864 -> 134217727 : 0 | | 134217728 -> 268435455 : 0 | | 268435456 -> 536870911 : 0 | | 536870912 -> 1073741823 : 0 | | 1073741824 -> 2147483647 : 0 | | 2147483648 -> 4294967295 : 0 | | 4294967296 -> 8589934591 : 0 | | 8589934592 -> 17179869183 : 0 | | 17179869184 -> 34359738367 : 0 | | 34359738368 -> 68719476735 : 0 | | 68719476736 -> 137438953471 : 0 | | 137438953472 -> 274877906943 : 0 | | 274877906944 -> 549755813887 : 0 | | 549755813888 -> 1099511627775 : 0 | | 1099511627776 -> 2199023255551 : 0 | | 2199023255552 -> 4398046511103 : 0 | | 4398046511104 -> 8796093022207 : 0 | | 8796093022208 -> 17592186044415 : 0 | | 17592186044416 -> 35184372088831 : 0 | | 35184372088832 -> 70368744177663 : 0 | | 70368744177664 -> 140737488355327 : 0 | | 140737488355328 -> 281474976710655 : 0 | | 281474976710656 -> 562949953421311 : 0 | | 562949953421312 -> 1125899906842623 : 0 | | 1125899906842624 -> 2251799813685247 : 0 | | 2251799813685248 -> 4503599627370495 : 0 | | 4503599627370496 -> 9007199254740991 : 0 | | 9007199254740992 -> 18014398509481983 : 0 | | 18014398509481984 -> 36028797018963967 : 17 | | With this bug corrected, the output is: Tracing block device I/O... Hit Ctrl-C to end. ^C usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 163 | | 32 -> 63 : 339 | | 64 -> 127 : 258 | | 128 -> 255 : 2259 |** | 256 -> 511 : 5677 |****** | 512 -> 1023 : 32662 |****************************************| 1024 -> 2047 : 20252 |************************ | 2048 -> 4095 : 12544 |*************** | 4096 -> 8191 : 11006 |************* | 8192 -> 16383 : 1230 |* | 16384 -> 32767 : 201 | | 32768 -> 65535 : 544 | | 65536 -> 131071 : 746 | | 131072 -> 262143 : 881 |* | 262144 -> 524287 : 309 | | Which is very easier to read.

When, for some reason, the bpf_ktime_get_ns() in trace_req_completion results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and results in a very huge number appears in the biotop output. Added a check to discard the measured request when this situation happens.

Last commit changes delta_us to delta, but the operation to convert was wrong (was * 1000, but needs to be / 1000)

brendangregg · 2019-12-22T04:01:55Z

Reject. This patch just tosses out I/O where time went backwards, resulting in misleading output that is missing events. In some observability tools this may be the practical approach, and I would only accept it if the documentation in the man page to make it crystal clear that the output may be incomplete. I think a better approach is to update a map to say "dropped %d events due to backwards time" with the output.

This patch does neither. You can update the patch to print the dropped message, or we can close this PR without merging.

brendangregg · 2019-12-22T04:05:53Z

Just to be clear: the output with completely garbage values is preferable than silently dropping values. The completely garbage values tells the end user that something is wrong. By dropping them, you give them no chance.

piterpunk · 2019-12-22T04:09:31Z

Just to be clear: the output with completely garbage values is preferable than silently dropping values. The completely garbage values tells the end user that something is wrong. By dropping them, you give them no chance.

Ok, I will edit the patch to add the dropped count message.

Thanks!

Added a message warning the user about the amount of IO requests that are dropped due to bpf_ktime_get_ns() at the end of request being older than the one collected at start.

yonghong-song · 2019-12-22T06:52:06Z

[buildbot, test this please]

yonghong-song · 2019-12-22T06:52:18Z

[buildbot, ok to test]

yonghong-song · 2019-12-22T06:57:18Z

@piterpunk I would like to understand more when the negative time could happen. Could you help collect status include cpu (bpf_get_smp_processor_id()) for start and done? When negative time happens, is the cpu the same or not? If they are not the same, this may explain this is due to different cpu time may drift a little bit? If they are the same, may be we have other issues. Thanks!

piterpunk · 2019-12-22T17:44:25Z

@piterpunk I would like to understand more when the negative time could happen. Could you help collect status include cpu (bpf_get_smp_processor_id()) for start and done? When negative time happens, is the cpu the same or not? If they are not the same, this may explain this is due to different cpu time may drift a little bit? If they are the same, may be we have other issues. Thanks!

Ok, I modified biosnoop to show the CPU in 'start' and 'done', these are the last two fields. I ran it by four minutes and filtered out the user processes. This operation resulted in 707 events, 692 of them began and end in different CPUs, bellow were the 15 that began and end in the same CPU:

17.119888000   jbd2/dm-6-8    48356  sdb     W  275117216 4096    18446744073708.59 67 67
21.443884000   kdmwork-253:33 2427   sdkw    W  348780239 122880  18446744073709.09 63 63
44.421883000   kdmwork-253:36 7013   sdrf    R  64105855  8192    18446744073708.79 0 0
48.284886000   kworker/u197:0 66870  sdb     W  93254656  524288  18446744073708.80 26 26
49.380885000   jbd2/dm-6-8    48356  sdb     W  275185152 135168  18446744073708.62 14 14
49.388890000   jbd2/dm-6-8    48356  sdb     W  275186504 4096    18446744073708.59 16 16
67.748884000   jbd2/dm-6-8    48356  sdb     W  275232080 4096    18446744073708.59 14 14
83.055887000   jbd2/dm-6-8    48356  sdb     W  275005784 32768   18446744073708.61 15 15
102.285886000  kdmwork-253:54 9340   sduv    R  234457039 8192    18446744073708.70 0 0
104.871898000  kdmwork-253:47 8351   sdaoe   R  67526975  8192    18446744073709.23 0 0
141.691886000  jbd2/dm-6-8    48356  sdb     W  275157720 45056   18446744073708.61 38 38
207.450888000  kworker/u200:2 60710  sdb     W  254371528 4096    18446744073708.61 45 45
211.662883000  jbd2/dm-6-8    48356  sdb     W  275104760 4096    18446744073708.59 33 33
230.056884000  jbd2/dm-6-8    48356  sdb     W  275141696 4096    18446744073708.58 32 32
231.651885000  jbd2/dm-6-8    48356  sdb     W  275146344 4096    18446744073708.59 33 33

If you want more information, feel free to ask.

Thanks!

yonghong-song · 2019-12-22T23:03:00Z

@piterpunk so when negative time happens, start and done always happened at different cpu, is it right?

piterpunk · 2019-12-23T00:40:16Z

@piterpunk so when negative time happens, start and done always happened at different cpu, is it right?

@yonghong-song Not always, in my four minutes collect, the majority (97.878%) of events with negative time, the 'start' and 'end' were in different CPUs. But there were a few (2.122%) events with negative time that were in the same CPU, those events were all listed in my previous post.

yonghong-song · 2019-12-23T02:38:07Z

Thanks for the information. I am wondering whether we missed some thing in the kernel why we could get negative time for the same cpu. Do you give some details how to reproduce the issue?

piterpunk · 2019-12-23T03:02:18Z

Thanks for the information. I am wondering whether we missed some thing in the kernel why we could get negative time for the same cpu. Do you give some details how to reproduce the issue?

Well, the issue happens everytime I execute biosnoop, biotop or biolatency. The specific machine where I am running this tests is a RHEL 7.7 system with kernel 3.10.0-1062.4.1.el7.x86_64.

The server have four sockets, each one with 12 cores and with HT enabled, resulting a total of 96 available CPUs. The machine have around 300 LUNs, with multipath, so it sees circa 1200 block devices. And does an average of 7000/8000 iops per second, with peaks of 17000 iops.

All the machines that I have access with this size and a comparable throughput that can use bcc are RHEL7 machines.

piterpunk · 2019-12-23T04:09:13Z

Ok, searching a bit more, it seems to be a known issue: #728

If this backward in time situation is caused by the same issue, the fix seems to not been applied on RHEL7.7 or, if applied, it didn't works.

I compared the file kernel/time/timekeeping.c between kernels 3.10.0-1062.4.1, which I know that have the problem, and the 3.10.0-1062.9.1, the last one available for RHEL7.7.z, and both are identical.

piterpunk added 3 commits December 22, 2019 00:06

Fix delta to milisseconds

642276b

Last commit changes delta_us to delta, but the operation to convert was wrong (was * 1000, but needs to be / 1000)

piterpunk requested review from brendangregg and goldshtn as code owners December 22, 2019 03:51

piterpunk mentioned this pull request Dec 22, 2019

Wrong latency numbers on biotop and biolatency #2661

Open

brendangregg closed this Dec 22, 2019

brendangregg reopened this Dec 22, 2019

piterpunk added 2 commits December 22, 2019 02:39

Added dropped events due to backwards time counter

3378a22

Added a message warning the user about the amount of IO requests that are dropped due to bpf_ktime_get_ns() at the end of request being older than the one collected at start.

Added dropped events due to backwards time counter

c5de947

Added a message warning the user about the amount of IO requests that are dropped due to bpf_ktime_get_ns() at the end of request being older than the one collected at start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix time going backwards on biolatency and biotop #2660

Fix time going backwards on biolatency and biotop #2660

piterpunk commented Dec 22, 2019

brendangregg commented Dec 22, 2019

brendangregg commented Dec 22, 2019

piterpunk commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

piterpunk commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

piterpunk commented Dec 23, 2019

yonghong-song commented Dec 23, 2019

piterpunk commented Dec 23, 2019

piterpunk commented Dec 23, 2019 •

edited

Fix time going backwards on biolatency and biotop #2660

Are you sure you want to change the base?

Fix time going backwards on biolatency and biotop #2660

Conversation

piterpunk commented Dec 22, 2019

brendangregg commented Dec 22, 2019

brendangregg commented Dec 22, 2019

piterpunk commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

piterpunk commented Dec 22, 2019

yonghong-song commented Dec 22, 2019

piterpunk commented Dec 23, 2019

yonghong-song commented Dec 23, 2019

piterpunk commented Dec 23, 2019

piterpunk commented Dec 23, 2019 • edited

piterpunk commented Dec 23, 2019 •

edited