Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix time going backwards on biolatency and biotop #2660

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

piterpunk
Copy link

Sometimes the time collected in the request completion is older than the one collected in the request start. This results in enormous numbers in the measured latency. The wrong latency makes histograms from biolatency occupies a lot of screen lines and the shows very high (and impossible) numbers on biotop.

This fix adds a check if signed delta is a negative number, in this case, the time of request completion is older than the request start and the offensor request is discarded.

This is the biolatency output before this change:

Tracing block device I/O... Hit Ctrl-C to end.
^C
               usecs                         : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 25       |                    |
                  32 -> 63                   : 32       |                    |
                  64 -> 127                  : 254      |                    |
                 128 -> 255                  : 2164     |*                   |
                 256 -> 511                  : 5421     |***                 |
                 512 -> 1023                 : 29795    |********************|
                1024 -> 2047                 : 17042    |***********         |
                2048 -> 4095                 : 6400     |****                |
                4096 -> 8191                 : 4935     |***                 |
                8192 -> 16383                : 588      |                    |
               16384 -> 32767                : 54       |                    |
               32768 -> 65535                : 14       |                    |
               65536 -> 131071               : 4        |                    |
              131072 -> 262143               : 4        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 0        |                    |
             4194304 -> 8388607              : 0        |                    |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 0        |                    |
            33554432 -> 67108863             : 0        |                    |
            67108864 -> 134217727            : 0        |                    |
           134217728 -> 268435455            : 0        |                    |
           268435456 -> 536870911            : 0        |                    |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 0        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 0        |                    |
          8589934592 -> 17179869183          : 0        |                    |
         17179869184 -> 34359738367          : 0        |                    |
         34359738368 -> 68719476735          : 0        |                    |
         68719476736 -> 137438953471         : 0        |                    |
        137438953472 -> 274877906943         : 0        |                    |
        274877906944 -> 549755813887         : 0        |                    |
        549755813888 -> 1099511627775        : 0        |                    |
       1099511627776 -> 2199023255551        : 0        |                    |
       2199023255552 -> 4398046511103        : 0        |                    |
       4398046511104 -> 8796093022207        : 0        |                    |
       8796093022208 -> 17592186044415       : 0        |                    |
      17592186044416 -> 35184372088831       : 0        |                    |
      35184372088832 -> 70368744177663       : 0        |                    |
      70368744177664 -> 140737488355327      : 0        |                    |
     140737488355328 -> 281474976710655      : 0        |                    |
     281474976710656 -> 562949953421311      : 0        |                    |
     562949953421312 -> 1125899906842623     : 0        |                    |
    1125899906842624 -> 2251799813685247     : 0        |                    |
    2251799813685248 -> 4503599627370495     : 0        |                    |
    4503599627370496 -> 9007199254740991     : 0        |                    |
    9007199254740992 -> 18014398509481983    : 0        |                    |
   18014398509481984 -> 36028797018963967    : 17       |                    |

And this one is after the bug is fixed:

Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 163      |                                        |
        32 -> 63         : 339      |                                        |
        64 -> 127        : 258      |                                        |
       128 -> 255        : 2259     |**                                      |
       256 -> 511        : 5677     |******                                  |
       512 -> 1023       : 32662    |****************************************|
      1024 -> 2047       : 20252    |************************                |
      2048 -> 4095       : 12544    |***************                         |
      4096 -> 8191       : 11006    |*************                           |
      8192 -> 16383      : 1230     |*                                       |
     16384 -> 32767      : 201      |                                        |
     32768 -> 65535      : 544      |                                        |
     65536 -> 131071     : 746      |                                        |
    131072 -> 262143     : 881      |*                                       |
    262144 -> 524287     : 309      |                                        |

Thanks!

When, for some reason, the bpf_ktime_get_ns() in trace_req_done results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and the histogram shows something like this:

Tracing block device I/O... Hit Ctrl-C to end.
^C
               usecs                         : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 25       |                    |
                  32 -> 63                   : 32       |                    |
                  64 -> 127                  : 254      |                    |
                 128 -> 255                  : 2164     |*                   |
                 256 -> 511                  : 5421     |***                 |
                 512 -> 1023                 : 29795    |********************|
                1024 -> 2047                 : 17042    |***********         |
                2048 -> 4095                 : 6400     |****                |
                4096 -> 8191                 : 4935     |***                 |
                8192 -> 16383                : 588      |                    |
               16384 -> 32767                : 54       |                    |
               32768 -> 65535                : 14       |                    |
               65536 -> 131071               : 4        |                    |
              131072 -> 262143               : 4        |                    |
              262144 -> 524287               : 0        |                    |
              524288 -> 1048575              : 0        |                    |
             1048576 -> 2097151              : 0        |                    |
             2097152 -> 4194303              : 0        |                    |
             4194304 -> 8388607              : 0        |                    |
             8388608 -> 16777215             : 0        |                    |
            16777216 -> 33554431             : 0        |                    |
            33554432 -> 67108863             : 0        |                    |
            67108864 -> 134217727            : 0        |                    |
           134217728 -> 268435455            : 0        |                    |
           268435456 -> 536870911            : 0        |                    |
           536870912 -> 1073741823           : 0        |                    |
          1073741824 -> 2147483647           : 0        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 0        |                    |
          8589934592 -> 17179869183          : 0        |                    |
         17179869184 -> 34359738367          : 0        |                    |
         34359738368 -> 68719476735          : 0        |                    |
         68719476736 -> 137438953471         : 0        |                    |
        137438953472 -> 274877906943         : 0        |                    |
        274877906944 -> 549755813887         : 0        |                    |
        549755813888 -> 1099511627775        : 0        |                    |
       1099511627776 -> 2199023255551        : 0        |                    |
       2199023255552 -> 4398046511103        : 0        |                    |
       4398046511104 -> 8796093022207        : 0        |                    |
       8796093022208 -> 17592186044415       : 0        |                    |
      17592186044416 -> 35184372088831       : 0        |                    |
      35184372088832 -> 70368744177663       : 0        |                    |
      70368744177664 -> 140737488355327      : 0        |                    |
     140737488355328 -> 281474976710655      : 0        |                    |
     281474976710656 -> 562949953421311      : 0        |                    |
     562949953421312 -> 1125899906842623     : 0        |                    |
    1125899906842624 -> 2251799813685247     : 0        |                    |
    2251799813685248 -> 4503599627370495     : 0        |                    |
    4503599627370496 -> 9007199254740991     : 0        |                    |
    9007199254740992 -> 18014398509481983    : 0        |                    |
   18014398509481984 -> 36028797018963967    : 17       |                    |

With this bug corrected, the output is:

Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 163      |                                        |
        32 -> 63         : 339      |                                        |
        64 -> 127        : 258      |                                        |
       128 -> 255        : 2259     |**                                      |
       256 -> 511        : 5677     |******                                  |
       512 -> 1023       : 32662    |****************************************|
      1024 -> 2047       : 20252    |************************                |
      2048 -> 4095       : 12544    |***************                         |
      4096 -> 8191       : 11006    |*************                           |
      8192 -> 16383      : 1230     |*                                       |
     16384 -> 32767      : 201      |                                        |
     32768 -> 65535      : 544      |                                        |
     65536 -> 131071     : 746      |                                        |
    131072 -> 262143     : 881      |*                                       |
    262144 -> 524287     : 309      |                                        |

Which is very easier to read.
When, for some reason, the bpf_ktime_get_ns() in trace_req_completion results in a number smaller than the one stored in trace_req_start, the latency for this request goes all wrong and results in a very huge number appears in the biotop output. Added a check to discard the measured request when this situation happens.
Last commit changes delta_us to delta, but the operation to convert was wrong (was * 1000, but needs to be / 1000)
@brendangregg
Copy link
Member

Reject. This patch just tosses out I/O where time went backwards, resulting in misleading output that is missing events. In some observability tools this may be the practical approach, and I would only accept it if the documentation in the man page to make it crystal clear that the output may be incomplete. I think a better approach is to update a map to say "dropped %d events due to backwards time" with the output.

This patch does neither. You can update the patch to print the dropped message, or we can close this PR without merging.

@brendangregg
Copy link
Member

Just to be clear: the output with completely garbage values is preferable than silently dropping values. The completely garbage values tells the end user that something is wrong. By dropping them, you give them no chance.

@piterpunk
Copy link
Author

Just to be clear: the output with completely garbage values is preferable than silently dropping values. The completely garbage values tells the end user that something is wrong. By dropping them, you give them no chance.

Ok, I will edit the patch to add the dropped count message.

Thanks!

Added a message warning the user about the amount of IO requests that are dropped due to  bpf_ktime_get_ns() at the end of request being older than the one collected at start.
Added a message warning the user about the amount of IO requests that are dropped due to  bpf_ktime_get_ns() at the end of request being older than the one collected at start.
@yonghong-song
Copy link
Collaborator

[buildbot, test this please]

@yonghong-song
Copy link
Collaborator

[buildbot, ok to test]

@yonghong-song
Copy link
Collaborator

@piterpunk I would like to understand more when the negative time could happen. Could you help collect status include cpu (bpf_get_smp_processor_id()) for start and done? When negative time happens, is the cpu the same or not? If they are not the same, this may explain this is due to different cpu time may drift a little bit? If they are the same, may be we have other issues. Thanks!

@piterpunk
Copy link
Author

@piterpunk I would like to understand more when the negative time could happen. Could you help collect status include cpu (bpf_get_smp_processor_id()) for start and done? When negative time happens, is the cpu the same or not? If they are not the same, this may explain this is due to different cpu time may drift a little bit? If they are the same, may be we have other issues. Thanks!

Ok, I modified biosnoop to show the CPU in 'start' and 'done', these are the last two fields. I ran it by four minutes and filtered out the user processes. This operation resulted in 707 events, 692 of them began and end in different CPUs, bellow were the 15 that began and end in the same CPU:

17.119888000   jbd2/dm-6-8    48356  sdb     W  275117216 4096    18446744073708.59 67 67
21.443884000   kdmwork-253:33 2427   sdkw    W  348780239 122880  18446744073709.09 63 63
44.421883000   kdmwork-253:36 7013   sdrf    R  64105855  8192    18446744073708.79 0 0
48.284886000   kworker/u197:0 66870  sdb     W  93254656  524288  18446744073708.80 26 26
49.380885000   jbd2/dm-6-8    48356  sdb     W  275185152 135168  18446744073708.62 14 14
49.388890000   jbd2/dm-6-8    48356  sdb     W  275186504 4096    18446744073708.59 16 16
67.748884000   jbd2/dm-6-8    48356  sdb     W  275232080 4096    18446744073708.59 14 14
83.055887000   jbd2/dm-6-8    48356  sdb     W  275005784 32768   18446744073708.61 15 15
102.285886000  kdmwork-253:54 9340   sduv    R  234457039 8192    18446744073708.70 0 0
104.871898000  kdmwork-253:47 8351   sdaoe   R  67526975  8192    18446744073709.23 0 0
141.691886000  jbd2/dm-6-8    48356  sdb     W  275157720 45056   18446744073708.61 38 38
207.450888000  kworker/u200:2 60710  sdb     W  254371528 4096    18446744073708.61 45 45
211.662883000  jbd2/dm-6-8    48356  sdb     W  275104760 4096    18446744073708.59 33 33
230.056884000  jbd2/dm-6-8    48356  sdb     W  275141696 4096    18446744073708.58 32 32
231.651885000  jbd2/dm-6-8    48356  sdb     W  275146344 4096    18446744073708.59 33 33

If you want more information, feel free to ask.

Thanks!

@yonghong-song
Copy link
Collaborator

@piterpunk so when negative time happens, start and done always happened at different cpu, is it right?

@piterpunk
Copy link
Author

@piterpunk so when negative time happens, start and done always happened at different cpu, is it right?

@yonghong-song Not always, in my four minutes collect, the majority (97.878%) of events with negative time, the 'start' and 'end' were in different CPUs. But there were a few (2.122%) events with negative time that were in the same CPU, those events were all listed in my previous post.

@yonghong-song
Copy link
Collaborator

Thanks for the information. I am wondering whether we missed some thing in the kernel why we could get negative time for the same cpu. Do you give some details how to reproduce the issue?

@piterpunk
Copy link
Author

Thanks for the information. I am wondering whether we missed some thing in the kernel why we could get negative time for the same cpu. Do you give some details how to reproduce the issue?

Well, the issue happens everytime I execute biosnoop, biotop or biolatency. The specific machine where I am running this tests is a RHEL 7.7 system with kernel 3.10.0-1062.4.1.el7.x86_64.

The server have four sockets, each one with 12 cores and with HT enabled, resulting a total of 96 available CPUs. The machine have around 300 LUNs, with multipath, so it sees circa 1200 block devices. And does an average of 7000/8000 iops per second, with peaks of 17000 iops.

All the machines that I have access with this size and a comparable throughput that can use bcc are RHEL7 machines.

@piterpunk
Copy link
Author

piterpunk commented Dec 23, 2019

Ok, searching a bit more, it seems to be a known issue: #728

If this backward in time situation is caused by the same issue, the fix seems to not been applied on RHEL7.7 or, if applied, it didn't works.

I compared the file kernel/time/timekeeping.c between kernels 3.10.0-1062.4.1, which I know that have the problem, and the 3.10.0-1062.9.1, the last one available for RHEL7.7.z, and both are identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants