Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

9234 reduce apic calibration error by taking multiple measurements #578

Closed
wants to merge 1 commit into from

Commits on May 16, 2018

  1. DLPX-50219 reduce apic calibration error by taking multiple measurements

    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    
    The APIC is used as a timer in Illumos. Specifically, it is used by the
    callout and cyclic frameworks to generate an interrupt around the time that
    the closest timer would expire. Once in the interrupt context those
    frameworks call `gethrtime()` to determine which timers have expired, thus
    the system doesn't solely rely on the accuracy of the APIC.
    
    If the APIC is lagging behind the real time then we will have more jitter
    and shorter timeouts will tend to be late.  If the APIC is quicker than it
    should then we will generate an excessive amount of interrupts as the APIC
    would fire an interrupt before any timers expire.  In any case, I've tested
    what happens if the APIC is severely miscalibrated (10% or 1000% of target
    speed) and it doesn't seem to create any unstability on the system.
    
    With 1000% of the speed: we'd see a significant increase of the number of
    interrupts fired, especially when system is idle:
    
        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  dt idl
          0   41   0    5  9711  247  343    6   20    3    0   527    1   3   0  96
          1   79   0   14  9366  409 1046    8   20    4    0  2894    1   3   0  96
    
    vs, normally:
    
        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  dt idl
          0  120   0   10   797  254 1082    9   20    3    0  2564    1   2   0  97
          1   80   0   11   830  387  385    7   19    4    0  1175    1   1   0  98
    
    The way that the APIC is calibrated is by using the 8254 fixed frequency
    timer (PIT). We wait for it to count a certain amount of ticks and then we
    check how many ticks does the APIC count in the same time interval. The main
    issue is that on some hypervisors, notably hyperv, both the 8254 and the
    APIC are emulated and thus can sometimes be inconsistent.
    
    I've done an experiment to measure how much of an effect do those
    inconsistencies have on the apic calibration factor (which determines how
    many apic ticks pass in a certain amount of nanoseconds), and here are the
    results for about 15000 measurements (done by performing 1000 measurements
    at a time on each boot).
    
    The main observation is that calibration doesn't seem to change from boot to
    boot and that the accuracy of measurements doesn't seem to have any
    correlation to the given time of measurement, which means that very
    inaccurate measurements happen randomly. Most measurements are quite
    accurate, except for some rare outliers (as can be seen in the graph). It
    was determined that a 5-value median filter would significantly reduce the
    worst case calibrations.
    
    In the results below, `stdev %` is the standard deviation divided by the
    average; `min %` is how far is the lowest calibration value measured
    compared to the average and `max %` is how far is the highest calibration
    value measured to the average.
    
        Base Results:
                    stdev % min %   max %
        dcenter     0.02     0.2      0.2
        AWS         0.02     1.4      0.1
        hyperv      0.79     6.4      5.5
        Azure       2.87    35.1    331.1
    
        Using 5-value Median Filter:
                    stdev % min %   max %
        dcenter     0.01    0.02      0.04
        AWS         0.01    0.01      0.03
        hyperv      0.47    1.47      1.76
        Azure       0.50    2.67      1.39
    
    As we can see, using the median filter significantly reduces the worst-case
    (min/max) miscalibrations on all platforms, and seems to be a necessity on
    Azure to insure a proper worst-case calibration.
    
    Closes openzfs#578
    pzakha authored and prakashsurya committed May 16, 2018
    Configuration menu
    Copy the full SHA
    2e9d99f View commit details
    Browse the repository at this point in the history