New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU frequency-dependent timer issues starting with 2.0.2 #3003

Closed
AndreasKempfNEC opened this Issue Feb 17, 2017 · 70 comments

Comments

@AndreasKempfNEC

AndreasKempfNEC commented Feb 17, 2017

Dear Open MPI team,

A few days ago, my colleague Daniel Tameling noticed severe performance issues when running the HPCC benchmark with Open MPI. After spending quite some time tracking down the reason, we suspect that a regression was introduced between Open MPI 2.0.1 and 2.0.2. More specifically: the Open MPI 2.0.1 release tarball seems to be okay and the 2.0.2 release shows issues that persist through the openmpi-v2.0.x-201702170256-5fa504b nightly build.

The issue is that the new releases seem to be severely affected by the CPU frequency set using the acpi-cpufreq driver. When the "ondemand" governor is active and set to allow frequencies between 1.20 GHz and 2.40 GHz, the performance difference between Open MPI 2.0.1 and versions > 2.0.2 is almost a factor of two. Only when the governor "userspace" is used to pin the frequency to the maximum+turbo, the two versions show similar performance.

It does not seem to depend on the PML, BTL, MTL or even fabric. We tested FDR, EDR, openib, mxm, ob1, yalla, cm (on IB with Slurm), and cm, psm2 (on OPA with PBS).

The following latencies were measured on 2 nodes, 2 sockets Intel Xeon E5-2680v4 connected using InfiniBand EDR:

ompi-2.0.3-5fa504b, "ondemand" at 1.20 GHz-2.40 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw006:37946] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47684] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330514.524672] [hsw006:37959:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
[1487330514.522429] [hsw007:47691:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       2.23
1                       2.23
2                       2.20
4                       2.19
8                       2.19
16                      2.27
32                      2.28
64                      2.37
128                     3.21
256                     3.37
512                     3.61
1024                    3.98
2048                    4.76
4096                    6.41
8192                   10.12
16384                  16.37
32768                  20.39
65536                  26.40
131072                 37.37
262144                 59.33
524288                102.96
1048576               191.14
2097152               364.29
4194304               711.79

ompi-2.0.1, "ondemand" at 1.20 GHz-2.40 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:37973] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330533.943496] [hsw006:37990:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
[1487330533.948853] [hsw007:47721:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.15
1                       1.14
2                       1.13
4                       1.10
8                       1.10
16                      1.13
32                      1.14
64                      1.17
128                     1.58
256                     1.65
512                     1.77
1024                    1.96
2048                    2.35
4096                    3.23
8192                    5.05
16384                   8.17
32768                  10.21
65536                  13.27
131072                 18.74
262144                 29.59
524288                 51.32
1048576                95.44
2097152               182.00
4194304               355.76

ompi-2.0.3-5fa504b, "userspace" at 1.8 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41654] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51373] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.74
1                       1.78
2                       1.78
4                       1.78
8                       1.78
16                      1.83
32                      1.84
64                      1.93
128                     2.54
256                     2.70
512                     2.93
1024                    3.29
2048                    4.04
4096                    5.62
8192                    9.20
16384                  12.25
32768                  14.89
65536                  18.98
131072                 26.13
262144                 41.60
524288                 69.95
1048576               128.21
2097152               244.13
4194304               475.81

ompi-2.0.1, "userspace" at 1.8 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41690] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51407] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.27
1                       1.30
2                       1.30
4                       1.30
8                       1.30
16                      1.35
32                      1.35
64                      1.38
128                     1.86
256                     1.97
512                     2.14
1024                    2.43
2048                    2.99
4096                    4.23
8192                    6.81
16384                   9.12
32768                  11.11
65536                  14.12
131072                 19.51
262144                 30.47
524288                 52.76
1048576                95.91
2097152               182.81
4194304               356.55

ompi-2.0.3-5fa504b, "userspace" at 2.4 GHz (turbo on):

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45372] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55141] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.09
1                       1.10
2                       1.10
4                       1.09
8                       1.09
16                      1.14
32                      1.14
64                      1.17
128                     1.60
256                     1.68
512                     1.79
1024                    1.98
2048                    2.37
4096                    3.21
8192                    5.07
16384                   8.20
32768                  10.20
65536                  13.22
131072                 18.69
262144                 29.67
524288                 51.43
1048576                95.50
2097152               182.12
4194304               355.87

ompi-2.0.1, "userspace" at 2.4 GHz (turbo on):

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45403] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55175] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.10
1                       1.11
2                       1.09
4                       1.09
8                       1.08
16                      1.13
32                      1.13
64                      1.16
128                     1.57
256                     1.65
512                     1.76
1024                    1.96
2048                    2.35
4096                    3.23
8192                    5.04
16384                   8.15
32768                  10.17
65536                  13.21
131072                 18.71
262144                 29.55
524288                 51.31
1048576                95.37
2097152               182.00
4194304               355.76

The Open MPI version (2.0.2a pre-release) in the HPC-X toolkit version 1.8.0 shows the same issues. Earlier releases (e.g., 1.10.2) also seem to be unaffected.

We are quite stumped as to what could be going on. (My gut feeling would be to blame the recent timer changes, but I really have no idea.)

In any case, thank you for your work on Open MPI!

@jsquyres jsquyres added this to the v2.0.3 milestone Feb 17, 2017

@ggouaillardet

This comment has been minimized.

Show comment
Hide comment
@ggouaillardet

ggouaillardet Feb 17, 2017

Contributor

that could be a side effect of the timer change.
and what you observe could be a higher latency, or incorrect timers suggesting a higher latency (!)

can you measure the elapsed time of the osu_latency benchmark (ideally only zero size messages, and a loooot of iterations, so the elapsed time is ~1 minute)
then compare the elapsed time of the other ompi release.
if both elapsed time are similar, but reported performance has a 2x difference, then
the issue is the timer is incorrect, and not the performance has dropped.

Contributor

ggouaillardet commented Feb 17, 2017

that could be a side effect of the timer change.
and what you observe could be a higher latency, or incorrect timers suggesting a higher latency (!)

can you measure the elapsed time of the osu_latency benchmark (ideally only zero size messages, and a loooot of iterations, so the elapsed time is ~1 minute)
then compare the elapsed time of the other ompi release.
if both elapsed time are similar, but reported performance has a 2x difference, then
the issue is the timer is incorrect, and not the performance has dropped.

@AndreasKempfNEC

This comment has been minimized.

Show comment
Hide comment
@AndreasKempfNEC

AndreasKempfNEC Feb 17, 2017

Funnily enough, the wall time does, indeed, seem to be roughly equal with the OSU test:

("userspace" at 1.2 GHz)

$ time ~/opt/ompi-2.0.3-5fa504b/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000'; time ~/opt/ompi-2.0.1-check/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000';
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17030] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00649] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         3.04

real	1m32.585s
user	1m31.418s
sys	0m0.250s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17091] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         1.50

real	1m31.565s
user	1m30.389s
sys	0m0.233s

From a cursory glance at the HPCC source code, the times all seem to be measured using MPI_Wtime, as well. It could really be possible that the wall time never changed during longer tests and we just did not notice! But why would anyone really doubt the notion of time?

AndreasKempfNEC commented Feb 17, 2017

Funnily enough, the wall time does, indeed, seem to be roughly equal with the OSU test:

("userspace" at 1.2 GHz)

$ time ~/opt/ompi-2.0.3-5fa504b/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000'; time ~/opt/ompi-2.0.1-check/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000';
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17030] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00649] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         3.04

real	1m32.585s
user	1m31.418s
sys	0m0.250s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17091] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         1.50

real	1m31.565s
user	1m30.389s
sys	0m0.233s

From a cursory glance at the HPCC source code, the times all seem to be measured using MPI_Wtime, as well. It could really be possible that the wall time never changed during longer tests and we just did not notice! But why would anyone really doubt the notion of time?

@ggouaillardet

This comment has been minimized.

Show comment
Hide comment
@ggouaillardet

ggouaillardet Feb 17, 2017

Contributor

my initial intuition was that the new timer is based on clock ticks, and time = ticks / frequency.
if the cpu clock increases, then you get more ticks for the same elapsed time.
since we do not seem to adjust the frequency, then MPI_Wtime() runs too fast, so the reported latency is higher than what it really is.
for the time being, i suggest you force timer based on gettimeofday()
(i am afk, and do not know how to do that off hand)

@hjelmn @bosilca can you please comment on that issue ?

Contributor

ggouaillardet commented Feb 17, 2017

my initial intuition was that the new timer is based on clock ticks, and time = ticks / frequency.
if the cpu clock increases, then you get more ticks for the same elapsed time.
since we do not seem to adjust the frequency, then MPI_Wtime() runs too fast, so the reported latency is higher than what it really is.
for the time being, i suggest you force timer based on gettimeofday()
(i am afk, and do not know how to do that off hand)

@hjelmn @bosilca can you please comment on that issue ?

@AndreasKempfNEC

This comment has been minimized.

Show comment
Hide comment
@AndreasKempfNEC

AndreasKempfNEC Feb 17, 2017

In opal/mca/timer/linux/timer_linux_component.c, the frequency seems to be taken from /proc/cpuinfo which seems to be changed by acpi-cpufreq. The comment was probably true until the timer was changed:

    fp = fopen("/proc/cpuinfo", "r");
    ...
    if (0 == opal_timer_linux_freq) {
        /* find the CPU speed - most timers are 1:1 with CPU speed */
        loc = find_info(fp, "cpu MHz", buf, 1024);
        if (NULL != loc) {
            ret = sscanf(loc, "%f", &cpu_f);
            if (1 == ret) {
                /* numer is in MHz - convert to Hz and make an integer */
                opal_timer_linux_freq = (opal_timer_t) (cpu_f * 1000000);
            }
        }
    }

Now that I think about it, this is particularly insidious when using "ondemand" with HPCC because all the tests are different and the CPU might run them at different frequencies.

AndreasKempfNEC commented Feb 17, 2017

In opal/mca/timer/linux/timer_linux_component.c, the frequency seems to be taken from /proc/cpuinfo which seems to be changed by acpi-cpufreq. The comment was probably true until the timer was changed:

    fp = fopen("/proc/cpuinfo", "r");
    ...
    if (0 == opal_timer_linux_freq) {
        /* find the CPU speed - most timers are 1:1 with CPU speed */
        loc = find_info(fp, "cpu MHz", buf, 1024);
        if (NULL != loc) {
            ret = sscanf(loc, "%f", &cpu_f);
            if (1 == ret) {
                /* numer is in MHz - convert to Hz and make an integer */
                opal_timer_linux_freq = (opal_timer_t) (cpu_f * 1000000);
            }
        }
    }

Now that I think about it, this is particularly insidious when using "ondemand" with HPCC because all the tests are different and the CPU might run them at different frequencies.

@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 21, 2017

Contributor

My changes to timers aren't in 2.0.x yet, and they shouldn't affect this.
Newer processors increment the TSC at the same rate regardless of frequency, @AndreasKempfNEC does your /proc/cpuinfo list constant_tsc in its flags? If it doesn't, the problem might be that the tick rate is being queried on init before the processor increases its frequency due to load.

Contributor

zzzoom commented Feb 21, 2017

My changes to timers aren't in 2.0.x yet, and they shouldn't affect this.
Newer processors increment the TSC at the same rate regardless of frequency, @AndreasKempfNEC does your /proc/cpuinfo list constant_tsc in its flags? If it doesn't, the problem might be that the tick rate is being queried on init before the processor increases its frequency due to load.

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Feb 21, 2017

Member

Here is a simple patch to disable TSC if it is not constant, and to fall back to a more costly but more accurate approach (clock_gettime if supported).

diff --git a/opal/mca/timer/linux/timer_linux_component.c b/opal/mca/timer/linux/timer_linux_component.c
index 5c16ac9..64a55aa 100644
--- a/opal/mca/timer/linux/timer_linux_component.c
+++ b/opal/mca/timer/linux/timer_linux_component.c
@@ -97,7 +97,7 @@ find_info(FILE* fp, char *str, char *buf, size_t buflen)
     return NULL;
 }

-static int opal_timer_linux_find_freq(void)
+static int opal_timer_linux_find_freq(int* constant_tsc)
 {
     FILE *fp;
     char *loc;
@@ -111,11 +111,15 @@ static int opal_timer_linux_find_freq(void)
     }

     opal_timer_linux_freq = 0;
-
+    *constant_tsc = 0;
 #if OPAL_ASSEMBLY_ARCH == OPAL_ARM64
-       opal_timer_linux_freq = opal_sys_timer_freq();
+    opal_timer_linux_freq = opal_sys_timer_freq();
 #endif
-
+    loc = find_info(fp, "flags", buf, 1024);
+    if( NULL != loc) {
+        if( NULL != strstr(loc, "constant_tsc") )
+            *constant_tsc = 1;
+    }
     if (0 == opal_timer_linux_freq) {
         /* first, look for a timebase field.  probably only on PPC,
            but one never knows */
@@ -164,10 +168,10 @@ static int opal_timer_linux_find_freq(void)

 int opal_timer_linux_open(void)
 {
-    int ret = OPAL_SUCCESS;
+    int ret = OPAL_SUCCESS, constant_tsc = 0;

     if (mca_timer_base_monotonic && !opal_sys_timer_is_monotonic ()) {
-#if OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC)
+#if OPAL_HAVE_CLOCK_GETTIME
         struct timespec res;
         if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
             opal_timer_linux_freq = 1.e3;
@@ -175,14 +179,28 @@ int opal_timer_linux_open(void)
             opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
             return ret;
         }
-#else
+#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
         /* Monotonic time requested but cannot be found. Complain! */
         opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);
-#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
     }
-    ret = opal_timer_linux_find_freq();
-    opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
-    opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+    ret = opal_timer_linux_find_freq(&constant_tsc);
+    if( ret >= 0 ) {
+        if( !(mca_timer_base_monotonic && (0 == constant_tsc)) ) {
+            opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
+            opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+            return OPAL_SUCCESS;
+        }
+    }
+#if OPAL_HAVE_CLOCK_GETTIME
+    /* The frequency might not be constant, so we should not use it */
+    struct timespec res;
+    if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
+        opal_timer_linux_freq = 1.e3;
+        opal_timer_base_get_cycles = opal_timer_base_get_cycles_clock_gettime;
+        opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
+        return OPAL_SUCCESS;
+    }
+#endif  /* OPAL_HAVE_CLOCK_GETTIME */
     return ret;
 }
Member

bosilca commented Feb 21, 2017

Here is a simple patch to disable TSC if it is not constant, and to fall back to a more costly but more accurate approach (clock_gettime if supported).

diff --git a/opal/mca/timer/linux/timer_linux_component.c b/opal/mca/timer/linux/timer_linux_component.c
index 5c16ac9..64a55aa 100644
--- a/opal/mca/timer/linux/timer_linux_component.c
+++ b/opal/mca/timer/linux/timer_linux_component.c
@@ -97,7 +97,7 @@ find_info(FILE* fp, char *str, char *buf, size_t buflen)
     return NULL;
 }

-static int opal_timer_linux_find_freq(void)
+static int opal_timer_linux_find_freq(int* constant_tsc)
 {
     FILE *fp;
     char *loc;
@@ -111,11 +111,15 @@ static int opal_timer_linux_find_freq(void)
     }

     opal_timer_linux_freq = 0;
-
+    *constant_tsc = 0;
 #if OPAL_ASSEMBLY_ARCH == OPAL_ARM64
-       opal_timer_linux_freq = opal_sys_timer_freq();
+    opal_timer_linux_freq = opal_sys_timer_freq();
 #endif
-
+    loc = find_info(fp, "flags", buf, 1024);
+    if( NULL != loc) {
+        if( NULL != strstr(loc, "constant_tsc") )
+            *constant_tsc = 1;
+    }
     if (0 == opal_timer_linux_freq) {
         /* first, look for a timebase field.  probably only on PPC,
            but one never knows */
@@ -164,10 +168,10 @@ static int opal_timer_linux_find_freq(void)

 int opal_timer_linux_open(void)
 {
-    int ret = OPAL_SUCCESS;
+    int ret = OPAL_SUCCESS, constant_tsc = 0;

     if (mca_timer_base_monotonic && !opal_sys_timer_is_monotonic ()) {
-#if OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC)
+#if OPAL_HAVE_CLOCK_GETTIME
         struct timespec res;
         if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
             opal_timer_linux_freq = 1.e3;
@@ -175,14 +179,28 @@ int opal_timer_linux_open(void)
             opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
             return ret;
         }
-#else
+#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
         /* Monotonic time requested but cannot be found. Complain! */
         opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);
-#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
     }
-    ret = opal_timer_linux_find_freq();
-    opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
-    opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+    ret = opal_timer_linux_find_freq(&constant_tsc);
+    if( ret >= 0 ) {
+        if( !(mca_timer_base_monotonic && (0 == constant_tsc)) ) {
+            opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
+            opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+            return OPAL_SUCCESS;
+        }
+    }
+#if OPAL_HAVE_CLOCK_GETTIME
+    /* The frequency might not be constant, so we should not use it */
+    struct timespec res;
+    if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
+        opal_timer_linux_freq = 1.e3;
+        opal_timer_base_get_cycles = opal_timer_base_get_cycles_clock_gettime;
+        opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
+        return OPAL_SUCCESS;
+    }
+#endif  /* OPAL_HAVE_CLOCK_GETTIME */
     return ret;
 }
@AndreasKempfNEC

This comment has been minimized.

Show comment
Hide comment
@AndreasKempfNEC

AndreasKempfNEC Feb 22, 2017

Sorry, I do not have that much time to look into this, so I might get some stuff wrong.

First of all, the CPU has the constant_tsc flag set. As far as I can see, the regression was introduced by #2596 because a checkout of b393c3b shows my issue and a5ab6eb does not.

My current understanding is that "constant_tsc" means that all the CPU cores tick at a constant rate (I would assume 2.4 GHz for the E5-2680v4) independent of the frequency it is currently operating at and the commit referenced above makes use of that. I suspect that the issue arises when converting these CPU ticks to seconds for MPI_Wtime. From the few minutes I could spare looking into the code, I saw that the "cpu MHz" field in /proc/cpuinfo might be used to convert ticks to seconds. However, /proc/cpuinfo would definitely not be constant. It reflects the frequency set using acpi-cpufreq and might even change depending on CPU load when the "ondemand" governor is in use. If /proc/cpuinfo is used to convert ticks to seconds and the CPU ticks at a constant 2.4 GHz, setting the CPU to 1.2 GHz would therefore make Open MPI count two seconds for every 2.4e9 ticks. This would be consistent with the results of the OSU tests I ran. I do not know if it the conversion changes during an MPI run as the reference value might just get read once during startup. However, what is in /proc/cpuinfo can depend on the current CPU load.

As I said, this could be incorrect as I am not familiar with the Open MPI code.

(The value in /proc/cpuinfo also depends on whether the turbo is on or off. With turbo on, the frequency is written as 2401.000 MHz. I guess that would mean that there would be a drift of about 1.5 seconds per hour.)

AndreasKempfNEC commented Feb 22, 2017

Sorry, I do not have that much time to look into this, so I might get some stuff wrong.

First of all, the CPU has the constant_tsc flag set. As far as I can see, the regression was introduced by #2596 because a checkout of b393c3b shows my issue and a5ab6eb does not.

My current understanding is that "constant_tsc" means that all the CPU cores tick at a constant rate (I would assume 2.4 GHz for the E5-2680v4) independent of the frequency it is currently operating at and the commit referenced above makes use of that. I suspect that the issue arises when converting these CPU ticks to seconds for MPI_Wtime. From the few minutes I could spare looking into the code, I saw that the "cpu MHz" field in /proc/cpuinfo might be used to convert ticks to seconds. However, /proc/cpuinfo would definitely not be constant. It reflects the frequency set using acpi-cpufreq and might even change depending on CPU load when the "ondemand" governor is in use. If /proc/cpuinfo is used to convert ticks to seconds and the CPU ticks at a constant 2.4 GHz, setting the CPU to 1.2 GHz would therefore make Open MPI count two seconds for every 2.4e9 ticks. This would be consistent with the results of the OSU tests I ran. I do not know if it the conversion changes during an MPI run as the reference value might just get read once during startup. However, what is in /proc/cpuinfo can depend on the current CPU load.

As I said, this could be incorrect as I am not familiar with the Open MPI code.

(The value in /proc/cpuinfo also depends on whether the turbo is on or off. With turbo on, the frequency is written as 2401.000 MHz. I guess that would mean that there would be a drift of about 1.5 seconds per hour.)

@hppritcha

This comment has been minimized.

Show comment
Hide comment
@hppritcha

hppritcha Feb 23, 2017

Member

@jsquyres do we want a fix for this in 2.1.0?

Member

hppritcha commented Feb 23, 2017

@jsquyres do we want a fix for this in 2.1.0?

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 23, 2017

Member

@hppritcha This is a bug. It looks like we have been using the wrong clock for some time. It is correct in most cases but if the user boosts the base clock I think we can get the wrong result.

@AndreasKempfNEC Does your BIOS boost the base clock? I don't see any other way the tsc calculation could be wrong.

Member

hjelmn commented Feb 23, 2017

@hppritcha This is a bug. It looks like we have been using the wrong clock for some time. It is correct in most cases but if the user boosts the base clock I think we can get the wrong result.

@AndreasKempfNEC Does your BIOS boost the base clock? I don't see any other way the tsc calculation could be wrong.

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Feb 23, 2017

Member

I understand that "cpu Mhz" is a mere supposition. Basically we read the frequency of the first proc at the moment when the process is starting up (not even the core associated with the current process), and we assume this frequency remains unchanged for the rest of the application. The more we dig into this the more obvious it become that we should restrict our usage of the TSC to the only the cases where the scaling governor is set to performance, and the TSC is constant. In all other cases, we should fall back to a safer, and certainly more expensive, solution (clock_gettime).

There is also the case where the governor is set to ondemand, in which case we could use the max frequency, and assume that all cores with MPI processes will be escalated to the max frequency. Or just rely on a more expensive, but more accurate, timer.

Member

bosilca commented Feb 23, 2017

I understand that "cpu Mhz" is a mere supposition. Basically we read the frequency of the first proc at the moment when the process is starting up (not even the core associated with the current process), and we assume this frequency remains unchanged for the rest of the application. The more we dig into this the more obvious it become that we should restrict our usage of the TSC to the only the cases where the scaling governor is set to performance, and the TSC is constant. In all other cases, we should fall back to a safer, and certainly more expensive, solution (clock_gettime).

There is also the case where the governor is set to ondemand, in which case we could use the max frequency, and assume that all cores with MPI processes will be escalated to the max frequency. Or just rely on a more expensive, but more accurate, timer.

@ggouaillardet

This comment has been minimized.

Show comment
Hide comment
@ggouaillardet

ggouaillardet Feb 23, 2017

Contributor

if i understand correctly, with constant_tsc_flag the ticks occur at the processor nominal speed.
could we try to parse the model name value of /proc/cpuinfo ?
for example, on my VM

model name	: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
cpu MHz		: 3591.947

so could using 3.60 GHz frequency (instead of 3.591947 GHz) be good enough ?

Contributor

ggouaillardet commented Feb 23, 2017

if i understand correctly, with constant_tsc_flag the ticks occur at the processor nominal speed.
could we try to parse the model name value of /proc/cpuinfo ?
for example, on my VM

model name	: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
cpu MHz		: 3591.947

so could using 3.60 GHz frequency (instead of 3.591947 GHz) be good enough ?

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Feb 23, 2017

Member

We could, as this value is indeed supposed to be maximal. But then again, this only works if the governor switch the cores to max perf.

Member

bosilca commented Feb 23, 2017

We could, as this value is indeed supposed to be maximal. But then again, this only works if the governor switch the cores to max perf.

@ggouaillardet

This comment has been minimized.

Show comment
Hide comment
@ggouaillardet

ggouaillardet Feb 23, 2017

Contributor

did you mean "only works if the constant_tsc_flag is set" instead ?
or am i missing something ?

Contributor

ggouaillardet commented Feb 23, 2017

did you mean "only works if the constant_tsc_flag is set" instead ?
or am i missing something ?

@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 23, 2017

Contributor

Intel® 64 and IA-32 Architectures Software Developer’s Manual
18.18.3 Determining the Processor Base Frequency
For Intel processors in which the nominal core crystal clock frequency is enumerated in CPUID.15H.ECX and the core crystal clock ratio is encoded in CPUID.15H (see Table 3-8 “Information Returned by CPUID Instruction”), the nominal TSC frequency can be determined by using the following equation:
Nominal TSC frequency = ( CPUID.15H.ECX[31:0] * CPUID.15H.EBX[31:0] ) ÷ CPUID.15H.EAX[31:0]

Contributor

zzzoom commented Feb 23, 2017

Intel® 64 and IA-32 Architectures Software Developer’s Manual
18.18.3 Determining the Processor Base Frequency
For Intel processors in which the nominal core crystal clock frequency is enumerated in CPUID.15H.ECX and the core crystal clock ratio is encoded in CPUID.15H (see Table 3-8 “Information Returned by CPUID Instruction”), the nominal TSC frequency can be determined by using the following equation:
Nominal TSC frequency = ( CPUID.15H.ECX[31:0] * CPUID.15H.EBX[31:0] ) ÷ CPUID.15H.EAX[31:0]

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 Feb 23, 2017

Member

Just as an FYI: there are several installations intending to field dynamic power mgmt in the next few years. These will definitely operate by adjusting freq on-the-fly, including adjustments on a per-cpu (or per-group-of-cpus) basis. So I don't think you'll be able to rely on a constant value.

Member

rhc54 commented Feb 23, 2017

Just as an FYI: there are several installations intending to field dynamic power mgmt in the next few years. These will definitely operate by adjusting freq on-the-fly, including adjustments on a per-cpu (or per-group-of-cpus) basis. So I don't think you'll be able to rely on a constant value.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 23, 2017

Member

@rhc54 We can rely on a constant value for the tsc clock unless intel is thinking about abandoning the constant tsc. If they do we will detect it by the constant tsc bit not being set in the cpuid register. @zzzoom has the right approach for getting the constant tsc. I can take a crack at implementing that unless @zzzoom wants to do it.

Member

hjelmn commented Feb 23, 2017

@rhc54 We can rely on a constant value for the tsc clock unless intel is thinking about abandoning the constant tsc. If they do we will detect it by the constant tsc bit not being set in the cpuid register. @zzzoom has the right approach for getting the constant tsc. I can take a crack at implementing that unless @zzzoom wants to do it.

@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 23, 2017

Contributor

I can give it a shot over the weekend if it isn't urgent.

Contributor

zzzoom commented Feb 23, 2017

I can give it a shot over the weekend if it isn't urgent.

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Feb 23, 2017

Member

What about all the others Linux-based machines (aka. the ones that are not based on IA32) ?

Member

bosilca commented Feb 23, 2017

What about all the others Linux-based machines (aka. the ones that are not based on IA32) ?

@AndreasKempfNEC

This comment has been minimized.

Show comment
Hide comment
@AndreasKempfNEC

AndreasKempfNEC Feb 23, 2017

Just for reference, the "cpu MHz" fields in /proc/cpuinfo of a Xeon CPU E5-2670 (Sandy Bridge) with two sockets and hyperthreading on using the "Intel P-State driver" (this one does not report clean values, and is probably what ggouaillardet has active) look like the following

$ grep 'cpu MHz' /proc/cpuinfo 
cpu MHz		: 1873.523 [<- so this one would be taken for the conversion, I guess?]
cpu MHz		: 1550.757
cpu MHz		: 1898.914
cpu MHz		: 2189.890
cpu MHz		: 1282.226
cpu MHz		: 1212.757
cpu MHz		: 1481.085
cpu MHz		: 1398.921
cpu MHz		: 1641.046
cpu MHz		: 1516.429
cpu MHz		: 2069.437
cpu MHz		: 1900.640
cpu MHz		: 1294.007
cpu MHz		: 1497.234
cpu MHz		: 1255.210
cpu MHz		: 1428.781
cpu MHz		: 1510.539
cpu MHz		: 1220.171
cpu MHz		: 2089.445
cpu MHz		: 1801.007
cpu MHz		: 1221.796
cpu MHz		: 1202.398
cpu MHz		: 1674.257
cpu MHz		: 1301.929
cpu MHz		: 2600.000
cpu MHz		: 2115.242
cpu MHz		: 1320.921
cpu MHz		: 1201.789
cpu MHz		: 1252.773
cpu MHz		: 1199.960
cpu MHz		: 2338.375
cpu MHz		: 1214.789

But why did this only become a problem after b393c3b? Did the OPAL_TIMER_MONOTONIC stuff change the default timer?

Reading the nominal frequency from a CPU register sounds very much like the correct solution.

I will change the title of this report in the meantime.

AndreasKempfNEC commented Feb 23, 2017

Just for reference, the "cpu MHz" fields in /proc/cpuinfo of a Xeon CPU E5-2670 (Sandy Bridge) with two sockets and hyperthreading on using the "Intel P-State driver" (this one does not report clean values, and is probably what ggouaillardet has active) look like the following

$ grep 'cpu MHz' /proc/cpuinfo 
cpu MHz		: 1873.523 [<- so this one would be taken for the conversion, I guess?]
cpu MHz		: 1550.757
cpu MHz		: 1898.914
cpu MHz		: 2189.890
cpu MHz		: 1282.226
cpu MHz		: 1212.757
cpu MHz		: 1481.085
cpu MHz		: 1398.921
cpu MHz		: 1641.046
cpu MHz		: 1516.429
cpu MHz		: 2069.437
cpu MHz		: 1900.640
cpu MHz		: 1294.007
cpu MHz		: 1497.234
cpu MHz		: 1255.210
cpu MHz		: 1428.781
cpu MHz		: 1510.539
cpu MHz		: 1220.171
cpu MHz		: 2089.445
cpu MHz		: 1801.007
cpu MHz		: 1221.796
cpu MHz		: 1202.398
cpu MHz		: 1674.257
cpu MHz		: 1301.929
cpu MHz		: 2600.000
cpu MHz		: 2115.242
cpu MHz		: 1320.921
cpu MHz		: 1201.789
cpu MHz		: 1252.773
cpu MHz		: 1199.960
cpu MHz		: 2338.375
cpu MHz		: 1214.789

But why did this only become a problem after b393c3b? Did the OPAL_TIMER_MONOTONIC stuff change the default timer?

Reading the nominal frequency from a CPU register sounds very much like the correct solution.

I will change the title of this report in the meantime.

@AndreasKempfNEC AndreasKempfNEC changed the title from CPU frequency-dependent performance issues starting with 2.0.2 to CPU frequency-dependent timer issues starting with 2.0.2 Feb 23, 2017

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 Feb 23, 2017

Member

@hjelmn I wasn't trying to imply anything about future directions. I was only trying to indicate that you can't just take one value from /proc/cpuinfo and assume one can use it for the entire life of the process. I concur with the proposed register-based approach so long as you periodically update from it, at least for IA machines.

Member

rhc54 commented Feb 23, 2017

@hjelmn I wasn't trying to imply anything about future directions. I was only trying to indicate that you can't just take one value from /proc/cpuinfo and assume one can use it for the entire life of the process. I concur with the proposed register-based approach so long as you periodically update from it, at least for IA machines.

@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 27, 2017

Contributor

Reading the TSC frequency is definitely the best (and only?) way to handle the constant_tsc case. However I've been toying around with CPUID leaf 15h, and this method only works for a subset of processors while recent ones require a hardcoded table because the data isn't available on the processor to query.
A good implementation of this would need to be kept up to date because Intel seems to change this randomly every generation or so (e.g. the linux kernel implementation has a new hardcoded entry for Skylake-X which hasn't even been released yet), and that doesn't look to me like something that Open MPI should need to be concerned about.
So yeah, tl;dr it can be done, but it would be ugly, and we should probably be prodding linux kernel/glibc/some hardware library devs to export this information (if it isn't exported somewhere yet) because they need to maintain it anyway.

Contributor

zzzoom commented Feb 27, 2017

Reading the TSC frequency is definitely the best (and only?) way to handle the constant_tsc case. However I've been toying around with CPUID leaf 15h, and this method only works for a subset of processors while recent ones require a hardcoded table because the data isn't available on the processor to query.
A good implementation of this would need to be kept up to date because Intel seems to change this randomly every generation or so (e.g. the linux kernel implementation has a new hardcoded entry for Skylake-X which hasn't even been released yet), and that doesn't look to me like something that Open MPI should need to be concerned about.
So yeah, tl;dr it can be done, but it would be ugly, and we should probably be prodding linux kernel/glibc/some hardware library devs to export this information (if it isn't exported somewhere yet) because they need to maintain it anyway.

@ccaamad

This comment has been minimized.

Show comment
Hide comment
@ccaamad

ccaamad Mar 1, 2017

Hi there,

This problem is a deal breaker for us, but I need some of the other fixes that were made in 2.0.2. Is it sensible to go with the following revert until 2.0.3 (or 2.1.0) is out with the proper fix, please? I had to hack a couple of files about a bit...

ccaamad/ompi@0fdc34b

(a slow MPI_Wtime is better than a broken MPI_Wtime!)

Thanks,

Mark

ccaamad commented Mar 1, 2017

Hi there,

This problem is a deal breaker for us, but I need some of the other fixes that were made in 2.0.2. Is it sensible to go with the following revert until 2.0.3 (or 2.1.0) is out with the proper fix, please? I had to hack a couple of files about a bit...

ccaamad/ompi@0fdc34b

(a slow MPI_Wtime is better than a broken MPI_Wtime!)

Thanks,

Mark

@yqin

This comment has been minimized.

Show comment
Hide comment
@yqin

yqin Mar 15, 2017

@bosilca, is that patch considered production ready or simply a hack?

yqin commented Mar 15, 2017

@bosilca, is that patch considered production ready or simply a hack?

@ggouaillardet

This comment has been minimized.

Show comment
Hide comment
@ggouaillardet

ggouaillardet Mar 15, 2017

Contributor

@yqin the patch above is a fix for cpu on which the constant_tsc flag is not set.
in this case, the flag is set, and the root cause is cpu frequency is not nominal
(governor is ondemand or userspace)
so depending on your configuration, the patch might or might not help

Contributor

ggouaillardet commented Mar 15, 2017

@yqin the patch above is a fix for cpu on which the constant_tsc flag is not set.
in this case, the flag is set, and the root cause is cpu frequency is not nominal
(governor is ondemand or userspace)
so depending on your configuration, the patch might or might not help

jsquyres added a commit to jsquyres/ompi that referenced this issue Mar 15, 2017

MPI_WTICK/WTIME: fall back to gettimeofday() for now
Per open-mpi#3003, there's an issue
with the native clock frequencies that we're using for Wtick/Wtime
(namely: they aren't constant throughout the life of the process).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Mar 15, 2017

Member

@bosilca That patch isn't correct. We already have a check for an invariant tsc using cpuid. That is the function of opal_sys_timer_is_monotonic().

Doing some research and it looks like we may be able to use bogomips to figure out the tsc frequency. @ccaamad is that constant across the cores in your system? On all my systems both bogomips and cpu MHz are constant.

Member

hjelmn commented Mar 15, 2017

@bosilca That patch isn't correct. We already have a check for an invariant tsc using cpuid. That is the function of opal_sys_timer_is_monotonic().

Doing some research and it looks like we may be able to use bogomips to figure out the tsc frequency. @ccaamad is that constant across the cores in your system? On all my systems both bogomips and cpu MHz are constant.

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres Mar 15, 2017

Member

There have been several off-list / off-issue discussions about this today. Summary:

  1. The patch that @bosilca put above looks like it is redundant / unnecessary. It is doing the same thing that opal_sys_timer_is_monotonic() is doing (i.e., checking if the processor supports the invariant TSC) -- but via /proc/cpuinfo rather than via CPU ID assembly. Hence, the code that is already there on master/v3.x/v2.x/v2.0.x all already check (via opal_sys_timer_is_monotonic()/assembly) for invariant TSC support.
  2. The underlying problem is that https://github.com/open-mpi/ompi/blob/master/opal/mca/timer/linux/timer_linux_component.c#L133-L134 is reading the "cpu MHz" from /proc/cpuinfo, and on some machines, that's returning the max frequency of first processor. On other machines, it's returning the current frequency. Hence, it is not a good solution.
  3. The timer component is used in two places:
    1. opal_progress(), to determine how frequently to call libevent. It's not a tragedy if we're a bit imprecise in clocks here -- if we go into libevent a few more/less times that we need to, it's not a huge problem.
    2. MPI_WTICK/MPI_WTIME. Having imprecise WTICK/WTIME is problematic (obviously).
  4. It looks like the proper fix for the Linux timer component is to use the bogomips value instead of the "cpu MHz" value.
    1. This would fix both the opal_progress() and MPI_WTICK/MPI_WTIME usage of the Linux timer.
    2. Disclaimer: @hjelmn and I talked about using bogomips a little this afternoon, but didn't spend enough time to figure out if this is the real/correct solution or not (and how hard it will be to reverse the bogomips number into a usable frequency).
  5. The short-term fix that should likely be applied to v2.0.3 and v2.1.0 is to simply nail down MPI_WTICK/MPI_WTIME to use gettimeofday(3).
    1. It's less precise, but it will be correct, and gives us time to figure out the Correct answer for the Linux timer component.
    2. As noted above, it's ok to leave a little imprecision in opal_progress() while figuring out the real fix. Fixing MPI_WTICK/MPI_WTIME is the immediate goal.

PR's for nailing WTICK/WTIME to gettimeofday(3) will be filed shortly.

Member

jsquyres commented Mar 15, 2017

There have been several off-list / off-issue discussions about this today. Summary:

  1. The patch that @bosilca put above looks like it is redundant / unnecessary. It is doing the same thing that opal_sys_timer_is_monotonic() is doing (i.e., checking if the processor supports the invariant TSC) -- but via /proc/cpuinfo rather than via CPU ID assembly. Hence, the code that is already there on master/v3.x/v2.x/v2.0.x all already check (via opal_sys_timer_is_monotonic()/assembly) for invariant TSC support.
  2. The underlying problem is that https://github.com/open-mpi/ompi/blob/master/opal/mca/timer/linux/timer_linux_component.c#L133-L134 is reading the "cpu MHz" from /proc/cpuinfo, and on some machines, that's returning the max frequency of first processor. On other machines, it's returning the current frequency. Hence, it is not a good solution.
  3. The timer component is used in two places:
    1. opal_progress(), to determine how frequently to call libevent. It's not a tragedy if we're a bit imprecise in clocks here -- if we go into libevent a few more/less times that we need to, it's not a huge problem.
    2. MPI_WTICK/MPI_WTIME. Having imprecise WTICK/WTIME is problematic (obviously).
  4. It looks like the proper fix for the Linux timer component is to use the bogomips value instead of the "cpu MHz" value.
    1. This would fix both the opal_progress() and MPI_WTICK/MPI_WTIME usage of the Linux timer.
    2. Disclaimer: @hjelmn and I talked about using bogomips a little this afternoon, but didn't spend enough time to figure out if this is the real/correct solution or not (and how hard it will be to reverse the bogomips number into a usable frequency).
  5. The short-term fix that should likely be applied to v2.0.3 and v2.1.0 is to simply nail down MPI_WTICK/MPI_WTIME to use gettimeofday(3).
    1. It's less precise, but it will be correct, and gives us time to figure out the Correct answer for the Linux timer component.
    2. As noted above, it's ok to leave a little imprecision in opal_progress() while figuring out the real fix. Fixing MPI_WTICK/MPI_WTIME is the immediate goal.

PR's for nailing WTICK/WTIME to gettimeofday(3) will be filed shortly.

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Mar 15, 2017

Member

Indeed, our monotonic function check for invariance instead of monotonicity.

According to POSIX gettimeofday should be considered obsolete, and applications should use clock_gettime instead.

Member

bosilca commented Mar 15, 2017

Indeed, our monotonic function check for invariance instead of monotonicity.

According to POSIX gettimeofday should be considered obsolete, and applications should use clock_gettime instead.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Mar 15, 2017

Member

@bosilca Yeah, I named the function opal_sys_timer_is_monotonic() since what we care about it that the tsc doesn't jump as the process moves around. Probably could have picked a better name.

Member

hjelmn commented Mar 15, 2017

@bosilca Yeah, I named the function opal_sys_timer_is_monotonic() since what we care about it that the tsc doesn't jump as the process moves around. Probably could have picked a better name.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Mar 15, 2017

Member

bogomips looks like it will work for our case. According to this: https://en.wikipedia.org/wiki/BogoMips the relevant processors (ie both x86 AND invariant tsc) all have a 2x multiplier on the value.

I would prefer a more complete solution but after looking at the code Linux uses to determine the tsc clock rate I don't know if we should go there (http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L667 and http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L743).

Member

hjelmn commented Mar 15, 2017

bogomips looks like it will work for our case. According to this: https://en.wikipedia.org/wiki/BogoMips the relevant processors (ie both x86 AND invariant tsc) all have a 2x multiplier on the value.

I would prefer a more complete solution but after looking at the code Linux uses to determine the tsc clock rate I don't know if we should go there (http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L667 and http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L743).

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres Mar 15, 2017

Member

@bosilca Unfortunately, clock_gettime() has horrid portability issues on OS X. ☹️ We had to stop using clock_gettime() in Libfabric because of said issues -- check out Homebrew/homebrew-core#6551 for the sordid details (you can probably start reading around Homebrew/homebrew-core#6551 (comment) or so).

Meaning: it is fine to use clock_gettime() in the Linux timer component. But it's problematic to use it up in MPI_WTICK/MPI_WTIME because OS X is evil with regards to clock_gettime(3) (we might be able to add a bunch of extra configury checks to figure it all out -- but that doesn't seem worth it if using gettimeofday(3) is a temporarily solution for WTICK/WTIME). IMHO.

Member

jsquyres commented Mar 15, 2017

@bosilca Unfortunately, clock_gettime() has horrid portability issues on OS X. ☹️ We had to stop using clock_gettime() in Libfabric because of said issues -- check out Homebrew/homebrew-core#6551 for the sordid details (you can probably start reading around Homebrew/homebrew-core#6551 (comment) or so).

Meaning: it is fine to use clock_gettime() in the Linux timer component. But it's problematic to use it up in MPI_WTICK/MPI_WTIME because OS X is evil with regards to clock_gettime(3) (we might be able to add a bunch of extra configury checks to figure it all out -- but that doesn't seem worth it if using gettimeofday(3) is a temporarily solution for WTICK/WTIME). IMHO.

@bosilca

This comment has been minimized.

Show comment
Hide comment
@bosilca

bosilca Mar 15, 2017

Member

Our detection of clock_gettime is already doing the right things (aka. it is not only checking the existence of the prototype). So we can use it as long as the configure detects it, and fall back to gettimeofday for all other cases.

Member

bosilca commented Mar 15, 2017

Our detection of clock_gettime is already doing the right things (aka. it is not only checking the existence of the prototype). So we can use it as long as the configure detects it, and fall back to gettimeofday for all other cases.

markalle added a commit to markalle/ompi that referenced this issue Apr 5, 2017

timer: hack use of clock_gettime
better solution needed later
 workaround for #3003

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 1, 2017

Member

It seems this is also affecting 1.10.6 (maybe also 1.10.5) because of commit 5a44018.

We're seeing MPI_Wtime report wrong results on our machines.
The CPUs are : "Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz", they do have the constant_tsc flag but using rdtsc doesn't seem to work well ...

I'm tempted to add the "critical" flag. Should we open a new bug for v1.10 ?

@hjelmn @hppritcha what do you think ?

Member

sjeaugey commented May 1, 2017

It seems this is also affecting 1.10.6 (maybe also 1.10.5) because of commit 5a44018.

We're seeing MPI_Wtime report wrong results on our machines.
The CPUs are : "Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz", they do have the constant_tsc flag but using rdtsc doesn't seem to work well ...

I'm tempted to add the "critical" flag. Should we open a new bug for v1.10 ?

@hjelmn @hppritcha what do you think ?

@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 1, 2017

Member

I wrote a simple reproducer comparing MPI_Wtime and gettimeofday. The cpu MHz in /proc/cpuinfo is indeed not the TSC frequency at all and completely depends on cpufreq.

Replacing "cpu MHz" by "bogomips" and dividing by two seems to fix the issue, but it seems dangerous though. There seem to be no robust way to determine the TSC freq from userspace, so I think we should just rollback those changes.

Member

sjeaugey commented May 1, 2017

I wrote a simple reproducer comparing MPI_Wtime and gettimeofday. The cpu MHz in /proc/cpuinfo is indeed not the TSC frequency at all and completely depends on cpufreq.

Replacing "cpu MHz" by "bogomips" and dividing by two seems to fix the issue, but it seems dangerous though. There seem to be no robust way to determine the TSC freq from userspace, so I think we should just rollback those changes.

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 2, 2017

Member

@sjeaugey Does this also affect v2.0.x, v2.1.x, master, and/or 3.0.x?

Member

jsquyres commented May 2, 2017

@sjeaugey Does this also affect v2.0.x, v2.1.x, master, and/or 3.0.x?

@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 2, 2017

Member

I didn't check all the branches but it was opened on 2.0.2, the bug is still open, and the code I see in master seems to be the same as in 1.10.6.
It seems you all added a patch to your own local branches to fix wtime but I don't see any commit on master (unless it was not referenced here). I would imagine you know better than I do.

Member

sjeaugey commented May 2, 2017

I didn't check all the branches but it was opened on 2.0.2, the bug is still open, and the code I see in master seems to be the same as in 1.10.6.
It seems you all added a patch to your own local branches to fix wtime but I don't see any commit on master (unless it was not referenced here). I would imagine you know better than I do.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 2, 2017

Member

We can not roll back the tsc change. We can not use clock_gettime() as the internal timer without killing performance. We addressed the wtime issue on master. Did we not bring it to v2.0.x?

Member

hjelmn commented May 2, 2017

We can not roll back the tsc change. We can not use clock_gettime() as the internal timer without killing performance. We addressed the wtime issue on master. Did we not bring it to v2.0.x?

@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 2, 2017

Member

@hjelmn can you point me to how it was fixed in master ? I don't see any commit referencing this issue.

Member

sjeaugey commented May 2, 2017

@hjelmn can you point me to how it was fixed in master ? I don't see any commit referencing this issue.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 2, 2017

Member

4009ba6
48d13aa
b933152

Blame Intel for no standard way of getting the TSC frequency. Total PIA.

Member

hjelmn commented May 2, 2017

4009ba6
48d13aa
b933152

Blame Intel for no standard way of getting the TSC frequency. Total PIA.

@drossetti

This comment has been minimized.

Show comment
Hide comment
@drossetti

drossetti May 2, 2017

I might be wrong, but if constant_tsc is true, 'tsc ticks per second' would be a calibration constant throughout the life of the process.
why can't you simply calculate it once for all at run-time in some _init() function by using gettimeofday() ?

drossetti commented May 2, 2017

I might be wrong, but if constant_tsc is true, 'tsc ticks per second' would be a calibration constant throughout the life of the process.
why can't you simply calculate it once for all at run-time in some _init() function by using gettimeofday() ?

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 2, 2017

Member

That is certainly a way to do it. The problem is not that there isn't a solution but that the solution takes developer cycles that are not available. If someone has time and opens a PR we will gladly take a look.

Member

hjelmn commented May 2, 2017

That is certainly a way to do it. The problem is not that there isn't a solution but that the solution takes developer cycles that are not available. If someone has time and opens a PR we will gladly take a look.

@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 2, 2017

Member

Ok, my only concern is v1.10 right now. @jsquyres Could we release 1.10.7 with a fix ? Or do we no longer release 1.x and moved to 2.x ?

I'm just worried that many people are still picking 1.10.6 as the latest "stable" version and get weird performance results.

Member

sjeaugey commented May 2, 2017

Ok, my only concern is v1.10 right now. @jsquyres Could we release 1.10.7 with a fix ? Or do we no longer release 1.x and moved to 2.x ?

I'm just worried that many people are still picking 1.10.6 as the latest "stable" version and get weird performance results.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 2, 2017

Member

That is all up to @rhc54. The master fixes should apply to v1.10.x.

Member

hjelmn commented May 2, 2017

That is all up to @rhc54. The master fixes should apply to v1.10.x.

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 May 2, 2017

Member

I find myself confused by the above - it started with a specific issue on the 2.0 branch, and then sprawled all over the place. I see multiple PRs and commits referencing this issue, and then commits @hjelmn referenced that have nothing to do with those PRs and include #if 0 code blocks.

Before I can make a judgement on v1.10, can someone please explain where we actually wound up on this mess, and remove those dead code blocks?

Member

rhc54 commented May 2, 2017

I find myself confused by the above - it started with a specific issue on the 2.0 branch, and then sprawled all over the place. I see multiple PRs and commits referencing this issue, and then commits @hjelmn referenced that have nothing to do with those PRs and include #if 0 code blocks.

Before I can make a judgement on v1.10, can someone please explain where we actually wound up on this mess, and remove those dead code blocks?

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 3, 2017

Member

So from my reading of the code:

  • v1.10: wtick/wtime still uses the opal routines to use TSC.
  • master: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v2.0.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v2.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v3.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere

@hjelmn You mentioned above that we should not be using clock_gettime(3) -- I'm assuming that that is in reference to the internal OPAL timers, and using clock_gettime(3) / gettimeofday(3) for wtick/wtime is fine.

Member

jsquyres commented May 3, 2017

So from my reading of the code:

  • v1.10: wtick/wtime still uses the opal routines to use TSC.
  • master: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v2.0.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v2.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere
  • v3.x: wtick/wtime uses clock_gettime(3) on Linux, gettimeofday(3) elsewhere

@hjelmn You mentioned above that we should not be using clock_gettime(3) -- I'm assuming that that is in reference to the internal OPAL timers, and using clock_gettime(3) / gettimeofday(3) for wtick/wtime is fine.

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 May 3, 2017

Member

I think this is what has me so confused. The initial report targeted the 2.x series - yet somehow, that code seems to be the one being pushed here. So is the code in v2.x wrong? If so, why would we want to push it into 1.10? Or did someone modify master and then backport the correct fix to everything but 1.10?

Member

rhc54 commented May 3, 2017

I think this is what has me so confused. The initial report targeted the 2.x series - yet somehow, that code seems to be the one being pushed here. So is the code in v2.x wrong? If so, why would we want to push it into 1.10? Or did someone modify master and then backport the correct fix to everything but 1.10?

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 3, 2017

Member

@jsquyres Correct. We can not use clock_gettime() for the internal timer used in opal_progress(). In v1.10.0-v1.10.1 we used the tsc for both opal_progress() and wtime. A user noticed that the clock was non-monotonic on an older x86 system so we disabled it in v1.10.2. I added a check for the monotonic tsc and re-enabled the tsc for v1.10.3. This issue came up shortly after that because the way we calculate the tsc frequency is wrong in some cases. In response we made a couple of changes to master (and v2.0.x, v2.x, etc) to make wtime always use gettimeofday() or clock_gettime() and always use a cycle counter (if available) for opal_progress(). This last set of changes is apparently not in v1.10.x and @sjeaugey is requesting that we make a v1.10.7 with it.

Member

hjelmn commented May 3, 2017

@jsquyres Correct. We can not use clock_gettime() for the internal timer used in opal_progress(). In v1.10.0-v1.10.1 we used the tsc for both opal_progress() and wtime. A user noticed that the clock was non-monotonic on an older x86 system so we disabled it in v1.10.2. I added a check for the monotonic tsc and re-enabled the tsc for v1.10.3. This issue came up shortly after that because the way we calculate the tsc frequency is wrong in some cases. In response we made a couple of changes to master (and v2.0.x, v2.x, etc) to make wtime always use gettimeofday() or clock_gettime() and always use a cycle counter (if available) for opal_progress(). This last set of changes is apparently not in v1.10.x and @sjeaugey is requesting that we make a v1.10.7 with it.

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 May 3, 2017

Member

I see - and the dead code block was left for future developers to consider (as per the comment).

i have no objections to releasing a final 1.10.7 for cleanup purposes, if someone wants to assemble the PR.

Member

rhc54 commented May 3, 2017

I see - and the dead code block was left for future developers to consider (as per the comment).

i have no objections to releasing a final 1.10.7 for cleanup purposes, if someone wants to assemble the PR.

sjeaugey added a commit to sjeaugey/ompi that referenced this issue May 4, 2017

mpi/c: Force wtick/wtime to use gettimeofday
 * See open-mpi#3003 for a discussion about
   this patch. Once we get a better version in place we can revert this
   change.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

sjeaugey added a commit to sjeaugey/ompi that referenced this issue May 4, 2017

timer: hack use of clock_gettime
better solution needed later
 workaround for #3003

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@sjeaugey

This comment has been minimized.

Show comment
Hide comment
@sjeaugey

sjeaugey May 4, 2017

Member

Assembling the PR, I took 1.10, then added 48d13aa and b933152 (visible at https://github.com/sjeaugey/ompi/tree/wtime-fix-v1.10).

@hjelmn I see 4009ba6 is a bug fix but is it really related to this issue ? I can add it too, I'm just trying to avoid changing too many things.

And in all cases, I'm not very comfortable with those "workaround patches" ( 48d13aa and b933152) and I'm wondering if simply reverting 5a44018 wouldn't roll back the situation to what it was in 1.10.5. @hjelmn, did we need that patch to fix something (or for some other code) or was it just meant as an improvement ?

Member

sjeaugey commented May 4, 2017

Assembling the PR, I took 1.10, then added 48d13aa and b933152 (visible at https://github.com/sjeaugey/ompi/tree/wtime-fix-v1.10).

@hjelmn I see 4009ba6 is a bug fix but is it really related to this issue ? I can add it too, I'm just trying to avoid changing too many things.

And in all cases, I'm not very comfortable with those "workaround patches" ( 48d13aa and b933152) and I'm wondering if simply reverting 5a44018 wouldn't roll back the situation to what it was in 1.10.5. @hjelmn, did we need that patch to fix something (or for some other code) or was it just meant as an improvement ?

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn May 5, 2017

Member

@sjeaugey We absolutely can not roll back 5a44018. That will re-introduce the performance regression. The work-around patches are ok and well contained. I don't think we need 4009ba6 but it is a nice to have.

Member

hjelmn commented May 5, 2017

@sjeaugey We absolutely can not roll back 5a44018. That will re-introduce the performance regression. The work-around patches are ok and well contained. I don't think we need 4009ba6 but it is a nice to have.

@AndreasKempfNEC

This comment has been minimized.

Show comment
Hide comment
@AndreasKempfNEC

AndreasKempfNEC May 5, 2017

Okay, if the latest versions of all release branches are free from this problem, this report no longer needs to stay open.

I guess I will close it pretty soon, then. Or somebody else can do so.

AndreasKempfNEC commented May 5, 2017

Okay, if the latest versions of all release branches are free from this problem, this report no longer needs to stay open.

I guess I will close it pretty soon, then. Or somebody else can do so.

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 5, 2017

Member

@AndreasKempfNEC Probably want to wait for @sjeaugey's PR for the v1.10 branch -- that should close out the issue on all the current release branches.

Member

jsquyres commented May 5, 2017

@AndreasKempfNEC Probably want to wait for @sjeaugey's PR for the v1.10 branch -- that should close out the issue on all the current release branches.

@yosefe

This comment has been minimized.

Show comment
Hide comment
@yosefe

yosefe May 16, 2017

Contributor

@jsquyres @hppritcha we see on some systems that MPI_Wtime() reports wrong results (due to cpu frequency scaling). Was why it decided to use clock_gettime() on Linux, and not gettimeofday()?
Can we go with the initial suggestion to "nail down MPI_WTICK/MPI_WTIME to use gettimeofday(3)"?

Contributor

yosefe commented May 16, 2017

@jsquyres @hppritcha we see on some systems that MPI_Wtime() reports wrong results (due to cpu frequency scaling). Was why it decided to use clock_gettime() on Linux, and not gettimeofday()?
Can we go with the initial suggestion to "nail down MPI_WTICK/MPI_WTIME to use gettimeofday(3)"?

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 16, 2017

Member

There were user requests to make MPI_WTIME more accurate.

All the branches should now have -- at least temporarily? -- set MPI_WTICK/MPI_WTIME to use gettimeofday(3).

Member

jsquyres commented May 16, 2017

There were user requests to make MPI_WTIME more accurate.

All the branches should now have -- at least temporarily? -- set MPI_WTICK/MPI_WTIME to use gettimeofday(3).

@yosefe

This comment has been minimized.

Show comment
Hide comment
@yosefe

yosefe May 16, 2017

Contributor

@jsquyres according to https://github.com/open-mpi/ompi/blob/v2.x/ompi/mpi/c/wtime.c#L60 clock_gettime() is used on linux - what am i missing?

Contributor

yosefe commented May 16, 2017

@jsquyres according to https://github.com/open-mpi/ompi/blob/v2.x/ompi/mpi/c/wtime.c#L60 clock_gettime() is used on linux - what am i missing?

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 16, 2017

Member

@yosefe My mistake -- I was thinking about the fact that we commented out the opal_* calls to get the times, which definitely had problems with frequency scaling.

You're saying that clock_gettime() itself is showing frequency scaling issues. That is new information to me. Can you show the code where this is happening?

Member

jsquyres commented May 16, 2017

@yosefe My mistake -- I was thinking about the fact that we commented out the opal_* calls to get the times, which definitely had problems with frequency scaling.

You're saying that clock_gettime() itself is showing frequency scaling issues. That is new information to me. Can you show the code where this is happening?

@yosefe

This comment has been minimized.

Show comment
Hide comment
@yosefe

yosefe May 18, 2017

Contributor

@jsquyres we are running osu_bandwidth benchmark which reported unreasonable [too good to be true] results on a particular system (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and after changing MPI_Wtime() to gettimeofday() the results are OK.
Can't point to the exact code in clock_gettime which is causing this, though...

Contributor

yosefe commented May 18, 2017

@jsquyres we are running osu_bandwidth benchmark which reported unreasonable [too good to be true] results on a particular system (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and after changing MPI_Wtime() to gettimeofday() the results are OK.
Can't point to the exact code in clock_gettime which is causing this, though...

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 18, 2017

Member

@bwbarrett @bosilca @hjelmn Have any of you guys heard of clock_gettime() suffering from frequency scaling issues?

Member

jsquyres commented May 18, 2017

@bwbarrett @bosilca @hjelmn Have any of you guys heard of clock_gettime() suffering from frequency scaling issues?

@rhc54

This comment has been minimized.

Show comment
Hide comment
@rhc54

rhc54 May 18, 2017

Member

According to what I can find in Intel manuals, clock_gettime(CLOCK_MONOTONIC) is guaranteed to be independent of cpu freq starting with Nehalem in 2008.

Member

rhc54 commented May 18, 2017

According to what I can find in Intel manuals, clock_gettime(CLOCK_MONOTONIC) is guaranteed to be independent of cpu freq starting with Nehalem in 2008.

@jsquyres

This comment has been minimized.

Show comment
Hide comment
@jsquyres

jsquyres May 19, 2017

Member

@yosefe I don't doubt your results, but per Intel's documentation, I have to wonder if something else was going on when you ran your tests...? Can you make a small example that shows that clock_gettime(CLOCK_MONOTONIC) is dependent on CPU frequency?

Member

jsquyres commented May 19, 2017

@yosefe I don't doubt your results, but per Intel's documentation, I have to wonder if something else was going on when you ran your tests...? Can you make a small example that shows that clock_gettime(CLOCK_MONOTONIC) is dependent on CPU frequency?

@yosefe

This comment has been minimized.

Show comment
Hide comment
@yosefe

yosefe May 19, 2017

Contributor

@jsquyres thanks, i'll try to dig some more into this.

Contributor

yosefe commented May 19, 2017

@jsquyres thanks, i'll try to dig some more into this.

@yosefe

This comment has been minimized.

Show comment
Hide comment
@yosefe

yosefe May 24, 2017

Contributor

@jsquyres we checked it again and we hadolder version of OMPI which measured cycle count and not clock_gettime(). Also, a small clock_gettime() test showed the correct timing. So we are good, sorry for confusion.

Contributor

yosefe commented May 24, 2017

@jsquyres we checked it again and we hadolder version of OMPI which measured cycle count and not clock_gettime(). Also, a small clock_gettime() test showed the correct timing. So we are good, sorry for confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment