New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve x86-64 timer performance #2926

Merged
merged 2 commits into from Feb 7, 2017

Conversation

Projects
None yet
4 participants
@zzzoom
Contributor

zzzoom commented Feb 4, 2017

WIP

On x86-64 a significant amount of processor time is currently being spent on opal_timer_base_get_usec_sys_timer, mainly used by opal_progress. The function calls opal_sys_timer_get_cycles to get a cycle count and divides it by a frequency.
In asm this looks like either cpuid; rdtsc; salq; orq; divq or (with RDTSCP support) rdtscp; cpuid; salq; orq; divq. The problem is that both cpuid and divq are expensive.

This PR tackles both:

  • cpuid is replaced by the cheaper lfence for serialization, like rdtsc_ordered does in the linux kernel. This is safe on Intel processors (http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html page 8-17), but apparently not on AMD, which requires mfence instead. I'm not sure of where in openmpi it would be best to perform runtime cpu detection and setting up the dispatch.
  • divq is avoided, using cycles directly instead of usecs in opal_progress if possible. This is done by flipping the timer priority from USEC_NATIVE > CYCLES_NATIVE > fallback cycles to CYCLES_NATIVE > USEC_NATIVE > fallback cycles.

I haven't looked for a good microbenchmark that isolates the effect, but I've been getting decent time improvements on NPB (mainly FT) and Quantum Espresso.

zzzoom added some commits Feb 4, 2017

opal_progress: use usec native timer only when a native cycle counter…
… isn't available

Signed-off-by: Carlos Bederián <bc@famaf.unc.edu.ar>
amd64 timers: use lfence instead of cpuid for serialization
Signed-off-by: Carlos Bederián <bc@famaf.unc.edu.ar>
@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 4, 2017

Contributor

Who maintains @ibm-ompi? Those logs aren't very useful.

Contributor

zzzoom commented Feb 4, 2017

Who maintains @ibm-ompi? Those logs aren't very useful.

@jjhursey

This comment has been minimized.

Show comment
Hide comment
@jjhursey

jjhursey Feb 6, 2017

Member

@zzzoom I maintain @ibm-ompi - It looks like our CI got stuck. You can re-run it by adding a comment:
bot:ibm:retest

Member

jjhursey commented Feb 6, 2017

@zzzoom I maintain @ibm-ompi - It looks like our CI got stuck. You can re-run it by adding a comment:
bot:ibm:retest

@jjhursey

This comment has been minimized.

Show comment
Hide comment
@jjhursey

jjhursey Feb 6, 2017

Member

The IBM CI is hitting a local file system issue that I need to resolve with the sysadmins. I'm turning it off for now. Disregard the CI failures for now. I'll retest when it comes back online.

Member

jjhursey commented Feb 6, 2017

The IBM CI is hitting a local file system issue that I need to resolve with the sysadmins. I'm turning it off for now. Disregard the CI failures for now. I'll retest when it comes back online.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 6, 2017

Member

👍 Thanks for doing this.

Member

hjelmn commented Feb 6, 2017

👍 Thanks for doing this.

: "=r" (h), "=r" (l)
:: "rax", "rbx", "rcx", "rdx");
#endif
: "=a" (l), "=d" (h));

This comment has been minimized.

@hjelmn

hjelmn Feb 6, 2017

Member

One question. Why remove the rdtscp path? Is it slower than the rdtsc instruction?

@hjelmn

hjelmn Feb 6, 2017

Member

One question. Why remove the rdtscp path? Is it slower than the rdtsc instruction?

This comment has been minimized.

@zzzoom

zzzoom Feb 6, 2017

Contributor

According to http://agner.org/optimize/instruction_tables.pdf rdtscp has higher latency than lfence+rdtsc and it has similar or worse ordering guarantees.

@zzzoom

zzzoom Feb 6, 2017

Contributor

According to http://agner.org/optimize/instruction_tables.pdf rdtscp has higher latency than lfence+rdtsc and it has similar or worse ordering guarantees.

This comment has been minimized.

@hjelmn

hjelmn Feb 7, 2017

Member

Ok. Thanks! I plan to merge this soon and PR it to 2.0.x and 2.1.x.

@hjelmn

hjelmn Feb 7, 2017

Member

Ok. Thanks! I plan to merge this soon and PR it to 2.0.x and 2.1.x.

This comment has been minimized.

@zzzoom

zzzoom Feb 7, 2017

Contributor

On AMD processors mfence should be used instead of lfence, but I don't know where in opal I should detect the processor and change the function pointer.

@zzzoom

zzzoom Feb 7, 2017

Contributor

On AMD processors mfence should be used instead of lfence, but I don't know where in opal I should detect the processor and change the function pointer.

This comment has been minimized.

@bosilca

bosilca Feb 7, 2017

Member

Such an approach will force us to lose the benefit of the inlined function. I like how the Linux kernel solves this issue, but it might be a little too intrusive to bring all the mechanism into OMPI.

What happens on AMD processors if we use lfence instead of mfence ? Would the timer stop being monotonic ?

@bosilca

bosilca Feb 7, 2017

Member

Such an approach will force us to lose the benefit of the inlined function. I like how the Linux kernel solves this issue, but it might be a little too intrusive to bring all the mechanism into OMPI.

What happens on AMD processors if we use lfence instead of mfence ? Would the timer stop being monotonic ?

This comment has been minimized.

@zzzoom

zzzoom Feb 7, 2017

Contributor

@bosilca no, it will have more variance.

@zzzoom

zzzoom Feb 7, 2017

Contributor

@bosilca no, it will have more variance.

This comment has been minimized.

@zzzoom

zzzoom Feb 9, 2017

Contributor

@bosilca I just took another look at the code and opal_sys_timer_get_cycles only gets inlined into opal_timer_base_get_usec_sys_timer and opal_timer_base_get_cycles_sys_timer (which just forwards the result). Then these two functions are called through the opal_timer_base_get_usec and opal_timer_base_get_cycles pointers, so there might be some gains to be had if we got rid of all the indirection and allowed inlining.

@zzzoom

zzzoom Feb 9, 2017

Contributor

@bosilca I just took another look at the code and opal_sys_timer_get_cycles only gets inlined into opal_timer_base_get_usec_sys_timer and opal_timer_base_get_cycles_sys_timer (which just forwards the result). Then these two functions are called through the opal_timer_base_get_usec and opal_timer_base_get_cycles pointers, so there might be some gains to be had if we got rid of all the indirection and allowed inlining.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 6, 2017

Member

:bot:mellanox:retest

Member

hjelmn commented Feb 6, 2017

:bot:mellanox:retest

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 6, 2017

Member

bot:ibm:retest

Member

hjelmn commented Feb 6, 2017

bot:ibm:retest

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 6, 2017

Member

@zzzoom I don't even know if we care about instruction ordering with the rdtsc instruction. All we need for MPI_Wtime () is monotonic. For opal_progress all we care about is that the event library fires at some point.

Member

hjelmn commented Feb 6, 2017

@zzzoom I don't even know if we care about instruction ordering with the rdtsc instruction. All we need for MPI_Wtime () is monotonic. For opal_progress all we care about is that the event library fires at some point.

@zzzoom

This comment has been minimized.

Show comment
Hide comment
@zzzoom

zzzoom Feb 6, 2017

Contributor

@hjelmn people use MPI_Wtime() to time code so I didn't want to introduce a regression by making it less accurate in some weird case where the processor is particularly crafty when reordering instructions.
opal_progress could use a lower precision timer that only does rdtsc, but the 4 cycle latency penalty in doing lfence probably isn't worth the refactoring.

Contributor

zzzoom commented Feb 6, 2017

@hjelmn people use MPI_Wtime() to time code so I didn't want to introduce a regression by making it less accurate in some weird case where the processor is particularly crafty when reordering instructions.
opal_progress could use a lower precision timer that only does rdtsc, but the 4 cycle latency penalty in doing lfence probably isn't worth the refactoring.

@hjelmn

This comment has been minimized.

Show comment
Hide comment
@hjelmn

hjelmn Feb 7, 2017

Member

bot:ibm:retest

Member

hjelmn commented Feb 7, 2017

bot:ibm:retest

@hjelmn hjelmn merged commit 9f073d7 into open-mpi:master Feb 7, 2017

7 of 10 checks passed

IBM-CI (GNU Compiler) Test FAILED
Details
IBM-CI (PGI Compiler) Test FAILED
Details
IBM-CI (XL Compiler) Test FAILED
Details
LANL-CI Test Passed
Details
LANL-OS-X Test Passed
Details
LANL-disable-dlopen Test Passed
Details
LANL-distcheck Test Passed
Details
Mellanox Build finished.
Details
Signed-off-by All commits signed off. Yay!
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@zzzoom zzzoom referenced this pull request Feb 7, 2017

Closed

Fix configure RDTSCP check #2865

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment