-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX-based MPI_OP performance regression #8334
Comments
@rajachan how much performance degradation did you measure? My understanding is that on recent processors, unaligned load/store are as efficient as aligned load/store when the data is aligned. Is it fair to say that the performance hit is caused by the combined use of unaligned load/store and gcc 4.8.5? |
@ggouaillardet it's a ~10% degradation. The average loop time of lammps (in seconds) jumps from 7.7 to 8.6. |
@shijin-aws thanks, well, 10% of the user app (e.g. not 10% of the what if you disable the |
@ggouaillardet Using |
@shijin-aws Thanks can you please confirm GCC 4.8.5 is to be blamed here? in that case, should be simply not build the |
@ggouaillardet Sure, @rajachan suggested the same thing. I am trying to build it with gcc7 on my machine to confirm the root cause is gcc 4.8.5. |
you can always use the reduce_local on the test/datatype with the type and op of your liking to see if there is any performance degradation in the AVX part of the reduction operation. On a skylake machine I see a difference of about 15% between the double SUM local reduction operation compiled with gcc 4.8.5 and gcc 10.2.0. Looking a little deeper into this it seems that gcc 4.8.5 does not understand the |
@bosilca I am afraid this is a different issue. The reported issue is I applied the patch below to the Then I found diff --git a/test/datatype/reduce_local.c b/test/datatype/reduce_local.c
index 97890f9..9a69b1b 100644
--- a/test/datatype/reduce_local.c
+++ b/test/datatype/reduce_local.c
@@ -115,11 +115,12 @@ do { \
const TYPE *_p1 = ((TYPE*)(INBUF)), *_p3 = ((TYPE*)(CHECK_BUF)); \
TYPE *_p2 = ((TYPE*)(INOUT_BUF)); \
skip_op_type = 0; \
- for(int _k = 0; _k < min((COUNT), 4); +_k++ ) { \
- memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
- tstart = MPI_Wtime(); \
- MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
- tend = MPI_Wtime(); \
+ tstart = MPI_Wtime(); \
+ for(int _k = 0; _k < min((COUNT), d); +_k++ ) { \
+ for(int _r = 0; _r < repeats; _r++) { \
+ memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
+ MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
+ } \
if( check ) { \
for( i = 0; i < (COUNT)-_k; i++ ) { \
if(((_p2+_k)[i]) == (((_p1+_k)[i]) OPNAME ((_p3+_k)[i]))) \
@@ -131,6 +132,7 @@ do { \
} \
} \
} \
+ tend = MPI_Wtime(); \
goto check_and_continue; \
} while (0)
@@ -163,15 +165,21 @@ int main(int argc, char **argv)
{
static void *in_buf = NULL, *inout_buf = NULL, *inout_check_buf = NULL;
int count, type_size = 8, rank, size, provided, correctness = 1;
- int repeats = 1, i, c;
+ int repeats = 1, i, c, d = 4;
double tstart, tend;
bool check = true;
char type[5] = "uifd", *op = "sum", *mpi_type;
int lower = 1, upper = 1000000, skip_op_type;
MPI_Op mpi_op;
- while( -1 != (c = getopt(argc, argv, "l:u:t:o:s:n:vfh")) ) {
+ while( -1 != (c = getopt(argc, argv, "d:l:u:t:o:s:n:vr:fh")) ) {
switch(c) {
+ case 'd':
+ d = atoi(optarg);
+ if( d < 1 ) {
+ fprintf(stderr, "Disalignment must be greater than zero\n");
+ exit(-1);
+ }
case 'l':
lower = atoi(optarg);
if( lower <= 0 ) { |
@ggouaillardet @rajachan I rebuilt the application with gcc/g++ 7.2.1 on the same os (alinux1), but the performance does not go back to the level of open mpi 4.0.5. |
Here's what I am seeing with vanilla reduce_local and 32-bit float sums:
|
So are we coming down to determining that this is a compiler issue? I.e., certain versions of gcc give terrible performance? If so, is there a way we can detect this in configure and react appropriately? |
That's what it is looking like to me. I'm going to try @shijin-aws's test with the actual application again to make sure he wasn't inadvertently running with the older compiler. |
I've reproduced @shijin-aws's observation. With the LAMMPS application, even with the newer gcc (7.2.1), runs using op/avx perform poorer than the ones without. $ /shared/ompi/install/bin/mpirun --mca op ^avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 $ /shared/ompi/install/bin/mpirun --mca op avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 $ gcc --version From OMPI config log:
Looks like there's more to it than the compiler versions and their AVX support. |
@rajachan thanks for confirming there is more that the gcc version. Would you be able to reproduce this issue with a smaller config ? |
Yup, it is more evident with a single-node run. with op/avx ( --mca op avx -n 24 -N 24): without op/avx ( --mca op ^avx -n 24 -N 24): Times are in seconds. Will run it through a profiler. |
Stating the obvious with some pretty charts, but the mpiP profile from the run without op/avx shows the aggregate AllReduce cost across ranks: Here are the mpiP profiles from the two runs and some more charts in case you want to look it over. I'll take a closer look too. |
This is totally puzzling. Assuming we are pointing toward the AVX support as the culprit behind this performance regression, I went ahead and tested just the MPI_OP and I am unable to replicate it anywhere. I've tried skylake with gcc 4.8.5, 7.0.2 and 10.2.0. Again, I have not looked at the performance of the MPI_Allreduce collective, but specifically at the performance of the MPI_OP. As it was not clear from the discussion which particular MPI_Allreduce has introduced the issue, a quick grep in the lammps code highlights 2 operations that stand out: sum and max on doubles. I also modified the reduce_local test, to be able to test specific shifts or misalignments of the buffers to see if that could be the issue. Unfortunately, all these efforts were in vain, nothing unusual popped up, performance look usually 15-20% better when AVX is turned on, for both sum and max, and for all of the compilers mentionned above. It would be great if you can run the same tests on your setup. You will need to patch your code with 20be3fc (from the #8322), and run |
same here, with the enhanced test and pinning the process, make sure the |
My previous tests were looking at the performance of a single MPI_OP running undisturbed on the machine, so I though maybe the issue is not coming in the MPI_OP itself but from running multiple of these MPI_OP simultaneously. So I run the OSU allreduce test on all the skylake cores I had access to, and it reflected the same finding as above: the AVX version is 5.7% faster than the non-AVX one (2092.11 us vs. 2170.22 us) for the code compiled with gcc 4.8.5. |
With 20be3fc from #8322 cherry-picked on v4.1.x and GCC 4.8.5: op/avx excluded:
op/avx included:
The reproduction seems limited to the application's pattern. I will take a closer look at LAMMPS usage of the collective today. |
I've been looking at this for a while now, and I still do not have a silver bullet here. Just the use of AVX for the op seems to be slowing things down. Poking around literature, I see several references to frequency scaling caused by heavy use of AVX on multiple cores simultaneously, and that causing slowdowns. Is this something you are aware of, and could that be a probable cause? https://dl.acm.org/doi/10.1145/3409963.3410488 I am just running the benchmark case that comes with lammps, in case you want to give it a try on your end: Like I mentioned earlier, I can reproduce this with newer versions of GCC and a single compute instance. |
@rajachan thanks for the report. Frequency scaling is indeed a documented drawback of AVX, that, in the worst case, slow things down, especially on a loaded system. |
Let's not jump to conclusions yet. If I correctly read the graphs posted by @rajachan we are looking at a factor 10x of performance decrease for the allreduce between the AVX and the non-AVX version, while the papers talk about a 10% decrease in a similar workload (suite of AVX and non-AVX operations). We already have an MCA parameter to control how much of the hardware AVX support is allowed by the user, op_avx_support. 0 means no AVX/SSE, 31 means just AVX, 53 AVX and AVX2, and no change to allow everything possible/available. |
@wesbland Have you seen anything like this? Can you perhaps connect us to someone over there who can help us figure out the right path forward? |
@bosilca With LAMMPS running on 24 ranks on a single compute node, here's what I see with the various avx levels: OMPI_MCA_op_avx_support=0 OMPI_MCA_op_avx_support=31 OMPI_MCA_op_avx_support=53 Default: I am running this on a Skylake system with the following capabilities:
|
Based on these numbers it seems we should leave the avx component enabled and with high priority on x86, but restrict it to use only AVX instructions. |
Agreed, this sounds the safest without having to change priority. |
I just discuvered that the Intel compiler does not define the AVX* macros without a specific -m option. Kudos to icc folks, way to go. I have a patch, I will restrict the AVX512 as well. |
@rajachan did you bind MPI tasks to a single core (e.g. I suspect AVX512 frequency scaling might cause unnecessary task migration that could severely impact performances. |
I used the default binding policy in that last run, but there's a degradation after pinning to cores as well, just not as pronounced: OMPI_MCA_op_avx_support=0 OMPI_MCA_op_avx_support=31 OMPI_MCA_op_avx_support=53 default: |
@rajachan thanks for the interesting numbers. without the |
As discussed in open-mpi#8334 the intensive use of AVX2 and AVX512 will lead to a significant frequency scaling, with drastic impact on application performance. Until a better way to prevent this come up, we decided to artificially disable support for AVX2 and AVX512. The support for AVX2 and AVX512 can be enabled by users via the corresponding MCA parameter (op_avx_support). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
As discussed in open-mpi#8334 the intensive use of AVX2 and AVX512 will lead to a significant frequency scaling, with drastic impact on application performance. Until a better way to prevent this come up, we decided to artificially disable support for AVX2 and AVX512. The support for AVX2 and AVX512 can be enabled by users via the corresponding MCA parameter (op_avx_support). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Yes, that's a bit puzzling too, and should be looked at separately in addition to the AVX512 issue. |
We are trying to replicate these results on a single, skylake-based node, but so far we are unable to highlight any performance regression with AVX2 or AVX512 turned on. @dong0321 will post the result soon. Meanwhile, I will amend #8372 and #8373 to remove the part where I alter the flags of the AVX component, such that we can pull in the fix for icc, but without reducing [yet] the capabilities of the AVX component. |
As discussed in open-mpi#8334 the intensive use of AVX2 and AVX512 will lead to a significant frequency scaling, with drastic impact on application performance. Until a better way to prevent this come up, we decided to artificially disable support for AVX2 and AVX512. The support for AVX2 and AVX512 can be enabled by users via the corresponding MCA parameter (op_avx_support). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
I did the same experiments as @rajachan described. Experiment environment: Here are the cmd lines and results:
The results show a different story, without op/avx the performance is the worst. With avx enabled (single avx, avx2, avx512 or mix of those), it shows a speedup of 14%~35%. |
I'm talking to George offline about this. I am setting up a test cluster for @dong0321 to check out the differences between our two runs. We will report back with findings. |
I had a vanilla build of OMPI, but @dong0321 had |
I reproduced Raja's results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz ~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0xfff --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 ~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x3f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 ~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x1f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 I also tested on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz, which shows no performance decrease with AVX512 or AVX2. |
We did another approach that takes the reduce_local and replace the local reduce by an allreduce on MPI_COMM_WORLD() with unaligned and aligned data. It shows no performance decrease. This is on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz Test with allreduce aligned and unaligned. The result shows AVX is still faster than non-AVX. $/home/zhongdong/opt/git/george_branch/bin/mpirun -np 24 reduce_local ...
$/home/zhongdong/opt/git/george_branch/bin/mpirun -mca op ^avx -np 24 reduce_local ...
|
@dong0321 Can you repeat this test on the Skylake system where we reproduced the impact on LAMMPS? |
allreduce results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz $ opt/ompi/4.1.x/bin/mpirun -mca op ^avx -np 24 reduce_local ...
$ opt/ompi/4.1.x/bin/mpirun -np 24 reduce_local ...
|
This regression does not happen when compiling OMPI with icc. The issue seems contained to the use of gcc (tested with multiple versions up until v11 candidate built from source). LAMMPS developers have confirmed they are not making explicit use of AVX512 here. @bosilca I propose updating #8376 to conditionally use those combinations only when the Intel compilers are used. |
Allow me to summarize the situation as I understand it. We have a performance regression on one application, on a particular processor, when compiled with a particular compiler (many versions of the same compiler). Analysis of the regression in the application context, pinpoints the performance issue on an MPI_Allreduce, but we are unable to replicate (even using the same set of conditions) in any stand alone benchmark. In addition, we have not been able to reproduce the performance regression on other applications, even on the exact same setup. So, I'm not sure I understand the proposal here. Allow AVX only when OMPI is compiled with icc ? When the application is compiled with icc ? Both of these are possible but unnecessary restrictive. At this point we have no common denominator here, and no understanding of the root cause. I would advocate we do nothing, add some wording on the FAQ and while we can leave this ticket open for future inquiries we move forward and remove the blocking label. |
Given this is a performance optimization we are talking about and given this was just introduced in this series, yes, that is exactly why I am proposing we be conservative. We have one application that we know of and we don't have full understanding of the problem, so we can not say no other application is impacted (we don't know what we don't know). We learned from the LAMMPS developers that there should be nothing special about their use of Allreduce. I am just repeating myself at this point, but the fact that we need more investigation is enough to say we should not make this the default for everyone. In my tests with the Intel compiler, I'd just compiled OMPI with icc and not the app. We have a few different drivers for a 4.1.x bugfix release and I don't want to hold that up any further, so if you want to take the FAQ route I'm fine with that. |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source tarball, default configuration built with GCC 4.8.5.
Please describe the system on which you are running
Details of the problem
We noticed an OS-specific regression with LAMMPS (in.chute.scaled case) with 4.1.0. Bisecting through the commits, this seems to have been introduced with the AVX-based MPI_OP changes that got backported into this series. Specifically, the commit which moved to the unaligned SSE memory access primitives for reduce OPs seems to be causing it: #7957
This was added to address the Accumulate issue, so it is a necessary correctness fix (#7954)
The actual PR which introduced the SSE-based MPI_OP in the first place was backported from master: #7935
Broadly, allreduce performance seems to have taken a hit in 4.1.0 compared to 4.0.5 in this environment because of these changes. We do not see this with Amazon Linux 2 (which has a 7.x series GCC) or Ubuntu 18, for instance.
Tried with #8322 just in case, that does not help either.
@bosilca does anything obvious stand out to you?
The text was updated successfully, but these errors were encountered: