Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op/aarch64: refactor SVE functions #12683

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

ggouaillardet
Copy link
Contributor

Refactor SVE functions and incidentally make NVIDIA compilers a happy panda again.

@ggouaillardet
Copy link
Contributor Author

@bosilca this is a refactor of the SVE functions, to make them more lookalike the SVE VLA examples provided by Arm.

I was initially unable to compile the op/aarch64 component with NVIDIA compilers 24.5 because a compiler issue.
The compiler bug can be evidenced with the following code (built on a Grace CPU fwiw)

#include <arm_sve.h>
#include <unistd.h>

#if 1
void ompi_op_aarch64_2buff_max_uint8_t_sve (const void *_in, void *_out, int *count) {
        int types_per_step = svcntb();
        size_t idx = 0, left_over = *count;
        uint8_t *in = (uint8_t *) _in, *out = (uint8_t *) _out;
        svuint8_t vsrc, vdst;
        svbool_t pred = svwhilelt_b8(idx, left_over);
        do {
            vsrc = svld1(pred, &in[idx]);
            vdst = svld1(pred, &out[idx]);
            vdst = svmax_x(pred, vdst, vsrc);
            svst1(pred, &out[idx], vdst);
            idx += types_per_step;
            pred = svwhilelt_b8(idx, left_over);
        } while (svptest_any(svptrue_b8(), pred));
    }
#endif

#if 1
void ompi_op_aarch64_2buff_max_int8_t_sve (const void *_in, void *_out, int *count) {
        int types_per_step = svcntb();
        size_t idx = 0, left_over = *count;
        int8_t *in = (int8_t *) _in, *out = (int8_t *) _out;
        svint8_t vsrc, vdst;
        svbool_t pred = svwhilelt_b8(idx, left_over);
        do {
            vsrc = svld1(pred, &in[idx]);
            vdst = svld1(pred, &out[idx]);
            vdst = svmax_x(pred, vdst, vsrc);
            svst1(pred, &out[idx], vdst);
            idx += types_per_step;
            pred = svwhilelt_b8(idx, left_over);
        } while (svptest_any(svptrue_b8(), pred));
    }

#endif

and then

$ nvc --version

nvc 24.5-1 linuxarm64 target on aarch64 Linux -tp neoverse-v2 
NVIDIA Compilers and Tools
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

$ nvc -c test.c
NVC++-S-0000-Internal compiler error. cngtyp problem      42  (test.c: 36)
NVC++/arm Linux 24.5-1: compilation completed with severe errors

FWIW, if the source file contains only one subroutine, compilation works fine, but if it contains two or more subroutines, then the compiler crashes.

@bosilca @jeffhammond can one of you please report this internally to the compiler team?

@jeffhammond
Copy link
Contributor

I reported on Slack. Will escalate if necessary.

@ggouaillardet
Copy link
Contributor Author

Thanks @jeffhammond !

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a fundamental difference with the original code and I really don't care much if it looks like the ARM documentation on SVE. But, if restructuring the code has no performance difference and makes another compiler happy, I'm all for it.

@cparrott73
Copy link

@jeffhammond alerted me to this issue. I will open an internal bug report on it here today. Sorry for any inconvenience caused by this issue.

We have already frozen development for the 24.7 release, but I'll get this into the pipeline to be fixed as soon as possible subsequent to that.

Refactor SVE functions and incidentally make NVIDIA compilers
a happy panda again.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet
Copy link
Contributor Author

Performance is basically on par (with some "interesting" variations")

  • LLVM 18.1.8
length	STOCK	REFACTOR
0	0.03	0.03
4	0.04	0.04
8	0.04	0.04
16	0.04	0.04
32	0.04	0.04
64	0.04	0.04
128	0.04	0.04
256	0.04	0.04
512	0.05	0.05
1024	0.06	0.06
2048	0.08	0.09
4096	0.14	0.14
8192	0.26	0.26
16384	0.48	0.48
32768	0.93	0.93
65536	1.83	1.83
131072	3.59	3.59
262144	7.17	7.16
524288	14.39	14.35
1048576	28.58	28.68
2097152	59.41	59.70
4194304	135.84	135.37
  • ARM 24.04
length	STOCK	REFACTOR
0	0.03	0.03
4	0.04	0.04
8	0.04	0.04
16	0.04	0.04
32	0.04	0.04
64	0.04	0.04
128	0.04	0.04
256	0.04	0.04
512	0.05	0.05
1024	0.06	0.06
2048	0.08	0.08
4096	0.13	0.13
8192	0.23	0.23
16384	0.43	0.43
32768	0.83	0.83
65536	1.66	1.66
131072	3.19	3.18
262144	6.34	6.33
524288	12.71	12.71
1048576	25.79	25.73
2097152	60.02	59.74
4194304	137.73	135.04
  • GCC 14.1.0
length	STOCK	REFACTOR
0	0.03	0.03
4	0.04	0.04
8	0.04	0.04
16	0.04	0.04
32	0.04	0.04
64	0.04	0.04
128	0.04	0.04
256	0.04	0.04
512	0.04	0.04
1024	0.05	0.05
2048	0.07	0.07
4096	0.11	0.11
8192	0.20	0.20
16384	0.35	0.35
32768	0.67	0.67
65536	1.63	1.63
131072	3.12	3.12
262144	6.25	6.25
524288	12.57	12.60
1048576	25.56	25.50
2097152	59.33	59.92
4194304	134.40	136.26

@ggouaillardet ggouaillardet marked this pull request as ready for review July 17, 2024 06:52
@bosilca bosilca merged commit 069a8c4 into open-mpi:main Jul 17, 2024
13 checks passed
@cparrott73
Copy link

Just to update - our developer has identified the issue and we are testing a fix in our development branch. This fix seems to have resolved the above issue with SVE intrinsics. We believe it will land in the HPC SDK 24.9 release, but we will keep you posted. (Ping me if I forget to check in here after that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants