Skip to content

x86 backend generates unoptimal loop alignment for Skylake and Sandy Bridge processors #89937

@lucic71

Description

@lucic71

When compiling the following LLVM IR file: test.ll we observed a considerable performance gain by modifying the loop alignment generated by LLVM inside the function called test. The default alignment generated by LLVM is 16 bits, as shown in test.S at line 34.

On a Skylake machine we got the following performance with the alignment generated by LLVM:

Elapsed time: 2.512872 seconds

By changing the p2align 4 at line 34 inside test.S to a p2align 5, we got:

Elapsed time: 2.196517 seconds

That's a performance improvement of around 12%!

On another machine, this time Sandy Bridge, we saw the following numbers: 4.3s for p2align4 and 2.5s for p2align5.

The steps for reproducing the issue are:

$ llc -o test.S test.ll
$ clang -o test test.S
$ ./test
Elapsed time: 2.512872 seconds
$ # in test.S at line 34 change `p2align 4` with `p2align 5`
$ clang -o test test.S
$ ./test
Elapsed time: 2.196517 seconds
$ clang --version
16.0.6

This is how the assembly looks like for the test function with p2align4:

00000000000011b0 <test>:
    11b0:       48 85 f6                test   %rsi,%rsi
    11b3:       7e 13                   jle    11c8 <test+0x18>
    11b5:       48 8d 46 ff             lea    -0x1(%rsi),%rax
    11b9:       39 54 b7 fc             cmp    %edx,-0x4(%rdi,%rsi,4)
    11bd:       48 89 c6                mov    %rax,%rsi
    11c0:       75 ee                   jne    11b0 <test>
    11c2:       b8 01 00 00 00          mov    $0x1,%eax
    11c7:       c3                      ret
    11c8:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    11cd:       c3                      ret

And this is how it looks with p2align5:

00000000000011d0 <test>:
    11d0:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    11d7:       00 00 00
    11da:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
    11e0:       48 85 f6                test   %rsi,%rsi
    11e3:       7e 13                   jle    11f8 <test+0x28>
    11e5:       48 8d 46 ff             lea    -0x1(%rsi),%rax
    11e9:       39 54 b7 fc             cmp    %edx,-0x4(%rdi,%rsi,4)
    11ed:       48 89 c6                mov    %rax,%rsi
    11f0:       75 ee                   jne    11e0 <test+0x10>
    11f2:       b8 01 00 00 00          mov    $0x1,%eax
    11f7:       c3                      ret
    11f8:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    11fd:       c3                      ret

Notice that the loop size is 18 in both cases (starting from the test instruction and ending at the jne instruction), the only thing that differs is the start address of the loop. For p2align4 the start address is 11b0 (aligned to 16 bits) and for p2align5 the start address is 11e0 (aligned to 32 bits).

There was a similar issue raised a few years ago where Maxim Kazantsev thought that this workload is bound to decoding so we decided to gather the same perf numbers as he did:

p2align4:
     3,181,514,079      idq.all_dsb_cycles_4_uops ( +-  0.21% )  (29.99%)
     6,271,689,140      idq.all_dsb_cycles_any_uops ( +-  0.23% )  (30.08%)
     6,306,733,259      idq.dsb_cycles ( +-  0.22% )  (30.08%)
    16,393,899,425      idq.dsb_uops ( +-  0.12% )  (30.13%)
p2align5:
     3,194,348,681      idq.all_dsb_cycles_4_uops ( +-  0.18% )  (29.93%)
     3,367,347,174      idq.all_dsb_cycles_any_uops ( +-  0.15% )  (30.03%)
     3,361,525,370      idq.dsb_cycles ( +-  0.14% )  (30.13%)
    16,302,023,646      idq.dsb_uops ( +-  0.12% )  (30.15%)

Notice that p2align5 is delivering more 4upos batches than p2align4 and the number of cycles spent in DSB decreases for p2align5.

In the linked issue, Maxim proposed some interesting solutions for solving this problem:

Align loops by 32 if:
  *   They are innermost;
  *   Size of loop mod 32 is between 16 and 31 (only in this case alignment by 32 will strictly reduce the number of 32 window crossings by 1);
  *   (Optional) The loop is small, e.g. less than 32 bytes;
  *   (Optional) We could make even sharper checks trying to ensure that all other conditions of DSB max utilization are met (may be very complex!)

cc: @nunoplopes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions