x86 backend generates unoptimal loop alignment for Skylake and Sandy Bridge processors

When compiling the following LLVM IR file: [test.ll](https://drive.google.com/file/d/1s_ERe3hP-OmoMrVrFELLkYAdfy2MPbWA/view?usp=sharing) we observed a considerable performance gain by modifying the loop alignment generated by LLVM inside the function called `test`. The default alignment generated by LLVM is 16 bits, as shown in [test.S](https://drive.google.com/file/d/1qRDBV9KHpCZaogN7QI09cDMJDfkwWYI_/view?usp=sharing) at line 34.

On a Skylake machine we got the following performance with the alignment generated by LLVM:
```
Elapsed time: 2.512872 seconds
```

By changing the `p2align 4` at line 34 inside [test.S](https://drive.google.com/file/d/1qRDBV9KHpCZaogN7QI09cDMJDfkwWYI_/view?usp=sharing) to a `p2align 5`, we got:
```
Elapsed time: 2.196517 seconds
```

That's a performance improvement of around 12%!

On another machine, this time Sandy Bridge, we saw the following numbers: 4.3s for p2align4 and 2.5s for p2align5.

The steps for reproducing the issue are:
```
$ llc -o test.S test.ll
$ clang -o test test.S
$ ./test
Elapsed time: 2.512872 seconds
$ # in test.S at line 34 change `p2align 4` with `p2align 5`
$ clang -o test test.S
$ ./test
Elapsed time: 2.196517 seconds
$ clang --version
16.0.6
```

This is how the assembly looks like for the `test` function with p2align4:
```asm
00000000000011b0 <test>:
    11b0:       48 85 f6                test   %rsi,%rsi
    11b3:       7e 13                   jle    11c8 <test+0x18>
    11b5:       48 8d 46 ff             lea    -0x1(%rsi),%rax
    11b9:       39 54 b7 fc             cmp    %edx,-0x4(%rdi,%rsi,4)
    11bd:       48 89 c6                mov    %rax,%rsi
    11c0:       75 ee                   jne    11b0 <test>
    11c2:       b8 01 00 00 00          mov    $0x1,%eax
    11c7:       c3                      ret
    11c8:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    11cd:       c3                      ret
```

And this is how it looks with p2align5:
```asm
00000000000011d0 <test>:
    11d0:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    11d7:       00 00 00
    11da:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
    11e0:       48 85 f6                test   %rsi,%rsi
    11e3:       7e 13                   jle    11f8 <test+0x28>
    11e5:       48 8d 46 ff             lea    -0x1(%rsi),%rax
    11e9:       39 54 b7 fc             cmp    %edx,-0x4(%rdi,%rsi,4)
    11ed:       48 89 c6                mov    %rax,%rsi
    11f0:       75 ee                   jne    11e0 <test+0x10>
    11f2:       b8 01 00 00 00          mov    $0x1,%eax
    11f7:       c3                      ret
    11f8:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    11fd:       c3                      ret
```

Notice that the loop size is 18 in both cases (starting from the `test` instruction and ending at the `jne` instruction), the only thing that differs is the start address of the loop. For p2align4 the start address is 11b0 (aligned to 16 bits) and for p2align5 the start address is 11e0 (aligned to 32 bits).

There was a [similar issue](https://lists.llvm.org/pipermail/llvm-dev/2021-January/148177.html) raised a few years ago where Maxim Kazantsev thought that this workload is bound to decoding so we decided to gather the same perf numbers as he did:
```
p2align4:
     3,181,514,079      idq.all_dsb_cycles_4_uops ( +-  0.21% )  (29.99%)
     6,271,689,140      idq.all_dsb_cycles_any_uops ( +-  0.23% )  (30.08%)
     6,306,733,259      idq.dsb_cycles ( +-  0.22% )  (30.08%)
    16,393,899,425      idq.dsb_uops ( +-  0.12% )  (30.13%)
```
```
p2align5:
     3,194,348,681      idq.all_dsb_cycles_4_uops ( +-  0.18% )  (29.93%)
     3,367,347,174      idq.all_dsb_cycles_any_uops ( +-  0.15% )  (30.03%)
     3,361,525,370      idq.dsb_cycles ( +-  0.14% )  (30.13%)
    16,302,023,646      idq.dsb_uops ( +-  0.12% )  (30.15%)
```

Notice that p2align5 is delivering more 4upos batches than p2align4 and the number of cycles spent in DSB decreases for p2align5.

In the linked issue, Maxim proposed some interesting solutions for solving this problem:
```
Align loops by 32 if:
  *   They are innermost;
  *   Size of loop mod 32 is between 16 and 31 (only in this case alignment by 32 will strictly reduce the number of 32 window crossings by 1);
  *   (Optional) The loop is small, e.g. less than 32 bytes;
  *   (Optional) We could make even sharper checks trying to ensure that all other conditions of DSB max utilization are met (may be very complex!)
```

cc: @nunoplopes 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x86 backend generates unoptimal loop alignment for Skylake and Sandy Bridge processors #89937

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

x86 backend generates unoptimal loop alignment for Skylake and Sandy Bridge processors #89937

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions