* test: https://gcc.godbolt.org/z/4Ks98cP1W ``` real_t s311(struct args_t * func_args) { real_t sum = (real_t)0.; for (int i = 0; i < LEN_1D; i++) { sum += a[i]; } return sum; } ``` * gcc: use whilelo to fold the tail loop ``` .L2: ld1w z31.s, p7/z, [x2, x0, lsl 2] add x0, x0, x3 fadda s0, p7, s0, z31.s whilelo p7.s, w0, w1 b.any .L2 ``` * clang: normal branch for the kernel loop body .LBB0_1 ``` .LBB0_1: // =>This Inner Loop Header: Depth=1 ld1w { z1.s }, p0/z, [x12, x10, lsl #2] add x10, x10, x9 cmp x13, x10 fadda s0, p0, s0, z1.s b.ne .LBB0_1 ```