Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Missing opportunities to optimize RVV instructions #80392

Open
wangpc-pp opened this issue Feb 2, 2024 · 7 comments
Open

[RISCV] Missing opportunities to optimize RVV instructions #80392

wangpc-pp opened this issue Feb 2, 2024 · 7 comments

Comments

@wangpc-pp
Copy link
Contributor

In the SelectionDAG level, we have several code paths to generate RVV pseudos:

  1. RVV intrinsics -> RVV pseudos.
  2. ISD nodes -> RVV pseudos.
  3. RISCVISD nodes -> RVV pseudos.
  4. RVV intrinsics -> RISCVISD nodes -> RVV pseudos.
  5. ISD nodes -> RISCVISD nodes -> RVV pseudos.
  6. etc.

Most of the optimizations for RVV are based on RISCVISD nodes, so we may miss some opportunities to optimize some codes.
For example (https://godbolt.org/z/f1jWEfhG7):

vuint8m1_t dup(uint8_t* data) {
    return __riscv_vmv_v_x_u8m1(*data, __riscv_vsetvlmax_e8m1());
}

vuint8m1_t dup2(uint8_t* data) {
    return __riscv_vlse8_v_u8m1(data, 0, __riscv_vsetvlmax_e8m1());
}
dup:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret
dup2:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret

These two snippets are of same assemblies because we lower intrinsics of vmv.v.x to RISCVISD::VMV_V_X first, and then we can optimize it to zero-stride load if profitable.
But, this is not common for other cases:

vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) {
    int vl = __riscv_vsetvlmax_e8m1();
    vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl);
    return __riscv_vadd_vv_u16m2(a, c, vl);
}

vuint16m2_t vwaddu(vuint16m2_t a, vuint8m1_t b) {
    return __riscv_vwaddu_wv_u16m2(a, b, __riscv_vsetvlmax_e16m2());
}
vadd:
        vsetvli a0, zero, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v8, v12
        ret
vwaddu:
        vsetvli a0, zero, e8, m1, ta, ma
        vwaddu.wv       v8, v8, v10
        ret

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.
Of cource, there is the same problem for ISD->RVV pseudos path:

typedef vuint8m1_t v16xi8 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen)));
typedef vuint16m2_t v16xi32 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen * 2)));

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __riscv_vzext_vf2_u16m2(b, 16);
    return a + c;
}
add:
        vsetivli        zero, 16, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v12, v8
        ret

I think we need to an universal representation (RISCVISD?) to do optimizations. But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 2, 2024

@llvm/issue-subscribers-backend-risc-v

Author: Wang Pengcheng (wangpc-pp)

In the SelectionDAG level, we have several code paths to generate RVV pseudos: 1. RVV intrinsics -> RVV pseudos. 2. ISD nodes -> RVV pseudos. 3. RISCVISD nodes -> RVV pseudos. 4. RVV intrinsics -> RISCVISD nodes -> RVV pseudos. 5. ISD nodes -> RISCVISD nodes -> RVV pseudos. 6. etc.

Most of the optimizations for RVV are based on RISCVISD nodes, so we may miss some opportunities to optimize some codes.
For example (https://godbolt.org/z/f1jWEfhG7):

vuint8m1_t dup(uint8_t* data) {
    return __riscv_vmv_v_x_u8m1(*data, __riscv_vsetvlmax_e8m1());
}

vuint8m1_t dup2(uint8_t* data) {
    return __riscv_vlse8_v_u8m1(data, 0, __riscv_vsetvlmax_e8m1());
}
dup:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret
dup2:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret

These two snippets are of same assemblies because we lower intrinsics of vmv.v.x to RISCVISD::VMV_V_X first, and then we can optimize it to zero-stride load if profitable.
But, this is not common for other cases:

vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) {
    int vl = __riscv_vsetvlmax_e8m1();
    vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl);
    return __riscv_vadd_vv_u16m2(a, c, vl);
}

vuint16m2_t vwaddu(vuint16m2_t a, vuint8m1_t b) {
    return __riscv_vwaddu_wv_u16m2(a, b, __riscv_vsetvlmax_e16m2());
}
vadd:
        vsetvli a0, zero, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v8, v12
        ret
vwaddu:
        vsetvli a0, zero, e8, m1, ta, ma
        vwaddu.wv       v8, v8, v10
        ret

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.
Of cource, there is the same problem for ISD->RVV pseudos path:

typedef vuint8m1_t v16xi8 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen)));
typedef vuint16m2_t v16xi32 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen * 2)));

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __riscv_vzext_vf2_u16m2(b, 16);
    return a + c;
}
add:
        vsetivli        zero, 16, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v12, v8
        ret

I think we need to an universal representation (RISCVISD?) to do optimizations. But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

@topperc
Copy link
Collaborator

topperc commented Feb 2, 2024

The last example can be optimized with full use of ISD nodes instead of mixing in intrinsics.

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __builtin_convertvector(b, v16xi32);
    return a + c;
}

@lukel97
Copy link
Contributor

lukel97 commented Feb 2, 2024

But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

RISCVFoldMasks and #71764 is an effort to move some of the SelectionDAG code out into MIR passes.

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.

I can't remember where I first heard this argument, but I think there was a question as to whether or not intrinsics should be optimised away? Since there might be the expectation that if the user writes __riscv_vzext_vf2_u16m2 then there should be a vzext.vf2 in the resulting code.

@wangpc-pp
Copy link
Contributor Author

The last example can be optimized with full use of ISD nodes instead of mixing in intrinsics.

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __builtin_convertvector(b, v16xi32);
    return a + c;
}

Thanks! I think my unawareness of this just shows these potential missed optimizations. 😄

@wangpc-pp
Copy link
Contributor Author

But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

RISCVFoldMasks and #71764 is an effort to move some of the SelectionDAG code out into MIR passes.

Yeah! Thanks for mentioning these works!

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.

I can't remember where I first heard this argument, but I think there was a question as to whether or not intrinsics should be optimised away? Since there might be the expectation that if the user writes __riscv_vzext_vf2_u16m2 then there should be a vzext.vf2 in the resulting code.

As my example shows, we have already broken this convention for vmv.v.x intrinsics now.

@topperc
Copy link
Collaborator

topperc commented Feb 2, 2024

Not directly related to this, but I'm not sure HasOptimizedZeroStrideLoad should default to true.

@wangpc-pp
Copy link
Contributor Author

Not directly related to this, but I'm not sure HasOptimizedZeroStrideLoad should default to true.

See https://reviews.llvm.org/D137699. cc @preames

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants