Enable RVV GEMM/IGEMM 7 x m4 in operator config #6411

bhbruce · 2024-05-14T06:02:39Z

This PR aims to enable RVV GEMM/IGEMM/X32-PACKW in GEMM config.
It leads to enabling RVV implementation in operator API.

bhbruce · 2024-05-14T06:10:24Z

@alankelly @fbarchard Could you help to review it?
Also, I would like to ask about what's the appropriate way to enable RVV-only nr2 selection logic in following files.

src/operators/convolution-nhwc.c
1653:  const struct xnn_gemm_config* gemm_nr2_config = xnn_init_f32_gemm_nr2_config();
src/operators/dynamic-fully-connected-nc.c
215:  const struct xnn_gemm_config* gemm_nr2_config = xnn_init_f32_gemm_nr2_config();
src/operators/fully-connected-nc.c
754:  const struct xnn_gemm_config* gemm_nr2_config = xnn_init_f32_gemm_nr2_config();
src/operators/deconvolution-nhwc.c
898:  const struct xnn_gemm_config* gemm_nr2_config = xnn_init_f32_gemm_nr2_config();

The current logic determines to use nr2_config(half nr) if gemm_config->nr > output_channels.
For the RISC-V vector, I would like to specialize in either

gemm_nr2_config->nr <= output_channels
or
gemm_config->nr / 2 <= output_channels

However, there is no arch-specifc definition macro used in src/operators/.

fbarchard · 2024-05-16T09:01:00Z

nr 2 is an MRx2 GEMM - 2 floats wide.
On SSE and NEON that normally use 4 floats per vector it allows a faster GEMM.
But it is optional... any gemm can output NC of less than a full vector, and on RVV is shouldnt make a difference.

src/configs/gemm-config.c

bhbruce · 2024-05-23T10:30:09Z

Hi @fbarchard @alankelly
Could you help to merge this PR?

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

fbarchard · 2024-09-06T01:47:22Z

Re nr2 - if you didnt have such huge vectors you wouldnt have this problem :-)

nr2 doesnt come up much, and you dont have to specialize for it, especially on rvv.
a regular gemm can do nr=2... its just handled as a remainder case.

Add an entry to
static void init_f32_gemm_nr2_config(void) {
with a pack function that can do nr=2 e.g. xnn_x32_packw_gemm_goi_ukernel_x2__scalar_float_u4
normally it would be
f32_gemm_nr2_config.nr = 2;
meaning 2 floats. hmmm... I see your issue. You want something like
// nr is set to vlen * 4 / sizeof(float) = 4 * VLENB * 8 / 32 = VLENB
f32_gemm_config.nr = hardware_config->vlenb;

what if you break from convention and fill in nr=2, meaning 2 floats = 8 bytes.
and implement the gemm using u1v. which will work most of the time.
you could check, in the gemm-config, that vlenb >= 8.
you could also check if vlenb >= 16, and configure an nr2 gemm
but considering how rarely these come up, I'd just do the basic u1v and add a todo to revisit it.

Its also possible to implement nr=2 gemm's more efficiently than the obvious. I did some for neon, using 4 floats per vector. and for nr=1 you can do 4 floats at a time. If thats possible on rvv, it would likely be faster.
I forget the exact method, but look at the 4x2-aarch64-neonfma-ld128.S.in
which does a trick to load 2 blocks at a time (4 floats) and then a paired add outside the loop.
# Main loop - 4 floats of A (16 bytes)
1:
LDR q0, [x3], 16
LD2 {v20.4s, v21.4s}, [x5], 32
LDR q1, [x11], 16
LDR q2, [x12], 16
LDR q3, [x4], 16
SUBS x0, x0, 16
FMLA v24.4s, v20.4s, v0.4s
FMLA v25.4s, v21.4s, v0.4s
FMLA v26.4s, v20.4s, v1.4s
FMLA v27.4s, v21.4s, v1.4s
FMLA v28.4s, v20.4s, v2.4s
FMLA v29.4s, v21.4s, v2.4s
FMLA v30.4s, v20.4s, v3.4s
FMLA v31.4s, v21.4s, v3.4s
B.HS 1b

    FADDP       v24.4s, v24.4s, v25.4s
    FADDP       v26.4s, v26.4s, v27.4s
    FADDP       v28.4s, v28.4s, v29.4s
    FADDP       v30.4s, v30.4s, v31.4s

fbarchard · 2024-09-06T11:16:07Z

Enable RVV GEMM/IGEMM 7 x m4 is landed in #7035
you can close this PR and if add an nr2 enable as followup

bhbruce · 2024-09-06T13:07:24Z

@fbarchard Thanks for your help.

bhbruce force-pushed the rv-gemm-config branch from 3d65776 to 6bdfc89 Compare May 15, 2024 01:18

fbarchard reviewed May 16, 2024

View reviewed changes

src/configs/gemm-config.c Outdated Show resolved Hide resolved

fbarchard approved these changes May 16, 2024

View reviewed changes

fbarchard reviewed May 18, 2024

View reviewed changes

src/configs/gemm-config.c Outdated Show resolved Hide resolved

fbarchard approved these changes May 18, 2024

View reviewed changes

bhbruce force-pushed the rv-gemm-config branch 3 times, most recently from 9804699 to 603cff1 Compare May 23, 2024 10:29

fbarchard approved these changes Jun 25, 2024

View reviewed changes

bhbruce added 3 commits June 26, 2024 03:35

Enable RVV 7 x m4 GEMM & IGEMM in operators

Loading
Loading status checks…

f6c040a

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

Remove RVV GEMM linear/relu from config

Loading
Loading status checks…

338576d

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

Remove empty space in x32-packw/rvv.c.in

Loading
Loading status checks…

3b0dc00

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

bhbruce force-pushed the rv-gemm-config branch from 603cff1 to 3b0dc00 Compare June 26, 2024 10:36

bhbruce closed this Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable RVV GEMM/IGEMM 7 x m4 in operator config #6411

Enable RVV GEMM/IGEMM 7 x m4 in operator config #6411

bhbruce commented May 14, 2024

bhbruce commented May 14, 2024 •

edited

Loading

fbarchard commented May 16, 2024

bhbruce commented May 23, 2024

fbarchard commented Sep 6, 2024

fbarchard commented Sep 6, 2024

bhbruce commented Sep 6, 2024

Enable RVV GEMM/IGEMM 7 x m4 in operator config #6411

Enable RVV GEMM/IGEMM 7 x m4 in operator config #6411

Conversation

bhbruce commented May 14, 2024

bhbruce commented May 14, 2024 • edited Loading

fbarchard commented May 16, 2024

bhbruce commented May 23, 2024

fbarchard commented Sep 6, 2024

fbarchard commented Sep 6, 2024

bhbruce commented Sep 6, 2024

bhbruce commented May 14, 2024 •

edited

Loading