Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programming with RISC-V Vector Instructions #6963

Open
guevara opened this issue Aug 7, 2020 · 0 comments
Open

Programming with RISC-V Vector Instructions #6963

guevara opened this issue Aug 7, 2020 · 0 comments

Comments

@guevara
Copy link
Owner

guevara commented Aug 7, 2020

Programming with RISC-V Vector Instructions



https://ift.tt/2DF1jSF






This articles compares the two main different styles of vector ISAs, discusses a string processing example that is implemented using RISC-V "V" draft version 0.8 (current as of early 2020) vector instructions and details how to set up a RISC-V "V" development environment under Linux.

The solution to all this is to design a variable length vector instruction set. In that way the instructions are then agnostic to the vector register size of a concrete CPU implementation. Thus, the binary code is portable between low, middle and high-end CPUs, and automatically makes use of wider registers in newer CPUs.

The RISC-V vector extension "V" implements such vector instruction set. As of early 2020, the RISC-V "V" specification is at version 0.8 and has draft status.

RISC-V "V" adds 32 vector registers, where the first register can be used as mask register and up to 8 registers can be grouped together. The operands of a vector instruction such as vadd.vv are single vector registers or vector register groups.

Since vector registers are of variable length, RISC-V "V" code has to indicate the maximum vector length it wants to work with, e.g.:

vsetvli t0, a2, e8

Meaning that a vector length (vl) of up to a2 8 bit wide (e8) elements is requested while the instruction returns the resulting length in register t0. Thus, if the a2 register is set to - say - 4096, on a CPU with a vector register length (VLEN) of 128 bits, the following vector instructions work on 16 element wide vectors and t0 is thus set to 16, while on a CPU with 512 bit registers the vectors are configured to be 64 elements wide and t0 is set to 64.

This approach also simplifies loops that iterate over an input array in vector length chunks. For example (where a1 contains the address of an array of a2 times 4 bytes):

.Loop:                        # local symbol name because of .L prefix
    vsetvli t0, a2, e32       # configure vectors of 32 bit elements
<span>vlw.v</span>   <span>v4</span><span>,</span> <span>(</span><span>a1</span><span>)</span>          <span># Load t0 elements into v4,</span>
                          <span># starting at the address stored in a1</span>

<span>...</span>                       <span># work with that chunk</span>

<span>slli</span>    <span>t1</span><span>,</span> <span>t0</span><span>,</span> <span>2</span>         <span># shift-left logical, i.e. times 4</span>
<span>add</span>     <span>a1</span><span>,</span> <span>a1</span><span>,</span> <span>t1</span>        <span># increment src by read elements</span>
<span>sub</span>     <span>a2</span><span>,</span> <span>a2</span><span>,</span> <span>t0</span>        <span># decrement n</span>
<span>bnez</span>    <span>a2</span><span>,</span> <span>.Loop</span>         <span># branch to loop head if not equal to zero</span>

<span>...</span>                       <span># continue</span></pre>

In cases where a2 isn't a multiple of the maximum vector length, the last iteration sets the vector length to a smaller value and the following vector instructions ignore the unused trailing elements. This implicit masking mechanism is orthogonal to the optional mask operand that is supported by most RISC-V vector instructions.

In contrast to that, with a vector length specific ISA, the main loop usually has to be followed by some finalization code block to explicitly deal with the last elements that don't fill a complete register, e.g.:

const unsigned char *p = inp;
size_t l = n / (VECTOR_LENGTH * ELEMENT_BYTES);
for (size_t i = 0; i < l; ++i, p += VECTOR_LENGTH * ELEMENT_BYTES) {
    ... // load p into a vector register
    ... // execute some vector instructions
}
// deal with some remaining bytes
// e.g. by setting up a mask or work on single elements
for (size_t i = l; i < n; ++i, p += ELEMENT_BYTES) {
    ... // work on the next element located at p
}

To illustrate RISC-V "V" with a real example, this section shows how to implement a vectorized function that converts a string of binary coded decimals (BCD) into an ASCII string. Why BCD to ASCII conversion? The task is complex enough such that most of the different vector instructions are used. On the other hand, it's simple enough to fit into a small article and doesn't require domain specific knowledge. It also demonstrates some perhaps not entirely obvious ways how vector instructions are used for string processing where those instruction could be assumed to only be useful for calculations.

With BCD, a byte (8 bits) is divided into two nibbles (4 bits) such that each nibble stores a (hexa-)decimal digit. Note that 4 bits allow to exactly encode 24 values, thus when using it just for storing decimal digits it's not a very efficient encoding.

For the purpose of our example, the exercise is to write vector code that efficiently converts a BCD string such as { 0x12, 0x34, ..., 0xcd, 0xef } to a corresponding ASCII string (e.g. { '1', '2', '3', '4', ..., 'c', 'd', 'e', 'f' }). On a high-level, a solution involves separating the nibbles into single bytes and then converting each byte to the matching ASCII value.

The complete example source code is available in my github repository.

Our function has the following function signature:

void bcd2ascii(void* dst, void const * src, size_t n);

Meaning that n input bytes are read from src and the conversion writes 2*n bytes into the dst output buffer. Under the RISC-V calling conventions, dst is passed in register a0, src in register a1 and n in register a2.

.Loop:                        # local symbol name because of .L prefix
    vsetvli a3, a2, e16, m8   # switch to 16 bit element size,
                              # 4 groups of 8 registers
    # --> a3 = min(a2, 8*vlenb/2)
    vlbu.v v16, (a1)          # Load a3 unsigned bytes,
                              # one byte per 16 bit element, zero-extend,
                              # starting at addr stored in a1
    # --> v16 = | 0, a1[vlenb/2-1], ..., 0, a1[1], 0, a1[0] |, ...,
    # v23 = | 0, a1[a3-1], ..., 0, a1[7*vlenb/2] |
    # --> v16 = | ... 00mn 00kl 00ij 00gh |
<span>add</span> <span>a1</span><span>,</span> <span>a1</span><span>,</span> <span>a3</span>            <span># increment src by read elements</span>
<span>sub</span> <span>a2</span><span>,</span> <span>a2</span><span>,</span> <span>a3</span>            <span># decrement n</span></pre>

The main loop starts with configuring a vector element size of 16 bit (e16), grouping 8 registers together (m8) and requesting a vector length that equals the number of remaining source bytes or the CPU maximum. With this grouping, each register group is accessed by using a vector register with a number that is dividable by 8. That means v0 identifies the group consisting of v0, v1, ..., v7, v8 identifies v8, ..., v15, etc.

The vl*.v load instruction comes in different variants. Here, the vlbu.v variant zero extends each input byte per 16 bit element which is useful in our example because this directly leaves room for shuffling the nibbles. In other words, it's a widening load and thus saves a separate widening operation such as vwaddu.vx.

That means on CPUs with 256 bit vector registers, this code loads up to 128 input bytes into the v16 register group.

Note that register content in the comments is enclosed in | | and written right to left, starting with the least significant element. Arbitrary nibbles are denoted sometimes by placeholder variables such as g, h, ....

The actual nibble shuffling:

vsll.vi v24, v16, 8       # shift-left-logical each element by 8 bits
# --> v24 = | ... mn00 kl00 ij00 gh00 |

vsrl.vi v16, v16, 4 # shift-right-logical each element by 4 bits
# --> v16 = | ... 000m 000k 000i 000g |

slli a3, a3, 1 # shift left logical by immediate,
# i.e. to double the number of vector elements
vsetvli t4, a3, e8, m8 # switch to 8 bit element size,
# 4 groups of 8 registers

vand.vx v24, v24, t2 # and each element with 0x0f,
# i.e. zero-out the high nibbles
# --> v24 = | ... 0n 00 0l 00 0j 00 0h 00 |
vor.vv v16, v16, v24 # or each element
# --> v16 = | ... 0n 0m 0l 0k 0j 0i 0h 0g |

So far the example shows most of the syntactic conventions of the "V" ISA. Vector instructions start with v and a suffix such as .vi, .vx and .vv describe the source operand types, i.e. vector-immediate, vector-scalar and vector-vector.

The bit-shift instructions don't cross element boundaries. Thus, just vector group v24 has to be zero-masked and not v16. The mask is located in register t2 which is set before the loop start.

Switching the vector register configuration to 8 bit elements (e8) at this point allows to use 0xf as mask value instead of the larger 0xf00. Thus, it fits into the immediate operand of the load immediate instruction such that one additional instruction is saved (i.e. addi t2,zero,15). It even fits into the immediate operand of the compressed load immediate instruction, which just encodes into two bytes (i.e. c.li) instead of the regular four.

The final clean result of separated digits is located in vector group v16.

The actual conversion is done in one instruction:

vrgather.vv v24, v8, v16
# --> v24[i] = (v16[i] >= VLMAX) ? 0 : v8[v16[i]]

Here, vector group v8 is used as table to look up the ASCII values. That means the v8 lookup table maps the integers {0, 1, 2, ..., 0xd, 0xe, 0xf } to the ASCII characters { '0', '1', '2', ..., 'd', 'e', 'f' }.

Of course, this lookup table has to be constructed before the loop is entered:

li a6, 16                 # load immediate (pseudo instruction)
vsetvli t0, a6, e8, m8    # switch to 8 bit element size,
                          # i.e. 4 groups of 8 registers

vid.v v8 # store Vector Element Indices,
# i.e. v8 = | 16, ..., 2, 1, 0 |
vmsgtu.vi v0, v8, 9 # set mask-bit if greater than unsigned immediate
# --> v0 = | 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |

li a7, 48 # load immediate, i.e. '0'
vadd.vx v8, v8, a7 # add that scalar to each element

addi a7, a7, -9 # add immediate, i.e. set to 39 == 'a'-'0'-10,
# i.e. to arrive at 'a', 'b', ...
vadd.vx v8, v8, a7, v0.t # masked add for the additional offset

Configuring a grouping of 8 registers for a vector of 16 elements might look like overkill because 128 bit vector registers are sufficient and should be widely available. On the other hand, there might be a CPU with "V" support that just implements - say - 64 bit vector registers where we would need to group 2 registers. Since a grouping thus may be needed it really doesn't hurt to configure the maximum here.

The v0.t syntax is just a marker that v0 is used as mask. Note that masks always just consist of one vector register, even if register groups are configured. With the current "V" 0.8 draft, the v0 register is the only valid choice for a mask operand.

Similar to before, the value 39 is constructed with addi instead of directly loading it with the pseudo-instruction li into another register because -9 fits into the immediate operand of the compressed c.addi instruction.

vsb.v v24, (a0)           # write result to dst
# --> a0[0] = v24[0], a0[1] = v24[1], ..., a0[vl-1] = v24[vlenb-1], ...,
# a0[vlenb*7] = v31[0], ..., a0[t0-1] = v31[vlenb-1]
# --> a0[0..t0-1] = [ 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n' ]
add  a0, a0, t3           # increment dst
bnez a2, .Loop            # branch to loop head if not equal to zero
ret

The loop and function is left if the complete input buffer is processed. Note that while the syntax of most RISC-V instructions follows the destination-source order, store instructions have this order inverted.

Since as of early 2020, the "V" vector extension still has draft status and version 0.8 was just released recently, support for it isn't widely available. That means there is no hardware with a RISC-V "V" CPU available, but also some well-known RISC-V emulators such as Qemu don't support the "V" extension or just support an older version of the "V" extension. Similarly, support for "V" version 0.8 for the standard development toolchain (binutils, gcc) is available, but not yet upstreamed. Meaning that one has to hunt down repositories, identify the right branches and compile those with the right flags, instead of just being able to use distro packages.

Another pitfall is that the "V" extension (similar to "F" and "D" floating point extensions) has to be enabled in the running system by setting a status register. Since the status register can only be accessed in machine-/system-mode that means that one also needs kernel support for the "V" extension.

This section details how to build the different components required for a RISC-V "V" 0.8 toolchain and an emulator.

The Spike RISC-V emulator does have "V" version 0.8 support. As of early 2020, there is one other emulator with "V" 0.8 support but it isn't open source.

Building Spike is straight forward:

sudo dnf install dtc  # i.e. device-tree-compiler
git clone https://github.com/riscv/riscv-isa-sim.git --depth 1
cd riscv-isa-sim
mkdir build
cd build
../configure --prefix=$HOME/local/riscvv08/spike
make
make install

Of course, the --depth 1 switch is optional, it just saves some disk space.

Make sure to a have a fresh clone that has "V" support fixed.

By default Spike enables the RV64IMAFDC ISAs, but this default can be changed at runtime (or even configure time). For example when we call spike like this:

spike --isa=RV64IMAFDCV ...
spike --isa=RV64gcV     ...    # equivalent

For executing user-space programs such as our example, spike needs the Proxy-Kernel (pk).

The RISC-V Proxy-Kernel (pk) implements enough to get a user-space program in Spike running, i.e. including setting up some status registers in machine-mode, switching to user-mode and implementing some syscalls. That means that calling the write syscall to write to stdout then just works in Spike and the text is printed to the console.

The pk needs to be cross-compiled with the GNU Toolchain (see previous Section).

git clone --depth 1 https://github.com/riscv/riscv-pk.git
cd riscv-pk
mkdir build
cd build
PATH=$HOME/local/riscvv08/gnu/bin:$PATH ../configure --prefix=$HOME/local/riscvv08/pk \
                                                     --host=riscv64-unknown-elf
PATH=$HOME/local/riscvv08/gnu/bin:$PATH make
PATH=$HOME/local/riscvv08/gnu/bin:$PATH make install

Again make sure to get a recent pk clone with fixed "V" support.

If you already have the GNU Toolchain you can skip this (as it already contains the binutils with "V" support). This is just relevant if you have obtained the Proxy-Kernel with "V" support in binary form and want to skip building the GNU Toolchain.

git clone https://github.com/riscv/riscv-binutils-gdb.git --branch rvv-0.8.x \
          --single-branch --depth 1 risv-binutils-gdb_rvv-0.8.x
mkdir build
cd build
../configure --prefix=$HOME/local/riscvv08/binutils --target riscv64-unknown-elf \
             --enable-multilib
make
make install

Finally, to actually execute our example, a small test program is needed that calls the bcd2ascii() function with some sample input and prints the results. If the complete GNU toolchain is available the simplest thing is to write that part in C, e.g.:

#include <stddef.h>

void bcd2ascii(void dst, const void src, size_t n);

static const unsigned char inp[] = {
0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef,
0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10,
0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef,
0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
};

#include <stdio.h>

int main()
{
char out[sizeof inp * 2 + 1] = {0};
// expected output:
// out = { '0', '1', '2', '3', ... }

<span>bcd2ascii</span><span>(</span><span>out</span><span>,</span> <span>inp</span><span>,</span> <span>sizeof</span> <span>inp</span><span>);</span>
<span>puts</span><span>(</span><span>out</span><span>);</span>
<span>return</span> <span>0</span><span>;</span>

}

Everything can then be cross-assembled, cross-compiled and linked with:

~/local/riscvv08/gnu/bin/riscv64-unknown-elf-as -march=rv64gcv -o bcd2ascii.o bcd2ascii.s
~/local/riscvv08/gnu/bin/riscv64-unknown-elf-gcc -Wall  main_bcd2a.c -o bcd2a bcd2ascii.o

Supplying just -march=rv64gv disables the use of compressed instructions.

Alternatively, without a C cross compiler but cross binutils, we need an assembly test program such as:

    .text                     # Start text section
    .balign 4                 # align 4 byte instructions by 4 bytes
    .global _start            # global
_start:
                              # check if vector extension is enabled
                              # user-mode doesn't have privileges to
                              # read mstatus/sstatus/misa CSRs
                              # thus, unclear how to check for V support
    li    t1, 0x1800000       # disable this check for now
    #csrr t1, mstatus # control and status register, i.e. read the
                              # mstatus register
    li    t2, 0b11            # load immediate mask
    slli  t2, t2, 23          # shift left logical immediate by 23 bits
                              # because "V" draft 0.8 defines the vector
                              # context status field VS as mstatus[24:23]
                              # (0b00 -> off, 0b01 -> initial, 0b10 -> clean,
                              # 0b11 -> dirty)
    and   t3, t1, t2
    beqz  t3, v_disabled_error
                          <span># Prepare calling bcd2ascii()</span>
<span>addi</span>  <span>sp</span><span>,</span> <span>sp</span><span>,</span> <span>-</span><span>68</span>         <span># grow stack by 64+4 bytes, some additional</span>
                          <span># space but keep it 4 byte aligned</span>
<span>mv</span>    <span>a0</span><span>,</span> <span>sp</span>              <span># store output on stack</span>
<span>lui</span>   <span>a1</span><span>,</span> <span>%hi</span><span>(</span><span>inp</span><span>)</span>        <span># load start address of</span>
<span>addi</span>  <span>a1</span><span>,</span> <span>a1</span><span>,</span> <span>%lo</span><span>(</span><span>inp</span><span>)</span>    <span># the input string</span>
<span>li</span>    <span>a2</span><span>,</span> <span>32</span>              <span># load immediate: sizeof inp</span>
<span>call</span>  <span>bcd2ascii</span>           <span># we don't need to save/restore our</span>
                          <span># return address because we don't return ...</span>
<span>li</span>    <span>t0</span><span>,</span> <span>0xa</span>             <span># load immediate: newline</span>
<span>sb</span>    <span>t0</span><span>,</span> <span>64</span><span>(</span><span>sp</span><span>)</span>          <span># store byte</span>
                          <span># i.e. terminate output string with '\n'</span>
<span>li</span>    <span>a0</span><span>,</span> <span>1</span>               <span># stdout</span>
<span>mv</span>    <span>a1</span><span>,</span> <span>sp</span>              <span># read output located on the stack</span>
<span>li</span>    <span>a2</span><span>,</span> <span>65</span>              <span># i.e. 64+1 characters</span>
<span>li</span>    <span>a7</span><span>,</span> <span>64</span>              <span># write syscall number</span>
<span>ecall</span>                     <span># call write(2)</span>

<span>li</span>    <span>a0</span><span>,</span> <span>0</span>               <span># set exit status to zero</span>

exit:
li a7, 93 # exit syscall number
ecall # call exit(2)
1:
j 1b # loop forever in case exit failed ...

v_disabled_error:
li a0, 2 # stderr
lui a1, %hi(err_msg) # load error message start address
addi a1, a1, %lo(err_msg)
lui a2, %hi(err_msg_size) # load error message size
addi a2, a2, %lo(err_msg_size)
li a7, 64 # write syscall number
ecall # call write(2)
li a0, 1 # load immediate exit argument
j exit

<span>.section</span> <span>.rodata</span>          <span># Start read-only data section</span>
<span>.balign</span> <span>4</span>                 <span># align to 4 bytes</span>

inp:
.byte 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef
.byte 0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
.byte 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef
.byte 0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10
err_msg:
.string "ERROR: RISC-V 'V' vector extension is disabled!\n"
.set err_msg_size, . - err_msg

Cross-assembling and linking everything:

~/local/riscvv08/riscv64-unknown-elf/bin/as -march=rv64gcv -o bcd2ascii.o bcd2ascii.s
~/local/riscvv08/riscv64-unknown-elf/bin/as -march=rv64gcv -o start_bcd2a.o start_bcd2a.s
~/local/riscvv08/riscv64-unknown-elf/bin/ld start_bcd2a.o bcd2ascii.o -o bcd2a

Of course, my repository also contains a makefile to simplify building the example.

Example emulating session:

$ ~/local/riscvv08/spike/bin/spike --isa=RV64gcV \
        ~/local/riscvv08/riscv64-unknown-elf/bin/pk bcd2a
bbl loader
0123456789abcdeffedcba98765432100123456789abcdeffedcba9876543210

Spike also has an interactive mode that allows to step through the instructions, inspect registers etc. For example:

$ ~/local/riscvv08/spike/bin/spike -d --isa=RV64gcV \
        ~/local/riscvv08/riscv64-unknown-elf/bin/pk bcd2a
: until pc 0 100e2
bbl loader
: vreg 0 8
VLEN=128 bits; ELEN=32 bits
v8 : [3]: 0x00000000 [2]: 0x020ae6a0 [1]: 0x00000000 [0]: 0x020ae630
:
core 0: 0x00000000000100e2 (0x5208a457) vid.v v8
: vreg 0 8
VLEN=128 bits; ELEN=32 bits
v8 : [3]: 0x0f0e0d0c [2]: 0x0b0a0908 [1]: 0x07060504 [0]: 0x03020100
: q

In comparison with GDB the interactive prompt is a bit spartanic and doesn't really report syntactic errors in the interactive commands, but it's sufficient. The help can be displayed with h, <ENTER> steps to the next instruction and q quits it.

The address 100e2 in the above example session comes from the disassembled bcd2a executable (i.e. using objdump).







via Georg's Log

August 7, 2020 at 03:46PM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant