Compiler ABI

Jeff Bush edited this page Apr 10, 2018 · 61 revisions

An LLVM backend targets this architecture and supports scalar and vector operations. The LLVM infrastructure can be a backend for any language, but focus is currently on the clang C/C++ compiler. The port is based on LLVM trunk and supports many recent features up to C++1z (

This backend does not support:

There is not a C++ standard library port yet.

The compiler defines the preprocessor macro __NYUZI__ .

The toolchain is installed by default in /usr/local/llvm-nyuzi. The tools are in the bin/ directory:

  • clang/clang++: C/C++ compiler with integrated assembler
  • ld.lld: LLD linker (this symlink invokes the 'ld' flavor of LLD)
  • lldb: Symbolic debugger
  • elf2hex: Converts ELF executables into a format that can be run in simulator/emulator/FPGA
  • llvm-ar: LLVM version of ar for creating static libraries
  • llvm-objdump: Object dump utility. Useful for seeing assembly listing of generated file.

Inline assembly

Inline assembler statements are available using GCC's extended assembler syntax ( The 'v' constraint used for vector operands and 'r' is for scalar operands, for example:

asm("store_v %0, (%1)" : : "v" (value), "r" (address));

Vector Support

The compiler supports vector types using GCC syntax The 'ext_vector_type' attribute indicates vector types:

typedef int veci16_t __attribute__((ext_vector_type(16)));
typedef unsigned int vecu16_t __attribute__((ext_vector_type(16)));
typedef float vecf16_t __attribute__((ext_vector_type(16)));

Vectors are first class types that can be local variables, global variables, parameters, or struct/class members. The compiler uses registers to store these wherever possible.

If a vector is a member of the structure, you must align that structure on a 64 byte multiple. The compiler automatically aligns vector members within the structure, stack allocated local variables, and global variables. However, if a structure is heap allocated, the heap implementation must align it (this is not the default behavior of most implementations).

Standard arithmetic operators are available for vector operations. For example, to add two vectors:

    veci16_t foo;
    veci16_t bar;
    veci16_t baz;
    foo = bar + baz;

Individual elements of a vector are set/read using the array operator. These compile to the getlane instruction and mask register moves.

    veci16_t foo;
    int total;
    for (int i = 0; i < 16; i++)
        total += foo[i];
        foo[i] += i;

Vectors can be initialized using curly bracket syntax. If the members are constant, this is loaded from the constant pool.

  const veci16_t steps = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };

You can also use non-constant members, in which case the compiler will generate a series of masked moves to load the vector.

  int a, b, c, d;
  veci16_t values = { a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d };

Scalar and vector values can be mixed:

    veci16_t foo;
    int bar;
    veci16_t baz;

    foo = baz + bar;

The backend recognizes when it can use mixed vector/scalar instructions. For example:

    add_i v0, v0, s0

In some situations, you may need to cast a scalar type to widen it:

   void somefunc(vecu16_t f);

   somefunc((vecu16_t) 12);

Floating point conversions are a little weird. Consider the following code:

    veci16_t i;
    vecf16_t f;

    f = (vecf16_t) i;

If these were scalar types, this would cause an integer to floating point conversion (eg. 1 would become 1.0). However, because they are vectors it does a bitcast instead. This is standard GCC behavior. Use __builtin_convertvector to convert the type:

    vecu16_t a;
    vecf16_t b = __builtin_convertvector(a, vecf16_t);

The GCC syntax supports vector comparisons that result in another vector type. For example:

    veci16_t a, b, c;
    a = b > c;

The instruction set does not natively support this. Comparisons set bitmasks in scalar registers. The compiler emulates the former behavior using masked move instructions. Builtins support native bitmask comparisons (f is for floats, i is for ints), for example:

    veci16_t b, c;
    uint32_t a = __builtin_nyuzi_mask_cmpi_sgt(b, c);  // Signed greater than

Two flexible compiler builtins support predicated instructions: __builtin_nyuzi_vector_mixf and __builtin_nyuzi_vector_mixi functions. Each one takes a mask and two vectors. Each of the low 16 bits in the mask selects whether the vector lane value comes from the first parameter or the second. A one bit pulls from the first, a zero from the second. These builtins don't necessarily emit instructions; they compiler inserts predicated instructions where possible. For example:

   vecf16_t a = __builtin_nyuzi_vector_mixf(mask, a + b, a);

Generates a single instruction:

   add_f_mask v0, s0, v0, v1

__builtin_nyuzi_shuffle instruction rearranges vector lanes and maps to the shuffle instruction. It takes two vector parameters. The first is a source vector and the second is a set of indices (0-15) into the first. For shuffles where the indices are hardcoded, the standard __builtin_shufflevector can be used. This provides more flexibility to mix and shuffle two vectors. The backend will emit optimized code depending on the indices.

While the LLVM toolchain supports auto-vectorization, the backend for this processor doesn't. The focus is on explicit vectorization in code.

Built-in functions

Read and write control registers:

int __builtin_nyuzi_read_control_reg(int index);
void __builtin_nyuzi_write_control_reg(int index, int value);

Each of the 16 bits of 'mask' correspond to a vector lane. A 1 bit in the lane selects from vector 'a', a zero from vector 'b' (see section above for notes about code generation):

veci16_t __builtin_nyuzi_vector_mixi(unsigned short mask, veci16_t a, veci16_t b);
vecf16_t __builtin_nyuzi_vector_mixf(unsigned short mask, vecf16_t a, vecf16_t b);

Return a new vector where each lane in the 'laneIndices' vector selects a lane from sourceVector.

veci16_t __builtin_nyuzi_shufflei(veci16_t sourceVector, veci16_t laneIndices);
vecf16_t __builtin_nyuzi_shufflef(vecf16_t sourceVector, veci16_t laneIndices);

Each lane in 'sourcePtr' is a pointer to a 32-bit value. This loads one value for each lane into a destination vector. For _masked versions, each bit in the mask register corresponds to a lane. If there is a 1 bit in the lane, the value will be fetched, if there is a zero, the lane will be ignored (the resulting value in the result is undefined).

veci16_t __builtin_nyuzi_gather_loadi(veci16_t sourcePtrs);
veci16_t __builtin_nyuzi_gather_loadi_masked(veci16_t sourcePtrs, unsigned short mask);
veci16_t __builtin_nyuzi_gather_loadf(veci16_t pointers);
veci16_t __builtin_nyuzi_gather_loadf_masked(veci16_t pointers, unsigned short mask);

Opposite of the load functions

void __builtin_nyuzi_scatter_storei(veci16_t destPtrs, veci16_t sourceValue);
void __builtin_nyuzi_scatter_storei_masked(veci16_t destPtrs, veci16_t sourceValue, unsigned short mask);
void __builtin_nyuzi_scatter_storef(veci16_t destPtrs, vecf16_t sourceValue);
void __builtin_nyuzi_scatter_storef_masked(veci16_t destPtrs, vecf16_t sourceValue, unsigned short mask);
void __builtin_nyuzi_block_storei_masked(veci16_t *dest, veci16_t values, unsigned short mask);
void __builtin_nyuzi_block_storef_masked(vecf16_t *dest, vecf16_t values, unsigned short mask);

Vector unsigned integer comparisons. Each bit in the return value corresponds to a lane.

  • gt Greater than
  • ge Greater than or equal
  • lt Less than
  • le Less than or equal
  • eq Equal
  • ne Not equal
unsigned short __builtin_nyuzi_mask_cmpi_ugt(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_uge(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ult(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ule(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_eq(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ne(veci16_t a, veci16_t b);

Vector signed integer comparisons.

unsigned short __builtin_nyuzi_mask_cmpi_sgt(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_sge(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_slt(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_sle(veci16_t a, veci16_t b);

Vector floating point comparisons

unsigned short __builtin_nyuzi_mask_cmpf_gt(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_ge(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_lt(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_le(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_eq(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_ne(vecf16_t a, vecf16_t b);


It is often useful to see a disassembled listing of the executable to debug issues. The llvm-objdump command disassembles ELF output files from the compiler.

/usr/local/llvm-nyuzi/bin/llvm-objdump --disassemble program.elf

    95b8:	bd 03 ff 02 	add_i sp, sp, -64
    95bc:	1d f3 00 88 	store_32 s24, 60(sp)
    95c0:	3d e3 00 88 	store_32 s25, 56(sp)
    95c4:	5d d3 00 88 	store_32 s26, 52(sp)
    95c8:	dd c3 00 88 	store_32 ra, 48(sp)
    95cc:	1d 60 00 88 	store_32 s0, 24(sp)

The -S flag can also be used with objdump to annotate the disassembly with the original source code. This uses line number debug information, so the original program must be compiled with debugging enabled (-g flag). However, because of the way instruction scheduling works, the assembly instructions may be reordered quite a bit and may not be easy to correlate with the original source code.

ABI/Code Generation

type size
char 1 byte
short 2 bytes
int 4 bytes
long 4 bytes
long long 8 bytes
void* 4 bytes
float 4 bytes
double 4 bytes (see below)
  • The compiler passes the first 8 scalar and vector function arguments in registers. It pushes the rest of the arguments on the stack in order, aligned by size.
  • 64-bit values are passed in two adjacent scalar registers, with the lower numbered register being the least significant word.
  • If a function is has variable number of arguments, then it pushes all arguments on the stack.
  • When a function returns a struct by value, the caller reserves space for the result in its own stack frame and passes the address of that region in s0. The parameters of the function then start at s1.
  • Scalar regs 24-27, gp, fp, ra, and vector regs 26-31 are callee save. The others are caller save.
  • s28 is the global pointer (gp), which points to the global offset table in position independent code.
  • s29 is as a frame pointer (fp) for function calls when needed. Most of the time it isn't; it only is if the function uses the frame or return address with __builtin_frame_address and __builtin_return_address intrinsics or if the function uses variable sized stack allocations.
  • s30 is the stack pointer (sp), which is 64 byte (vector width) aligned.
  • The hardware uses s31 as the return address register (ra). It sets this when a call instruction is executed.
  • the 'double' type is 32-bits wide and is actually a IEEE single precision float. This is because there is no hardware support for double precision floating point and the compiler defaults to double for many operations. While unusual, I believe this is technically spec compliant.
  • Integer modulus and division are not supported in hardware and generate calls to library functions __udivsi3, __divsi3, __umodsi3, and __modsi3. These are in libclang_rt.builtins-nyuzi.a, which are built as part of the compiler-rt project in the toolchain repository.
  • Floating point division operations emit a reciprocal estimate instruction followed by two Newton-Raphson iterations, (9 instructions for reciprocal, 10 for division)

The Nyuzi ELF format supports the following relocation types:

ID Name Description
1 R_NYUZI_ABS32 32 bit absolute relocation
2 R_NYUZI_BRANCH20 20 bit PC-relative offset in bits 24-5 (short branch instruction)
3 R_NYUZI_BRANCH25 25 bit PC-relative offset in bits 24-0 (long branch instruction)
4 R_NYUZI_HI19 Upper 19 bits of absolute address in bits 18-5,4-0 (immediate arithmetic, movehi instruction)
5 R_NYUZI_IMM_LO13 Lower 13 bits of absolute address in bits 23-10 (immediate arithmetic instruction).
6 R_NYUZI_GOT Load instruction, offset into global offset table

The ELF machine ID for Nyuzi is currently set to 9999, as it doesn't have an official ID.

Other tools

The elf2hex tool builds memory images that tools load:

  • It uses the format that the Verilog $readmemh system function understands. Each line is 8 characters hexadecimal ASCII, which encodes four bytes. The processor is little endian, so, if a line is "002600f6", the processor will read the instruction as 0xf6002600.
  • It unpacks the ELF file into a flat memory representation with the segments at their proper addresses, BSS regions cleared, etc.
  • It clobbers the first word of the unused ELF header with a jump instruction to the start address.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.