Skip to content

Latest commit

 

History

History
589 lines (441 loc) · 16.9 KB

FlexibleVectors.md

File metadata and controls

589 lines (441 loc) · 16.9 KB

Flexible vectors overview

The goal of this proposal is to provide flexible vector instructions for WebAssembly as a way to bridge the gap between existing SIMD instruction sets available on various platforms. More specifically, this proposal aims to enable better use processing capabilities of existing SIMD hardware and bring performance of vector operaions available in WebAssembly closer to native. simd128 proposal already identified operations that would commonly work on platforms that are important to WebAssembly, this proposal is attempting to extend the same operations to work with variable vector lengths.

Types

Proposal introduces the following vector types:

  • vec.i8 : 8-bit integer lanes
  • vec.i16: 16-bit integer lanes
  • vec.i32: 32-bit integer lanes
  • vec.i64: 64-bit integer lanes
  • vec.f32: single precision floating point lanes
  • vec.f64: double precision floating point lanes

Lane division interpretation

In semantic pseudocode S is the particular vector type, S.LaneBits is the size of the lane in bits, S.Lanes is the number of lanes, which is dynamic.

S S.LaneBits
vec.i8 8
vec.i16 16
vec.i32 32
vec.f32 32
vec.i64 64
vec.f64 64

Restrictions

Lane values are intended to be handled exactly like in simd128 proposal, with the following differences applying to overall types:

  • Runtime sets maximum vector length for every type
  • Number of lanes is set separately for different lane sizes
  • Vectors with different lane size are not immediately interoperable

Immediate operands

TBD value range, depends on instruction encoding.

  • ImmLaneIdxV8: lane index for 8-bit lanes
  • ImmLaneIdxV16: lane index for 16-bit lanes
  • ImmLaneIdxV32: lane index for 32-bit lanes
  • ImmLaneIdxV64: lane index for 64-bit lanes

Operations

Completely new operations introduced in this proposal are the operations that provide interface to vector length.

Vector length

Querying length of supported vector:

  • vec.i8.length -> i32
  • vec.i16.length -> i32
  • vec.i32.length -> i32
  • vec.i64.length -> i32
  • vec.f32.length -> i32
  • vec.f64.length -> i32

Constructing vector values

Create vector with identical lanes:

  • vec.i8.splat(x:i32) -> vec.i8
  • vec.i16.splat(x:i32) -> vec.i16
  • vec.i32.splat(x:i32) -> vec.i32
  • vec.i64.splat(x:i64) -> vec.i64
  • vec.f32.splat(x:f32) -> vec.f32
  • vec.f64.splat(x:f64) -> vec.f64

Construct vector with x replicated to all lanes:

def S.splat(x):
    result = S.New()
    for i in range(S.Lanes):
        result[i] = x
    return result

Accessing lanes

Extract lane as a scalar

  • vec.i8.extract_lane_s(a: vec.i8, imm: ImmLaneIdxV8) -> i32
  • vec.i8.extract_lane_u(a: vec.i8, imm: ImmLaneIdxV8) -> i32
  • vec.i16.extract_lane_s(a: vec.i16, imm: ImmLaneIdxV16) -> i32
  • vec.i16.extract_lane_u(a: vec.i16, imm: ImmLaneIdxV16) -> i32
  • vec.i32.extract_lane(a: vec.i32, imm: ImmLaneIdxV32) -> i32
  • vec.i64.extract_lane(a: vec.i64, imm: ImmLaneIdxV64) -> i64
  • vec.f32.extract_lane(a: vec.f32, imm: ImmLaneIdxV32) -> f32
  • vec.f64.extract_lane(a: vec.f64, imm: ImmLaneIdxV64) -> f64

Extract the scalar value of lane specified in the immediate mode operand imm in a. The {interpretation}.extract_lane{_s}{_u} instructions are encoded with one immediate byte providing the index of the lane to extract.

def S.extract_lane(a, i):
    return a[i]

The _s and _u variants will sign-extend or zero-extend the lane value to i32 respectively.

Replace lane value

  • vec.i8.replace_lane(a: vec.i8, imm: ImmLaneIdxV8, x: i32) -> vec.i8
  • vec.i16.replace_lane(a: vec.i16, imm: ImmLaneIdxV16, x: i32) -> vec.i16
  • vec.i32.replace_lane(a: vec.i32, imm: ImmLaneIdxV32, x: i32) -> vec.i32
  • vec.i64.replace_lane(a: vec.i64, imm: ImmLaneIdxV64, x: i64) -> vec.i64
  • vec.f32.replace_lane(a: vec.f32, imm: ImmLaneIdxV32, x: f32) -> vec.f32
  • vec.f64.replace_lane(a: vec.f64, imm: ImmLaneIdxV64, x: f64) -> vec.f64

Return a new vector with lanes identical to a, except for the lane specified in the immediate mode operand imm which has the value x. The {interpretation}.replace_lane instructions are encoded with an immediate byte providing the index of the lane the value of which is to be replaced.

def S.replace_lane(a, i, x):
    result = S.New()
    for j in range(S.Lanes):
        result[j] = a[j]
    result[i] = x
    return result

The input lane value, x, is interpreted the same way as for the splat instructions. For the i8 and i16 lanes, the high bits of x are ignored.

Shuffles

Left lane-wise shift by scalar

  • vec.i8.lshl(a: vec.i8, x: i32) -> vec.i8
  • vec.i16.lshl(a: vec.i16, x: i32) -> vec.i16
  • vec.i32.lshl(a: vec.i32, x: i32) -> vec.i32
  • vec.i64.lshl(a: vec.i64, x: i32) -> vec.i64

Returns a new vector with lanes selected from the lanes of the two input vectors a and b by shifting lanes of the original to the left by the amount specified in the integer argument and shifting zero values in.

def S.lshl(a, x):
    result = S.New()
    for i in range(S.Lanes):
        if i < x:
            result[i] = 0
        else:
            result[i] = a[i - x]
    return result

Right lane-wise shift by scalar

  • vec.i8.lshr(a: vec.i8, x: i32) -> vec.i8
  • vec.i16.lshr(a: vec.i16, x: i32) -> vec.i16
  • vec.i32.lshr(a: vec.i32, x: i32) -> vec.i32
  • vec.i64.lshr(a: vec.i64, x: i32) -> vec.i64

Returns a new vector with lanes selected from the lanes of the two input vectors a and b by shifting lanes of the original to the right by the amount specified in the integer argument and shifting zero values in.

def S.lshr(a, x):
    result = S.New()
    for i in range(S.Lanes):
        if i < S.Lanes - x:
            result[i] = a[i + x]
        else:
            result[i] = 0
    return result

Integer arithmetic

Wrapping integer arithmetic discards the high bits of the result.

def S.Reduce(x):
    bitmask = (1 << S.LaneBits) - 1
    return x & bitmask

Integer division operation is omitted to be compatible with 128-bit SIMD.

Integer addition

  • vec.i8.add(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.add(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.add(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.add(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise wrapping integer addition:

def S.add(a, b):
    def add(x, y):
        return S.Reduce(x + y)
    return S.lanewise_binary(add, a, b)

Integer subtraction

  • vec.i8.sub(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.sub(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.sub(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.sub(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise wrapping integer subtraction:

def S.sub(a, b):
    def sub(x, y):
        return S.Reduce(x - y)
    return S.lanewise_binary(sub, a, b)

Integer multiplication

  • vec.i8.mul(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.mul(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.mul(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.mul(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise wrapping integer multiplication:

def S.mul(a, b):
    def mul(x, y):
        return S.Reduce(x * y)
    return S.lanewise_binary(mul, a, b)

Integer negation

  • vec.i8.neg(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.neg(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.neg(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.neg(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise wrapping integer negation. In wrapping arithmetic, y = -x is the unique value such that x + y == 0.

def S.neg(a):
    def neg(x):
        return S.Reduce(-x)
    return S.lanewise_unary(neg, a)

Saturating integer arithmetic

Saturating integer arithmetic behaves differently on signed and unsigned lanes.

def S.SignedSaturate(x):
    if x < S.Smin:
        return S.Smin
    if x > S.Smax:
        return S.Smax
    return x

def S.UnsignedSaturate(x):
    if x < 0:
        return 0
    if x > S.Umax:
        return S.Umax
    return x

Saturating integer addition

  • vec.i8.add_sat_s(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.add_sat_u(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.add_sat_s(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i16.add_sat_u(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64
  • vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise saturating addition:

def S.add_sat_s(a, b):
    def addsat(x, y):
        return S.SignedSaturate(x + y)
    return S.lanewise_binary(addsat, S.AsSigned(a), S.AsSigned(b))

def S.add_sat_u(a, b):
    def addsat(x, y):
        return S.UnsignedSaturate(x + y)
    return S.lanewise_binary(addsat, S.AsUnsigned(a), S.AsUnsigned(b))

Saturating integer subtraction

  • vec.i8.sub_sat_s(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.sub_sat_u(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.sub_sat_s(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i16.sub_sat_u(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64
  • vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise saturating subtraction:

def S.sub_sat_s(a, b):
    def subsat(x, y):
        return S.SignedSaturate(x - y)
    return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b))

def S.sub_sat_u(a, b):
    def subsat(x, y):
        return S.UnsignedSaturate(x - y)
    return S.lanewise_binary(subsat, S.AsUnsigned(a), S.AsUnsigned(b))

Lane-wise integer minimum

  • vec.i8.min_s(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.min_u(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.min_s(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i16.min_u(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.min_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i32.min_u(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.min_s(a: vec.i64, b: vec.i64) -> vec.i64
  • vec.i64.min_u(a: vec.i64, b: vec.i64) -> vec.i64

Compares lane-wise signed/unsigned integers, and returns the minimum of each pair.

def S.min(a, b):
    return S.lanewise_binary(min, a, b)

Lane-wise integer maximum

  • vec.i8.max_s(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.max_u(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.max_s(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i16.max_u(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.max_s(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i32.max_u(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.max_s(a: vec.i64, b: vec.i64) -> vec.i64
  • vec.i64.max_u(a: vec.i64, b: vec.i64) -> vec.i64

Compares lane-wise signed/unsigned integers, and returns the maximum of each pair.

def S.max(a, b):
    return S.lanewise_binary(max, a, b)

Lane-wise integer rounding average

  • vec.i8.avgr_u(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i16.avgr_u(a: vec.i16, b: vec.i16) -> vec.i16
  • vec.i32.avgr_u(a: vec.i32, b: vec.i32) -> vec.i32
  • vec.i64.avgr_u(a: vec.i64, b: vec.i64) -> vec.i64

Lane-wise rounding average:

def S.RoundingAverage(x, y):
    return (x + y + 1) // 2

def S.avgr_u(a, b):
    return S.lanewise_binary(S.RoundingAverage, S.AsUnsigned(a), S.AsUnsigned(b))

Lane-wise integer absolute value

  • vec.i8.abs(a: vec.i8) -> vec.i8
  • vec.i16.abs(a: vec.i16) -> vec.i16
  • vec.i32.abs(a: vec.i32) -> vec.i32
  • vec.i64.abs(a: vec.i64) -> vec.i64

Lane-wise wrapping absolute value.

def S.abs(a):
    return S.lanewise_unary(abs, S.AsSigned(a))

Bit shifts

Left shift by scalar

  • vec.i8.shl(a: vec.i8, y: i32) -> vec.i8
  • vec.i16.shl(a: vec.i16, y: i32) -> vec.i16
  • vec.i32.shl(a: vec.i32, y: i32) -> vec.i32
  • vec.i64.shl(a: vec.i64, y: i32) -> vec.i64

Shift the bits in each lane to the left by the same amount. The shift count is taken modulo lane width:

def S.shl(a, y):
    # Number of bits to shift: 0 .. S.LaneBits - 1.
    amount = y mod S.LaneBits
    def shift(x):
        return S.Reduce(x << amount)
    return S.lanewise_unary(shift, a)

Right shift by scalar

  • vec.i8.shr_s(a: vec.i8, y: i32) -> vec.i8
  • vec.i8.shr_u(a: vec.i8, y: i32) -> vec.i8
  • vec.i16.shr_s(a: vec.i16, y: i32) -> vec.i16
  • vec.i16.shr_u(a: vec.i16, y: i32) -> vec.i16
  • vec.i32.shr_s(a: vec.i32, y: i32) -> vec.i32
  • vec.i32.shr_u(a: vec.i32, y: i32) -> vec.i32
  • vec.i64.shr_s(a: vec.i64, y: i32) -> vec.i64
  • vec.i64.shr_u(a: vec.i64, y: i32) -> vec.i64

Shift the bits in each lane to the right by the same amount. The shift count is taken modulo lane width. This is an arithmetic right shift for the _s variants and a logical right shift for the _u variants.

def S.shr_s(a, y):
    # Number of bits to shift: 0 .. S.LaneBits - 1.
    amount = y mod S.LaneBits
    def shift(x):
        return x >> amount
    return S.lanewise_unary(shift, S.AsSigned(a))

def S.shr_u(a, y):
    # Number of bits to shift: 0 .. S.LaneBits - 1.
    amount = y mod S.LaneBits
    def shift(x):
        return x >> amount
    return S.lanewise_unary(shift, S.AsUnsigned(a))

Bitwise operations

Bitwise logic

  • vec.i8.and(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.or(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.xor(a: vec.i8, b: vec.i8) -> vec.i8
  • vec.i8.not(a: vec.i8) -> vec.i8

The logical operations defined on the scalar integer types are also available on the v128 type where they operate bitwise the same way C's &, |, ^, and ~ operators work on an unsigned type.

Bitwise AND-NOT

  • vec.i8.andnot(a: vec.i8, b: vec.i8) -> vec.i8

Bitwise AND of bits of a and the logical inverse of bits of b. This operation is equivalent to vec.i8.and(a, vec.i8.not(b)).

Bitwise select

  • vec.i8.bitselect(v1: vec.i8, v2: vec.i8, c: vec.i8) -> vec.i8

Use the bits in the control mask c to select the corresponding bit from v1 when 1 and v2 when 0. This is the same as vec.i8.or(vec.i8.and(v1, c), vec.i8.and(v2, vec.i8.not(c))).

Note that the normal WebAssembly select instruction also works with vector types. It selects between two whole vectors controlled by a single scalar value, rather than selecting bits controlled by a control mask vector.

Boolean horizontal reductions

TBD

Comparisons

TBD

Load and store

  • vec.v8.load(memarg) -> vec.v8
  • vec.v16.load(memarg) -> vec.v16
  • vec.v32.load(memarg) -> vec.v32
  • vec.v64.load(memarg) -> vec.v64

Load a vector from the given heap address.

  • vec.v8.store(memarg, data:vec.v8)
  • vec.v16.store(memarg, data:vec.v16)
  • vec.v32.store(memarg, data:vec.v32)
  • vec.v64.store(memarg, data:vec.v64)

Store a vector to the given heap address.

Floating-point sign bit operations

TBD

Floating-point min and max

TBD

Floating-point arithmetic

The floating-point arithmetic operations are all lane-wise versions of the existing scalar WebAssembly operations.

Addition

  • vec.f32.add(a: vec.f32, b: vec.f32) -> vec.f32
  • vec.f64.add(a: vec.f64, b: vec.f64) -> vec.f64

Lane-wise IEEE addition.

Subtraction

  • vec.f32.sub(a: vec.f32, b: vec.f32) -> vec.f32
  • vec.f64.sub(a: vec.f64, b: vec.f64) -> vec.f64

Lane-wise IEEE subtraction.

Division

  • vec.f32.div(a: vec.f32, b: vec.f32) -> vec.f32
  • vec.f64.div(a: vec.f64, b: vec.f64) -> vec.f64

Lane-wise IEEE division.

Multiplication

  • vec.f32.mul(a: vec.f32, b: vec.f32) -> vec.f32
  • vec.f64.mul(a: vec.f64, b: vec.f64) -> vec.f64

Lane-wise IEEE multiplication.

Square root

  • vec.f32.sqrt(a: vec.f32, b: vec.f32) -> vec.f32
  • vec.f64.sqrt(a: vec.f64, b: vec.f64) -> vec.f64

Lane-wise IEEE squareRoot.

Conversions

TBD

Setting vector length

TBD whether this should be included

  • 8-bit lanes
    • vec.i8.set_length(len: i32) -> i32
    • vec.i8.set_length_imm(imm: ImmLaneIdx8) -> i32
  • 16-bit lanes
    • vec.i16.set_length(len: i32) -> i32
    • vec.i16.set_length_imm(imm: ImmLaneIdx16) -> i32
  • 32-bit lanes
    • vec.i32.set_length(len: i32) -> i32
    • vec.i32.set_length_imm(imm: ImmLaneIdx32) -> i32
    • vec.f32.set_length(len: i32) -> i32
    • vec.f32.set_length_imm(imm: ImmLaneIdx32) -> i32
  • 64-bit lanes
    • vec.i64.set_length(len: i32) -> i32
    • vec.i64.set_length_imm(imm: ImmLaneIdx64) -> i32
    • vec.f64.set_length(len: i32) -> i32
    • vec.f64.set_length_imm(imm: ImmLaneIdx64) -> i32

The above operations set the number of lanes for corresponding vector type to the minimum of supported vector length and the requested length. The length is then returned on the stack.

This sets number of lanes for vector operations working on corresponding vector types. Setting vector length to zero turns corresponding vector operations (aside of set length) into NOPs.

The following sections describe operations that work on flexible vector types.