Skip to content

Releases: microsoft/Accera

v1.2.29

18 Apr 06:08
Compare
Choose a tag to compare

What's Changed


  • Merged PR 3211: Upgrade hatlib dependency to 0.0.39. [Captain Jack
    Sparrow]

    Upgrade hatlib dependency to 0.0.39

  • Merged PR 3209: Support AffineParallelOp and scf::ParallelOp in
    RangeValue utils. [Mason Remy]

    Support AffineParallelOp and scf::ParallelOp in RangeValue utils

  • Merged PR 3207: Fix parallelization and enable file checker in tests.
    [Captain Jack Sparrow]

    Fix parallelization and enable file checker in tests

  • Merged PR 3195: [LLVM 15] progressive upgrade (24a37a396a9b), disable
    macos builds. [Lisa Ong]

    The first of a series of progressive upgrades from LLVM 14.0.6 to LLVM 15.0.7 (and possibly beyond).

    Current LLVM version:
    https://intelligentdevices.visualstudio.com/ELL/_git/accera.llvm?version=GBaccera/llvmorg-15-24a37a396a9b&_a=history

    This is llvmorg-15.0.0-init, fast forwarded to about 100 "relevant" MLIR commits (actual number of commits is higher).

    Performance on AVX2 is verified for Windows (no regressions).

    Breaking Change: macOS builds
    With this upgrade we are also retiring the macOS pipelines due to lack of build resources for LLVM macos/intel Conan packages. This only affects internal developer scenarios. Public developers continue to rely on vcpkg builds.

  • Merged PR 3172: Adds better support for compiling specifically for
    AVX2 targets. [Kern Handa]

    • Plumb AVX2 flags to LLVM, with a block for macOS. We plan to remove official support for macOS/Intel starting from LLVM 15 due to limited build resources.
    • Initialize Target.HOST extensions using cpu_info
    • Added more AVX2 filecheck tests to catch LLVM lowering regressions before moving to LLVM 15 [MasonR]

    Breaking Change: Target.HOST no longer unconditionally enables the AVX2 extension if the underlying CPU does not support it, otherwise codegen may result in unsupported instructions.

    To compile for AVX2 if your host doesn't support AVX2, specify Target(""). For example, plan = schedule.create_plan(Target("Intel 6700"))

  • Merged PR 3203: Plumb target device info into llvm lowering. [Denny
    Sun]

    llvm lowering now depends on some static complier macro to check target device info, which breaks cross compilation support.

    // TODO: get check `TargetDeviceInfo` for the OS instead
    #ifdef WIN32
    
    const int hostBitSize = 64; // TODO:: FIXME :: This assumes that the host is always 64bit
    // Should query the target hardware
    auto llvmIntTy = hostBitSize == 32 ? llvmI32Ty : llvmI64Ty;
    

Full Changelog: v1.2.28...v1.2.29

v1.2.28

04 Apr 09:52
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3199: Rename _slice to slice and add docs. [Captain Jack
    Sparrow]

    Rename _slice to slice and add docs

  • Merged PR 3197: Preserve dest memref shape during SliceOp to SubViewOp
    lowering. [Captain Jack Sparrow]

    Preserve dest memref shape during SliceOp to SubViewOp lowering:

    Without this change, subview op would discard the dest memref type required by the slice op. For example,

    %7 = "accv.slice"(%arg0, %6) {sliceDimensions = [0]} : (memref<1x30x256xui8>, index) -> memref<30x256xui8, affine_map<...>>
    

    would get lowered to:

    %4 = memref.subview %arg0[%3, 0, 0] [1, 30, 256] [1, 1, 1] : memref<1x30x256xui8> to memref<1x30x256xui8, affine_map<...>>
    %5 = memref.cast %4 : memref<1x30x256xui8, affine_map<...>> to memref<?x?x?xui8, affine_map<...>>
    

    which does not drop the first dimension as expected. With this fix, the slice op correctly lowers to:

    %4 = memref.subview %arg0[%3, 0, 0] [1, 30, 256] [1, 1, 1] : memref<1x30x256xui8> to memref<30x256xui8, affine_map<...>>
    %5 = memref.cast %4 : memref<30x256xui8, affine_map<...>> to memref<30x256xui8, affine_map<...>>
    
  • Merged PR 3194: Reorder the ops in GetTimeOpLowering to improve the
    timing accuracy. [Denny Sun]

    In order to get the most accurate timing, we need to order the operations more appropriately,

    from
            Independent logic
            GetTime()
            Independent logic
            Main logic to profile
            Independent logic
            GetTime()
            Independent logic
    
    to
    
            Independent logic
            Independent logic
            GetTime()
            Main logic  to profile
            GetTime()
            Independent logic
            Independent logic
    
  • Merged PR 3187: Fully dynamic split_dimension op. [Denny Sun]

    This change enable Accera to be able to split a dynamic dimension by a dynamic size

    `       M, N, MN = create_dimensions()
    
            Input = Array(role=Role.INPUT, element_type=ScalarType.float32, shape=(MN, ))
            Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
    
            nest = Nest(shape=(M, N))
            i, j = nest.get_indices()
    
            @nest.iteration_logic
            def _():
                split_input = Input._split_dimension(0, N)
                Output[i, j] = split_input[i, j]
    
           package.add(nest, args=(MN, M, N, Input, Output), base_name=f"{test_name}_fn")`
    
  • Merged PR 3185: [nfc] Adds tests for vectorization, fast_exp_sum.
    [Kern Handa]

  • Merged PR 3168: [docs] Tensorization tutorials and type name updates.
    [Captain Jack Sparrow]

Full Changelog: v1.2.27...v1.2.28

v1.2.27

27 Mar 02:19
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3181: Fix bug with reinterpret_cast of partially-dynamic
    array. [Mason Remy]

    Fix bug with reinterpret_cast of partially-dynamic array

  • Merged PR 3180: Enable getting a memref shape from a memref_cast
    result. [Mason Remy]

    Enable getting a memref shape from a memref_cast result

  • Merged PR 3179: Fix vulkan-specific smoke test break. [Lisa Ong]

    Missing an import for test_vulkan_gpu_matmul(). This test code path is only exercised when vulkan is installed.

            format = self.PACKAGE_FORMAT if "VULKAN_SDK" in os.environ else Package.Format.HAT_STATIC
            with verifiers.VerifyPackage(self, "test_vulkan_gpu_matmul", TEST_PACKAGE_DIR):
                package.build(
                    name="test_vulkan_gpu_matmul", format=format, mode=self.PACKAGE_MODE, output_dir=TEST_PACKAGE_DIR
                )
    

Full Changelog: v1.2.26...v1.2.27

v1.2.26

24 Mar 04:20
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3176: [Accera] split_dim op supports dynamic dims with
    static split size. [Denny Sun]

    With this fix the following test case which has dynamic dims with static split size can succeed.

            M, MN = create_dimensions()
            N = 16
    
            Input = Array(role=Role.INPUT, element_type=ScalarType.float32, shape=(MN,))
            Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
    
            nest = Nest(shape=(M, N))
            i, j = nest.get_indices()
    
            @nest.iteration_logic
            def _():
                split_input = Input._split_dimension(0, cast(16, ScalarType.index))
                Output[i, j] = split_input[i, j]
    
  • Merged PR 3174: Ensure any dynamic allocations are heap allocs that
    get dealloced. [Mason Remy]

    Ensure any dynamic allocations are heap allocs

  • Merged PR 3171: [test] Add some tests for Dimensions. [Kern Handa]

  • Merged PR 3175: Support reinterpret cast of same bitwidth without
    changing layout. [Mason Remy]

    Support reinterpret cast of same bitwidth without changing layout

  • Merged PR 3167: Remove hack to treat INPUT_OUTPUT Arrays with shape
    (1,) as Elements. [Kern Handa]

    I don't have complete context on this, so this might break something. If it does, that should be fixed separately rather than keep this hack around, which breaks semantics in non-obvious ways.

  • Merged PR 3165: [build] Fix clang 14 release build warnings treated as
    errors on macOS/Apple. [Lisa Ong]

    Errors are showing up on release builds:

    cmake .. -DCMAKE_BUILD_TYPE=Release -G Ninja
    cmake --build . --config Release
    

    Clang version:

    Apple clang version 14.0.0 (clang-1400.0.29.202)
    Target: arm64-apple-darwin22.3.0
    Thread model: posix
    
  • Merged PR 3162: Bump vcpkg to latest release. [Lisa Ong]

    Last release was Sept 2022. Update to the latest tag (2023.02.24)

    Preparation for LLVM 15 upgrade

  • Merged PR 3161: Fix cache reduce scale constant hoisting. [Mason Remy]

    Fix cache reduce scale constant hoisting

  • Merged PR 3163: Extend vector masked loads/stores to handle arbitrary
    bin ops and constant operands. [Mason Remy]

    Extend vector masked loads/stores to handle arbitrary bin ops and
    constant operands

Full Changelog: v1.2.25...v1.2.26

v1.2.25

16 Mar 06:50
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3160: [security] bump onnx to 1.13.0. [Lisa Ong]

    This resolves a high severity dependabot alert

  • Merged PR 3157: Dynamic split dim tests. [Mason Remy]

    Dynamic split dim tests

  • Merged PR 3158: Do not unroll the profiling ops when vectorization
    enabled. [Denny Sun]

    when vectorization is enabled, the ops in kernel get unrolled, for example, without this fix the timer added to inner kernel will have 8 copies, which is definitely wrong.

  • Merged PR 3153: Fix the lowering issue of the profiling ops. [Denny
    Sun]

    With this fix the kernel level profiling support can work end to end. Here is some example about how to use it:

            @tile_nest.iteration_logic
            def _tile_logic():
                EnterProfileRegion("pack_b_fn_outer")
                pack_b_fn(B, B_temp, j, k)
                ExitProfileRegion("pack_b_fn_outer")
    
                EnterProfileRegion("matmul_fn_outer")
                matmul_fn(A, B, C, B_temp, i, j, k)
                ExitProfileRegion("matmul_fn_outer")
    
                PrintProfileResults()
    

    The timings printed out look like:

    matmul_fn_outer 1       0.000100 ms
    pack_b_fn_outer 1       0.000400 ms
    matmul_fn_outer 2       0.000400 ms
    pack_b_fn_outer 2       0.001200 ms
    matmul_fn_outer 3       0.000600 ms
    pack_b_fn_outer 3       0.001700 ms
    matmul_fn_outer 4       0.000800 ms
    pack_b_fn_outer 4       0.002300 ms
    matmul_fn_outer 5       0.000900 ms
    pack_b_fn_outer 5       0.002700 ms
    matmul_fn_outer 6       0.001200 ms
    pack_b_fn_outer 6       0.003200 ms
    matmul_fn_outer 7       0.001500 ms
    pack_b_fn_outer 7       0.003700 ms
    matmul_fn_outer 8       0.001700 ms
    pack_b_fn_outer 8       0.004000 ms
    matmul_fn_outer 9       0.002000 ms
    pack_b_fn_outer 9       0.004500 ms
    matmul_fn_outer 10      0.002200 ms
    pack_b_fn_outer 10      0.004800 ms
    matmul_fn_outer 11      0.002400 ms
    pack_b_fn_outer 11      0.005300 ms
    matmul_fn_outer 12      0.002700 ms
    pack_b_fn_outer 12      0.006500 ms
    matmul_fn_outer 13      0.003100 ms
    pack_b_fn_outer 13      0.007400 ms
    matmul_fn_outer 14      0.003400 ms
    pack_b_fn_outer 14      0.007800 ms
    matmul_fn_outer 15      0.003700 ms
    pack_b_fn_outer 15      0.008300 ms
    matmul_fn_outer 16      0.004000 ms
    pack_b_fn_outer 16      0.008800 ms
    matmul_fn_outer 17      0.004400 ms
    pack_b_fn_outer 17      0.009199 ms
    matmul_fn_outer 18      0.004800 ms
    pack_b_fn_outer 18      0.009599 ms
    matmul_fn_outer 19      0.005100 ms
    pack_b_fn_outer 19      0.010099 ms
    matmul_fn_outer 20      0.005400 ms
    pack_b_fn_outer 20      0.010599 ms
    matmul_fn_outer 21      0.006000 ms
    pack_b_fn_outer 21      0.011299 ms
    matmul_fn_outer 22      0.006300 ms
    pack_b_fn_outer 22      0.011899 ms
    matmul_fn_outer 23      0.006500 ms
    pack_b_fn_outer 23      0.012299 ms
    matmul_fn_outer 24      0.006701 ms
    pack_b_fn_outer 24      0.012699 ms
    matmul_fn_outer 25      0.006901 ms
    pack_b_fn_outer 25      0.013099 ms
    matmul_fn_outer 26      0.007101 ms
    pack_b_fn_outer 26      0.013399 ms
    matmul_fn_outer 27      0.007300 ms
    pack_b_fn_outer 27      0.013799 ms
    matmul_fn_outer 28      0.007401 ms
    pack_b_fn_outer 28      0.014100 ms
    matmul_fn_outer 29      0.007601 ms
    pack_b_fn_outer 29      0.014600 ms
    matmul_fn_outer 30      0.007801 ms
    pack_b_fn_outer 30      0.015000 ms
    matmul_fn_outer 31      0.007901 ms
    pack_b_fn_outer 31      0.015399 ms
    matmul_fn_outer 32      0.008101 ms
    pack_b_fn_outer 32      0.015699 ms
    matmul_fn_outer 33      0.008301 ms
    pack_b_fn_outer 33      0.015999 ms
    matmul_fn_outer 34      0.008601 ms
    pack_b_fn_outer 34      0.016...
    
  • Merged PR 3152: [nfc] [test] Skip fast_exp mlas tests on unsupported
    Aarch64. [Lisa Ong]

    These tests generate llvm.x86.avx.max.ps.256 which is not supported on non-intel processors like Apple M1

      %28 = load <8 x float>, <8 x float>* %27, align 4, !dbg !19
      %29 = call <8 x float> @llvm.x86.avx.max.ps.256(<8 x float> %28, <8 x float> <float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000>), !dbg !20
      %30 = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> %29, <8 x float> <float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000>, <8 x float> <float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000>), !dbg !21
      %31 = fsub <8 x float> %30, <float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000>, !dbg !22
    
    

Full Changelog: v1.2.24...v1.2.25

v1.2.24

10 Mar 07:49
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3150: Change high precision fp to not perform contraction.
    [Mason Remy]

    Change high precision fp to not perform contraction

    Also change value library FMA to use the math dialect FmaOp and
    vectorize to the vector dialect FMAOp

  • Merged PR 3147: Fix vector cast with same bitwidth. [Mason Remy]

    Fix vector cast with same bitwidth.

    accv.cast vector<16xi8> to vector<16xui8>
    was erroneously lowering to
    cast vector<16xi8> to ui8

  • Merged PR 3149: Improve 1-D horizontal sum reductions for 8xf32 and
    8xi32. [Mason Remy]

    Improve 1-D horizontal sum reductions for 8xf32 and 8xi32

  • Merged PR 3148: Adds Package level FP precision override. [Kern Handa]

  • Merged PR 3144: Removes fp precision as an option for Package.build.
    [Kern Handa]

    The fp-contract option being used in accc.py was overriding the recent addition of the fp precision specification at the function level. Since there's now an equivalent default for each function, we shouldn't have need of the option to be specified to llc and opt during build time.

  • Merged PR 3143: Add dsl test for profiling op. [Denny Sun]

    1. add profiling enable flag to Package.build()
    2. add a dsl test
  • Merged PR 3022: Assert the arg order in debug mode. [Denny Sun]

    Dimension arg should precede array arg in the arg list for debug mode.

  • Merged PR 3137: expose profiling function to DSL. [Denny Sun]

    expose profiling function to DSL

  • Merged PR 3142: [Release] Tie accera-llvm versioning to LLVM version.
    [Lisa Ong]

    This change introduces a new versioning schema for accera-llvm that follows LLVM's versioning, while allowing for Accera versioned forks:

    <llvm_major>.<llvm_minor>.<llvm_micro><accera_micro> = (N+).(N+).(N+)(N{2})

    This overloads the micro version field due to constraints on Python versioning: https://peps.python.org/pep-0440/

    Examples:

    • Current LLVM fork is 14.0.6-2: accera_llvm.14.0.602, which means LLVM 14.0.6 + accera fork v2
    • If/when upgrading to LLVM 15.0.7: accera_llvm.15.0.700
    • Then when we rev the Accera fork to LLVM 15.0.7-1: accera_llvm.15.0.701

    Limitations:

    • We don't expect Accera's fork to span beyond 2-digit versions

    Alternatives:

    • Omit the 0 delimiters, if we think it is unlikely that Accera forks will rev micro versions beyond single-digit. Accera forks may rev more often if we don't update LLVM.
    • Use a dev version, e.g. accera_llvm.14.0.6.dev4. Downside is that this looks unofficial - devN is intended for developmental releases rather than official PyPI releases. That said, the whole Accera project is developmental :)
  • Merged PR 3139: Allows setting precision of fp ops per function. [Kern
    Handa]

    Allows setting precision of fp ops per function

  • Merged PR 3140: Fix bug with reinterpret casts of unrealized
    conversion casts. [Mason Remy]

    Fix bug with reinterpret casts of unrealized conversion casts.

    This happens when we do a heap alloc followed by a reinterpret cast, but
    it can come up in other scenarios too

  • Merged PR 3135: [nfc] Add XeonE5 benchmark machine to targets, bump
    hatlib dependency. [Lisa Ong]

    Best guesses at cache sizes and cache lines from: https://en.wikichip.org/wiki/intel/xeon_e5/e5-2673_v4

Full Changelog: v1.2.23...v1.2.24

v1.2.23

02 Mar 22:04
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3131: Set masked load/store inbounds flag to true. [Mason
    Remy]

    Set masked load/store inbounds flag to true

    The mask we generate, as well as the rest of our infrastructure, will
    prevent out-of-bounds accesses when used properly. Therefore for
    performance reasons we don't want MLIR to generate runtime bounds
    checking

  • Merged PR 3130: Recognize and simplify always true EQ and NE CmpOps.
    [Mason Remy]

    Recognize and simplify always true EQ and NE CmpOps

    These would already get simplified after converting to the builtin
    dialects, but this makes them happen earlier in the lowering

  • Merged PR 3129: Optimize 1-row horizontal i16->i32 sum reduction.
    [Mason Remy]

    Optimize 1-row horizontal i16->i32 sum reduction

  • Merged PR 3118: vectorize accumulation of results of two masked load
    ops. [JUBI TANEJA]

    This PR vectorizes a pattern that occurs in MMIF where there are two conditional loads, followed by an accumulation operation, and a conditional store. On vectorizing the following DSL:

            N_input = 8
            N_output = 5
            Input = Array(role=Role.INPUT, element_type=ScalarType.int32, shape=(N_input, ))
            Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(N_output, ))
            nest = Nest(shape=(N_input, ))
            i, = nest.get_indices()
    
            @nest.iteration_logic
            def _nest():
    
                def store_value():
                    Output[i] += Input[i]
    
                _If(i < N_output, store_value)
    

    It produces the following assembly. We are looking for vpmaskmovd instructions that correspond to vector.transfer_read/vector.transfer_write ops in MLIR.

    0000000000000030 <test_vectorized_masked_accumulate_3e5de44f3dcca64e>:
      30:   c5 fd 6f 05 00 00 00    vmovdqa 0x0(%rip),%ymm0        # 38 <test_vectorized_masked_accumulate_3e5de44f3dcca64e+0x8>
      37:   00
      38:   c4 e2 7d 8c 0e          vpmaskmovd (%rsi),%ymm0,%ymm1
      3d:   c4 e2 7d 8c 17          vpmaskmovd (%rdi),%ymm0,%ymm2
      42:   c5 ed fe c9             vpaddd %ymm1,%ymm2,%ymm1
      46:   c4 e2 7d 8e 0e          vpmaskmovd %ymm1,%ymm0,(%rsi)
      4b:   c5 f8 77                vzeroupper
      4e:   c3                      retq
    
  • Merged PR 3126: [test] Adds more tests for vectorized transpose. [Kern
    Handa]

    [test] Adds more tests for vectorized transpose

  • Merged PR 3121: [nfc] Separate bounds checking into separate pass
    file. [Mason Remy]

    [nfc] Separate bounds checking into separate pass file

    This removes the bounds checking code from
    ExecutionPlanToAffineLoweringPass and creates a separate pass file for
    it. There is no change in when and where the checking occurs (currently
    it only happens for caching-generated loads and stores).

    In a future change we will further separate the pass and run it at a
    different phase of the lowering and plumb controls for
    enabling/disabling it to the DSL

  • Merged PR 3122: Fix reinterpret_cast output memref shape. [Mason Remy]

    Fix reinterpret_cast output memref shape

  • Merged PR 3115: Normalize AffineForOps to have unit stride and begin
    at 0. [Mason Remy]

    Normalize AffineForOps to have unit stride and begin at 0

  • Merged PR 3117: Vectorize horizontal multi-dim sum reductions. [Mason
    Remy]

    Vectorize horizontal multi-dim sum reductions

    Recognizes and vectorizes these sum reductions:
    4x16xi16 -> 4x1xi32
    4x8xi32 -> 4x1xi32
    4x8xf32 -> 4x1xf32

  • Merged PR 3099: Adds pattern rewriting for AVX2 vectorized transpose.
    [Kern Handa]

Full Changelog: v1.2.22...v1.2.23

v1.2.22

24 Feb 05:04
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3107: Make vectorization happen after inlining and
    simplification. [Mason Remy]

    Make vectorization happen after inlining and simplification

    This change fills out the vectorization passes and removes vectorization
    from LoopNestToValueFunc. Some bugs were exposed that this also fixes.

    Since vectorization is now a separate pass, mlir filecheck lit tests can
    be run more easily. This change adds the initial file with one test, but
    we should continue expanding this test suite

  • Merged PR 3108: extend vectorization for masked store case. [JUBI
    TANEJA]

  • Merged PR 3109: Set conan version < 2.0.0. [Mason Remy]

    Our infra isn't set up for the new conan 2 behavior, so fix our usage to
    version 1 until we take the upgrade intentionally

  • Merged PR 3104: Position fusing dim after the fused dimensions.
    [Captain Jack Sparrow]

    Position fusing dim after the fused dimensions

  • Merged PR 3096: Add "RelWithDebInfo"-like option to accc. [Chuck
    Jacobs]

    This PR adds another option to the Options flag for AcceraProject.gemerate_and_emit to keep some debug (the frame pointers) info around when building the Accera project. This can be helpful when trying to interpret perf profiler output.

Full Changelog: v1.2.21...v1.2.22

v1.2.21

20 Feb 09:58
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3101: [build] install pkg-config for macos buddy builds.
    [Lisa Ong]

    Fixes macos packaging build failure:

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=47235&view=results

  • Merged PR 3098: [nfc] Move vectorization code to separate files.
    [Mason Remy]

    [nfc] Move vectorization code to separate files

    Moves vectorization code out of ExecutionPlanToAffineLoweringPass in
    preparation for better separating out a vectorization pass that can be
    run later than vectorization is currently happening

  • Merged PR 3100: Adds CMake dependencies to acc-translate to ensure
    correct build. [Kern Handa]

    Adds CMake dependencies to acc-translate to ensure correct build

  • Merged PR 3095: Remove duplicate SubArray class. [Mason Remy]

    Remove duplicate SubArray class

  • Merged PR 3073: vectorize masked load store. [JUBI TANEJA]

    This PR handles vectorization specifically for a masked buffer fill, where the output size is larger than the input. There is a conditional load and vector store.

    Given the nest:

            @nest.iteration_logic
            def _nest():
                def store_value():
                    Output[i] = Input[i]
                def store_zero():
                    Output[i] = 0
                _If(i < N_input, store_value).Else(store_zero)
    

    The unoptimized MLIR is as follows:

      %c0_i32 = arith.constant 0 : i32
      %c5 = arith.constant 5 : index
      "accv.lambda"() ({
        affine.for %arg2 = 0 to 8 {
          %0 = "accv.cmp"(%arg2, %c5) {predicate = 2 : i64} : (index, index) -> i1
          scf.if %0 {
            %1 = affine.load %arg0[%arg2] : memref<5xi32>
            affine.store %1, %arg1[%arg2] : memref<8xi32>
          } else {
            affine.store %c0_i32, %arg1[%arg2] : memref<8xi32>
          }
        }
    

    On vectorizing this for loop, we get the vectorized MLIR (simplified version) as follows:

      %c5 = arith.constant 5 : index
      %cst = arith.constant dense<false> : vector<8xi1>
      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      %c2 = arith.constant 2 : index
      %c3 = arith.constant 3 : index
      %c4 = arith.constant 4 : index
      %c6 = arith.constant 6 : index
      %c7 = arith.constant 7 : index
      %c0_i32 = arith.constant 0 : i32
      "accv.lambda"() ({
        affine.for %arg2 = 0 to 8 step 8 {
    
          %7 = "accv.cmp"(%arg2, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %9 = "accv.cmp"(%0, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %11 = "accv.cmp"(%1, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %13 = "accv.cmp"(%2, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %15 = "accv.cmp"(%3, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %17 = "accv.cmp"(%4, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %19 = "accv.cmp"(%5, %c5) {predicate = 2 : i64} : (index, index) -> i1
          %21 = "accv.cmp"(%6, %c5) {predicate = 2 : i64} : (index, index) -> i1
    
          %23 = memref.reinterpret_cast %arg0 to offset: [0], sizes: [5], strides: [1] : memref<5xi32> to memref<5xi32>
          %24 = vector.transfer_read %23[%arg2], %c0_i32, %22 : memref<5xi32>, vector<8xi32>
    
          %25 = memref.reinterpret_cast %arg1 to offset: [0], sizes: [8], strides: [1] : memref<8xi32> to memref<8xi32>
          vector.store %24, %25[%arg2] : memref<8xi32>, vector<8xi32>
        }
    
  • Merged PR 3093: Add meaningful error messages for c++ exceptions.
    [Captain Jack Sparrow]

    Add meaningful error messages for c++ exceptions

  • Merged PR 3092: Add type size getter utility. [Captain Jack Sparrow]

    Add type size getter utility

  • Merged PR 3074: Add rudimentary pass to fix redundant load/store
    issue. [Chuck Jacobs]

    This PR adds a simple pattern to ValueSimplifyPass that looks for the redundant load/store pattern we often see at the end of kernels, and removes them.

  • Merged PR 3075: Enable fast_exp operation. [Chuck Jacobs]

    This PR makes a few changes to enable the fast_exp operation:

    • Adds fast_exp to the python DSL
    • Enables vectorization of abs instruction (which is used by fast_exp)

    It also makes a couple of other minor changes:

    • Improves auto-naming of nest indices
    • Better support for using custom LLVM builds with Accera
  • Merged PR 3088: Support dynamic sub_array shape, split_dim size.
    [Mason Remy]

    Support dynamic sub_array shape, split_dim size

    This still requires that the sizes are static before lowering, but it
    supports dynamic sizes temporarily before inlining into an outer static
    function

  • Merged PR 3078: Adds reinterpret_cast functionality to Array. [Kern
    Handa]

    Adds reinterpret_cast functionality to Array

  • Merged PR 3070: Fixes for sub_array and _split_dimension. [Mason Remy]

    Fixes for sub_array and _split_dimension

    This fixes the sub array and split dim ops to work with the accera
    codebase that has updated around them. Some MemoryLayout assumptions are
    getting in the way and have been disabled in the short-term, however
    long term our memory layout behavior should more closely match what MLIR
    affine maps can represent for more generalized dynamic support

  • Merged PR 3063: Refactor Dimension with C++ backend container class
    and few other fixes. [Captain Jack Sparrow]

    • Refactor Dimension with C++ backend container (ScalarDimension)
    • Enable output scalar variables
    • Fix dynamic sized TEMP arrays
  • Merged PR 3072: Bump hatlib version to 0.0.34, skip unsupported test
    on arm64 macOS, minor targets doc update. [Lisa Ong]

    Update hatlib version since there is no incompatibility

Full Changelog: v1.2.20...v1.2.21

v1.2.20

09 Feb 03:35
Compare
Choose a tag to compare

What's Changed

  • Merged PR 3070: Fixes for sub_array and _split_dimension [Mason Remy]

    Fixes for sub_array and _split_dimension

    This fixes the sub array and split dim ops to work with the accera
    codebase that has updated around them. Some MemoryLayout assumptions are
    getting in the way and have been disabled in the short-term, however
    long term our memory layout behavior should more closely match what MLIR
    affine maps can represent for more generalized dynamic support

  • Merged PR 3063: Refactor Dimension with C++ backend container class and few other fixes [Captain Jack Sparrow]

    • Refactor Dimension with C++ backend container (ScalarDimension)
    • Enable output scalar variables
    • Fix dynamic sized TEMP arrays
  • Merged PR 3072: Bump hatlib version to 0.0.34, skip unsupported test on arm64 macOS, minor targets doc update [Lisa Ong]

    Update hatlib version since there is no incompatibility

Full Changelog: v1.2.19...v1.2.20