Releases · microsoft/Accera

03 Feb 07:34

lisaong

v1.2.19

dc6b16d

v1.2.19

What's Changed

Merged PR 3069: Set target device features on module and check when
matching avx2/512 ops. [Mason Remy]

Set target device features on module and check when matching avx2/512 ops
Merged PR 3060: Adds support for sqrt op in acc-translate. [Kern
Handa]

Full Changelog: v1.2.18...v1.2.19

Assets 2

26 Jan 00:22

lisaong

v1.2.18

fd895c2

v1.2.18

What's Changed

Merged PR 3055: Move value unrolling to after function inlining and
loop simplification. [Mason Remy]

Move value unrolling to after function inlining and loop simplification

This enables dynamically-sized inner functions that get inlined into
statically-sized regions to have loop unrolling affect their
actually-statically-sized loops when possible
Merged PR 3053: Add package.build flags for building with higher-
precision FP vector ops. [Mason Remy]

Add package.build flags for building with higher-precision FP vector ops

Setting this new flag prevents a vmulps -> vaddps sequence
from being contracted into a vfmaddps
Merged PR 3052: Place heap allocations at the top level of the
function. [Mason Remy]

Place heap allocations at the top level of the function
Merged PR 3050: [non-func, API] Change Nest.get_shape() to always
return a list. [Captain Jack Sparrow]

Change Nest.get_shape() to always return a list
Merged PR 3030: Include acc-translate whenever accera is installed.
[Lisa Ong]

Perhaps a longer-term fix is to merge the accera-gpu package into accera-compilers so we have one less package to maintain.

However, that adds constraints to the binary size of acc-opt (to not push us past the 100MB PyPI hard limit), so punting until we have cycles for this.
Merged PR 3035: [nfc] Adds my machine to targets.py. [Kern Handa]

Full Changelog: v1.2.17...v1.2.18

Assets 2

18 Jan 01:02

lisaong

v1.2.17

d92e34e

v1.2.17

What's Changed

Merged PR 3029: Work around constraint resolution issues with dynamic
split size 1. [Mason Remy]

Work around constraint resolution issues with dynamic split size 1

Full Changelog: v1.2.16...v1.2.17

Assets 2

16 Jan 02:05

lisaong

v1.2.16

4e60e4a

v1.2.16

What's Changed

Merged PR 3027: Hack required to use Array as output element argument
(Dimension) [Captain Jack Sparrow]
Merged PR 3025: Add arg name and size string required for hat
metadata. [Captain Jack Sparrow]

Add arg name and size string required for hat metadata
Merged PR 3017: Output array supports gather function. [Denny Sun]

Add the dsl test for gather function.

Full Changelog: v1.2.15...v1.2.16

Assets 2

12 Jan 03:22

jubitaneja

v1.2.15

5f5de21

v1.2.15

What's Changed

Merged PR 3018: Use VS 17.4.3-built binaries. This is in a separate
channel to allow older ve... [Mason Remy]

Use VS 17.4.3-built binaries. This is in a separate channel to allow older versions to keep working
Merged PR 3012: Correctness check for output array support for range
node. [Denny Sun]

Successful correctness check means output array support can work end to end.
Merged PR 3015: Update hatlib version to support floating type as
function arg. [Denny Sun]

Update hatlib version to support floating type as function arg
Merged PR 3010: Disable BinOp simplification for floating types.
[Captain Jack Sparrow]

Disable BinOp simplification for floating types
Merged PR 3013: Apply major version in docs. [Lisa Ong]

Removes the need to update docs versions every time we release
Merged PR 2981: Prologue and Epilogue op support with tensorization
and caching. [Captain Jack Sparrow]
- Add optional prologue and epilogue support for tensorization
- Supported gemm parameters with fragment ops are: {alpha: 1, beta: any} and {alpha: >1, beta: 0}
- ReLU, SET, SCALE added a standard fragment op
Related work items: #3704

Full ChangeLog v1.2.14...v1.2.15

Assets 2

15 Dec 09:29

lisaong

v1.2.14

c35ce22

v1.2.14

Merged PR 3001: [test] Expect failures on macos for x86 intrinsics
tests. [Lisa Ong]

macos does not support x86 and x86 avx intrinsicts
Merged PR 3000: Expect failures for macos in vpmaddwd tests. [Lisa
Ong]
Merged PR 2994: Bump hatlib to 0.0.32. [Lisa Ong]
Merged PR 2997: Support more casting cases in vpmaddwd matcher. [Mason
Remy]

Support more casting cases in vpmaddwd matcher
Merged PR 2996: [release] bump docs to 1.2.14 for next release. [Lisa
Ong]

bump docs to 1.2.14 for next release

Full Changelog: v1.2.13...v1.2.14

Assets 2

14 Dec 10:10

lisaong

v1.2.13

6c09b4a

v1.2.13

Merged PR 2987: Add support for max/min/round ops and vectorizing
those ops. [Mason Remy]

Add support for max/min/round ops and vectorizing those ops
Merged PR 2963: Control TEMP array allocation location. [Mason Remy]

Control TEMP array allocation location
Merged PR 2962: Expand vpmaddwd matching and add intrinsic call.
[Mason Remy]

Expand vpmaddwd matching and add intrinsic call

Matches more vpmaddwd cases and creates a pathway to invoking the LLVM
intrinsic directly.
Merged PR 2961: Match more vectorization patterns and support
vectorized cast. [Mason Remy]

Match more vectorization patterns and support vectorized cast

Tries to match and rewrite vectorization patterns:
- 2-loop interleaving store -> vector shuffle and store
- simple horizontal reductions (not always efficient currently)
- vectorized casts
Makes vectorization of non-innermost loops do a per-op "inplace" unroll and
vectorize the innermost loop
TODO : update documentation to describe this behavior better
Merged PR 2960: Enable marking functions as no-inline-into. [Mason
Remy]

Enable marking functions as no-inline-into

Functions marked no-inline-into
won't inline calls to other functions within their body. This
is a useful compiler performance (not emitted code performance)
optimization when we have many nested functions calls
Merged PR 2986: [output array] Emit range function with input_output
type arguments. [Denny Sun]

Instead of using output type, we use input_output instead to generate two functions for the Range function.
Now Accera can successfully generate code for range function.
```
# Generate functions like:
# get_size(float start, float limit, float delta, int64_t* output_dim);
# get_array(int64_t input_dim, float* output, float start, float delta);
```
Merged PR 2959: Improved affine for op range simplification. [Mason
Remy]

Improved affine for op range simplification

Add range value / constant-cmp-result patterns and affine for op range
simplifications to the affine simplification pass and run it after
inlining functions.
When inlining a dynamically-sized function into a statically-sized
function, this change is useful for resolving the dynamic ranges to
constants and pruning dynamic-range loops that are not needed given the
specific constant value being used.
Merged PR 2958: Hack to erase loops in a nest to support nest-of-nest
or overfused. [Mason Remy]

Hack to erase loops in a nest to support nest-of-nest or overfused
scenarios

This change enables an action plan to erase loops. Typically this would
be used when an outer nest traverses tiles and invokes an inner nest (or
multiple nests) which operate within each tile. The outer nest still
needs to cover the full iteration space, however after splitting by the
tile sizes a user will not want the outer nest to perform the inner
loops
Merged PR 2985: [release] Rev docs to 1.2.13. [Lisa Ong]
Merged PR 2983: Increase timeouts of GPU benchmarks. [Captain Jack
Sparrow]

Increase timeouts of GPU benchmarks
Merged PR 2982: Work around bug with redundant splits of dynamic
dimensions. [Mason Remy]

Work around bug with redundant splits of dynamic dimensions
Merged PR 2972: Build both static and dynamic binaries by default, put
both in aux dependencies. [Kern Handa]
Merged PR 2975: Updates llc/opt build flags to enable more
optimizations by default. [Kern Handa]

Updates llc/opt build flags to enable more optimizations by default
Merged PR 2977: Updates CMake to do FindPython before pybind11 config.
[Kern Handa]

Updates CMake to do FindPython before pybind11 config
Merged PR 2955: Reduce Linux PR runtime to under 60mins. [Lisa Ong]

Filter DEV_MODE reruns to dsl_tests.py, this is not comprehensive and is a best effort.

Full Changelog: v1.2.12...v1.2.13

Assets 2

21 Nov 03:02

lisaong

v1.2.12

711af89

v1.2.12

What's Changed

Merged PR 2953: Workaround debug mode failures with dimension argument
ordering. [Lisa Ong]

Order dimension arguments after Array args to avoid this lowering issue in Debug mode (until Debug mode is fixed)

test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:236:28: error: use of value '%7' expects different type than prior uses: 'i64' vs '!llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>'
    %42 = llvm.insertvalue %7, %41[3, 0] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
                           ^
/Users/lisaong/work/staging/Accera/build/lib.macosx-11.1-arm64-3.10/test_acccgen/test_all_dynamic_sizes_static_unroll_matmul/_tmp/test_all_dynamic_sizes_static_unroll_matmul/test_all_dynamic_sizes_static_unroll_matmul_llvm.mlir:201:5: note: prior use here
    %7 = llvm.insertvalue %arg6, %6[4, 1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
    ^

Enable DEV_MODE tests in one CI pipeline so that we can catch these in the future

Merged PR 2950: [Release] Rev docs to v1.2.12. [Lisa Ong]

In preparation for 1.2.12 release EOW
Merged PR 2946: Fix hierarchical partial fusing. [Mason Remy]

Fix hierarchical partial fusing

Index attributes in fragment predicate ops weren't getting updated as
part of fusion mapping old indices to new fused indices. This fix is a
quick change to recursively walk predicates and update their index
attributes manually.
In the future we could use SymbolicIndexOps and rely on
BlockAndValueMapping replacements in clone, however this will also
require that we don't create as many duplicate SymbolicIndexOps for the
same Index
Merged PR 2942: Hold onto intermediate split indices when fusing.
[Mason Remy]

Hold onto intermediate split indices when fusing

When we split a loop multiple times, the outer index references the
inner intermediate split indices in affine expressions, even if those
indices get further split and are no longer loop indices. We have been
discarding them because they aren't loop indices or dimension indices,
but they wound up getting re-added to the transformed domain by
serialization and this led to fusion bugs.
Merged PR 2834: match and rewrite a pattern to vectorize int16 matmul.
[JUBI TANEJA]

This rewrite rule matches the jj and kk loops in int16 matmul, where outer loop jj {0..8} is followed by an inner loop kk {0..2}. It vectorizes the jj and kk loop and replaces each affine op by a vectorized op. At the end, it generates vpmaddwd instruction for MatMul.
Merged PR 2918: Support vectorization and static size caching for
split dynamic range. [Mason Remy]

Support vectorization and static size caching for split dynamic range
loops
Merged PR 2914: Support static loop splits of dynamic sized ranges.
[Mason Remy]

Support static loop splits of dynamic sized ranges

This change creates a specialization of the AffineConstraintsHelper that
works with Loopnest concepts and uses that in LoopNestBuilder to update
the loop split generation
Merged PR 2911: Support dynamic ranges in ScheduledLoopOp. [Mason
Remy]

Support dynamic ranges in ScheduledLoopOp
Merged PR 2907: Implement initial affine constraint helper for dynamic
size loop. [Mason Remy]

Implement initial affine constraint helper for dynamic size loop
handling

Implements a wrapper around mlir::FlatAffineValueConstraints and a set
of low-level tests using it that enable static-sized splitting of
dynamic loop ranges
Merged PR 2935: Remove thread coarsening factor > 4 from GPU
benchmarks. [Captain Jack Sparrow]

Remove thread coarsening factor > 4 from GPU benchmarks
Merged PR 2932: Upgrade to CUDA 11.8. [Captain Jack Sparrow]

Upgrade to CUDA 11.8
Merged PR 2931: Update to ROCm 5.3. [Captain Jack Sparrow]

Update to ROCm 5.3
Merged PR 2926: Plumb parameter usages to emitted HAT files. [Lisa
Ong]
Merged PR 2927: Reduce benchmark configs using thread coarsening.
[Captain Jack Sparrow]

Reduce benchmark configs using thread coarsening
Merged PR 2925: Add optional optimization hint for number of thread
blocks per SM. [Captain Jack Sparrow]

Add optional optimization hint for number of thread blocks per SM

Related work items: #3736

Assets 2

18 Oct 07:45

lisaong

v1.2.11

013a281

v1.2.11

What's Changed

Update vcpkg by @AtariDreams in #52

Merged PR 2924: Update hatlib dependency in setup.cfg, add comment.
[Lisa Ong]
Merged PR 2922: [Github] Update vcpkg. [Lisa Ong]

From c2177e6 Mon Sep 17 00:00:00 2001
Merged PR 2910: Updates hatlib dependency to 0.0.29. [Kern Handa]
Merged PR 2905: Fix internal param name in GPU benchmarks. [Captain
Jack Sparrow]

Fix internal param name in GPU benchmarks
Merged PR 2902: Increase ROCm baseline benchmark timeout to 10 hours.
[Captain Jack Sparrow]
- Increase ROCm baseline benchmark to 10 hours
- Add category to the gemm input for classification
Merged PR 2901: Increase ROCm baseline timeout to 7 hours. [Captain
Jack Sparrow]

Increase ROCm baseline timeout to 7 hours
Merged PR 2900: Prune gemm benchmark input for big sizes by removing
NT and TT configs. [Captain Jack Sparrow]
- Prune gemm benchmark input for big sizes by removing NT and TT configs
- Disable verification for resnet sizes
- Fix baseline tagging for pytorch
Merged PR 2896: Dynamic shared memory allocation support. [Captain
Jack Sparrow]
- Add optional param in plan.cache for memory offset
- Add optional param in schedule.create_plan for total dynamic memory size in bytes
- Update benchmarks to allow dynamic shared memory usage
Related work items: #3735
Merged PR 2898: Add pytorch gemm implementation for GPU benchmark
baselines. [Ritwik Das]

Add pytorch gemm implementation for GPU benchmark baselines
Merged PR 2897: Generalize partial dynamic size support. [Mason Remy]

Generalize partial dynamic size support

Plumbs through mappings from arrays to which args provide the dimension
sizes for those arrays more generically.

This also generalizes dynamic size support beyond matmul scenarios.

Note: due to assumptions in the debug mode plumbing, the size arguments
still must occur first in the argument list, and a later PR should
generalize that
Merged PR 2894: Add one test case for partially dynamic sized array.
[Denny Sun]
Merged PR 2891: [nfc][release] Rev docs to 1.2.11. [Lisa Ong]
Merged PR 2882: Add tests for thread coarsening and update GPU
benchmarks. [Ritwik Das]
- Add tests for thread coarsening and update GPU benchmarks
Related work items: #3684
Merged PR 2890: Add folding scenario for cast ops where the only
downcasts are. [Mason Remy]

Add folding scenario for cast ops where the only downcasts are
internally-generated

This is useful for converting uint8uint8->uint8 to
int16int16->int32 using cache element types as is needed in the
vpmaddwd matmul scenario
Merged PR 2889: [refactoring] Prevent overloading of keyword "Tensor"
- disambiguate with "MMAFragment" [Ritwik Das]
Prevent overloading of keyword "Tensor" - disambiguate with "MMAFragment"

New Contributors

@AtariDreams made their first contribution in #52

Full Changelog: v1.2.10...v1.2.11

Contributors

AtariDreams

Assets 2

29 Sep 01:33

lisaong

v1.2.10

1b5f14e

v1.2.10

What's Changed

Update ci.yml to fix path changes by @lisaong in #49
Add unrolled convolution case study link by @marina-neseem in #50
Bump protobuf from 3.20.1 to 3.20.2 in /accera/onnx-emitter/test by @dependabot in #51

Merged PR 2886: [release] Bump docs to 1.2.10, sync GH to ADO. [Lisa
Ong]
- Bulk docs version update
- Bump protobuf from 3.20.1 to 3.20.2 in /accera/onnx-emitter/test (d1b87ec)
- Also fixing a minor docs bug (errant backtick)
Merged PR 2884: Add DSL test for runtime size correctness. [Denny Sun]
Merged PR 2878: Optimize warp id calculation by forcing scalar
registers. [Ritwik Das]
- ROCM: use __builtin_amdgcn_readfirstlane to force scalar reg usage
- CUDA: don't use anything special since __shfl_sync seems to generate slower code
Merged PR 2885: Updates python dependencies. [Kern Handa]

Updates hatlib version

Merged PR 2881: Fix the runtime crash caused by incorrectly generated
LLVM IR. [Denny Sun]

Call the specific version of LLVM type converter for dynamic memory
Create MemRefDescriptor from dynamic memory shape by associating the arrays with correct size arguments

With this change, the following DSL test can succeed and pass correctness check.

        M = Dimension()
        N = Dimension()
        K = Dimension()

        A = Array(shape=(M, K), element_type=ScalarType.float32,
            role=Array.Role.INPUT)

        B = Array(shape=(K, N), element_type=ScalarType.float32,
            role=Array.Role.INPUT)

        C = Array(shape=(M, N),
                    element_type=ScalarType.float32,
                    role=Array.Role.INPUT_OUTPUT)

        @nest.iteration_logic
        def _():
            C[i, j] += A[i, k] * B[k, j]

        M_test = np.int64(64)
        N_test = np.int64(128)
        K_test = np.int64(32)
        A_test = np.random.random((M_test, K_test)).astype(np.float32)
        B_test = np.random.random((K_test, N_test)).astype(np.float32)
        C_test = np.random.random((M_test, N_test)).astype(np.float32)

        correctness_check_values = {
            "pre": [M_test, N_test, K_test, A_test, B_test, C_test],
            "post": [M_test, N_test, K_test, A_test, B_test, C_test + A_test @ B_test],
        }

        function = package.add(nest, args=(M, N, K, A, B, C), base_name="runtimesizes")

        with verifiers.VerifyPackage(self, "test_runtimesizes", TEST_PACKAGE_DIR) as v:
            package.build("test_runtimesizes", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
            if correctness_check_values:
                v.check_correctness(
                    function.name,
                    before=correctness_check_values["pre"],
                    after=correctness_check_values["post"],
                )

Merged PR 2879: Fix exception in GPU baseline benchmark. [Ritwik Das]

Fix exception in GPU baseline benchmark
Merged PR 2856: Enable output caching in ROCM for all MMA shapes.
[Ritwik Das]
Merged PR 2876: Introduce warp bindings in CUDA. [Ritwik Das]
- Bind indices to WARP_X/Y along with tensorization (exclusively from thread id mapping)
- warp x dim is always a multiple of warp size in the x dimension. e.g. if for dividing a 64x64 block tile into 4 subtiles of 32x32 each where each subtile is computed by a single warp then the blockDim would be (64,2,1).
- This is required since with tensorization we would want block dims to be generated in a specific way than without it. Calculating offsets within the matrix based on warps is non-trivial if not impossible with just thread bindings.
Related work items: #3726
Merged PR 2874: Add unrolled convolution case study link (#50) [Lisa
Ong]

Add unrolled convolution case study link (#50)
- Update README.md
Add unrolled convolution case study reference link
- Update the reference link
Update the reference according to latest updates in the case study
Merged PR 2873: Convert function signature from dynamic memref type to
llvm type. [Denny Sun]

With this change, Accera is able to write the correct function signature of dynamic memref type to HAT file
Merged PR 2871: Update hatlib version. [Denny Sun]

from 0.0.23 to 0.0.25
Merged PR 2870: Filter benchmark kernels based on scheduling policy.
[Ritwik Das]

Filter benchmark kernels based on scheduling policy
Merged PR 2867: [build][github] Update test path in github actions.
[Lisa Ong]

Fixes https://github.com/microsoft/Accera/actions/runs/3071905923

Full Changelog: v1.2.9...v1.2.10

Contributors

lisaong, dependabot, and marina-neseem

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

What's Changed

What's Changed

What's Changed

What's Changed

What's Changed

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Releases: microsoft/Accera

v1.2.19

What's Changed

v1.2.18

What's Changed

v1.2.17

What's Changed

v1.2.16

What's Changed

v1.2.15

What's Changed

v1.2.14

v1.2.13

v1.2.12

What's Changed

v1.2.11

What's Changed

New Contributors

Contributors

v1.2.10

What's Changed

Contributors