Releases · microsoft/Accera

17 Sep 03:37

masonremy

v1.2.9

adda009

v1.2.9

Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]

write runtime size of index type to Hat
Merged PR 2861: Fix cache_C benchmark variable which is not getting
set properly for CUDA. [Ritwik Das]

Fix cache_C benchmark variable which is not getting set properly for CUDA
Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
Ong]

Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.

See: actions/runner-images@ef638dd
- Only pipelines that rely on azure build agents are affected.
- We still need to keep the submodule around to enable external builds from the Github repo.
- Remove defunct pipeline
- Update vcpkg submodule while we're here
Merged PR 2839: Enable CUDA output caching. [Ritwik Das]
- Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
- Change caching code to generate MMALoad/StoreOps for caching of the output.
Related work items: #3725
Merged PR 2813: Add pass to recognize patterns that look like int16
matrix multiply. [Chuck Jacobs]

This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought vpmaddwd instruction.

It also fixes some old low-level tests of integer arithmetic.
Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
action container. [Lisa Ong]
- Rev docs to 1.2.9
- Update github workflow to reference updated tag for 14.0.6-1
Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
layouts. [Ritwik Das]

Filter GPU benchmarks by de-parameterizing cache layouts
Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
[Ritwik Das]
- Fix bug in GPU benchmark to calculate valid variant
- Add cosmosdb util to cleanup old entries
Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
MemRefCastOp. [Lisa Ong]

From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001
Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
kernel filters. [Ritwik Das]

Paramterize cache strategy in GPU benchmarks and fix kernel filters

Merged PR 2836: Value DSL support for runtime sized output arrays.
[Lisa Ong]

This adds memref-in-memref support for output arrays that are allocated in the function
A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
Update accv.StoreOp to support memrefs-in-memrefs

Value pointer levels are defined as follows:

Layout	Example	Pointer level	C-type
scalar	int16, float32, index, ...	0	int16_t, float32_t, int64_t, ...
single-level memref	memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32>	1	int64_t, float32_t, float32_t*
memref in memref	memref<memref<?x?x?f32>>	at least 2 (= the number of levels of memrefs)	float32_t**

Future work:

End-to-end lowering through Python DSL
Bare pointer convention for output arrays
Custom allocator functions. Currently we use the built-in std alloc.

Related work items: #3730

Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
[Lisa Ong]

The container registry allows pull-only access
Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
LLVM malloc. [Denny Sun]
1. make deep copy of range end of value type when cloning ops
2. plumbing runtime size to LLVM
3. transform memref.alloc to LLVM malloc
4. conversion between block argument and symbol name
the generated IRs:

Initial.mlir

%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)

LoopNestToValueFunc.mlir
```
%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)
affine.for %arg4 = 0 to %arg0 {
    affine.for %arg5 = 0 to %arg1 {
    }
}
```
ConvertValueToStd.mlir
```
`%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`
```
ConvertValueToLLVM.mlir
```
%8 = llvm.mul %arg1, %arg0  : i64
%9 = llvm.mlir.null : !llvm.ptr<f32>
%10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
%11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64
%12 = llvm.call @malloc(%11) : (i64) -> !llvm.ptr<i8>
```
Related work items: #3733
Merged PR 2831: Record unique IDs so that different processes acting
on a value module. [Mason Remy]

Record unique IDs so that different processes acting on a value module
don't produce conflicting IDs
Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
benchmarks. [Ritwik Das]

Fix WPT calculation to prevent 0 work and filter benchmarks
Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
[Ritwik Das]
- Add a flag to plan.cache() to expose the different thread <--> data arrangements
- Optimize thread ID calculation to check blockdim first
Merged PR 2829: Add handwritten caching implementation for GPU.
[Ritwik Das]

Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.
Merged PR 2821: Fixes constraint logic for fusion of more than two
schedules. [Kern Handa]

Fixes constraint logic for fusion of more than two schedules
Merged PR 2830: Fixes macOS CI build. [Kern Handa]

Fixes macOS CI build
Merged PR 2806: Enable specifying cache element type. [Mason Remy]

Enable specifying cache element type
- Supports accumulating and/or computing in a different element type and
  batching up the casts for those types
- Also adds support for binop/castop expansion and castop folding
Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]

Upgrade hatlib dependency to v0.0.23
Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]

Refactor cast to a value cast op
Merged PR 2788: Re-enabled fusing test that was taking too long.
[Chuck Jacobs]

This PR just re-enables a skipped test that was taking too long
Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]

Upgrade hatlib requirement to 0.0.22
Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
machines. [Lisa Ong]

According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.

Full Changelog: v1.2.8...v1.2.9

Assets 2

10 Aug 03:29

CaptnJackSparrow

v1.2.8

a9ab6bd

v1.2.8

What's Changed

Set license field in metadata of package by @tonybaloney in #46
Github codespaces configuration by @lisaong in #48

Merged PR 2814: Parameterize batch_size in GPU benchmarks. [Ritwik
Das]

Parameterize batch_size in GPU benchmarks
Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump
github actions to llvm 14.0.6. [Lisa Ong]

Preparation for 1.2.8 release
Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky
parallelized tests. [Lisa Ong]
- Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds
- ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel
Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds. [Lisa Ong]
- Remove redundant setup.py builds in pipelines with cmake builds
- Build debug for Linux only (the fastest config)
- Add pipeline caching for ccache, conan, and pip where applicable
- Add parallel configs where applicable
- Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS.
Merged PR 2807: Enable verification for CK baselines. [Ritwik Das]
- Enable verification for CK baselines
- increase timeout for cuda resnet
- add functionality for extracting kernel code from cosmosdb
Merged PR 2802: Fix barrier optimization pass. [Chuck Jacobs]

This PR fixes a couple of barrier-related issues:
- The barrier optimization pass wasn't keeping barriers that protected vector load/store ops
- Multiple barriers were getting generated when hoisting barriers out of conditionals
Related work items: #3732
Merged PR 2800: Add max_threads to parallelize and change default
behavior. [Ritwik Das]
- Add num_threads to parallelize
- change default behavior to count the number of iterations of the given indices
- Update documentation
Merged PR 2801: Remove verification on cuda-fp32-big benchmark.
[Ritwik Das]

Remove verification on cuda-fp32-big benchmark
Merged PR 2798: LLVM 14.0.6 upgrade. [Lisa Ong]

An incremental upgrade with minimal or no changes to MLIR
Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent. [Kern
Handa]

Makes NestedPassAdaptor's pipeline consistent

This change makes it so NestedPassAdaptor creates a new pass manager
every time a new pass is added. Prior to this change, if dumpPasses was
false, the same nested pass manager would be used. If dumpPasses was
true, a new nested pass manager would be created per call to addPass.
This difference in behavior was also resulting in the lowering pipeline
to be different, depending on the value of dumpPasses.

For example, in the following code in AcceraPasses.cpp, all the passes
that are added to funcOpPM run BEFORE createConvertSCFToOpenMPPass
if dumpPasses was false.
```
    auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
    funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
    funcOpPM.addPass(createSimplifyAffineStructuresPass());
    funcOpPM.addPass(createCanonicalizerPass());
    funcOpPM.addPass(createLoopInvariantCodeMotionPass());
    funcOpPM.addPass(createCSEPass());

    pmAdaptor.addPass(createConvertSCFToOpenMPPass());
    pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
    funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue()));
    pmAdaptor.addPass(value::createRangeValueOptimizePass());
    pmAdaptor.addPass(createCanonicalizerPass());
    pmAdaptor.addPass(createCSEPass());
```
Additionally, this change exposed the fact that the BarrierOpt pass is
incorrectly erasing barriers, and so has been made into a no-op until
this correctness issue has been fixed.
Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder
illustrations. [Lisa Ong]
- Clarify in the labels while working on the animated version
- Cleanup and rename .js files for (slightly) easier lookup
Merged PR 2475: LLVM 14.0.0 upgrade. [Lisa Ong]

Tag: llvmorg-14.0.0

Notable changes:
- std dialect ops are now moved to arith, math dialects
- StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x
- [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a memrefCopy C function for non-identity layout copies. This affects Array.sub_array in debug mode.
- [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops.
- Lots of renames
- Upgraded macOS to macOS-12
Related work items: #3646

Merged PR 2753: accera.Dimension and runtime-sized Arrays in the
Python DSL. [Denny Sun]

With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes.

        M = Dim()
        N = Dim()
        K = Dim()

        A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT)
        B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT)
        C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)

        nest = Nest((M, N, K))
        i, j, k = nest.get_indices()

        @nest.iteration_logic
        def _():
            C[i, j] += A[i, k] * B[k, j]

        package.add()
        package.build()

#domain0 = #accln<"idomain{{i,3}={0:{op_idx:0}:1}, {j,4}={0:{op_idx:1}:1}, {k,5}={0:{op_idx:2}:1}}">
#domain1 = #accln<"idomain{{i,9}={0:{op_idx:0}:1}, {j,10}={0:{op_idx:1}:1}}">
#domain2 = #accln<"idomain{{i,6}={0:1:1}}">

#map = affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>
#xdomain0 = #accln<"xfdomain{dims: {{i,3}, {j,4}, {k,5}}, indices: {{{i,3} : {0:{op_idx:0}:1}}, {{j,4} : {0:{op_idx:1}:1}}, {{k,5} : {0:{op_idx:2}:1}}}}">
#xdomain1 = #accln<"xfdomain{dims: {{i,9}, {j,10}}, indices: {{{i,9} : {0:{op_idx:0}:1}}, {{j,10} : {0:{op_idx:1}:1}}}}">
#xdomain2 = #accln<"xfdomain{dims: {{i,6}}, indices: {{{i,6} : {0:1:1}}}}">
module @test_runtimesizes attributes {llvm.data_layout = "... ..."}  {
  accv.module "test_runtimesizes"  {
    accv.func nested @runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, #map> loc(unknown), %arg4: memref<?x?xf32, #map> loc(unknown), %arg5: memref<?x?xf32, #map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} {
      %0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      %1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      %2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      "accln.nest"(%0, %1, %2) ( {
        %3 = accln.sym_index {name = "i"} #accln<"index{i,3}"> loc(#loc)
        %4 = accln.sym_index {name = "j"} #accln<"index{j,4}"> loc(#loc)
        %5 = accln.sym_index {name = "k"} #accln<"index{k,5}"> loc(#loc)
        "accln.kernel"() ( {
          %7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, #map>, index, index) -> memref<f32> loc(#loc)
          ... ...
          accln.terminator loc(#loc)
        }) {sym_name = "_"} : () -> () loc(#loc)
        ... ...
        accln.terminator loc(#loc)
      }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(#loc)
      accv.return loc(#loc)
    } loc(#loc)
    accv.func @runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo...

Merged PR 2793: support sign extend op in canVectorize() function to
improve generated MLIR. [JUBI TANEJA]

While trying to optimize int16 MatMul with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in canVectorize function and now with this op supported, we can emit some vector instructions in the MLIR.
Merged PR 2790: Filter invalid kernels from GPU benchmarks. [Ritwik
Das]
- Filter invalid kernels from GPU benchmarks
- Disable verification on cuda f16 benchmarks
- Remove frequent cleanups
Merged PR 2787: Remove MLIR flag from package format in benchmarks.
[Ritwik Das]

Remove MLIR flag from package format in benchmarks
Merged PR 2784: Merge Github changes to ADO. [Lisa Ong]
Merged PR 2776: Make fusing more efficient. [Chuck Jacobs]

This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules).

Here are some timings on a few tests (modified versions of Mason's example script) I ran:

test main branch PR branch

3 fused schedules, tile first only 18.8s 5.8s

3 fused schedules, tile 1 & 2 190s 6.2s

3 fused schedules, tile all 3 ???? 7.2s

Related work items: #3731
Merged PR 2781: Fix benchmark with MLIR format and add repro test.
[Ritwik Das]
Merged PR 2780: Type support for tensor ops in CUDA. [R...

test	main branch	PR branch
3 fused schedules, tile first only	18.8s	5.8s
3 fused schedules, tile 1 & 2	190s	6.2s
3 fused schedules, tile all 3	????	7.2s

Contributors

tonybaloney and lisaong

Assets 2

13 Jul 18:06

kernhanda

v1.2.7

1d3b1a3

v1.2.7

Merged PR 2744: [doc] Fixes link in reference/functions/cast.md, revs
version on all docs. [Kern Handa]

[doc] Fixes link in reference/functions/cast.md
Merged PR 2743: [DSL] Document implicit casting rules and the explicit
cast function. [Lisa Ong]
- Document implicit casting rules implemented by !2693
- Promote acc.cast to a documented function to give the user control to override implicit casting behavior
Merged PR 2739: Updates ROCM tensorization pattern to handle casting.
[Kern Handa]

Updates ROCM tensorization pattern to handle casting
Merged PR 2643: Some fixes for last major array caching in
tensorization. [Mason Remy]

Some fixes for last major array caching in tensorization
Merged PR 2693: Updates DSL codegen to implicitly cast if possible.
[Kern Handa]

Updates DSL codegen to implicitly cast if possible
Merged PR 2735: Pass multiple input files as comma-separated list to
benchmark tool. [Ritwik Das]

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41588&view=logs&j=d78921a4-2f18-50b0-77ad-4c6803f3371b&t=f97c60f6-ada7-5ec9-5ea1-510216c408e9

Above pipeline did not run the 2nd set of input sizes since the 1st process did not exit until pipeline timeout was hit. After the fix, we will always have a single job.
Merged PR 2721: Remove unnecessary logging in benchmarks. [Ritwik Das]

Remove unnecessary logging in benchmarks

Merged PR 2674: Support emitting runtime array sizes in the Value DSL.
[Lisa Ong]

Minimum set of changes to support runtime sizes in the Value DSL without transformations
Add a ScalarDimension type (name TBC) which is aliased to Scalar
Support variable ends in MemoryLayout, ScheduledLoopOp, RangeValueAnalysis
Use mlir::ShapedType::kDynamicSize and mlir::ShapedType::kDynamicStrideOrOffset as sentinel values, following the pattern in MemRefOps, TensorOps, etc.
TODO: E2E verification in the next PR
TODO: Python DSL changes in the next PR

Output of mlir-translate for the runtime_sizes_all case, where %21, %22 and %23 are the runtime sizes for M, N, and K:

define void @NestMatMul(float* %0, float* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, float* %7, float* %8, i64 %9, i64 %10, i64 %11, i64 %12, i64 %13, float* %14, float* %15, i64 %16, i64 %17, i64 %18, i64 %19, i64 %20, i64 %21, i64 %22, i64 %23) !dbg !3 {
  br label %25, !dbg !7

25:                                               ; preds = %57, %24
  %26 = phi i64 [ %58, %57 ], [ 0, %24 ]
  %27 = icmp slt i64 %26, %21, !dbg !9
  br i1 %27, label %28, label %59, !dbg !10

28:                                               ; preds = %25
  br label %29, !dbg !11

29:                                               ; preds = %55, %28
  %30 = phi i64 [ %56, %55 ], [ 0, %28 ]
  %31 = icmp slt i64 %30, %22, !dbg !12
  br i1 %31, label %32, label %57, !dbg !13

32:                                               ; preds = %29
  br label %33, !dbg !14

33:                                               ; preds = %36, %32
  %34 = phi i64 [ %54, %36 ], [ 0, %32 ]
  %35 = icmp slt i64 %34, %23, !dbg !15
  br i1 %35, label %36, label %55, !dbg !16

36:                                               ; preds = %33
  %37 = mul i64 %26, %5, !dbg !17
  %38 = add i64 %37, %34, !dbg !18
  %39 = getelementptr float, float* %1, i64 %38, !dbg !19
  %40 = load float, float* %39, align 4, !dbg !20
  %41 = mul i64 %34, %12, !dbg !21
  %42 = add i64 %41, %30, !dbg !22
  %43 = getelementptr float, float* %8, i64 %42, !dbg !23
  %44 = load float, float* %43, align 4, !dbg !24
  %45 = fmul float %40, %44, !dbg !25
  %46 = mul i64 %26, %19, !dbg !26
  %47 = add i64 %46, %30, !dbg !27
  %48 = getelementptr float, float* %15, i64 %47, !dbg !28
  %49 = load float, float* %48, align 4, !dbg !29
  %50 = fadd float %49, %45, !dbg !30
  %51 = mul i64 %26, %19, !dbg !31
  %52 = add i64 %51, %30, !dbg !32
  %53 = getelementptr float, float* %15, i64 %52, !dbg !33
  store float %50, float* %53, align 4, !dbg !34
  %54 = add i64 %34, 1, !dbg !35
  br label %33, !dbg !36

55:                                               ; preds = %33
  %56 = add i64 %30, 1, !dbg !37
  br label %29, !dbg !38

57:                                               ; preds = %29
  %58 = add i64 %26, 1, !dbg !39
  br label %25, !dbg !40

59:                                               ; preds = %25
  ret void, !dbg !41
}

Related work items: #3716, #3717

Merged PR 2682: Add nvidia device optimized sizes and some benchmark
fixes. [Ritwik Das]

Add nvidia dev opt sizes and some bench fixes
Merged PR 2676: Add automated weekly rocm baseline benchmark. [Ritwik
Das]

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41316&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8&t=687344d2-d6b6-5d8c-dd9d-6aab558fd96c

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41314&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8
Merged PR 2673: Add automated weekly baseline benchmarks on Nvidia
GPU. [Ritwik Das]

Assets 2

17 Jun 06:43

lisaong

v1.2.6

2459eb8

v1.2.6

What's Changed

Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by @dependabot in #42
[ci] Fix out of disk space errors for CI workflow by @lisaong in #43
Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by @dependabot in #44

Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]
- switch to gpu dialect for gpu index ops
- add conversion pass from gpu dialect to rocdl
Merged PR 2652: Add integer tensor ops support for AMD targets.
[Ritwik Das]
- int mfma ops
- tests
- static_cast in c++
Related work items: #3727
Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
[Lisa Ong]
Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]

Add more MMA shapes for CUDA
- 32x8x16
- 8x32x16
Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]
- Manually set the Target.Model user capability on agents running A6000
- Update benchmarking pipelines to demand A6000s
https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback
Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]

Remove couple more big gemm sizes
Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
[Lisa Ong]

Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

This makes it easier to expand debug mode in the future.
Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
[Lisa Ong]

https://github.com/microsoft/hat/releases/tag/v0.0.19
Merged PR 2630: Add batched gemm support with tensorization. [Ritwik
Das]

Related work items: #3677
Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes.
[Ritwik Das]
- Add env var for ACCOUNT_KEY
- shuffle gemm sizes from small to big
- remove correctness check from big inputs and fp16
Merged PR 2607: Infrastructure for plan.auto() to support a basic none
cache heuristics approach. [JUBI TANEJA]

Infrastructure for plan.auto() to support a basic none cache heuristics approach

This is a basic approach to test parameterization of cache arguments, index and layout.
User only needs to specify the source they want to cache, and AutoPlanner's
NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
with possible set of values.

Overall idea at DSL level:
Given input -
schedule.reorder(i, j, k, ii, jj, kk)
plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))

Internally, auto() invokes cache and adds two functions with
a unique value of layout.

plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})

Important change in this PR:
- Add a new algorithms module in Accera
- Do not delay resolution of delayed parameters to get the value, instead it
  now allows setting parameters with a possible set of values and this can be
  passed between heuristics and plan object. Check: Parameter.py
- Parameters constructed by heuristics are termed as "herustic parameters".
  They are not available to the external users of Accera, but just named
  separately in the implementation to differentiate them from user-defined "parameters".
Limitation/Changes coming in the subsequent PRs:
- Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
  For now, the code only focuses on testing AutoPlanner without any user-defined parameters
  that one can create using API: create_parameters.
- Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
  next PR.
Merged PR 2600: Refactor MFMA indexing calculations. [Mason Remy]

Refactor MFMA indexing calculations
- Use the iteration space position when determing MFMA computation
  locations rather than computing the position from the thread id
- Construct the full subschedules for AMD MFMA ops so that the bound
  loop indices are ordered appropriately for the MFMA op being invoked
- Update unit tests accordingly. The schedule changes may need to be
  moved to an under-the-hood feature of tensorization
Merged PR 2627: Raise error for invalid block dimensions. [Ritwik Das]

Raise error for invalid block dimensions based on target info

Related work items: #3715
Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets.
[Lisa Ong]

Debug mode is not yet supported for GPU targets
- Fail early
- Update documentation
Merged PR 2622: Fix dependencies for benchmark tools. [Ritwik Das]

Fix dependencies for benchmark tools
Merged PR 2604: Add bfloat16 support for tensor ops on rocm. [Ritwik
Das]

Add bfloat16 support for tensor ops on cuda and rocm

Related work items: #3713
Merged PR 2621: Merge changes from Github repo. [Lisa Ong]

commit 5b5f5ef
Merged PR 2620: Upgrade GPU self-hosted agents to g++-10. [Lisa Ong]

The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.

This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.

Azure DevOps agents will continue to run on their pre-installed g++-9.

Merged PR 2619: Parameterize Plan.bind. [Denny Sun]

        P0, P1, P2, P3, P4, P5 = create_parameters()

        plan.bind(mapping={
            P0: P3,
            P1: P4,
            P2: P5
        })

        package.add(
            plan,
            args=(A, B, C),
            parameters={
                P0: i,
                P1: j,
                P2: k,
                P3: v100.GridUnit.BLOCK_X,
                P4: v100.GridUnit.THREAD_X,
                P5: v100.GridUnit.THREAD_Y,
            },
            base_name=test_name)

Related work items: #3708

Merged PR 2599: Support parameterizing caches based on memory space.
[Mason Remy]

Support parameterizing caches based on memory space
- Identifies bound indices that the cache should be parameterized on,
  rather than shaped by.
  e.g. for a private memory cache inserted at a gpu block level, the
  computed memory space will not be the full active block at that level,
  but the portion derived from loops that weren't bound to gpu thread
  dims.
- Adds some BoundProcessorOp utilities and shares some common binding
  code
Merged PR 2618: Fix memory allocation bug during benchmark
verification. [Ritwik Das]

Fix memory allocation bug during benchmark verification
Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table. [Lisa
Ong]
Merged PR 2616: Formatting Python code a bit for the better
readability. [Denny Sun]
1. Some functions have a long list of parameters, add line wrap
2. Separate external imports from internal ones
Merged PR 2614: Remove redundant variable and cosmosdb fix. [Ritwik
Das]

Cosmos DB error when upserting from multiple processes:

Process runner0:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
container = get_container(containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
container_proxy.read(
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
self._properties = self.client_connection.ReadContainer(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
return self.Read(path, "colls", collection_id, None, options, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
return synchronized_request.SynchronizedRequest(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
return _retry_utility.Execute(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Reques...

Contributors

lisaong and dependabot

Assets 2

24 May 06:26

lisaong

v1.2.5

9d8ae51

v1.2.5

What's Changed

Add link to the NCHWc 2D Convolution Case Study by @marina-neseem in #41

Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in
preparation for release. [Lisa Ong]

bump docs version to 1.2.5 in preparation for release

Merged PR 2586: Loop order and indices as parameters [Denny Sun]

With this change, the user can write a schedule with loop_order parameterized:

   loop_order = create_parameters()
   schedule.reorder(order=loop_order )

    parameter_grid = {
        loop_order : (j, k, i, ii, jj, kk)
    }

    parameters = create_parameter_grid(parameter_grid,
                                    filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
                                    sample=5)

    # Add another function to the package
    package.add(
        plan,
        args=(A, B, C),
        parameters=parameters,
        base_name="matmul_256_256_256"
    )

Related work items: #3693

Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR
CI. [Kern Handa]
Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline.
[Lisa Ong]

The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.
Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer. [Kern
Handa]

[nfc] Fixes a bunch of warnings in C++ layer
Merged PR 2589: [test] Adds DSL tests for Schedule.pad. [Kern Handa]

Adds DSL tests for Schedule.pad
Merged PR 2587: Sync Github to ADO. [Lisa Ong]

commit b934ad05f6b8cd84420226b93f57b8ac3229eadc
Merged PR 2585: Use conditional instead of loop-unswitching on GPU.
[Chuck Jacobs]

This PR changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

Related work items: #3703
Merged PR 2571: Add random seed to enable reproducible sampling.
[Denny Sun]

Giving users control over sampling strategies.

Merged PR 2581: Add CUDA tensor core support. [Ritwik Das]

Added CUDA tensor ops (no caching)
Added validation tests
Changed MMA enum names
Bit of generated tensor op code in cuda:

...
vhalf *var11 = (vhalf*)arg2;
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
vhalf *var13 = (vhalf*)arg0;
wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
vhalf *var15 = (vhalf*)arg1;
wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);

Related work items: #3694

Merged PR 2584: Adds cublas_gemm benchmarking tool. [Kern Handa]

Adds cublas_gemm benchmarking tool
Merged PR 2583: Don't hold ResolveWarpSize results with rvalue. [Mason
Remy]

Don't hold ResolveWarpSize results with rvalue

gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
not holding the result with an rvalue pair appears to fix it.

This was resulting in some mod 0's and floordiv 0's when we would expect
the warp size constants to either be 32 or 64 exactly.
Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion. [Kern
Handa]
Merged PR 2579: Improves accera_gemm.py's handling of unsupported
configs. [Kern Handa]

Improves accera_gemm.py's handling of unsupported configs
Merged PR 2578: Fixes time unit conversions in accera_gemm.py. [Kern
Handa]

Also addresses comments for the previous rocblas_gemm PR
Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API
change. [Kern Handa]

Fixes accera_gemm.py code after Plan.tensorize API change
Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker. [Kern
Handa]

Adds library warmup to rocblas_gemm benchmarker
Merged PR 2572: [nfc] Move accera/viz -> tools/viz. [Kern Handa]

[nfc] Move accera/viz -> tools/viz
Merged PR 2573: Update setup.cfg hatlib dependency version. [Mason
Remy]

Update setup.cfg hatlib dependency version
Merged PR 2557: Overhauls the benchmarking tool. [Kern Handa]

This change moves the benchmarking tool to a top-level tools/benchmarkers directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

The aforementioned rocblas gemm benchmarking utility is also added in this change. rocblas_gemm is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

An example:
```
<build accera like usual>
ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
cd tools/benchmarkers
mkdir ~/accera_benchmarks
./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm
```
Related work items: #3685
Merged PR 2569: Make tensorization passes configurable, remove
dependency from split indices. [Ritwik Das]
- Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
- Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
- Added documentation for the new enum type.
- Added some tests
- Current code does not work with K > M (still investigating this, but should not block this PR)
Related work items: #3688
Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays. [Mason
Remy]

Fix vectorized access of LAST_MAJOR arrays
- mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
  strides on the minor dimension of the memref they access, so
  reinterpretcast the memref to a flat buffer to pass that check
- add translation for reinterpretcastop
- improve vectorization of LAST_MAJOR matrices in cache accesses by
  changing the traversal order of the cache region (when
  filling/reducing) based on the memory ordering of the outer array
  being acted on.
Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry
for ROCm build agent. [Lisa Ong]
Merged PR 2560: Make register allocation during tensorization tunable.
[Ritwik Das]
- Add controllable number of fused mfma passes
- Add controllable scheduling policy of mfma ops
- Add tests
Related work items: #3687
Merged PR 2565: [build] bump hatlib dependency to 0.0.13. [Lisa Ong]

hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

New Contributors

@marina-neseem made their first contribution in #41

Full Changelog: v1.2.4...v1.2.5

Contributors

marina-neseem

Assets 2

06 May 02:12

lisaong

v1.2.4

d4ca84f

v1.2.4

What's Changed

Docs refactoring install by @Arslan-e-Mustafa in #27
Revise Pi3_Cross_Compilation.md by @Arslan-e-Mustafa in #28
Docs refactoring tutorials hello matmul by @Arslan-e-Mustafa in #29
Docs refactoring tutorials hello matmul gpu by @Arslan-e-Mustafa in #30
Docs refactoring tutorials optimized matmul by @Arslan-e-Mustafa in #31
Refactoring of Accera.md from reference docs by @Arslan-e-Mustafa in #32
Complete refactoring of safety analysis by @Arslan-e-Mustafa in #33
Refactoring of functions docs in reference files by @Arslan-e-Mustafa in #34
Demo fixes for hatlib 0.0.11 by @lisaong in #36
[nfc] [doc] Update arrow label positions by @lisaong in #35
completed reference docs by @Arslan-e-Mustafa in #37
Update docstrings to match reference doc changes by @lisaong in #38
[ci][nfc] Update CI pipeline to Azure Container Registry by @lisaong in #39
[doc] Contributing guide for Case Studies by @lisaong in #40

Merged PR 2563: Add a table of operators and code examples to the
Parameters.md. [Denny Sun]

Update the Manuals with the supported operators and code examples.
Merged PR 2562: [nfc] Add some macOS targets and synced Model.md.
[Lisa Ong]
- Re-generated Model.md to add missing models
- Handle zero (unknown) vector_bytes cases in tests
- Opportunistically added these models used during development:
  - 2016 macbook pro
  - M1 max
Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc
versions to 1.2.4. [Lisa Ong]

Merged PR 2558: [nfc] update requirements to latest version of six.
[Lisa Ong]

Fixes this warning:

<frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()

Merged PR 2559: Finer-granularity error reporting for python tests.
[Chuck Jacobs]

This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.
Merged PR 2556: [non-functional] Change ROCM code to generate gcn
intrinsics when possible. [Ritwik Das]
- Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
- Add helpers which automatically check for runtime before emitting the proper code
Related work items: #3698
Merged PR 2547: [non-functional] Change custom mfma types to Memref
and some refactoring. [Ritwik Das]

Make inital changes to remove custom mfma types

Related work items: #3691
Merged PR 2555: create_parameters(count: int) no longer needs count as
an argument. [Denny Sun]
1. Remove the count of parameters to be created from the DSL
2. Throw exception when users write the following code:
  create_parameters()
3. The correct way of calling create_parameters() is:
  p1, p2 , p3 ..., pN = create_parameters()
Merged PR 2554: [doc] Updated some missing enums and fixed Case Study
path. [Lisa Ong]
Merged PR 2522: Generalize array indexing in tensorized GEMM. [Chuck
Jacobs]

This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

There are also a number of tests.

This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

Related work items: #3676
Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool.
[Lisa Ong]
- The Linux1ESPool is created to support internal builds of LLVM
- Fix regression in pipeline due to overzealous .dockerignore
Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote. [Lisa
Ong]

In preparation for merge from ADO to GitHub for Case Studies publishing
Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third
party containers. [Lisa Ong]

Updating Dockerfile references
Merged PR 2548: Add README file for case studies. [Denny Sun]

README file has a table where each case study points to the external repo link.
Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for
development. [Lisa Ong]

Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

No plans to release pip packages until there is CI support

Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10
Merged PR 2543: Add precomputed offset map optimization for
tensorization (no caching) [Ritwik Das]
- Add flag to tensorize() to enable optimization (off by default)
- Optimization only affects load/store of accumulator (C) argument
- Supports all 4 mfma shapes
Related work items: #3671
Merged PR 2542: An assortment of minor fixes. [Chuck Jacobs]

This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

The specific things are:
- Add 2 new target models to Targets.py (that correspond to my local dev boxes)
- Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
- Print a warning message if check_correctness skips a correctness check because no hat file was generated
- Add a "minimum version" constraint to requirements.txt for hatlib
Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for
Package.Format. [Kern Handa]

Unifies CUDA and CPP enum values to SOURCE for Package.Format

Related work items: #3679
Merged PR 2544: [nfc] Removes now unnecessary ldebug output. [Kern
Handa]

[nfc] Removes now unnecessary ldebug output
Merged PR 2527: Enable vectorized shared memory write. [Mason Remy]

Enable vectorized shared memory write
- This adds mod simplification support needed for vecotrizing shared
  memory writes
- Also refactors some of the affine simplification code slightly to
  share some common code between the floordiv and mod simplifications
Related work items: #3586, #3661, #3689
Merged PR 2526: Enable GPU global read vectorization. [Mason Remy]

Enable GPU global read vectorization
- Implements a floor div simplification that enables better recognition
  of vectorizable load and stores
Related work items: #3661, #3690
Merged PR 2541: Fix a few issues with GEMM benchmarking script. [Chuck
Jacobs]

This PR fixes a couple of errors:
- there was a bug in the GEMM kernel
- sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"
It makes a couple of other tweaks:
- it fails if the alpha and beta parameters aren't 1.0 and 0.0
- it culls some variants with known-uncompilable tensorization parameters before trying to compile them
Merged PR 2538: Fix std::pair unpacking issue in
TensorizeAffineForOpConversion. [Lisa Ong]

In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files
Merged PR 2536: Parameter supports most of the
arithmetic/binary/unary operations defined in operator lib. [Denny
Sun]

Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

fma_unit_count, vector_size = acc.create_parameters(2)
jjj = schedule.split(jj, fma_unit_count * vector_size)
jjjj = schedule.split(jjjj, vector_size)

Related work items: #3692
Merged PR 2539: [nfc][docs] Merging commits from Github/main. [Lisa
Ong]

commit ee28126a338d905eb5931038d3c5daba6ead3811
Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm
smoke tests. [Lisa Ong]
- Docker image for self-hosted build agent on the ROCm development machine
- Pipeline will front-load the Python ROCm tests so that we fail faster
- The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
- CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
- The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
- Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.
Related work items: #3682
Merged PR 2537: [Compliance] Make dependency on ffmpeg optional. [Lisa
Ong]

ffmpeg-python is only needed for video export from the Iteration Visualizer Tool

Removing the hard dependency from the tool.
Merged PR 2525: Fix vectorization plumbing for GPU scenarios. [Mason
Remy]

Fix vectorization plumbing for GPU scenarios

Related work items: #3661
Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main.
[Lisa Ong]

commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d
Merged PR 2530: Adds initial GPU benchmarking infrastructure. [Kern
Handa]

Related work items: #3685
Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file.
[Mason Remy]

[nfc] Refactor RangeValue utilities to separate file

Related work items: #3661
Merged PR 2532: [prog] Fallback to known TargetDevice names for
looking up the LLVM triple. [Lisa Ong]

Resolves the issue where the CPU type is resolved (e.g....

Contributors

lisaong and Arslan-e-Mustafa

Assets 2

07 Apr 00:11

lisaong

v1.2.3

70f0f63

v1.2.3

What's Changed

Docs refactoring manual fusing by @Arslan-e-Mustafa in #26

Merged PR 2508: [release] Bump docs version to 1.2.3. [Lisa Ong]

In preparation for a PyPI release to facilitate community contributions for case studies

Synced doc editorials from public Github repo
Merged PR 2503: [prog] Support unsigned integer types in the DSL.
[Lisa Ong]
- Add ScalarType.uint8/16/32/64 support
- Use UnrealizedConversionCastOps to convert these unsigned ints to signless ints
- Refactored CastImpl now that we have to handle both unsigned and signless cases for casts to/from ints
- Use a tuple of (mlir Type, llvm Type) to infer the C type when writing function declarations in the HAT file. The former holds sign-ness information, the latter determines the C type (e.g. pointer or not)
- Simplified CheckAllClose function to reduce unnecessary casting
- Doc updates
- Fixed HAT file issues with ScalarType.bool
Merged PR 2507: Updates acc-translate output for ROCm 5.1. [Kern
Handa]
Merged PR 2437: Add more known targets(from our team's devices) [Denny
Sun]

The new list covers the following cpus, these cpus are being used by our devs,
Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz 2.11 GHz
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz

Related work items: #3546
Merged PR 2505: [nfc] Rename parameters for schedule.tile and
plan.bind. [Kern Handa]

[nfc] Rename parameters for schedule.tile and plan.bind
Merged PR 2501: Adds support for more than one GPU function per
package. [Kern Handa]

Adds support for more than one GPU function per package

Related work items: #3686
Merged PR 2504: [docs] Update stale versions in Reference docs. [Lisa
Ong]

Fixing while considering better approaches....
Merged PR 2499: Updates the syntax for schedule.tile. [Kern Handa]

Updates the syntax for schedule.tile
Merged PR 2498: Updates the syntax for plan.bind. [Kern Handa]

Updates the syntax for plan.bind

Related work items: #3678
Merged PR 2500: Adds support for specifying index bitwidth for acc-
translate. [Kern Handa]

Adds support for specifying index bitwidth for acc-translate

Story #3669

Related work items: #3669
Merged PR 2490: Restore CMake Export. [Abdul Dakkak]

Restore the CMake Export feature as it is used by argo-experiments. Note that you cannot use this feature if you are using the vcpkg llvm build
Merged PR 2497: Fix vectorization plumbing to correctly handle zero
vectorization budget cases in cache reduce ops. [Mason Remy]

Fix vectorization plumbing to correctly handle zero vectorization budget cases in cache reduce ops
Merged PR 2496: [nfc] Switch docs versioning to bump2version, replace
VERSION with simple git tag-based version. [Lisa Ong]
- Populate ACCERA_VERSION from the latest git tag
- bump2version is now configured for the docs/ tree
Merged PR 2495: [test] Import break with python -m unittest discover.
[Lisa Ong]

python -m unittest discover accera/test *.py will interrogate verifiers.py and fail because of the relative import
Merged PR 2492: Updates test verifier code to match hatlib API
changes. [Kern Handa]

Updates test verifier code to match hatlib API changes
Merged PR 2488: Simplify RangeValue analysis. [Abdul Dakkak]

Uses LLVM's ConstantRange instead of implementing our own to delete a lot of code
Merged PR 2489: add missing type_traits include. [Abdul Dakkak]

add missing type_traits include
Merged PR 2482: Fix parameterized caches producing multiple caches
erroneously. [Mason Remy]

Fix parameterized caches producing multiple caches erroneously
- This is more of a one-off fix. A more generalized fix for resetting
  schedules/plans for different parameter value resolution should be
  implemented down the road
Merged PR 2479: FP16 tensorization for ROCM. [Abdul Dakkak]
Merged PR 2472: Tensorization + Caching. [Abdul Dakkak]
Merged PR 2485: Add another keyword to function's auxiliary table.
[Denny Sun]

Add 'parameters' keyword to the parameter values in a function's auxiliary table, then the table will look like:

[functions.matmul_256_256_256_bdec0fac.auxiliary.accera]
[functions.matmul_256_256_256_bdec0fac.auxiliary.accera.parameters]
p_m_split_size = 16
p_n_split_size = 128
p_s_split_size = 256
p_s_split_2_size = 8
p_n_split_2_size = 16
p_n_split_3_size = 4

Related work items: #3662
Merged PR 2484: [Pipelines] Enable uploads to PyPI when tagging a
release. [Lisa Ong]

Configurable service connection variable, allows setting of test and production PyPI service connections during scheduling.

Also cleaned up a stale workaround for auditwheel in the ManyLinux pipeline.
Merged PR 2471: Fix to caching. [Abdul Dakkak]

This avoids the aggressive cache deletion specifically when it occurs within loop. This is a temporary fix, and a more elegant one is to handle memory access info across loop boundaries.
Merged PR 2476: Add accera.create_parameter_grid() with self-defined
filter and sample as arguments. [Denny Sun]

Provide a generic function in DSL for users to create the parameters list from a dictionary(grid), self define a filter function to remove invalid parameter values and limit the number of parameter grid as well as the number of functions generated.

We find out the requirement for this function when updating our matmul grid search case study.

Related work items: #3662

Merged PR 2483: [Test] Integrate FileCheck into Python tests. [Lisa
Ong]

Added FileCheck utility to the accera-llvm package
Can be run on any output file produced by the Package.build process, e.g. .cu, .mlir
Support some basic directives
Added examples for caching and rocm validation

Example error spew:

/root/Accera/build/lib.linux-x86_64-3.9/accera/bin/FileCheck /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck --input-file /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu

/root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck:2:16: error: CHECK-COUNT: expected string not found in input (4 out of 4)
CHECK-COUNT-4: for (int64_t idx{{[0-9]}} = 0; idx{{[0-9]}} < 16; idx{{[0-9]}} += 1) {
               ^
/root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu:42:47: note: scanning from here
for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
                                              ^

Input file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu
Check file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck

-dump-input=help explains the following input dump.

Input was:
<<<<<<
         .
         .
         .
        37:
        38:
        39: extern "C" __global__ __launch_bounds__(1) void test_rocm_gemm_tiled_output_710d7d7d2ca9ca9e__gpu__(float *arg0, float *arg1, float *arg2) {
        40: for (int64_t idx0 = 0; idx0 < 16; idx0 += 1) {
        41: for (int64_t idx1 = 0; idx1 < 16; idx1 += 1) {
        42: for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
count:2                                                   X error: no match found
        43: /*%0 = memref.load %arg0[%arg3, %arg5] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        44: const auto arg0_offset0 = affine_map_func_0_i0(idx0, idx2);
count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        45: float var3 = ((float*)arg0)[arg0_offset0];
count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        46: /*%1 = memref.load %arg1[%arg5, %arg4] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        47: const auto arg1_offset1 = affine_map_func_0_i0(idx2, idx1);
count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         .
         .
         .
>>>>>>

Merged PR 2480: Clean up cache vectorization argument plumbing. [Mason
Remy]

Clean up cache vectorization argument plumbing
Merged PR 2481: Enables verification for ROCm smoke tests. [Kern
Handa]
Merged PR 2473: Extends range analysis by adding support for
udiv,sdiv,urem,srem. [Abdul Dakkak]

these come up during code gen
Merged PR 2474: Add vectorize arg to plan.cache. [Mason Remy]

Add vectorize arg to plan.cache
- Enables specifying whether or not to vectorize ops for a given cache,
  including an "AUTO" option, which will behave how caching
  vectorization has behaved in the past, where it vectorizes the cache
  if any loop in the loopnest is also vectorized
- Also fix some include paths

Full Changelog: v1.2.2...v1.2.3

Contributors

Arslan-e-Mustafa

Assets 2

18 Mar 05:32

lisaong

v1.2.2

2060948

v1.2.2

What's Changed

Full Changelog: v1.2.1...v1.2.2

Add Ubuntu CI workflow by @lisaong in #9
Rework documentation sections by @lisaong in #8
Manually run script to update doc versions by @lisaong in #10
Show more perf gains for the quickstart example by @lisaong in #12
Fix post merge build break by @lisaong in #14
README.md refactoring by @Arslan-e-Mustafa in #13
Complete refactoring of file array.md and simple affine loop nests.md file in manual docs by @Arslan-e-Mustafa in #16
complete refactoring of introduction.md file in manual docs by @Arslan-e-Mustafa in #15
Complete refactoring of vectorization and parallelization of manual docs by @Arslan-e-Mustafa in #20
Complete refactoring of targets.md and previous typos by @Arslan-e-Mustafa in #18
Complete refactoring of caching.md file by @Arslan-e-Mustafa in #19
Complete refactoring of schedule.md file from manual docs by @Arslan-e-Mustafa in #17
Complete refactoring of manual docs: Deferred Layout of Constant Arrays by @Arslan-e-Mustafa in #21
Refactoring of packages(dot)md from manual docs by @Arslan-e-Mustafa in #23
Complete refactoring of parameters dot md from manual docs by @Arslan-e-Mustafa in #22

Merged PR 2439: Downstream doc changes from github/main. [Lisa Ong]

Squashed commit of the following:

commit 8a6e553
Merged PR 2440: Enable tensorization for Rocm target. [Abdul Dakkak]
Merged PR 2470: Adds support for the execution of GPU (CUDA only)
functions via hat. [Kern Handa]
Merged PR 2467: Adding multiple functions in package.add() can't work
with stateful auxiliary metadata and index_map. [Denny Sun]

These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

Related work items: #3662
Merged PR 2469: Convert 'Local' memory space to 'Private' [Mason Remy]

Convert 'Local' memory space to 'Private'
Merged PR 2463: Enable specifying double buffer memory space. [Mason
Remy]

Enable specifying double buffer memory space
Merged PR 2468: Move to VS2022 for builds. [Kern Handa]

Move to VS2022 for builds
Merged PR 2465: extend gpu target spec. [Abdul Dakkak]

extend gpu target spec

Merged PR 2464: Compute a stable hash for function name suffixes.
[Lisa Ong]

Create a stable hash using md5 and json serialization of these stringized entries:

Array args: shape, type, role, layout
parameter dictionary
Target

Example output:

test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
DEBUG:root:Adding wrapped function
Building function fusing_test_32d12fb1a01061ec
DEBUG:root:Detected logic function _ uses indices i,j
DEBUG:root:Detected logic function _ uses indices i,j
Building function _debug_check_allclose_16_16_4cfd65a8b606655b

Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds.
[Lisa Ong]
Merged PR 2461: Replace MemoryType with MemorySpace for consistency.
[Mason Remy]

Replace MemoryType with MemorySpace for consistency
Merged PR 2416: Implement initial thrifty caching support. [Mason
Remy]

Implement initial thrifty caching support
- This is a simple brute-force approach where each thrifty cache is
  examined element-by-element alongside the array it is caching to check
  whether there is a stride of 1 between every access
- Currently this thrifty analysis and the potential erasing of thrifty
  caches happens after the cache ops have been created. This is due to
  needing the cache mapping to have already run in order to support
  hierarchical caching scenarios. Eventually this should be refactored
  and the thrifty analysis should be used to prevent creating the cache
  ops, but that is a larger refactor than the scope for this task.
- When creating affine loads and stores into caches, this change also
  tacks on some attributes onto the load/store ops to indicate how the
  original load or store accessed the base array. Since the base array
  -> cache position mapping is not always invertible (consider
  coefficient cache layout cases), this is one of the only ways to
  encode this information. Unfortunately, canonicalization on affine
  load/store ops will scrub away these attributes, so any reliance on
  them has to occur before a canonicalization pass. Similarly, the
  MakeCacheOps recording which argument to their accesses are the base
  array positions depends on the operand list being unchanged, however
  canonicalization may remove operands if it determines they are not
  used - while this is fine for the load/store op itself, any assumption
  like "base array indices are at positions N...N+K in the operand list"
  are no longer valid
Related work items: #3575
Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection.
[Kern Handa]

Changes the order of the LLVM_SETUP_VARIANT detection
Merged PR 2458: Fixes building with clang++ on Linux/WSL. [Kern Handa]

Fixes building with clang++ on Linux/WSL
Merged PR 2438: Support for double-buffer caching. [Mason Remy]

Support for double-buffer caching
- Adds plumbing from python dsl for double_buffer flag to cache API
- Implements double buffering by hoisting the initial cache fill outside
  of the cache trigger loop parent, then creating a prologue subnest
  that fills a temporary buffer with the i+1'st iterations data and an
  epilogue subnest that moves that temporary buffer data into the main
  cache buffer. The last iteration of the trigger loop parent loop is
  unswitched and no cache filling is done in that loop.
- On GPU the temporary buffer is allocated in private memory and if the
  cache is in shared memory each thread just holds onto their own
  contribution to the cache in their own private memory buffer until the
  epilogue fill nest
- Barrier ops are hoisted out of conditionals to avoid potential for
  deadlocks. The conditionals introduced in this PR should be
  always-true or always-false, but this is added as a safety measure.
  Currently the hoisting is naive - any barrier within a conditional is
  erased and barriers are placed before and after the conditional block.
  This is not correct for all future conditional scenarios as any
  operations that happen within the conditional that depend on the
  barrier existing will be broken, however it works for how conditionals
  are used currently and can be improved on over time
Related work items: #3659
Merged PR 2450: Automatically add parameter dict as auxiliary data.
[Denny Sun]

Automatically add parameter dict as auxiliary data

Related work items: #3662
Merged PR 2456: Updates CUDA source emission based on testing with
nvrtc. [Kern Handa]

Updates CUDA source emission based on testing with nvrtc
Merged PR 2453: Sets CPU targets to default to openmp. [Kern Handa]

Sets CPU targets to default to openmp
Merged PR 2443: Add FP16 support. [Abdul Dakkak]

preparation for adding mfma support for CUDA which only operates on FP16
Merged PR 2452: Updates GPU source emitting path to emit host launcher
and device function pairs. [Kern Handa]
Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow
for exact matches. [Kern Handa]

Updates IR util ResolveExec[Target,Runtime] to allow for exact matches
Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime. [Kern
Handa]

Makes Vulkan specific behavior pred. on Runtime
Merged PR 2446: Updates Runtime enum in Targets.py to be more
comprehensive. [Kern Handa]

Updates Runtime enum in Targets.py to be more comprehensive
Merged PR 2449: [Cleanup] Replace "rc*" prefixes with "acc*"
prefixes in tablegen'ed code. [Lisa Ong]

For *.td, perform the following replacements for ops:

s/rcv_/accv_/g
s/rc_/acc_/g
s/rcxp_/accxp_/g
s/rcln_/accln_/g
Merged PR 2448: fix typo in the condition for mod in range analysis.
[Abdul Dakkak]

fix typo in the condition for mod in range analysis
Merged PR 2445: Fix bind command when index is further split. [Abdul
Dakkak]
Merged PR 2444: add range remainder. [Abdul Dakkak]

add range remainder
Merged PR 2441: Fix APInt usage in RangeValueOptimizePass. [Mason
Remy]

Run the RangeValueOptimizePass as part of acc-to-llvm
Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr
<-> struct utils. [Mason Remy]

Move ExecutionOptions to ir lib and create arrayattr <-> struct utils
Simplify target passthrough layer. [Mason Remy]
Move ExecutionOptions to ir lib and create arrayattr <-> struct utils.
[Mason Remy]
Merged PR 2430: Remove unnecessary barrier ops. [Chuck Jacobs]

This PR adds an optim...

Contributors

lisaong and Arslan-e-Mustafa

Assets 2

26 Jan 11:08

lisaong

v1.2.1

e7879bc

v1.2.1

What's Changed

Merged PR 2391: Update quickstart example, updated docs structure per
feedback. [Lisa Ong]
- Teasers for transformations in the Quickstart sample (to differentiate Accera from others), with benchmarking
- Removed the Miscellaneous section, redistributed various docs to various related locations
- Renamed the cross compilation tutorial so that it is ordered last
Merged PR 2392: Populate Target.Models based on known devices. [Kern
Handa]

Populate Target.Models based on known devices
Merged PR 2390: Merge multiple HAT files during project building.
[Kern Handa]

Merge multiple HAT files during project building

Related work items: #3559
Merged PR 2386: Add support for various targets. [Kern Handa]

Add support for various targets

Related work items: #3631
Merged PR 2389: [nfc] Doc typos and consistency fixes. [Lisa Ong]
Merged PR 2388: Update quickstart example, add binder quickstart.
[Lisa Ong]
- Update quickstart example to perform a matmul + ReLU (unoptimized)
- Add Launch in Binder button to run everything in the browser
Merged PR 2387: Placeholder GPU GridUnit definitions, add library
creation from multiple object files. [Lisa Ong]

Dependent HAT PR: microsoft/hat#21
- GridUnit definitions are static until we have real GPU targets. These are updated just to be consistent with the Manual
- When not cross compiling, combine multiple .obj/.o into .lib/.a
Related work items: #3576
Merged PR 2384: Update target docs, split Intel generation 8 and 9 for
consistency. [Lisa Ong]
- Update target docs to list the name of the target in the table
- Define separate models for Intel generation 8 and 9 for consistency
Related work items: #3631
Merged PR 2383: Support dynamic libs from Package.build [Lisa Ong]
- Add static and dynamic variants to the HAT and MLIR formats
- MLIR format is also split because we'd want to support MLIR inspection of the cross-compilation scenario without forcing users to switch between dynamic and static
- Updated README sample
Left for future work:
- Combining multiple object files into a static lib or dynamic lib. We'd need to think about how HAT packages can be merged together (for example, how to reconcile the metadata in the HAT file, such as description, author - do we merge all metadata or just pick the first HAT file encountered as the "master", etc)
Related PR: microsoft/hat#18

Related work items: #3576
Merged PR 2382: [nfc] Move Case Studies out of the Accera repo. [Lisa
Ong]

Case Studies will live in other repositories, and be cross linked from the Accera repo's Case Studies README.md (to be added in the future).

Related work items: #3632
Merged PR 2379: Specify dynamic lib dependencies from the HAT Package.
[Lisa Ong]

This is the final missing piece before we transition to building static / dynamic libs using hatlib.
- Plan infers additional dynamic dependencies when the target is GPU or when parallelization is requested.
- Package.add collects the dependency info the various Plan instances.
- Package.build, the platform parameter is used to resolve to the appropriate library (either a path or a -l directive).
  - For library paths that cannot be fully determined in advance, we default to the current working directory, so perhaps the user can put the lib in the same path as the binaries. (this needs to be fleshed out more)
- Removed dead code
Dependent hatlib PR: https://github.com/microsoft/hat/pull/16/files

Related work items: #3576
Merged PR 2380: Add Raspberry Pi 4 (B) support. [Kern Handa]

Related work items: #3631

Merged PR 2368: Update and optimize acc-translate. [Abdul Dakkak]

propagate constants while generating C++ code
inline mlir within the C++ code to ease debugging
increase support for vector ops
silence a lot of warnings that were being emitted in the acc-translate codebase

The following

// CONFIG: {"K":2048,"M":2048,"N":2048,"block":{"x":16,"y":16,"z":1},"grid":{"x":128,"y":128,"z":1}}
module @gemm_naive_14479263422999410716_module attributes {gpu.binary = "HSACO"} {
  func @gemm_naive_14479263422999410716(%arg0: memref<2048x2048xf32> loc(unknown), %arg1: memref<2048x2048xf32> loc(unknown), %arg2: memref<2048x2048xf32> loc(unknown)) {
    %c16 = constant 16 : index loc(unknown)
    %c0 = constant 0 : index loc(unknown)
    %c2048 = constant 2048 : index loc(unknown)
    %c1 = constant 1 : index loc(unknown)
    %cst = constant 0.000000e+00 : f32 loc(unknown)
    %0 = "gpu.thread_id"() {dimension = "x"} : () -> index loc(unknown)
    %1 = "gpu.thread_id"() {dimension = "y"} : () -> index loc(unknown)
    %2 = "gpu.block_id"() {dimension = "x"} : () -> index loc(unknown)
    %3 = "gpu.block_id"() {dimension = "y"} : () -> index loc(unknown)
    %4 = scf.for %arg3 = %c0 to %c2048 step %c1 iter_args(%arg4 = %cst) -> (f32) {
      %11 = muli %3, %c16 : index loc(unknown)
      %12 = addi %1, %11 : index loc(unknown)
      %13 = memref.load %arg0[%12, %arg3] : memref<2048x2048xf32> loc(unknown)
      %14 = muli %2, %c16 : index loc(unknown)
      %15 = addi %0, %14 : index loc(unknown)
      %16 = memref.load %arg1[%arg3, %15] : memref<2048x2048xf32> loc(unknown)
      %17 = mulf %13, %16 {RelaxedPrecision} : f32 loc(unknown)
      %18 = addf %arg4, %17 {RelaxedPrecision} : f32 loc(unknown)
      scf.yield %18 : f32 loc(unknown)
    } loc(unknown)
    %5 = muli %3, %c16 : index loc(unknown)
    %6 = addi %1, %5 : index loc(unknown)
    %7 = muli %2, %c16 : index loc(unknown)
    %8 = addi %0, %7 : index loc(unknown)
    %9 = memref.load %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
    %10 = addf %9, %4 {RelaxedPrecision} : f32 loc(unknown)
    memref.store %10, %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
    return loc(unknown)
  } loc(unknown)
} loc(unknown)

generates the following cpp file

#if defined(__HIP_PLATFORM_AMD__)
#include <hip/hip_runtime.h>
using vfloatx2_t = float __attribute__((ext_vector_type(2)));
using vfloatx4_t = float __attribute__((ext_vector_type(4)));
using vfloatx16_t = float __attribute__((ext_vector_type(16)));
#else
#include "cuda_fp16.h"
#endif // !defined(__HIP_PLATFORM_AMD__)

#include <math.h>
#include <stdint.h>

__global__ void gemm_naive_14479263422999410716(float (*arg0)[2048], float (*arg1)[2048], float (*arg2)[2048])
{
    /*%0 = "gpu.thread_id"() {dimension = "x"} : () -> index*/
    const uint threadIdx_x_0 = threadIdx.x;
    /*%1 = "gpu.thread_id"() {dimension = "y"} : () -> index*/
    const uint threadIdx_y_1 = threadIdx.y;
    /*%2 = "gpu.block_id"() {dimension = "x"} : () -> index*/
    const uint blockIdx_x_2 = blockIdx.x;
    /*%3 = ...

Merged PR 2376: [build] Install acc-lsp-server as an internal tool.
[Lisa Ong]

Removes acc-lsp-server from accera-compilers

Minor CMake macro renames to (hopefully) improve usability
Merged PR 2378: [doc] Update doc links after DSL changes, fix missing
file warnings. [Lisa Ong]

Verified by:
```
cd <accera_root>
pip install mkdocs-material mkdocs-git-revision-date-plugin
mkdocs serve
```
Merged PR 2377: Retire Benchmark.py, use hatlib for benchmarking and
shared library creation. [Lisa Ong]

This cleanup work precedes the actual work to produce static or dynamic libraries by migrating existing HAT Python scripts to consume hatlib. Next PRs will consume hatlib to produce those libraries.

hatlib defines a HAT package as .hat files and a library.
- Remove accera.tuning.AutoBenchmark and replace usages with hat.run_benchmark in case studies
- Removed accera.tuning.CorrectnessCheck. Baked correctness checking into accera.test.verifiers
- Disabled some tests in preparation for coming work (next PRs)
  - parallelization tests: need to specify lomp as a link target dependency in the HAT file, and update hatlib to honor this flag
  - emit_unpacked_buffer_tests: to resolve multi-MLIR-module scenario where we have a globals module in addition to the package module
Depends on this PR: microsoft/hat#15

Related work items: #3556
Merged PR 2374: Retain and honor the order of functions added to the
package. [Kern Handa]

Retain and honor the order of functions added to the package

Related work items: #3629
Merged PR 2371: add lsp server for accera. [Abdul Dakkak]

this adds an lsp server to be used with the mlir vscode extension https://marketplace.visualstudio.com/items?itemName=llvm-vs-code-extensions.vscode-mlir . You will have to specify the lsp server in your settings.json . On my system this means to add the following setting
```
  "mlir.server_path": "${workspaceFolder}/build/accera/acc-lsp-server/acc-lsp-server",
```
It's not super robust though
Merged PR 2372: reduce install size. For example, on linux the install
size goes from 873M to 742M on Linux. [Abdul Dakkak]

reduce install size. For example, on linux the install dir goes from 873M to 742M. More can be done along those lines
Merged PR 2369: run clang-format on acc_translate. [Abdul Dakkak]

run clang-format on acc_translate. There are no modifications to the code
Merged PR 2367: Selectively emit GPU utilities. [Kern Handa]

Selectively emit GPU utilities

Related work items: #3559
Merged PR 2366: [build] Fix manylinux package build. [Lisa Ong]

Apply updated requirements.txt without rebuilding docker image
Merged PR 2365: Unify Package.add_function and Package.add_functions
into Package.add. [Kern Handa]

Related work items: #3549
Merged P...

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

v1.2.7

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Releases: microsoft/Accera

v1.2.9

v1.2.8

What's Changed

Contributors

v1.2.7

v1.2.7

v1.2.6

What's Changed

Contributors

v1.2.5

What's Changed

New Contributors

Contributors

v1.2.4

What's Changed

Contributors

v1.2.3

What's Changed

Contributors

v1.2.2

What's Changed

Contributors

v1.2.1

What's Changed