Skip to content

Releases: microsoft/Accera

v1.2.9

17 Sep 03:37
Compare
Choose a tag to compare
  • Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]

    write runtime size of index type to Hat

  • Merged PR 2861: Fix cache_C benchmark variable which is not getting
    set properly for CUDA. [Ritwik Das]

    Fix cache_C benchmark variable which is not getting set properly for CUDA

  • Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
    Ong]

    Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.

    See: actions/runner-images@ef638dd

    • Only pipelines that rely on azure build agents are affected.
    • We still need to keep the submodule around to enable external builds from the Github repo.
    • Remove defunct pipeline
    • Update vcpkg submodule while we're here
  • Merged PR 2839: Enable CUDA output caching. [Ritwik Das]

    • Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
    • Change caching code to generate MMALoad/StoreOps for caching of the output.

    Related work items: #3725

  • Merged PR 2813: Add pass to recognize patterns that look like int16
    matrix multiply. [Chuck Jacobs]

    This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought vpmaddwd instruction.

    It also fixes some old low-level tests of integer arithmetic.

  • Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
    action container. [Lisa Ong]

    • Rev docs to 1.2.9

    • Update github workflow to reference updated tag for 14.0.6-1

  • Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
    layouts. [Ritwik Das]

    Filter GPU benchmarks by de-parameterizing cache layouts

  • Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
    [Ritwik Das]

    • Fix bug in GPU benchmark to calculate valid variant
    • Add cosmosdb util to cleanup old entries
  • Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
    MemRefCastOp. [Lisa Ong]

    From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001

  • Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
    kernel filters. [Ritwik Das]

    Paramterize cache strategy in GPU benchmarks and fix kernel filters

  • Merged PR 2836: Value DSL support for runtime sized output arrays.
    [Lisa Ong]

    • This adds memref-in-memref support for output arrays that are allocated in the function
    • A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
    • Update accv.StoreOp to support memrefs-in-memrefs

    Value pointer levels are defined as follows:

    Layout Example Pointer level C-type
    scalar int16, float32, index, ... 0 int16_t, float32_t, int64_t, ...
    single-level memref memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32> 1 int64_t*, float32_t*, float32_t*
    memref in memref memref<memref<?x?x?f32>> at least 2 (= the number of levels of memrefs) float32_t**

    Future work:

    • End-to-end lowering through Python DSL
    • Bare pointer convention for output arrays
    • Custom allocator functions. Currently we use the built-in std alloc.

    Related work items: #3730

  • Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
    [Lisa Ong]

    The container registry allows pull-only access

  • Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
    LLVM malloc. [Denny Sun]

    1. make deep copy of range end of value type when cloning ops
    2. plumbing runtime size to LLVM
    3. transform memref.alloc to LLVM malloc
    4. conversion between block argument and symbol name

    the generated IRs:

    Initial.mlir

    %2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)

    LoopNestToValueFunc.mlir

    %2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)
    affine.for %arg4 = 0 to %arg0 {
        affine.for %arg5 = 0 to %arg1 {
        }
    }
    

    ConvertValueToStd.mlir

    `%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`
    

    ConvertValueToLLVM.mlir

    %8 = llvm.mul %arg1, %arg0  : i64
    %9 = llvm.mlir.null : !llvm.ptr<f32>
    %10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
    %11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64
    %12 = llvm.call @malloc(%11) : (i64) -> !llvm.ptr<i8>
    

    Related work items: #3733

  • Merged PR 2831: Record unique IDs so that different processes acting
    on a value module. [Mason Remy]

    Record unique IDs so that different processes acting on a value module
    don't produce conflicting IDs

  • Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
    benchmarks. [Ritwik Das]

    Fix WPT calculation to prevent 0 work and filter benchmarks

  • Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
    [Ritwik Das]

    • Add a flag to plan.cache() to expose the different thread <--> data arrangements
    • Optimize thread ID calculation to check blockdim first
  • Merged PR 2829: Add handwritten caching implementation for GPU.
    [Ritwik Das]

    Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.

  • Merged PR 2821: Fixes constraint logic for fusion of more than two
    schedules. [Kern Handa]

    Fixes constraint logic for fusion of more than two schedules

  • Merged PR 2830: Fixes macOS CI build. [Kern Handa]

    Fixes macOS CI build

  • Merged PR 2806: Enable specifying cache element type. [Mason Remy]

    Enable specifying cache element type

    • Supports accumulating and/or computing in a different element type and
      batching up the casts for those types
    • Also adds support for binop/castop expansion and castop folding
  • Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]

    Upgrade hatlib dependency to v0.0.23

  • Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]

    Refactor cast to a value cast op

  • Merged PR 2788: Re-enabled fusing test that was taking too long.
    [Chuck Jacobs]

    This PR just re-enables a skipped test that was taking too long

  • Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]

    Upgrade hatlib requirement to 0.0.22

  • Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
    machines. [Lisa Ong]

    According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.

Full Changelog: v1.2.8...v1.2.9

v1.2.8

10 Aug 03:29
Compare
Choose a tag to compare

What's Changed


  • Merged PR 2814: Parameterize batch_size in GPU benchmarks. [Ritwik
    Das]

    Parameterize batch_size in GPU benchmarks

  • Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump
    github actions to llvm 14.0.6. [Lisa Ong]

    Preparation for 1.2.8 release

  • Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky
    parallelized tests. [Lisa Ong]

    • Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds
    • ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel
  • Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds. [Lisa Ong]

    • Remove redundant setup.py builds in pipelines with cmake builds
    • Build debug for Linux only (the fastest config)
    • Add pipeline caching for ccache, conan, and pip where applicable
    • Add parallel configs where applicable
    • Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS.
  • Merged PR 2807: Enable verification for CK baselines. [Ritwik Das]

    • Enable verification for CK baselines
    • increase timeout for cuda resnet
    • add functionality for extracting kernel code from cosmosdb
  • Merged PR 2802: Fix barrier optimization pass. [Chuck Jacobs]

    This PR fixes a couple of barrier-related issues:

    • The barrier optimization pass wasn't keeping barriers that protected vector load/store ops
    • Multiple barriers were getting generated when hoisting barriers out of conditionals

    Related work items: #3732

  • Merged PR 2800: Add max_threads to parallelize and change default
    behavior. [Ritwik Das]

    • Add num_threads to parallelize
    • change default behavior to count the number of iterations of the given indices
    • Update documentation
  • Merged PR 2801: Remove verification on cuda-fp32-big benchmark.
    [Ritwik Das]

    Remove verification on cuda-fp32-big benchmark

  • Merged PR 2798: LLVM 14.0.6 upgrade. [Lisa Ong]

    An incremental upgrade with minimal or no changes to MLIR

  • Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent. [Kern
    Handa]

    Makes NestedPassAdaptor's pipeline consistent

    This change makes it so NestedPassAdaptor creates a new pass manager
    every time a new pass is added. Prior to this change, if dumpPasses was
    false, the same nested pass manager would be used. If dumpPasses was
    true, a new nested pass manager would be created per call to addPass.
    This difference in behavior was also resulting in the lowering pipeline
    to be different, depending on the value of dumpPasses.

    For example, in the following code in AcceraPasses.cpp, all the passes
    that are added to funcOpPM run BEFORE createConvertSCFToOpenMPPass
    if dumpPasses was false.

        auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
        funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
        funcOpPM.addPass(createSimplifyAffineStructuresPass());
        funcOpPM.addPass(createCanonicalizerPass());
        funcOpPM.addPass(createLoopInvariantCodeMotionPass());
        funcOpPM.addPass(createCSEPass());
    
        pmAdaptor.addPass(createConvertSCFToOpenMPPass());
        pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
        funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue()));
        pmAdaptor.addPass(value::createRangeValueOptimizePass());
        pmAdaptor.addPass(createCanonicalizerPass());
        pmAdaptor.addPass(createCSEPass());

    Additionally, this change exposed the fact that the BarrierOpt pass is
    incorrectly erasing barriers, and so has been made into a no-op until
    this correctness issue has been fixed.

  • Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder
    illustrations. [Lisa Ong]

    • Clarify in the labels while working on the animated version

    • Cleanup and rename .js files for (slightly) easier lookup

  • Merged PR 2475: LLVM 14.0.0 upgrade. [Lisa Ong]

    Tag: llvmorg-14.0.0

    Notable changes:

    • std dialect ops are now moved to arith, math dialects
    • StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x
    • [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a memrefCopy C function for non-identity layout copies. This affects Array.sub_array in debug mode.
    • [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops.
    • Lots of renames
    • Upgraded macOS to macOS-12

    Related work items: #3646

  • Merged PR 2753: accera.Dimension and runtime-sized Arrays in the
    Python DSL. [Denny Sun]

    With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes.

            M = Dim()
            N = Dim()
            K = Dim()
    
            A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT)
            B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT)
            C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)
    
            nest = Nest((M, N, K))
            i, j, k = nest.get_indices()
    
            @nest.iteration_logic
            def _():
                C[i, j] += A[i, k] * B[k, j]
    
            package.add()
            package.build()
    
    #domain0 = #accln<"idomain{{i,3}={0:{op_idx:0}:1}, {j,4}={0:{op_idx:1}:1}, {k,5}={0:{op_idx:2}:1}}">
    #domain1 = #accln<"idomain{{i,9}={0:{op_idx:0}:1}, {j,10}={0:{op_idx:1}:1}}">
    #domain2 = #accln<"idomain{{i,6}={0:1:1}}">
    
    #map = affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>
    #xdomain0 = #accln<"xfdomain{dims: {{i,3}, {j,4}, {k,5}}, indices: {{{i,3} : {0:{op_idx:0}:1}}, {{j,4} : {0:{op_idx:1}:1}}, {{k,5} : {0:{op_idx:2}:1}}}}">
    #xdomain1 = #accln<"xfdomain{dims: {{i,9}, {j,10}}, indices: {{{i,9} : {0:{op_idx:0}:1}}, {{j,10} : {0:{op_idx:1}:1}}}}">
    #xdomain2 = #accln<"xfdomain{dims: {{i,6}}, indices: {{{i,6} : {0:1:1}}}}">
    module @test_runtimesizes attributes {llvm.data_layout = "... ..."}  {
      accv.module "test_runtimesizes"  {
        accv.func nested @runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, #map> loc(unknown), %arg4: memref<?x?xf32, #map> loc(unknown), %arg5: memref<?x?xf32, #map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} {
          %0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          %1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          %2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          "accln.nest"(%0, %1, %2) ( {
            %3 = accln.sym_index {name = "i"} #accln<"index{i,3}"> loc(#loc)
            %4 = accln.sym_index {name = "j"} #accln<"index{j,4}"> loc(#loc)
            %5 = accln.sym_index {name = "k"} #accln<"index{k,5}"> loc(#loc)
            "accln.kernel"() ( {
              %7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, #map>, index, index) -> memref<f32> loc(#loc)
              ... ...
              accln.terminator loc(#loc)
            }) {sym_name = "_"} : () -> () loc(#loc)
            ... ...
            accln.terminator loc(#loc)
          }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(#loc)
          accv.return loc(#loc)
        } loc(#loc)
        accv.func @runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo...
    
  • Merged PR 2793: support sign extend op in canVectorize() function to
    improve generated MLIR. [JUBI TANEJA]

    While trying to optimize int16 MatMul with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in canVectorize function and now with this op supported, we can emit some vector instructions in the MLIR.

  • Merged PR 2790: Filter invalid kernels from GPU benchmarks. [Ritwik
    Das]

    • Filter invalid kernels from GPU benchmarks
    • Disable verification on cuda f16 benchmarks
    • Remove frequent cleanups
  • Merged PR 2787: Remove MLIR flag from package format in benchmarks.
    [Ritwik Das]

    Remove MLIR flag from package format in benchmarks

  • Merged PR 2784: Merge Github changes to ADO. [Lisa Ong]

  • Merged PR 2776: Make fusing more efficient. [Chuck Jacobs]

    This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules).

    Here are some timings on a few tests (modified versions of Mason's example script) I ran:

    test main branch PR branch
    3 fused schedules, tile first only 18.8s 5.8s
    3 fused schedules, tile 1 & 2 190s 6.2s
    3 fused schedules, tile all 3 ???? 7.2s

    Related work items: #3731

  • Merged PR 2781: Fix benchmark with MLIR format and add repro test.
    [Ritwik Das]

  • Merged PR 2780: Type support for tensor ops in CUDA. [R...

Read more

v1.2.7

13 Jul 18:06
Compare
Choose a tag to compare

v1.2.7

  • Merged PR 2744: [doc] Fixes link in reference/functions/cast.md, revs
    version on all docs. [Kern Handa]

    [doc] Fixes link in reference/functions/cast.md

  • Merged PR 2743: [DSL] Document implicit casting rules and the explicit
    cast function. [Lisa Ong]

    • Document implicit casting rules implemented by !2693
    • Promote acc.cast to a documented function to give the user control to override implicit casting behavior
  • Merged PR 2739: Updates ROCM tensorization pattern to handle casting.
    [Kern Handa]

    Updates ROCM tensorization pattern to handle casting

  • Merged PR 2643: Some fixes for last major array caching in
    tensorization. [Mason Remy]

    Some fixes for last major array caching in tensorization

  • Merged PR 2693: Updates DSL codegen to implicitly cast if possible.
    [Kern Handa]

    Updates DSL codegen to implicitly cast if possible

  • Merged PR 2735: Pass multiple input files as comma-separated list to
    benchmark tool. [Ritwik Das]

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41588&view=logs&j=d78921a4-2f18-50b0-77ad-4c6803f3371b&t=f97c60f6-ada7-5ec9-5ea1-510216c408e9

    Above pipeline did not run the 2nd set of input sizes since the 1st process did not exit until pipeline timeout was hit. After the fix, we will always have a single job.

  • Merged PR 2721: Remove unnecessary logging in benchmarks. [Ritwik Das]

    Remove unnecessary logging in benchmarks

  • Merged PR 2674: Support emitting runtime array sizes in the Value DSL.
    [Lisa Ong]

    • Minimum set of changes to support runtime sizes in the Value DSL without transformations
    • Add a ScalarDimension type (name TBC) which is aliased to Scalar
    • Support variable ends in MemoryLayout, ScheduledLoopOp, RangeValueAnalysis
    • Use mlir::ShapedType::kDynamicSize and mlir::ShapedType::kDynamicStrideOrOffset as sentinel values, following the pattern in MemRefOps, TensorOps, etc.
    • TODO: E2E verification in the next PR
    • TODO: Python DSL changes in the next PR

    Output of mlir-translate for the runtime_sizes_all case, where %21, %22 and %23 are the runtime sizes for M, N, and K:

    define void @NestMatMul(float* %0, float* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, float* %7, float* %8, i64 %9, i64 %10, i64 %11, i64 %12, i64 %13, float* %14, float* %15, i64 %16, i64 %17, i64 %18, i64 %19, i64 %20, i64 %21, i64 %22, i64 %23) !dbg !3 {
      br label %25, !dbg !7
    
    25:                                               ; preds = %57, %24
      %26 = phi i64 [ %58, %57 ], [ 0, %24 ]
      %27 = icmp slt i64 %26, %21, !dbg !9
      br i1 %27, label %28, label %59, !dbg !10
    
    28:                                               ; preds = %25
      br label %29, !dbg !11
    
    29:                                               ; preds = %55, %28
      %30 = phi i64 [ %56, %55 ], [ 0, %28 ]
      %31 = icmp slt i64 %30, %22, !dbg !12
      br i1 %31, label %32, label %57, !dbg !13
    
    32:                                               ; preds = %29
      br label %33, !dbg !14
    
    33:                                               ; preds = %36, %32
      %34 = phi i64 [ %54, %36 ], [ 0, %32 ]
      %35 = icmp slt i64 %34, %23, !dbg !15
      br i1 %35, label %36, label %55, !dbg !16
    
    36:                                               ; preds = %33
      %37 = mul i64 %26, %5, !dbg !17
      %38 = add i64 %37, %34, !dbg !18
      %39 = getelementptr float, float* %1, i64 %38, !dbg !19
      %40 = load float, float* %39, align 4, !dbg !20
      %41 = mul i64 %34, %12, !dbg !21
      %42 = add i64 %41, %30, !dbg !22
      %43 = getelementptr float, float* %8, i64 %42, !dbg !23
      %44 = load float, float* %43, align 4, !dbg !24
      %45 = fmul float %40, %44, !dbg !25
      %46 = mul i64 %26, %19, !dbg !26
      %47 = add i64 %46, %30, !dbg !27
      %48 = getelementptr float, float* %15, i64 %47, !dbg !28
      %49 = load float, float* %48, align 4, !dbg !29
      %50 = fadd float %49, %45, !dbg !30
      %51 = mul i64 %26, %19, !dbg !31
      %52 = add i64 %51, %30, !dbg !32
      %53 = getelementptr float, float* %15, i64 %52, !dbg !33
      store float %50, float* %53, align 4, !dbg !34
      %54 = add i64 %34, 1, !dbg !35
      br label %33, !dbg !36
    
    55:                                               ; preds = %33
      %56 = add i64 %30, 1, !dbg !37
      br label %29, !dbg !38
    
    57:                                               ; preds = %29
      %58 = add i64 %26, 1, !dbg !39
      br label %25, !dbg !40
    
    59:                                               ; preds = %25
      ret void, !dbg !41
    }
    

    Related work items: #3716, #3717

  • Merged PR 2682: Add nvidia device optimized sizes and some benchmark
    fixes. [Ritwik Das]

    Add nvidia dev opt sizes and some bench fixes

  • Merged PR 2676: Add automated weekly rocm baseline benchmark. [Ritwik
    Das]

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41316&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8&t=687344d2-d6b6-5d8c-dd9d-6aab558fd96c

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41314&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8

  • Merged PR 2673: Add automated weekly baseline benchmarks on Nvidia
    GPU. [Ritwik Das]

v1.2.6

17 Jun 06:43
Compare
Choose a tag to compare

What's Changed

  • Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by @dependabot in #42
  • [ci] Fix out of disk space errors for CI workflow by @lisaong in #43
  • Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by @dependabot in #44
  • Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]

    • switch to gpu dialect for gpu index ops
    • add conversion pass from gpu dialect to rocdl
  • Merged PR 2652: Add integer tensor ops support for AMD targets.
    [Ritwik Das]

    • int mfma ops
    • tests
    • static_cast in c++

    Related work items: #3727

  • Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
    [Lisa Ong]

  • Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]

    Add more MMA shapes for CUDA

    • 32x8x16
    • 8x32x16
  • Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]

    • Manually set the Target.Model user capability on agents running A6000
    • Update benchmarking pipelines to demand A6000s

    https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback

  • Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]

    Remove couple more big gemm sizes

  • Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
    [Lisa Ong]

    Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

    This makes it easier to expand debug mode in the future.

  • Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
    [Lisa Ong]

    https://github.com/microsoft/hat/releases/tag/v0.0.19

  • Merged PR 2630: Add batched gemm support with tensorization. [Ritwik
    Das]

    Related work items: #3677

  • Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes.
    [Ritwik Das]

    • Add env var for ACCOUNT_KEY
    • shuffle gemm sizes from small to big
    • remove correctness check from big inputs and fp16
  • Merged PR 2607: Infrastructure for plan.auto() to support a basic none
    cache heuristics approach. [JUBI TANEJA]

    Infrastructure for plan.auto() to support a basic none cache heuristics approach

    This is a basic approach to test parameterization of cache arguments, index and layout.
    User only needs to specify the source they want to cache, and AutoPlanner's
    NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
    with possible set of values.

    Overall idea at DSL level:
    Given input -
    schedule.reorder(i, j, k, ii, jj, kk)
    plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))

    Internally, auto() invokes cache and adds two functions with
    a unique value of layout.

    plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})

    Important change in this PR:

    • Add a new algorithms module in Accera
    • Do not delay resolution of delayed parameters to get the value, instead it
      now allows setting parameters with a possible set of values and this can be
      passed between heuristics and plan object. Check: Parameter.py
    • Parameters constructed by heuristics are termed as "herustic parameters".
      They are not available to the external users of Accera, but just named
      separately in the implementation to differentiate them from user-defined "parameters".

    Limitation/Changes coming in the subsequent PRs:

    • Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
      For now, the code only focuses on testing AutoPlanner without any user-defined parameters
      that one can create using API: create_parameters.
    • Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
      next PR.
  • Merged PR 2600: Refactor MFMA indexing calculations. [Mason Remy]

    Refactor MFMA indexing calculations

    • Use the iteration space position when determing MFMA computation
      locations rather than computing the position from the thread id
    • Construct the full subschedules for AMD MFMA ops so that the bound
      loop indices are ordered appropriately for the MFMA op being invoked
    • Update unit tests accordingly. The schedule changes may need to be
      moved to an under-the-hood feature of tensorization
  • Merged PR 2627: Raise error for invalid block dimensions. [Ritwik Das]

    Raise error for invalid block dimensions based on target info

    Related work items: #3715

  • Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets.
    [Lisa Ong]

    Debug mode is not yet supported for GPU targets

    • Fail early
    • Update documentation
  • Merged PR 2622: Fix dependencies for benchmark tools. [Ritwik Das]

    Fix dependencies for benchmark tools

  • Merged PR 2604: Add bfloat16 support for tensor ops on rocm. [Ritwik
    Das]

    Add bfloat16 support for tensor ops on cuda and rocm

    Related work items: #3713

  • Merged PR 2621: Merge changes from Github repo. [Lisa Ong]

    commit 5b5f5ef

  • Merged PR 2620: Upgrade GPU self-hosted agents to g++-10. [Lisa Ong]

    The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.

    This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.

    Azure DevOps agents will continue to run on their pre-installed g++-9.

  • Merged PR 2619: Parameterize Plan.bind. [Denny Sun]

            P0, P1, P2, P3, P4, P5 = create_parameters()
    
            plan.bind(mapping={
                P0: P3,
                P1: P4,
                P2: P5
            })
    
            package.add(
                plan,
                args=(A, B, C),
                parameters={
                    P0: i,
                    P1: j,
                    P2: k,
                    P3: v100.GridUnit.BLOCK_X,
                    P4: v100.GridUnit.THREAD_X,
                    P5: v100.GridUnit.THREAD_Y,
                },
                base_name=test_name)
    

    Related work items: #3708

  • Merged PR 2599: Support parameterizing caches based on memory space.
    [Mason Remy]

    Support parameterizing caches based on memory space

    • Identifies bound indices that the cache should be parameterized on,
      rather than shaped by.
      e.g. for a private memory cache inserted at a gpu block level, the
      computed memory space will not be the full active block at that level,
      but the portion derived from loops that weren't bound to gpu thread
      dims.

    • Adds some BoundProcessorOp utilities and shares some common binding
      code

  • Merged PR 2618: Fix memory allocation bug during benchmark
    verification. [Ritwik Das]

    Fix memory allocation bug during benchmark verification

  • Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table. [Lisa
    Ong]

  • Merged PR 2616: Formatting Python code a bit for the better
    readability. [Denny Sun]

    1. Some functions have a long list of parameters, add line wrap
    2. Separate external imports from internal ones
  • Merged PR 2614: Remove redundant variable and cosmosdb fix. [Ritwik
    Das]

    Cosmos DB error when upserting from multiple processes:

    Process runner0:
    Traceback (most recent call last):
    File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
    File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
    cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
    File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
    container = get_container(containerName, verboseLogs)
    File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
    container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
    File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
    return func(*args, **kwargs) # type: ignore
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
    container_proxy.read(
    File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
    return func(*args, **kwargs) # type: ignore
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
    self._properties = self.client_connection.ReadContainer(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
    return self.Read(path, "colls", collection_id, None, options, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
    result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
    return synchronized_request.SynchronizedRequest(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
    return _retry_utility.Execute(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
    result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
    return function(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Reques...

Read more

v1.2.5

24 May 06:26
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in
    preparation for release. [Lisa Ong]

    bump docs version to 1.2.5 in preparation for release

  • Merged PR 2586: Loop order and indices as parameters​ [Denny Sun]

    With this change, the user can write a schedule with loop_order parameterized:

       loop_order = create_parameters()
       schedule.reorder(order=loop_order )
    
        parameter_grid = {
            loop_order : (j, k, i, ii, jj, kk)
        }
    
        parameters = create_parameter_grid(parameter_grid,
                                        filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
                                        sample=5)
    
        # Add another function to the package
        package.add(
            plan,
            args=(A, B, C),
            parameters=parameters,
            base_name="matmul_256_256_256"
        )
    

    Related work items: #3693

  • Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR
    CI. [Kern Handa]

  • Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline.
    [Lisa Ong]

    The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.

  • Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer. [Kern
    Handa]

    [nfc] Fixes a bunch of warnings in C++ layer

  • Merged PR 2589: [test] Adds DSL tests for Schedule.pad. [Kern Handa]

    Adds DSL tests for Schedule.pad

  • Merged PR 2587: Sync Github to ADO. [Lisa Ong]

    commit b934ad05f6b8cd84420226b93f57b8ac3229eadc

  • Merged PR 2585: Use conditional instead of loop-unswitching on GPU.
    [Chuck Jacobs]

    This PR changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

    Related work items: #3703

  • Merged PR 2571: Add random seed to enable reproducible sampling.
    [Denny Sun]

    Giving users control over sampling strategies.

  • Merged PR 2581: Add CUDA tensor core support. [Ritwik Das]

    • Added CUDA tensor ops (no caching)
    • Added validation tests
    • Changed MMA enum names
    • Bit of generated tensor op code in cuda:
    ...
    vhalf *var11 = (vhalf*)arg2;
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
    wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
    vhalf *var13 = (vhalf*)arg0;
    wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
    wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
    vhalf *var15 = (vhalf*)arg1;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
    wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
    wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
    wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);
    

    Related work items: #3694

  • Merged PR 2584: Adds cublas_gemm benchmarking tool. [Kern Handa]

    Adds cublas_gemm benchmarking tool

  • Merged PR 2583: Don't hold ResolveWarpSize results with rvalue. [Mason
    Remy]

    Don't hold ResolveWarpSize results with rvalue

    gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
    not holding the result with an rvalue pair appears to fix it.

    This was resulting in some mod 0's and floordiv 0's when we would expect
    the warp size constants to either be 32 or 64 exactly.

  • Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion. [Kern
    Handa]

  • Merged PR 2579: Improves accera_gemm.py's handling of unsupported
    configs. [Kern Handa]

    Improves accera_gemm.py's handling of unsupported configs

  • Merged PR 2578: Fixes time unit conversions in accera_gemm.py. [Kern
    Handa]

    Also addresses comments for the previous rocblas_gemm PR

  • Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API
    change. [Kern Handa]

    Fixes accera_gemm.py code after Plan.tensorize API change

  • Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker. [Kern
    Handa]

    Adds library warmup to rocblas_gemm benchmarker

  • Merged PR 2572: [nfc] Move accera/viz -> tools/viz. [Kern Handa]

    [nfc] Move accera/viz -> tools/viz

  • Merged PR 2573: Update setup.cfg hatlib dependency version. [Mason
    Remy]

    Update setup.cfg hatlib dependency version

  • Merged PR 2557: Overhauls the benchmarking tool. [Kern Handa]

    This change moves the benchmarking tool to a top-level tools/benchmarkers directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

    The aforementioned rocblas gemm benchmarking utility is also added in this change. rocblas_gemm is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

    An example:

    <build accera like usual>
    ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
    cd tools/benchmarkers
    mkdir ~/accera_benchmarks
    ./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm

    Related work items: #3685

  • Merged PR 2569: Make tensorization passes configurable, remove
    dependency from split indices. [Ritwik Das]

    • Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
    • Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
    • Added documentation for the new enum type.
    • Added some tests
    • Current code does not work with K > M (still investigating this, but should not block this PR)

    Related work items: #3688

  • Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays. [Mason
    Remy]

    Fix vectorized access of LAST_MAJOR arrays

    • mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
      strides on the minor dimension of the memref they access, so
      reinterpretcast the memref to a flat buffer to pass that check
    • add translation for reinterpretcastop
    • improve vectorization of LAST_MAJOR matrices in cache accesses by
      changing the traversal order of the cache region (when
      filling/reducing) based on the memory ordering of the outer array
      being acted on.
  • Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry
    for ROCm build agent. [Lisa Ong]

  • Merged PR 2560: Make register allocation during tensorization tunable.
    [Ritwik Das]

    • Add controllable number of fused mfma passes
    • Add controllable scheduling policy of mfma ops
    • Add tests

    Related work items: #3687

  • Merged PR 2565: [build] bump hatlib dependency to 0.0.13. [Lisa Ong]

    hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

New Contributors

Full Changelog: v1.2.4...v1.2.5

v1.2.4

06 May 02:12
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2563: Add a table of operators and code examples to the
    Parameters.md. [Denny Sun]

    Update the Manuals with the supported operators and code examples.

  • Merged PR 2562: [nfc] Add some macOS targets and synced Model.md.
    [Lisa Ong]

    • Re-generated Model.md to add missing models
    • Handle zero (unknown) vector_bytes cases in tests
    • Opportunistically added these models used during development:
      • 2016 macbook pro
      • M1 max
  • Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc
    versions to 1.2.4. [Lisa Ong]

  • Merged PR 2558: [nfc] update requirements to latest version of six.
    [Lisa Ong]

    Fixes this warning:

    <frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()
    
  • Merged PR 2559: Finer-granularity error reporting for python tests.
    [Chuck Jacobs]

    This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.

  • Merged PR 2556: [non-functional] Change ROCM code to generate gcn
    intrinsics when possible. [Ritwik Das]

    • Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
    • Add helpers which automatically check for runtime before emitting the proper code

    Related work items: #3698

  • Merged PR 2547: [non-functional] Change custom mfma types to Memref
    and some refactoring. [Ritwik Das]

    Make inital changes to remove custom mfma types

    Related work items: #3691

  • Merged PR 2555: create_parameters(count: int) no longer needs count as
    an argument. [Denny Sun]

    1. Remove the count of parameters to be created from the DSL
    2. Throw exception when users write the following code:
      create_parameters()
    3. The correct way of calling create_parameters() is:
      p1, p2 , p3 ..., pN = create_parameters()
  • Merged PR 2554: [doc] Updated some missing enums and fixed Case Study
    path. [Lisa Ong]

  • Merged PR 2522: Generalize array indexing in tensorized GEMM. [Chuck
    Jacobs]

    This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

    There are also a number of tests.

    This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

    Related work items: #3676

  • Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool.
    [Lisa Ong]

    • The Linux1ESPool is created to support internal builds of LLVM

    • Fix regression in pipeline due to overzealous .dockerignore

  • Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote. [Lisa
    Ong]

    In preparation for merge from ADO to GitHub for Case Studies publishing

  • Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third
    party containers. [Lisa Ong]

    Updating Dockerfile references

  • Merged PR 2548: Add README file for case studies. [Denny Sun]

    README file has a table where each case study points to the external repo link.

  • Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for
    development. [Lisa Ong]

    Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

    No plans to release pip packages until there is CI support

    Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10

  • Merged PR 2543: Add precomputed offset map optimization for
    tensorization (no caching) [Ritwik Das]

    • Add flag to tensorize() to enable optimization (off by default)
    • Optimization only affects load/store of accumulator (C) argument
    • Supports all 4 mfma shapes

    Related work items: #3671

  • Merged PR 2542: An assortment of minor fixes. [Chuck Jacobs]

    This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

    The specific things are:

    • Add 2 new target models to Targets.py (that correspond to my local dev boxes)
    • Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
    • Print a warning message if check_correctness skips a correctness check because no hat file was generated
    • Add a "minimum version" constraint to requirements.txt for hatlib
  • Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for
    Package.Format. [Kern Handa]

    Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Related work items: #3679

  • Merged PR 2544: [nfc] Removes now unnecessary ldebug output. [Kern
    Handa]

    [nfc] Removes now unnecessary ldebug output

  • Merged PR 2527: Enable vectorized shared memory write. [Mason Remy]

    Enable vectorized shared memory write

    • This adds mod simplification support needed for vecotrizing shared
      memory writes
    • Also refactors some of the affine simplification code slightly to
      share some common code between the floordiv and mod simplifications

    Related work items: #3586, #3661, #3689

  • Merged PR 2526: Enable GPU global read vectorization. [Mason Remy]

    Enable GPU global read vectorization

    • Implements a floor div simplification that enables better recognition
      of vectorizable load and stores

    Related work items: #3661, #3690

  • Merged PR 2541: Fix a few issues with GEMM benchmarking script. [Chuck
    Jacobs]

    This PR fixes a couple of errors:

    • there was a bug in the GEMM kernel
    • sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"

    It makes a couple of other tweaks:

    • it fails if the alpha and beta parameters aren't 1.0 and 0.0
    • it culls some variants with known-uncompilable tensorization parameters before trying to compile them
  • Merged PR 2538: Fix std::pair unpacking issue in
    TensorizeAffineForOpConversion. [Lisa Ong]

    In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files

  • Merged PR 2536: Parameter supports most of the
    arithmetic/binary/unary operations defined in operator lib. [Denny
    Sun]

    Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

    fma_unit_count, vector_size = acc.create_parameters(2)​
    jjj = schedule.split(jj, fma_unit_count * vector_size)​
    jjjj = schedule.split(jjjj, vector_size)

    Related work items: #3692

  • Merged PR 2539: [nfc][docs] Merging commits from Github/main. [Lisa
    Ong]

    commit ee28126a338d905eb5931038d3c5daba6ead3811

  • Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm
    smoke tests. [Lisa Ong]

    • Docker image for self-hosted build agent on the ROCm development machine
    • Pipeline will front-load the Python ROCm tests so that we fail faster
    • The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
    • CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
    • The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
    • Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.

    Related work items: #3682

  • Merged PR 2537: [Compliance] Make dependency on ffmpeg optional. [Lisa
    Ong]

    ffmpeg-python is only needed for video export from the Iteration Visualizer Tool

    Removing the hard dependency from the tool.

  • Merged PR 2525: Fix vectorization plumbing for GPU scenarios. [Mason
    Remy]

    Fix vectorization plumbing for GPU scenarios

    Related work items: #3661

  • Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main.
    [Lisa Ong]

    commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d

  • Merged PR 2530: Adds initial GPU benchmarking infrastructure. [Kern
    Handa]

    Related work items: #3685

  • Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file.
    [Mason Remy]

    [nfc] Refactor RangeValue utilities to separate file

    Related work items: #3661

  • Merged PR 2532: [prog] Fallback to known TargetDevice names for
    looking up the LLVM triple. [Lisa Ong]

    Resolves the issue where the CPU type is resolved (e.g....

Read more

v1.2.3

07 Apr 00:11
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2508: [release] Bump docs version to 1.2.3. [Lisa Ong]

    In preparation for a PyPI release to facilitate community contributions for case studies

    Synced doc editorials from public Github repo

  • Merged PR 2503: [prog] Support unsigned integer types in the DSL.
    [Lisa Ong]

    • Add ScalarType.uint8/16/32/64 support
    • Use UnrealizedConversionCastOps to convert these unsigned ints to signless ints
    • Refactored CastImpl now that we have to handle both unsigned and signless cases for casts to/from ints
    • Use a tuple of (mlir Type, llvm Type) to infer the C type when writing function declarations in the HAT file. The former holds sign-ness information, the latter determines the C type (e.g. pointer or not)
    • Simplified CheckAllClose function to reduce unnecessary casting
    • Doc updates
    • Fixed HAT file issues with ScalarType.bool
  • Merged PR 2507: Updates acc-translate output for ROCm 5.1. [Kern
    Handa]

  • Merged PR 2437: Add more known targets(from our team's devices) [Denny
    Sun]

    The new list covers the following cpus, these cpus are being used by our devs,
    Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
    11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz 2.11 GHz
    Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
    Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz

    Related work items: #3546

  • Merged PR 2505: [nfc] Rename parameters for schedule.tile and
    plan.bind. [Kern Handa]

    [nfc] Rename parameters for schedule.tile and plan.bind

  • Merged PR 2501: Adds support for more than one GPU function per
    package. [Kern Handa]

    Adds support for more than one GPU function per package

    Related work items: #3686

  • Merged PR 2504: [docs] Update stale versions in Reference docs. [Lisa
    Ong]

    Fixing while considering better approaches....

  • Merged PR 2499: Updates the syntax for schedule.tile. [Kern Handa]

    Updates the syntax for schedule.tile

  • Merged PR 2498: Updates the syntax for plan.bind. [Kern Handa]

    Updates the syntax for plan.bind

    Related work items: #3678

  • Merged PR 2500: Adds support for specifying index bitwidth for acc-
    translate. [Kern Handa]

    Adds support for specifying index bitwidth for acc-translate

    Story #3669

    Related work items: #3669

  • Merged PR 2490: Restore CMake Export. [Abdul Dakkak]

    Restore the CMake Export feature as it is used by argo-experiments. Note that you cannot use this feature if you are using the vcpkg llvm build

  • Merged PR 2497: Fix vectorization plumbing to correctly handle zero
    vectorization budget cases in cache reduce ops. [Mason Remy]

    Fix vectorization plumbing to correctly handle zero vectorization budget cases in cache reduce ops

  • Merged PR 2496: [nfc] Switch docs versioning to bump2version, replace
    VERSION with simple git tag-based version. [Lisa Ong]

    • Populate ACCERA_VERSION from the latest git tag
    • bump2version is now configured for the docs/ tree
  • Merged PR 2495: [test] Import break with python -m unittest discover.
    [Lisa Ong]

    python -m unittest discover accera/test *.py will interrogate verifiers.py and fail because of the relative import

  • Merged PR 2492: Updates test verifier code to match hatlib API
    changes. [Kern Handa]

    Updates test verifier code to match hatlib API changes

  • Merged PR 2488: Simplify RangeValue analysis. [Abdul Dakkak]

    Uses LLVM's ConstantRange instead of implementing our own to delete a lot of code

  • Merged PR 2489: add missing type_traits include. [Abdul Dakkak]

    add missing type_traits include

  • Merged PR 2482: Fix parameterized caches producing multiple caches
    erroneously. [Mason Remy]

    Fix parameterized caches producing multiple caches erroneously

    • This is more of a one-off fix. A more generalized fix for resetting
      schedules/plans for different parameter value resolution should be
      implemented down the road
  • Merged PR 2479: FP16 tensorization for ROCM. [Abdul Dakkak]

  • Merged PR 2472: Tensorization + Caching. [Abdul Dakkak]

  • Merged PR 2485: Add another keyword to function's auxiliary table.
    [Denny Sun]

    Add 'parameters' keyword to the parameter values in a function's auxiliary table​, then the table will look like:

    [functions.matmul_256_256_256_bdec0fac.auxiliary.accera]​
    [functions.matmul_256_256_256_bdec0fac.auxiliary.accera.parameters]​
    p_m_split_size = 16​
    p_n_split_size = 128​
    p_s_split_size = 256​
    p_s_split_2_size = 8​
    p_n_split_2_size = 16​
    p_n_split_3_size = 4

    Related work items: #3662

  • Merged PR 2484: [Pipelines] Enable uploads to PyPI when tagging a
    release. [Lisa Ong]

    Configurable service connection variable, allows setting of test and production PyPI service connections during scheduling.

    Also cleaned up a stale workaround for auditwheel in the ManyLinux pipeline.

  • Merged PR 2471: Fix to caching. [Abdul Dakkak]

    This avoids the aggressive cache deletion specifically when it occurs within loop. This is a temporary fix, and a more elegant one is to handle memory access info across loop boundaries.

  • Merged PR 2476: Add accera.create_parameter_grid() with self-defined
    filter and sample as arguments. [Denny Sun]

    Provide a generic function in DSL for users to create the parameters list from a dictionary(grid), self define a filter function to remove invalid parameter values and limit the number of parameter grid as well as the number of functions generated.

    We find out the requirement for this function when updating our matmul grid search case study.

    Related work items: #3662

  • Merged PR 2483: [Test] Integrate FileCheck into Python tests. [Lisa
    Ong]

    • Added FileCheck utility to the accera-llvm package
    • Can be run on any output file produced by the Package.build process, e.g. .cu, .mlir
    • Support some basic directives
    • Added examples for caching and rocm validation

    Example error spew:

    /root/Accera/build/lib.linux-x86_64-3.9/accera/bin/FileCheck /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck --input-file /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu
    
    /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck:2:16: error: CHECK-COUNT: expected string not found in input (4 out of 4)
    CHECK-COUNT-4: for (int64_t idx{{[0-9]}} = 0; idx{{[0-9]}} < 16; idx{{[0-9]}} += 1) {
                   ^
    /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu:42:47: note: scanning from here
    for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
                                                  ^
    
    Input file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu
    Check file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck
    
    -dump-input=help explains the following input dump.
    
    Input was:
    <<<<<<
             .
             .
             .
            37:
            38:
            39: extern "C" __global__ __launch_bounds__(1) void test_rocm_gemm_tiled_output_710d7d7d2ca9ca9e__gpu__(float *arg0, float *arg1, float *arg2) {
            40: for (int64_t idx0 = 0; idx0 < 16; idx0 += 1) {
            41: for (int64_t idx1 = 0; idx1 < 16; idx1 += 1) {
            42: for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
    count:2                                                   X error: no match found
            43: /*%0 = memref.load %arg0[%arg3, %arg5] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
    count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            44: const auto arg0_offset0 = affine_map_func_0_i0(idx0, idx2);
    count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            45: float var3 = ((float*)arg0)[arg0_offset0];
    count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            46: /*%1 = memref.load %arg1[%arg5, %arg4] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
    count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            47: const auto arg1_offset1 = affine_map_func_0_i0(idx2, idx1);
    count:2     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             .
             .
             .
    >>>>>>
    
  • Merged PR 2480: Clean up cache vectorization argument plumbing. [Mason
    Remy]

    Clean up cache vectorization argument plumbing

  • Merged PR 2481: Enables verification for ROCm smoke tests. [Kern
    Handa]

  • Merged PR 2473: Extends range analysis by adding support for
    udiv,sdiv,urem,srem. [Abdul Dakkak]

    these come up during code gen

  • Merged PR 2474: Add vectorize arg to plan.cache. [Mason Remy]

    Add vectorize arg to plan.cache

    • Enables specifying whether or not to vectorize ops for a given cache,
      including an "AUTO" option, which will behave how caching
      vectorization has behaved in the past, where it vectorizes the cache
      if any loop in the loopnest is also vectorized
    • Also fix some include paths

Full Changelog: v1.2.2...v1.2.3

v1.2.2

18 Mar 05:32
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.2.1...v1.2.2

  • Merged PR 2439: Downstream doc changes from github/main. [Lisa Ong]

    Squashed commit of the following:

    commit 8a6e553

  • Merged PR 2440: Enable tensorization for Rocm target. [Abdul Dakkak]

  • Merged PR 2470: Adds support for the execution of GPU (CUDA only)
    functions via hat. [Kern Handa]

  • Merged PR 2467: Adding multiple functions in package.add() can't work
    with stateful auxiliary metadata and index_map. [Denny Sun]

    These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

    Related work items: #3662

  • Merged PR 2469: Convert 'Local' memory space to 'Private' [Mason Remy]

    Convert 'Local' memory space to 'Private'

  • Merged PR 2463: Enable specifying double buffer memory space. [Mason
    Remy]

    Enable specifying double buffer memory space

  • Merged PR 2468: Move to VS2022 for builds. [Kern Handa]

    Move to VS2022 for builds

  • Merged PR 2465: extend gpu target spec. [Abdul Dakkak]

    extend gpu target spec

  • Merged PR 2464: Compute a stable hash for function name suffixes.
    [Lisa Ong]

    Create a stable hash using md5 and json serialization of these stringized entries:

    • Array args: shape, type, role, layout
    • parameter dictionary
    • Target

    Example output:

    test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
    DEBUG:root:Adding wrapped function
    Building function fusing_test_32d12fb1a01061ec
    DEBUG:root:Detected logic function _ uses indices i,j
    DEBUG:root:Detected logic function _ uses indices i,j
    Building function _debug_check_allclose_16_16_4cfd65a8b606655b
    
  • Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds.
    [Lisa Ong]

  • Merged PR 2461: Replace MemoryType with MemorySpace for consistency.
    [Mason Remy]

    Replace MemoryType with MemorySpace for consistency

  • Merged PR 2416: Implement initial thrifty caching support. [Mason
    Remy]

    Implement initial thrifty caching support

    • This is a simple brute-force approach where each thrifty cache is
      examined element-by-element alongside the array it is caching to check
      whether there is a stride of 1 between every access
    • Currently this thrifty analysis and the potential erasing of thrifty
      caches happens after the cache ops have been created. This is due to
      needing the cache mapping to have already run in order to support
      hierarchical caching scenarios. Eventually this should be refactored
      and the thrifty analysis should be used to prevent creating the cache
      ops, but that is a larger refactor than the scope for this task.
    • When creating affine loads and stores into caches, this change also
      tacks on some attributes onto the load/store ops to indicate how the
      original load or store accessed the base array. Since the base array
      -> cache position mapping is not always invertible (consider
      coefficient cache layout cases), this is one of the only ways to
      encode this information. Unfortunately, canonicalization on affine
      load/store ops will scrub away these attributes, so any reliance on
      them has to occur before a canonicalization pass. Similarly, the
      MakeCacheOps recording which argument to their accesses are the base
      array positions depends on the operand list being unchanged, however
      canonicalization may remove operands if it determines they are not
      used - while this is fine for the load/store op itself, any assumption
      like "base array indices are at positions N...N+K in the operand list"
      are no longer valid

    Related work items: #3575

  • Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection.
    [Kern Handa]

    Changes the order of the LLVM_SETUP_VARIANT detection

  • Merged PR 2458: Fixes building with clang++ on Linux/WSL. [Kern Handa]

    Fixes building with clang++ on Linux/WSL

  • Merged PR 2438: Support for double-buffer caching. [Mason Remy]

    Support for double-buffer caching

    • Adds plumbing from python dsl for double_buffer flag to cache API
    • Implements double buffering by hoisting the initial cache fill outside
      of the cache trigger loop parent, then creating a prologue subnest
      that fills a temporary buffer with the i+1'st iterations data and an
      epilogue subnest that moves that temporary buffer data into the main
      cache buffer. The last iteration of the trigger loop parent loop is
      unswitched and no cache filling is done in that loop.
    • On GPU the temporary buffer is allocated in private memory and if the
      cache is in shared memory each thread just holds onto their own
      contribution to the cache in their own private memory buffer until the
      epilogue fill nest
    • Barrier ops are hoisted out of conditionals to avoid potential for
      deadlocks. The conditionals introduced in this PR should be
      always-true or always-false, but this is added as a safety measure.
      Currently the hoisting is naive - any barrier within a conditional is
      erased and barriers are placed before and after the conditional block.
      This is not correct for all future conditional scenarios as any
      operations that happen within the conditional that depend on the
      barrier existing will be broken, however it works for how conditionals
      are used currently and can be improved on over time

    Related work items: #3659

  • Merged PR 2450: Automatically add parameter dict as auxiliary data.
    [Denny Sun]

    Automatically add parameter dict as auxiliary data

    Related work items: #3662

  • Merged PR 2456: Updates CUDA source emission based on testing with
    nvrtc. [Kern Handa]

    Updates CUDA source emission based on testing with nvrtc

  • Merged PR 2453: Sets CPU targets to default to openmp. [Kern Handa]

    Sets CPU targets to default to openmp

  • Merged PR 2443: Add FP16 support. [Abdul Dakkak]

    preparation for adding mfma support for CUDA which only operates on FP16

  • Merged PR 2452: Updates GPU source emitting path to emit host launcher
    and device function pairs. [Kern Handa]

  • Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow
    for exact matches. [Kern Handa]

    Updates IR util ResolveExec[Target,Runtime] to allow for exact matches

  • Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime. [Kern
    Handa]

    Makes Vulkan specific behavior pred. on Runtime

  • Merged PR 2446: Updates Runtime enum in Targets.py to be more
    comprehensive. [Kern Handa]

    Updates Runtime enum in Targets.py to be more comprehensive

  • Merged PR 2449: [Cleanup] Replace "rc*" prefixes with "acc*"
    prefixes in tablegen'ed code. [Lisa Ong]

    For *.td, perform the following replacements for ops:

    s/rcv_/accv_/g
    s/rc_/acc_/g
    s/rcxp_/accxp_/g
    s/rcln_/accln_/g

  • Merged PR 2448: fix typo in the condition for mod in range analysis.
    [Abdul Dakkak]

    fix typo in the condition for mod in range analysis

  • Merged PR 2445: Fix bind command when index is further split. [Abdul
    Dakkak]

  • Merged PR 2444: add range remainder. [Abdul Dakkak]

    add range remainder

  • Merged PR 2441: Fix APInt usage in RangeValueOptimizePass. [Mason
    Remy]

    Run the RangeValueOptimizePass as part of acc-to-llvm

  • Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr
    <-> struct utils. [Mason Remy]

    Move ExecutionOptions to ir lib and create arrayattr <-> struct utils

  • Simplify target passthrough layer. [Mason Remy]

  • Move ExecutionOptions to ir lib and create arrayattr <-> struct utils.
    [Mason Remy]

  • Merged PR 2430: Remove unnecessary barrier ops. [Chuck Jacobs]

    This PR adds an optim...

Read more

v1.2.1

26 Jan 11:08
Compare
Choose a tag to compare

What's Changed

  • Merged PR 2391: Update quickstart example, updated docs structure per
    feedback. [Lisa Ong]

    • Teasers for transformations in the Quickstart sample (to differentiate Accera from others), with benchmarking
    • Removed the Miscellaneous section, redistributed various docs to various related locations
    • Renamed the cross compilation tutorial so that it is ordered last
  • Merged PR 2392: Populate Target.Models based on known devices. [Kern
    Handa]

    Populate Target.Models based on known devices

  • Merged PR 2390: Merge multiple HAT files during project building.
    [Kern Handa]

    Merge multiple HAT files during project building

    Related work items: #3559

  • Merged PR 2386: Add support for various targets. [Kern Handa]

    Add support for various targets

    Related work items: #3631

  • Merged PR 2389: [nfc] Doc typos and consistency fixes. [Lisa Ong]

  • Merged PR 2388: Update quickstart example, add binder quickstart.
    [Lisa Ong]

    • Update quickstart example to perform a matmul + ReLU (unoptimized)
    • Add Launch in Binder button to run everything in the browser
  • Merged PR 2387: Placeholder GPU GridUnit definitions, add library
    creation from multiple object files. [Lisa Ong]

    Dependent HAT PR: microsoft/hat#21

    • GridUnit definitions are static until we have real GPU targets. These are updated just to be consistent with the Manual
    • When not cross compiling, combine multiple .obj/.o into .lib/.a

    Related work items: #3576

  • Merged PR 2384: Update target docs, split Intel generation 8 and 9 for
    consistency. [Lisa Ong]

    • Update target docs to list the name of the target in the table
    • Define separate models for Intel generation 8 and 9 for consistency

    Related work items: #3631

  • Merged PR 2383: Support dynamic libs from Package.build [Lisa Ong]

    • Add static and dynamic variants to the HAT and MLIR formats
    • MLIR format is also split because we'd want to support MLIR inspection of the cross-compilation scenario without forcing users to switch between dynamic and static
    • Updated README sample

    Left for future work:

    • Combining multiple object files into a static lib or dynamic lib. We'd need to think about how HAT packages can be merged together (for example, how to reconcile the metadata in the HAT file, such as description, author - do we merge all metadata or just pick the first HAT file encountered as the "master", etc)

    Related PR: microsoft/hat#18

    Related work items: #3576

  • Merged PR 2382: [nfc] Move Case Studies out of the Accera repo. [Lisa
    Ong]

    Case Studies will live in other repositories, and be cross linked from the Accera repo's Case Studies README.md (to be added in the future).

    Related work items: #3632

  • Merged PR 2379: Specify dynamic lib dependencies from the HAT Package.
    [Lisa Ong]

    This is the final missing piece before we transition to building static / dynamic libs using hatlib.

    • Plan infers additional dynamic dependencies when the target is GPU or when parallelization is requested.
    • Package.add collects the dependency info the various Plan instances.
    • Package.build, the platform parameter is used to resolve to the appropriate library (either a path or a -l directive).
      • For library paths that cannot be fully determined in advance, we default to the current working directory, so perhaps the user can put the lib in the same path as the binaries. (this needs to be fleshed out more)
    • Removed dead code

    Dependent hatlib PR: https://github.com/microsoft/hat/pull/16/files

    Related work items: #3576

  • Merged PR 2380: Add Raspberry Pi 4 (B) support. [Kern Handa]

    Related work items: #3631

  • Merged PR 2368: Update and optimize acc-translate. [Abdul Dakkak]

    • propagate constants while generating C++ code
    • inline mlir within the C++ code to ease debugging
    • increase support for vector ops
    • silence a lot of warnings that were being emitted in the acc-translate codebase

    The following

    // CONFIG: {"K":2048,"M":2048,"N":2048,"block":{"x":16,"y":16,"z":1},"grid":{"x":128,"y":128,"z":1}}
    module @gemm_naive_14479263422999410716_module attributes {gpu.binary = "HSACO"} {
      func @gemm_naive_14479263422999410716(%arg0: memref<2048x2048xf32> loc(unknown), %arg1: memref<2048x2048xf32> loc(unknown), %arg2: memref<2048x2048xf32> loc(unknown)) {
        %c16 = constant 16 : index loc(unknown)
        %c0 = constant 0 : index loc(unknown)
        %c2048 = constant 2048 : index loc(unknown)
        %c1 = constant 1 : index loc(unknown)
        %cst = constant 0.000000e+00 : f32 loc(unknown)
        %0 = "gpu.thread_id"() {dimension = "x"} : () -> index loc(unknown)
        %1 = "gpu.thread_id"() {dimension = "y"} : () -> index loc(unknown)
        %2 = "gpu.block_id"() {dimension = "x"} : () -> index loc(unknown)
        %3 = "gpu.block_id"() {dimension = "y"} : () -> index loc(unknown)
        %4 = scf.for %arg3 = %c0 to %c2048 step %c1 iter_args(%arg4 = %cst) -> (f32) {
          %11 = muli %3, %c16 : index loc(unknown)
          %12 = addi %1, %11 : index loc(unknown)
          %13 = memref.load %arg0[%12, %arg3] : memref<2048x2048xf32> loc(unknown)
          %14 = muli %2, %c16 : index loc(unknown)
          %15 = addi %0, %14 : index loc(unknown)
          %16 = memref.load %arg1[%arg3, %15] : memref<2048x2048xf32> loc(unknown)
          %17 = mulf %13, %16 {RelaxedPrecision} : f32 loc(unknown)
          %18 = addf %arg4, %17 {RelaxedPrecision} : f32 loc(unknown)
          scf.yield %18 : f32 loc(unknown)
        } loc(unknown)
        %5 = muli %3, %c16 : index loc(unknown)
        %6 = addi %1, %5 : index loc(unknown)
        %7 = muli %2, %c16 : index loc(unknown)
        %8 = addi %0, %7 : index loc(unknown)
        %9 = memref.load %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
        %10 = addf %9, %4 {RelaxedPrecision} : f32 loc(unknown)
        memref.store %10, %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
        return loc(unknown)
      } loc(unknown)
    } loc(unknown)
    

    generates the following cpp file

    #if defined(__HIP_PLATFORM_AMD__)
    #include <hip/hip_runtime.h>
    using vfloatx2_t = float __attribute__((ext_vector_type(2)));
    using vfloatx4_t = float __attribute__((ext_vector_type(4)));
    using vfloatx16_t = float __attribute__((ext_vector_type(16)));
    #else
    #include "cuda_fp16.h"
    #endif // !defined(__HIP_PLATFORM_AMD__)
    
    #include <math.h>
    #include <stdint.h>
    
    __global__ void gemm_naive_14479263422999410716(float (*arg0)[2048], float (*arg1)[2048], float (*arg2)[2048])
    {
        /*%0 = "gpu.thread_id"() {dimension = "x"} : () -> index*/
        const uint threadIdx_x_0 = threadIdx.x;
        /*%1 = "gpu.thread_id"() {dimension = "y"} : () -> index*/
        const uint threadIdx_y_1 = threadIdx.y;
        /*%2 = "gpu.block_id"() {dimension = "x"} : () -> index*/
        const uint blockIdx_x_2 = blockIdx.x;
        /*%3 = ...
    
    
  • Merged PR 2376: [build] Install acc-lsp-server as an internal tool.
    [Lisa Ong]

    Removes acc-lsp-server from accera-compilers

    Minor CMake macro renames to (hopefully) improve usability

  • Merged PR 2378: [doc] Update doc links after DSL changes, fix missing
    file warnings. [Lisa Ong]

    Verified by:

    cd <accera_root>
    pip install mkdocs-material mkdocs-git-revision-date-plugin
    mkdocs serve
    
  • Merged PR 2377: Retire Benchmark.py, use hatlib for benchmarking and
    shared library creation. [Lisa Ong]

    This cleanup work precedes the actual work to produce static or dynamic libraries by migrating existing HAT Python scripts to consume hatlib. Next PRs will consume hatlib to produce those libraries.

    hatlib defines a HAT package as .hat files and a library.

    • Remove accera.tuning.AutoBenchmark and replace usages with hat.run_benchmark in case studies
    • Removed accera.tuning.CorrectnessCheck. Baked correctness checking into accera.test.verifiers
    • Disabled some tests in preparation for coming work (next PRs)
      • parallelization tests: need to specify lomp as a link target dependency in the HAT file, and update hatlib to honor this flag
      • emit_unpacked_buffer_tests: to resolve multi-MLIR-module scenario where we have a globals module in addition to the package module

    Depends on this PR: microsoft/hat#15

    Related work items: #3556

  • Merged PR 2374: Retain and honor the order of functions added to the
    package. [Kern Handa]

    Retain and honor the order of functions added to the package

    Related work items: #3629

  • Merged PR 2371: add lsp server for accera. [Abdul Dakkak]

    this adds an lsp server to be used with the mlir vscode extension https://marketplace.visualstudio.com/items?itemName=llvm-vs-code-extensions.vscode-mlir . You will have to specify the lsp server in your settings.json . On my system this means to add the following setting

      "mlir.server_path": "${workspaceFolder}/build/accera/acc-lsp-server/acc-lsp-server",
    

    It's not super robust though

  • Merged PR 2372: reduce install size. For example, on linux the install
    size goes from 873M to 742M on Linux. [Abdul Dakkak]

    reduce install size. For example, on linux the install dir goes from 873M to 742M. More can be done along those lines

  • Merged PR 2369: run clang-format on acc_translate. [Abdul Dakkak]

    run clang-format on acc_translate. There are no modifications to the code

  • Merged PR 2367: Selectively emit GPU utilities. [Kern Handa]

    Selectively emit GPU utilities

    Related work items: #3559

  • Merged PR 2366: [build] Fix manylinux package build. [Lisa Ong]

    Apply updated requirements.txt without rebuilding docker image

  • Merged PR 2365: Unify Package.add_function and Package.add_functions
    into Package.add. [Kern Handa]

    Related work items: #3549

  • Merged P...

Read more