Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo fixes for hatlib 0.0.11 #36

Merged
merged 1 commit into from
Apr 20, 2022
Merged

Conversation

lisaong
Copy link
Contributor

@lisaong lisaong commented Apr 20, 2022

Describe the pull request

  • What does your PR fix?

    Fixes signature changes to hat.load from hatlib 0.0.11

  • Is this a documentation-only fix?

    Yes

If you are still working on the PR, open it as a Draft: https://github.blog/2019-02-14-introducing-draft-pull-requests/

@lisaong lisaong merged commit 31b8ff7 into main Apr 20, 2022
@lisaong lisaong deleted the dev/onglisa/demo_hatlib_0.0.11 branch April 20, 2022 13:34
lisaong pushed a commit that referenced this pull request Apr 27, 2022
commit 6654c774fb0d2d6fac760b911a547b4e66b23127
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Wed Apr 27 00:44:53 2022 +0000

    Merged PR 2522: Generalize array indexing in tensorized GEMM

    This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

    There are also a number of tests.

    This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

    Related work items: #3676

commit 4d030709101f3653712b805bd8f3698e0e293bd3
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 17:50:18 2022 +0000

    Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool

    * The Linux1ESPool is created to support internal builds of LLVM

    * Fix regression in pipeline due to overzealous .dockerignore

commit 9b9d6b4b77c46b12788665412b9d0d1c2ff62d18
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 10:43:28 2022 +0000

    Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote

    In preparation for merge from ADO to GitHub for Case Studies publishing

commit c1298946d18fb785788c556ea2959b9438f9c6b7
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 08:10:47 2022 +0000

    Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third party containers

    Updating Dockerfile references

commit 0c7a3610ba082e82e554297bdadbf9579b094745
Author: Denny Sun <dennys@microsoft.com>
Date:   Tue Apr 26 04:40:05 2022 +0000

    Merged PR 2548: Add README file for case studies

    README file has a table where each case study points to the external repo link.

commit edbc50edd00efe8f12a675735d7e52371e43f7b1
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Mon Apr 25 23:49:15 2022 +0000

    Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for development

    Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

    No plans to release pip packages until there is CI support

    Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10

commit 166e333a3d10b77c804dc3edc1c71bfc5716c768
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Mon Apr 25 17:50:22 2022 +0000

    Merged PR 2543: Add precomputed offset map optimization for tensorization (no caching)

    - Add flag to tensorize() to enable optimization (off by default)
    - Optimization only affects load/store of accumulator (C) argument
    - Supports all 4 mfma shapes

    Related work items: #3671

commit e11c4d4e87bbae87f7cb9035eff8e6af650c9d1a
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Sun Apr 24 01:00:41 2022 +0000

    Merged PR 2542: An assortment of minor fixes

    This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

    The specific things are:
    - Add 2 new target models to `Targets.py` (that correspond to my local dev boxes)
    - Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
    - Print a warning message if `check_correctness` skips a correctness check because no hat file was generated
    - Add a "minimum version" constraint to `requirements.txt` for `hatlib`

commit 8da7903ac9b6d8612711593308e49a7a3e82678d
Author: Kern Handa <kerha@microsoft.com>
Date:   Sat Apr 23 23:59:53 2022 +0000

    Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Related work items: #3679

commit fe2c40fa8f1c28dcf47e1533223457fd3e6bf195
Author: Kern Handa <kerha@microsoft.com>
Date:   Sat Apr 23 23:17:43 2022 +0000

    Merged PR 2544: [nfc] Removes now unnecessary ldebug output

    [nfc] Removes now unnecessary ldebug output

commit 32090d786ce13299bb77a6675c3478b3d7cdf48c
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Apr 22 21:31:01 2022 +0000

    Merged PR 2527: Enable vectorized shared memory write

    Enable vectorized shared memory write

    - This adds mod simplification support needed for vecotrizing shared
      memory writes
    - Also refactors some of the affine simplification code slightly to
      share some common code between the floordiv and mod simplifications

    Related work items: #3586, #3661, #3689

commit 0eb698af118b94bf3f4d4862a142c86055f8b7bb
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Apr 22 19:13:27 2022 +0000

    Merged PR 2526: Enable GPU global read vectorization

    Enable GPU global read vectorization

    - Implements a floor div simplification that enables better recognition
      of vectorizable load and stores

    Related work items: #3661, #3690

commit df849f066ff6c2c82c796d9b48e3bea6390c7877
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Fri Apr 22 06:03:27 2022 +0000

    Merged PR 2541: Fix a few issues with GEMM benchmarking script

    This PR fixes a couple of errors:
    - there was a bug in the GEMM kernel
    - sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"

    It makes a couple of other tweaks:
    - it fails if the `alpha` and `beta` parameters aren't `1.0` and `0.0`
    - it culls some variants with known-uncompilable tensorization parameters before trying to compile them

commit 339253767ae4bb4f7e5c323f77fc938ba1a4ab92
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri Apr 22 01:26:53 2022 +0000

    Merged PR 2538: Fix std::pair unpacking issue in TensorizeAffineForOpConversion

    In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files

commit 075c83247d34bfd9fb291e4ea6b9df059a94993a
Author: Denny Sun <dennys@microsoft.com>
Date:   Fri Apr 22 00:26:56 2022 +0000

    Merged PR 2536: Parameter supports most of the arithmetic/binary/unary operations defined in operator lib

    Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

    fma_unit_count, vector_size = acc.create_parameters(2)​
    jjj = schedule.split(jj, fma_unit_count * vector_size)​
    jjjj = schedule.split(jjjj, vector_size)

    Related work items: #3692

commit 6d5e71899c6fb606e32ec46ee871ae1af25d3cd6
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 21 18:22:12 2022 +0000

    Merged PR 2539: [nfc][docs] Merging commits from Github/main

    commit ee28126a338d905eb5931038d3c5daba6ead3811
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Wed Apr 20 21:35:20 2022 +0800

        Update arrow label positions (#35)

        * [nfc] [doc] Update arrow label positions

        * make arrowhead more visible

        * nfc

    commit ddcecaaffd9dd0861999a6d29443dc7c37d79665
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Wed Apr 20 21:34:40 2022 +0800

        demo fixes for hatlib 0.0.11 (#36)
kernhanda added a commit that referenced this pull request Jul 13, 2022
commit f3a1a2becb6740ae8cf7873b5029c6df140f5c19
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Jul 12 16:52:41 2022 +0000

    Merged PR 2744: [doc] Fixes link in reference/functions/cast.md, revs version on all docs

    [doc] Fixes link in reference/functions/cast.md

commit 23f4c8fbf2415b02e8b0090a76380d34790205fa
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Jul 12 05:55:48 2022 +0000

    Merged PR 2743: [DSL] Document implicit casting rules and the explicit `cast` function

    * Document implicit casting rules implemented by !2693
    * Promote `acc.cast` to a documented function to give the user control to override implicit casting behavior

commit 3ec63b62705327a65decc4da7ec4cb5412dc7299
Author: Kern Handa <kerha@microsoft.com>
Date:   Mon Jul 11 23:57:23 2022 +0000

    Merged PR 2739: Updates ROCM tensorization pattern to handle casting

    Updates ROCM tensorization pattern to handle casting

commit 60c082dd38ff1b0bc030a7e28dc19f553bad9099
Author: Mason Remy <masonr@microsoft.com>
Date:   Mon Jul 11 22:58:42 2022 +0000

    Merged PR 2643: Some fixes for last major array caching in tensorization

    Some fixes for last major array caching in tensorization

commit 812c3065b7d4d6c9d716acf4fb1df4be66ef101d
Author: Kern Handa <kerha@microsoft.com>
Date:   Mon Jul 11 20:43:12 2022 +0000

    Merged PR 2693: Updates DSL codegen to implicitly cast if possible

    Updates DSL codegen to implicitly cast if possible

commit 6ed316e50e8f9e398f9ee6b8bfa8e6aa05fbffb1
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Sat Jul 9 05:52:22 2022 +0000

    Merged PR 2735: Pass multiple input files as comma-separated list to benchmark tool

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41588&view=logs&j=d78921a4-2f18-50b0-77ad-4c6803f3371b&t=f97c60f6-ada7-5ec9-5ea1-510216c408e9

    Above pipeline did not run the 2nd set of input sizes since the 1st process did not exit until pipeline timeout was hit. After the fix, we will always have a single job.

commit e5010caebc5a135e40464a06432a5cf1fc965203
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Mon Jun 27 23:32:49 2022 +0000

    Merged PR 2721: Remove unnecessary logging in benchmarks

    Remove unnecessary logging in benchmarks

commit e0c5945d3ef218a5be858bc0934274793972abdb
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Jun 21 01:12:02 2022 +0000

    Merged PR 2674: Support emitting runtime array sizes in the Value DSL

    * Minimum set of changes to support runtime sizes in the Value DSL without transformations
    * Add a ScalarDimension type (name TBC) which is aliased to Scalar
    * Support variable ends in MemoryLayout, ScheduledLoopOp, RangeValueAnalysis
    * Use mlir::ShapedType::kDynamicSize and mlir::ShapedType::kDynamicStrideOrOffset as sentinel values, following the pattern in MemRefOps, TensorOps, etc.
    * TODO: E2E verification in the next PR
    * TODO: Python DSL changes in the next PR

    Output of mlir-translate for the runtime_sizes_all case, where %21, %22 and %23 are the runtime sizes for M, N, and K:

    ```
    define void @NestMatMul(float* %0, float* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, float* %7, float* %8, i64 %9, i64 %10, i64 %11, i64 %12, i64 %13, float* %14, float* %15, i64 %16, i64 %17, i64 %18, i64 %19, i64 %20, i64 %21, i64 %22, i64 %23) !dbg !3 {
      br label %25, !dbg !7

    25:                                               ; preds = %57, %24
      %26 = phi i64 [ %58, %57 ], [ 0, %24 ]
      %27 = icmp slt i64 %26, %21, !dbg !9
      br i1 %27, label %28, label %59, !dbg !10

    28:                                               ; preds = %25
      br label %29, !dbg !11

    29:                                               ; preds = %55, %28
      %30 = phi i64 [ %56, %55 ], [ 0, %28 ]
      %31 = icmp slt i64 %30, %22, !dbg !12
      br i1 %31, label %32, label %57, !dbg !13

    32:                                               ; preds = %29
      br label %33, !dbg !14

    33:                                               ; preds = %36, %32
      %34 = phi i64 [ %54, %36 ], [ 0, %32 ]
      %35 = icmp slt i64 %34, %23, !dbg !15
      br i1 %35, label %36, label %55, !dbg !16

    36:                                               ; preds = %33
      %37 = mul i64 %26, %5, !dbg !17
      %38 = add i64 %37, %34, !dbg !18
      %39 = getelementptr float, float* %1, i64 %38, !dbg !19
      %40 = load float, float* %39, align 4, !dbg !20
      %41 = mul i64 %34, %12, !dbg !21
      %42 = add i64 %41, %30, !dbg !22
      %43 = getelementptr float, float* %8, i64 %42, !dbg !23
      %44 = load float, float* %43, align 4, !dbg !24
      %45 = fmul float %40, %44, !dbg !25
      %46 = mul i64 %26, %19, !dbg !26
      %47 = add i64 %46, %30, !dbg !27
      %48 = getelementptr float, float* %15, i64 %47, !dbg !28
      %49 = load float, float* %48, align 4, !dbg !29
      %50 = fadd float %49, %45, !dbg !30
      %51 = mul i64 %26, %19, !dbg !31
      %52 = add i64 %51, %30, !dbg !32
      %53 = getelementptr float, float* %15, i64 %52, !dbg !33
      store float %50, float* %53, align 4, !dbg !34
      %54 = add i64 %34, 1, !dbg !35
      br label %33, !dbg !36

    55:                                               ; preds = %33
      %56 = add i64 %30, 1, !dbg !37
      br label %29, !dbg !38

    57:                                               ; preds = %29
      %58 = add i64 %26, 1, !dbg !39
      br label %25, !dbg !40

    59:                                               ; preds = %25
      ret void, !dbg !41
    }
    ```

    Related work items: #3716, #3717

commit 51a07e5c60009c47c3b375b402ac96f47619ca8f
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Tue Jun 21 00:18:02 2022 +0000

    Merged PR 2682: Add nvidia device optimized sizes and some benchmark fixes

    Add nvidia dev opt sizes and some bench fixes

commit 6325b5e5bc68136d29e4a65d657699a4e781214d
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Sat Jun 18 17:59:50 2022 +0000

    Merged PR 2676: Add automated weekly rocm baseline benchmark

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41316&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8&t=687344d2-d6b6-5d8c-dd9d-6aab558fd96c

    https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41314&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8

commit 940e599ff7026e7c41cb1b2566eec44d70709e96
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Jun 17 16:34:22 2022 +0000

    Merged PR 2673: Add automated weekly baseline benchmarks on Nvidia GPU

commit 1a521c78783538e233230ecc90d2bc347092e0da
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Thu Jun 16 00:07:26 2022 +0000

    Merged PR 2657: Add conversion pass from gpu ops to rocdl ops

    - switch to gpu dialect for gpu index ops
    - add conversion pass from gpu dialect to rocdl

commit 3c3271deb2a2a42bac5e6356f28d7e04e6ff7678
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Wed Jun 15 02:52:50 2022 +0000

    Merged PR 2652: Add integer tensor ops support for AMD targets

    - int mfma ops
    - tests
    - static_cast in c++

    Related work items: #3727

commit 8b96d139067119de37ab6bcd1ecb0382ce7f46b3
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Jun 14 20:25:33 2022 +0000

    Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO

    Author: Lisa Ong <onglisa@microsoft.com>
    Date:   Tue Jun 14 17:04:53 2022 +0800

        bump docs version to 1.2.6

    commit 4014c176e8e5eb270404915c1bdc7404d657d9f6
    Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Date:   Sat Jun 4 07:11:15 2022 +0800

        Bump bottle from 0.12.19 to 0.12.20 in /tools/viz (#44)

        Bumps [bottle](https://github.com/bottlepy/bottle) from 0.12.19 to 0.12.20.
        - [Release notes](https://github.com/bottlepy/bottle/releases)
        - [Changelog](https://github.com/bottlepy/bottle/blob/master/docs/changelog.rst)
        - [Commits](https://github.com/bottlepy/bottle/compare/0.12.19...0.12.20)

commit 72ede2b19b5e21f3d96fc800bcc1ed30d69c389c
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Mon Jun 13 23:17:59 2022 +0000

    Merged PR 2624: Add more MMA shapes for CUDA

    Add more MMA shapes for CUDA
    - 32x8x16
    - 8x32x16

commit f237d89d12e4935af98c85b8a6ef5cdd2777272f
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Mon Jun 13 08:19:08 2022 +0000

    Merged PR 2644: Enable CUDA benchmarks only for A6000

    * Manually set the Target.Model user capability on agents running A6000
    * Update benchmarking pipelines to demand A6000s

    https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback

commit ad46d3c4bd399b1bd3fa6b24a90aa29f1f0bf685
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Jun 10 19:44:02 2022 +0000

    Merged PR 2634: Remove couple more big gemm sizes

    Remove couple more big gemm sizes

commit 21638e465b02e4730842f14ac08f3caa28942b65
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Jun 9 23:23:10 2022 +0000

    Merged PR 2626: [refactor] Moving debug mode to its own lowering pass

    Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

    This makes it easier to expand debug mode in the future.

commit 91c178f348c61b45a7505ee69da068f55536cdc9
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Jun 9 17:20:50 2022 +0000

    Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices

    https://github.com/microsoft/hat/releases/tag/v0.0.19

commit 209effd66c47a36d7399a83a286447f9bc70f5b4
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Thu Jun 9 09:22:03 2022 +0000

    Merged PR 2630: Add batched gemm support with tensorization

    Related work items: #3677

commit 12cb73e175bb590e7a1b575b0ae75a3d99b812ff
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Thu Jun 9 07:12:17 2022 +0000

    Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes

    - Add env var for ACCOUNT_KEY
    - shuffle gemm sizes from small to big
    - remove correctness check from big inputs and fp16

commit bb4de9008fdfbc36768e922ce05efe5e28733cfd
Author: JUBI TANEJA <jubitaneja@microsoft.com>
Date:   Thu Jun 9 00:02:41 2022 +0000

    Merged PR 2607: Infrastructure for plan.auto() to support a basic none cache heuristics approach

    Infrastructure for plan.auto() to support a basic none cache heuristics approach

    This is a basic approach to test parameterization of cache arguments, index and layout.
    User only needs to specify the source they want to cache, and AutoPlanner's
    NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
    with possible set of values.

    **Overall idea at DSL level:**
    Given input -
    schedule.reorder(i, j, k, ii, jj, kk)
    plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))

    Internally, auto() invokes cache and adds two functions with
    a unique value of layout.

    plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})

    **Important change in this PR:**
    - Add a new algorithms module in Accera
    - Do not delay resolution of delayed parameters to get the value, instead it
      now allows setting parameters with a possible set of values and this can be
      passed between heuristics and plan object. Check: Parameter.py
    - Parameters constructed by heuristics are termed as "herustic parameters".
      They are not available to the external users of Accera, but just named
      separately in the implementation to differentiate them from user-defined "parameters".

    **Limitation/Changes coming in the subsequent PRs:**
    - Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
      For now, the code only focuses on testing AutoPlanner without any user-defined parameters
      that one can create using API: `create_parameters`.
    - Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
      next PR.

commit c437a48aede39d2bb8dcb963de6c3c3b5c6ac682
Author: Mason Remy <masonr@microsoft.com>
Date:   Tue Jun 7 07:30:34 2022 +0000

    Merged PR 2600: Refactor MFMA indexing calculations

    Refactor MFMA indexing calculations

    - Use the iteration space position when determing MFMA computation
      locations rather than computing the position from the thread id
    - Construct the full subschedules for AMD MFMA ops so that the bound
      loop indices are ordered appropriately for the MFMA op being invoked
    - Update unit tests accordingly. The schedule changes may need to be
      moved to an under-the-hood feature of tensorization

commit ea74434226671de7dac68dd70eb01c6142a1257e
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Mon Jun 6 23:29:26 2022 +0000

    Merged PR 2627: Raise error for invalid block dimensions

    Raise error for invalid block dimensions based on target info

    Related work items: #3715

commit 1ccd939a800ec565835c5b4917d3411aa526f992
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Mon Jun 6 17:12:42 2022 +0000

    Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets

    Debug mode is not yet supported for GPU targets

    * Fail early
    * Update documentation

commit e757df5e8792d358d37ab8da3fe90d1310485274
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Jun 3 21:36:08 2022 +0000

    Merged PR 2622: Fix dependencies for benchmark tools

    Fix dependencies for benchmark tools

commit 34a6f3f061ded0a0e0620393f120ce0e6431bc92
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Jun 3 18:16:30 2022 +0000

    Merged PR 2604: Add bfloat16 support for tensor ops on rocm

    Add bfloat16 support for tensor ops on cuda and rocm

    Related work items: #3713

commit f5c864eaed962e7e3775e732abbf5e73df8e6c4b
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri Jun 3 00:41:50 2022 +0000

    Merged PR 2621: Merge changes from Github repo

    commit 5b5f5eff6b0c5422f026634153b5de219db2c628
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Fri Jun 3 00:14:04 2022 +0800

        [ci] Fix out of disk space errors for CI workflow (#43)

        * Split into debug/release builds
        * Cleanup vcpkg buildtrees

    commit b7135deb61766e52b4583db0800926799084dafe
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Thu Jun 2 22:21:56 2022 +0800

        Update ci.yml

        Cleanup build folder

    commit b59a60435eb72334898f0ca227ad24a15e003498
    Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Date:   Thu Jun 2 16:20:50 2022 +0800

        Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers (#42)

        Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.25.8 to 1.26.5.
        - [Release notes](https://github.com/urllib3/urllib3/releases)
        - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
        - [Commits](https://github.com/urllib3/urllib3/compare/1.25.8...1.26.5)

        ---
        updated-dependencies:
        - dependency-name: urllib3
          dependency-type: direct:production
        ...

        Signed-off-by: dependabot[bot] <support@github.com>

        Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit cf9868ebfb0346903335d55d745eefdc051d9cbd
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Jun 2 22:00:11 2022 +0000

    Merged PR 2620: Upgrade GPU self-hosted agents to g++-10

    The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.

    This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.

    Azure DevOps agents will continue to run on their pre-installed g++-9.

commit efdff3963b07469af391ffc950c6b751d41b81df
Author: Denny Sun <dennys@microsoft.com>
Date:   Thu Jun 2 04:06:57 2022 +0000

    Merged PR 2619: Parameterize Plan.bind

    ```
            P0, P1, P2, P3, P4, P5 = create_parameters()

            plan.bind(mapping={
                P0: P3,
                P1: P4,
                P2: P5
            })

            package.add(
                plan,
                args=(A, B, C),
                parameters={
                    P0: i,
                    P1: j,
                    P2: k,
                    P3: v100.GridUnit.BLOCK_X,
                    P4: v100.GridUnit.THREAD_X,
                    P5: v100.GridUnit.THREAD_Y,
                },
                base_name=test_name)
    ```

    Related work items: #3708

commit 5c0fdfde2512baed8a37bde8c50c4bce649930b3
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed Jun 1 20:17:00 2022 +0000

    Merged PR 2599: Support parameterizing caches based on memory space

    Support parameterizing caches based on memory space

    - Identifies bound indices that the cache should be parameterized on,
      rather than shaped by.
      e.g. for a private memory cache inserted at a gpu block level, the
      computed memory space will not be the full active block at that level,
      but the portion derived from loops that weren't bound to gpu thread
      dims.

    - Adds some BoundProcessorOp utilities and shares some common binding
      code

commit 2c4cceca16084554b4af458525fc68f06503bea1
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Wed Jun 1 08:46:52 2022 +0000

    Merged PR 2618: Fix memory allocation bug during benchmark verification

    Fix memory allocation bug during benchmark  verification

commit 2977b3b905109d2405524d7e6eb1583eed9f2d7d
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Jun 1 04:19:38 2022 +0000

    Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table

commit 53dfbdb0e42250cf877da0c0e04f93acedc5caf4
Author: Denny Sun <dennys@microsoft.com>
Date:   Wed Jun 1 03:38:27 2022 +0000

    Merged PR 2616: Formatting Python code a bit for the better readability

    1. Some functions have a long list of parameters, add line wrap
    2. Separate external imports from internal ones

commit 92565fa4495e828efb698078e30a5233d343f287
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Tue May 31 07:41:56 2022 +0000

    Merged PR 2614: Remove redundant variable and cosmosdb fix

    Cosmos DB error when upserting from multiple processes:

    Process runner0:
    Traceback (most recent call last):
      File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
        self.run()
      File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
        self._target(*self._args, **self._kwargs)
      File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
        cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
      File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
        container = get_container(containerName, verboseLogs)
      File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
        container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
      File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
        return func(*args, **kwargs) # type: ignore
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
        container_proxy.read(
      File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
        return func(*args, **kwargs) # type: ignore
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
        self._properties = self.client_connection.ReadContainer(
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
        return self.Read(path, "colls", collection_id, None, options, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
        result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
        return synchronized_request.SynchronizedRequest(
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
        return _retry_utility.Execute(
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
        result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
        return function(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Request
        raise exceptions.CosmosHttpResponseError(message=data, response=response)
    azure.cosmos.exceptions.CosmosHttpResponseError: Status code: 400
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
    <HTML><HEAD><TITLE>Bad Request</TITLE>
    <META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
    <BODY><h2>Bad Request</h2>
    <hr><p>HTTP Erro...

commit 34fadcd529161a656a34f67288fb934b42a73524
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Tue May 31 00:37:59 2022 +0000

    Merged PR 2613: Enable daily CUDA benchmarks

    - Enable CUDA benchmarks
    - some refactoring

commit 52d7f77ecea2d437f658be9cf898724ee220e670
Author: Mason Remy <masonr@microsoft.com>
Date:   Mon May 30 05:57:31 2022 +0000

    Merged PR 2596: Updates to affine simplifications

    Updates to affine simplifications

    - Run simplifications on AffineApplyOps
    - Detect and simplify some single-element-numerator cases for floordiv
      and mod
    - Detect GPU constants such grid dim size and block dim size and
      incorporate those constants into affine maps for later simplification
    - Detect GPU bound dimensions block id and thread id in affine ops and
      incorporate those ranges into simplification passes

    Related work items: #3667

commit 2e1b837854692a64df920b7b3ef44d9d5a5ca3fa
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri May 27 22:02:47 2022 +0000

    Merged PR 2594: Always resolve unrealized loopnest indices when computing cache positions

    Always resolve unrealized loopnest indices when
    computing cache positions

commit 53edefaddd4de4f2eba90033fda3e9b61d2c3c95
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri May 27 21:14:50 2022 +0000

    Merged PR 2574: Support binding multiple indices to a processor handle

    Support binding multiple indices to a processor handle

    - This creates a mapping of the processor handle to the index iterations
      based on the ordering of the indices in the tuple

commit 6cf6faf95f72c7df3bf6acf39a29bb2591275781
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Fri May 27 19:55:20 2022 +0000

    Merged PR 2611: Fix issue when splitting indices by factors that don't divide evenly

    This PR fixes an issue when splitting by a factor that doesn't evenly divide the parent index's range. E.g., if `i` has a range of `[0, 320)`, then `ii = split(i, 128)` would end up with `ii` having a range of `192` instead of `128`.

commit 8c1a2da2e83e92fa07dd45c6d734cca1ab678f74
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri May 27 16:09:39 2022 +0000

    Merged PR 2612: Add missing psutil dependency

    - Add missing psutil dependency
    - Remove private branch from benchmarks

commit 3e4a765acb1ab2905e57e6a37ee9e2448faaa870
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri May 27 07:17:14 2022 +0000

    Merged PR 2608: Caching fixes and benchmarking optimizations

    - Explore k_split independently of outer tile dims, allows for arbitrary k splits
    - Fix for workPerThread < 1 (from Mason), which was exposed since the benchmark now explores k-split of size 1, 2, 4, etc. and this causes small active blocks for caching, and when work per thread becomes less than 1 the compiler crashes during package.build.

commit 11810900f837a261d97bb7589620a8fe7e82b70a
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri May 27 01:29:45 2022 +0000

    Merged PR 2610: Opportunistically add more targets used in CI machines and update Model.md

    * Renamed some fields to add units
    * Added some Intel Xeon models as we encounter them
    * Updated some cache sizes

commit e6d31889676e69d9f884da9e543b295b1bfc7aed
Author: Denny Sun <dennys@microsoft.com>
Date:   Thu May 26 19:47:02 2022 +0000

    Merged PR 2606: Parameterize Array.sub_array

    `

            P0, P1 = create_parameters()
            arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(256, 256))
            arr0 = arr.sub_array(offsets=(0, 0), shape=(P0, P1))
            package.add(nest, args=(arr0, ),  parameters={P0: 128, P1: 128})
    `

    Related work items: #3707

commit cd39c8d4ead0392c7be87a059a2df70e6cf373a4
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu May 26 09:13:30 2022 +0000

    Merged PR 2609: [build] peg protobuf to 3.20.1 due to incompatibilities with latest version

    Even though we peg to onnx==1.9.0, onnx requires protobuf >= 3.20.1 which pulls an incompatible version of protobuf (4.x).

commit ee0ad728c60353a3e8562d428dc1f6d9355cffa2
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed May 25 08:08:31 2022 +0000

    Merged PR 2576: [doc] MFMA thread assignment visualizations for AMD

    Some helper visualizations for MFMA:
    * 2x2x16
    * 4x4x32

commit 90edd8ad70675bfb10bf448e0919f18a95b8453f
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed May 25 07:12:07 2022 +0000

    Merged PR 2601: [ci] CUDA pipeline and buddy build

    * Container for CUDA self-hosted Azure devops agent
    * Initial buddy build pipeline (similar to ROCm)
    * Replaces references to Dockerhub with Azure Container Registry for compliance purposes

commit 0c2d6416bb1bf96e184b43289c55cf01522fa05b
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed May 25 06:25:53 2022 +0000

    Merged PR 2603: Add CUDA pipeline host to known targets

    Note that the CPU frequency is conflicting, I went with cpuinfo and dmesg.

    References:

    ```
    > python -m cpuinfo

    Python Version: 3.8.10.final.0 (64 bit)
    Cpuinfo Version: 8.0.0
    Vendor ID Raw: AuthenticAMD
    Hardware Raw:
    Brand Raw: AMD EPYC 7V12 64-Core Processor
    Hz Advertised Friendly: 3.3049 GHz
    Hz Actual Friendly: 3.3049 GHz
    Hz Advertised: (3304919000, 0)
    Hz Actual: (3304919000, 0)
    Arch: X86_64
    Bits: 64
    Count: 128
    Arch String Raw: x86_64
    L1 Data Cache Size: 2 MiB
    L1 Instruction Cache Size: 2 MiB
    L2 Cache Size: 32 MiB
    L2 Cache Line Size: 512
    L2 Cache Associativity: 6
    L3 Cache Size: 524288
    Stepping:
    Model: 49
    Family: 23
    Processor Type:
    Flags: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avic, avx, avx2, bmi1, bmi2, bpext, cat_l3, cdp_l3, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpb, cpuid, cqm, cqm_llc, cqm_mbm_local, cqm_mbm_total, cqm_occup_llc, cr8_legacy, cx16, cx8, dbx, de, decodeassists, extapic, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fxsr, fxsr_opt, ht, hw_pstate, ibpb, ibrs, ibs, irperf, lahf_lm, lbrv, lm, mba, mca, mce, misalignsse, mmx, mmxext, monitor, movbe, msr, mtrr, mwaitx, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, overflow_recov, pae, pat, pausefilter, pci_l2i, pclmulqdq, pdpe1gb, perfctr_core, perfctr_llc, perfctr_nb, pfthreshold, pge, pni, popcnt, pqe, pqm, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdt_a, rdtscp, rep_good, sep, sev, sha, sha_ni, skinit, smap, smca, sme, smep, ssbd, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, stibp, succor, svm, svm_lock, syscall, tce, topoext, tsc, tsc_scale, umip, v_vmsave_vmload, vgif, vmcb_clean, vme, vmmcall, wbnoinvd, wdt, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves
    ```

    ```
    > dmesg | grep MHz
    [    0.000000] tsc: Detected 2450.083 MHz processor
    [    7.731766] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
    [    8.979712] tsc: Refined TSC clocksource calibration: 2449.961 MHz
    ```

    ```
    > lscpu

    Architecture:                    x86_64
    CPU op-mode(s):                  32-bit, 64-bit
    Byte Order:                      Little Endian
    Address sizes:                   43 bits physical, 48 bits virtual
    CPU(s):                          128
    On-line CPU(s) list:             0-127
    Thread(s) per core:              2
    Core(s) per socket:              64
    Socket(s):                       1
    NUMA node(s):                    1
    Vendor ID:                       AuthenticAMD
    CPU family:                      23
    Model:                           49
    Model name:                      AMD EPYC 7V12 64-Core Processor
    Stepping:                        0
    Frequency boost:                 enabled
    CPU MHz:                         1497.558
    CPU max MHz:                     2450.0000
    CPU min MHz:                     1500.0000
    BogoMIPS:                        4900.16
    Virtualization:                  AMD-V
    L1d cache:                       2 MiB
    L1i cache:                       2 MiB
    L2 cache:                        32 MiB
    L3 cache:                        2...

commit d97008bfa45e1acbc2a37efee640bd15c3dfb8b4
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Wed May 25 05:47:07 2022 +0000

    Merged PR 2602: Add rocwmma plumbing in tensorize

    - Add rocwmma plumbing in tensorize
    - Cannot use this flag until the 5.2 ROCm release which natively supports rocWmma.

    Related work items: #3672

commit 426d382e6f2441fe40b29748895a1eba2cb2677d
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Wed May 25 03:52:48 2022 +0000

    Merged PR 2570: Enhancements to the gpu benchmark tool

    - Add multiprocess package builders and runners
    - Support for running on different GPU devices
    - Add clock speed determinism
    - add composable_kernel benchmarks
    - add cutlass benchmarks
    - add cublas and rocblas benchmarks
    - Add Cosmos DB result upload capability

    Related work items: #3683, #3700, #3705, #3685

commit 4454f6455b1324b1a97706604f9051cc9076f97b
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed May 25 01:23:34 2022 +0000

    Merged PR 2598: Fix mfma enum name typo

    Fix mfma enum name typo

commit aaeb8d1883ce24873fd614efa9cc835383576f64
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue May 24 07:29:17 2022 +0000

    Merged PR 2595: [nfc] Renames smoke_test.py -> smoke_tests.py

    [nfc] Renames smoke_test.py -> smoke_tests.py

commit fa7e10bcfd7f7424c976a7decaaaa1569d818525
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue May 24 04:51:43 2022 +0000

    Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in preparation for release

    bump docs version to 1.2.5 in preparation for release

commit 33aaf54a3018ab0904e722f965f441df3c9c9f9f
Author: Denny Sun <dennys@microsoft.com>
Date:   Mon May 23 23:05:17 2022 +0000

    Merged PR 2586: Loop order and indices as parameters​

    With this change, the user can write a schedule with loop_order parameterized:

           loop_order = create_parameters()
           schedule.reorder(order=loop_order )

            parameter_grid = {
                loop_order : (j, k, i, ii, jj, kk)
            }

            parameters = create_parameter_grid(parameter_grid,
                                            filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
                                            sample=5)

            # Add another function to the package
            package.add(
                plan,
                args=(A, B, C),
                parameters=parameters,
                base_name="matmul_256_256_256"
            )

    Related work items: #3693

commit efe3934c4bf4619999995322da5bea00d162e1f4
Author: Kern Handa <kerha@microsoft.com>
Date:   Fri May 20 19:49:29 2022 +0000

    Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR CI

commit 6a882080f7b19e4cc7e4f10cd3b82956fb758e8f
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri May 20 11:35:29 2022 +0000

    Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline

    The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.

commit d9d11c44ad2d576b1c42ad09a5b267d1c8694994
Author: Kern Handa <kerha@microsoft.com>
Date:   Fri May 20 10:06:50 2022 +0000

    Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer

    [nfc] Fixes a bunch of warnings in C++ layer

commit 1440a549ce2510c99f920f29c1cff15ebd159b55
Author: Kern Handa <kerha@microsoft.com>
Date:   Fri May 20 09:20:18 2022 +0000

    Merged PR 2589: [test] Adds DSL tests for Schedule.pad

    Adds DSL tests for Schedule.pad

commit adbf2ecd95d1389905e742e522702ef6fe66b615
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri May 20 01:31:22 2022 +0000

    Merged PR 2587: Sync Github to ADO

    commit b934ad05f6b8cd84420226b93f57b8ac3229eadc

    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Thu May 12 08:44:15 2022 +0800

        Update CONTRIBUTING.md

    commit f9f967cbbf36d8fe85b3078ae2e55c64501ac839
    Author: Marina Neseem <marinahesham21@gmail.com>
    Date:   Wed May 11 20:40:07 2022 -0400

        Add link to the NCHWc 2D Convolution Case Study (#41)

        * Add link to the NCHWC 2D Convolution Case Study

        * Update README.md

commit ea3b02fa3736c13b65f1eee38cee3035c9830b3b
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Thu May 19 20:16:27 2022 +0000

    Merged PR 2585: Use conditional instead of loop-unswitching on GPU

    This PR  changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

    Related work items: #3703

commit f34beb256d542e2f30b72c36b92bd02d96a7dba8
Author: Denny Sun <dennys@microsoft.com>
Date:   Wed May 18 20:12:00 2022 +0000

    Merged PR 2571: Add random seed to enable reproducible sampling

    Giving users control over sampling strategies.

commit 7ece029087bcb0458e3c597b8ed1c3340b4307c1
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Wed May 18 03:19:13 2022 +0000

    Merged PR 2581: Add CUDA tensor core support

    - Added CUDA tensor ops (no caching)
    - Added validation tests
    - Changed MMA enum names
    - Bit of generated tensor op code in cuda:
    ```
    ...
    vhalf *var11 = (vhalf*)arg2;
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
    wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
    vhalf *var13 = (vhalf*)arg0;
    wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
    wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
    vhalf *var15 = (vhalf*)arg1;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
    wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
    wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
    wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);
    ```

    Related work items: #3694

commit 92e02a82046a1d8547b4ed4cae027cebabf663ff
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue May 17 20:42:38 2022 +0000

    Merged PR 2584: Adds cublas_gemm benchmarking tool

    Adds cublas_gemm benchmarking tool

commit 35ef308ded44d277a62ba3861b5aadcef8c327c7
Author: Mason Remy <masonr@microsoft.com>
Date:   Mon May 16 20:08:47 2022 +0000

    Merged PR 2583: Don't hold ResolveWarpSize results with rvalue

    Don't hold ResolveWarpSize results with rvalue

    gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
    not holding the result with an rvalue pair appears to fix it.

    This was resulting in some mod 0's and floordiv 0's when we would expect
    the warp size constants to either be 32 or 64 exactly.

commit bece463adb08fd82a0698e571afe1e9cf850c082
Author: Kern Handa <kerha@microsoft.com>
Date:   Fri May 13 23:44:58 2022 +0000

    Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion

commit 3023409e3d229811e5c3e1bfb1c522684cbdf090
Author: Kern Handa <kerha@microsoft.com>
Date:   Thu May 12 09:22:08 2022 +0000

    Merged PR 2579: Improves accera_gemm.py's handling of unsupported configs

    Improves accera_gemm.py's handling of unsupported configs

commit 279b916ad2c38f069170ed457df3a2f41c7b4afd
Author: Kern Handa <kerha@microsoft.com>
Date:   Thu May 12 07:46:35 2022 +0000

    Merged PR 2578: Fixes time unit conversions in accera_gemm.py

    Also addresses comments for the previous rocblas_gemm PR

commit 2e34b46db3bb524085dddc91250d6f95ec04ec02
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed May 11 22:17:22 2022 +0000

    Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API change

    Fixes accera_gemm.py code after Plan.tensorize API change

commit 754c2125c7a2a254552bd27e8030c4deab18e8ae
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed May 11 17:27:09 2022 +0000

    Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker

    Adds library warmup to rocblas_gemm benchmarker

commit 01fed5a32d19c2ca2cf212cb73ff7244a1bcc94f
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue May 10 12:50:59 2022 +0000

    Merged PR 2572: [nfc] Move accera/viz -> tools/viz

    [nfc] Move accera/viz -> tools/viz

commit 1eaf20d24fab93e08709c4639acfd2aaa9a7f072
Author: Mason Remy <masonr@microsoft.com>
Date:   Tue May 10 09:09:58 2022 +0000

    Merged PR 2573: Update setup.cfg hatlib dependency version

    Update setup.cfg hatlib dependency version

commit e946c9f8c6b6bbd5263c48340231d51137fdd1fd
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue May 10 07:39:40 2022 +0000

    Merged PR 2557: Overhauls the benchmarking tool

    This change moves the benchmarking tool to a top-level `tools/benchmarkers` directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

    The aforementioned rocblas gemm benchmarking utility is also added in this change. `rocblas_gemm` is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

    An example:
    ```sh
    <build accera like usual>
    ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
    cd tools/benchmarkers
    mkdir ~/accera_benchmarks
    ./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm
    ```

    Related work items: #3685

commit 6a41aa27c07bdacd1475bff0f830bfd4c6fd514b
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Tue May 10 03:32:50 2022 +0000

    Merged PR 2569: Make tensorization passes configurable, remove dependency from split indices

    - Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
    - Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
    - Added documentation for the new enum type.
    - Added some tests
    - Current code does not work with K > M (still investigating this, but should not block this PR)

    Related work items: #3688

commit 9897093985ed70d9f62a4c980f03a37c94ae46d6
Author: Mason Remy <masonr@microsoft.com>
Date:   Tue May 10 02:32:45 2022 +0000

    Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays

    Fix vectorized access of LAST_MAJOR arrays

    - mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
      strides on the minor dimension of the memref they access, so
      reinterpretcast the memref to a flat buffer to pass that check
    - add translation for reinterpretcastop
    - improve vectorization of LAST_MAJOR matrices in cache accesses by
      changing the traversal order of the cache region (when
      filling/reducing) based on the memory ordering of the outer array
      being acted on.

commit 79d169d7ab8acaf989e19b6fb13cf960ed5f6260
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Mon May 9 22:50:55 2022 +0000

    Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry for ROCm build agent

commit 43d0883d6b353533dbe754092d4b34435d71fb2f
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri May 6 07:12:21 2022 +0000

    Merged PR 2560: Make register allocation during tensorization tunable

    - Add controllable number of fused mfma passes
    - Add controllable scheduling policy of mfma ops
    - Add tests

    Related work items: #3687

commit 056e108c3e31a5fc1a2529690062c695a3613005
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri May 6 05:11:23 2022 +0000

    Merged PR 2565: [build] bump hatlib dependency to 0.0.13

    hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

commit 5c3050fa5dcc64af16aa46fe882c25795a4ec9ac
Author: Denny Sun <dennys@microsoft.com>
Date:   Thu May 5 06:08:05 2022 +0000

    Merged PR 2563: Add a table of operators and code examples to the Parameters.md

    Update the Manuals with the supported operators and code examples.

commit 5417fa787349a866adda36f3d7d8531c83e42699
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu May 5 01:43:25 2022 +0000

    Merged PR 2562: [nfc] Add some macOS targets and synced Model.md

    * Re-generated Model.md to add missing models
    * Handle zero (unknown) vector_bytes cases in tests
    * Opportunistically added these models used during development:
      * 2016 macbook pro
      * M1 max

commit 9672086febe0523785988a47922e44692ae18a00
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu May 5 00:09:18 2022 +0000

    Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc versions to 1.2.4

commit 7b67f68344fe3ecedf9e5a84a4ac9667d7eaea96
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed May 4 18:26:56 2022 +0000

    Merged PR 2558: [nfc] update requirements to latest version of six

    Fixes this warning:

    ```
    <frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()
    ```

commit 6d3f21a5136c9122feee71bb9990f85970e3db98
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Tue May 3 19:03:44 2022 +0000

    Merged PR 2559: Finer-granularity error reporting for python tests

    This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.

commit b77a99ea7cce3a82a3b36272458895f47773b666
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Sat Apr 30 01:57:09 2022 +0000

    Merged PR 2556: [non-functional] Change ROCM code to generate gcn intrinsics when possible

    - Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
    - Add helpers which automatically check for runtime before emitting the proper code

    Related work items: #3698

commit ca0a6fe390900c46ea345ac03662c44827827ca5
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Apr 29 05:43:55 2022 +0000

    Merged PR 2547: [non-functional] Change custom mfma types to Memref and some refactoring

    Make inital changes to remove custom mfma types

    Related work items: #3691

commit b8b9631601eb5eab88af035eca1e5ecdf27741ba
Author: Denny Sun <dennys@microsoft.com>
Date:   Thu Apr 28 01:11:54 2022 +0000

    Merged PR 2555: create_parameters(count: int) no longer needs count as an argument

    1. Remove the count of parameters to be created from the DSL
    2. Throw exception when users write the following code:
    create_parameters()
    3. The correct way of calling create_parameters() is:
    p1, p2 , p3 ..., pN = create_parameters()

commit 68eb52ad7cea75671916c49a13617d40f6488471
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Apr 27 18:23:57 2022 +0000

    Merged PR 2554: [doc] Updated some missing enums and fixed Case Study path

commit 6654c774fb0d2d6fac760b911a547b4e66b23127
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Wed Apr 27 00:44:53 2022 +0000

    Merged PR 2522: Generalize array indexing in tensorized GEMM

    This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

    There are also a number of tests.

    This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

    Related work items: #3676

commit 4d030709101f3653712b805bd8f3698e0e293bd3
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 17:50:18 2022 +0000

    Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool

    * The Linux1ESPool is created to support internal builds of LLVM

    * Fix regression in pipeline due to overzealous .dockerignore

commit 9b9d6b4b77c46b12788665412b9d0d1c2ff62d18
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 10:43:28 2022 +0000

    Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote

    In preparation for merge from ADO to GitHub for Case Studies publishing

commit c1298946d18fb785788c556ea2959b9438f9c6b7
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 26 08:10:47 2022 +0000

    Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third party containers

    Updating Dockerfile references

commit 0c7a3610ba082e82e554297bdadbf9579b094745
Author: Denny Sun <dennys@microsoft.com>
Date:   Tue Apr 26 04:40:05 2022 +0000

    Merged PR 2548: Add README file for case studies

    README file has a table where each case study points to the external repo link.

commit edbc50edd00efe8f12a675735d7e52371e43f7b1
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Mon Apr 25 23:49:15 2022 +0000

    Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for development

    Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

    No plans to release pip packages until there is CI support

    Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10

commit 166e333a3d10b77c804dc3edc1c71bfc5716c768
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Mon Apr 25 17:50:22 2022 +0000

    Merged PR 2543: Add precomputed offset map optimization for tensorization (no caching)

    - Add flag to tensorize() to enable optimization (off by default)
    - Optimization only affects load/store of accumulator (C) argument
    - Supports all 4 mfma shapes

    Related work items: #3671

commit e11c4d4e87bbae87f7cb9035eff8e6af650c9d1a
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Sun Apr 24 01:00:41 2022 +0000

    Merged PR 2542: An assortment of minor fixes

    This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

    The specific things are:
    - Add 2 new target models to `Targets.py` (that correspond to my local dev boxes)
    - Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
    - Print a warning message if `check_correctness` skips a correctness check because no hat file was generated
    - Add a "minimum version" constraint to `requirements.txt` for `hatlib`

commit 8da7903ac9b6d8612711593308e49a7a3e82678d
Author: Kern Handa <kerha@microsoft.com>
Date:   Sat Apr 23 23:59:53 2022 +0000

    Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Related work items: #3679

commit fe2c40fa8f1c28dcf47e1533223457fd3e6bf195
Author: Kern Handa <kerha@microsoft.com>
Date:   Sat Apr 23 23:17:43 2022 +0000

    Merged PR 2544: [nfc] Removes now unnecessary ldebug output

    [nfc] Removes now unnecessary ldebug output

commit 32090d786ce13299bb77a6675c3478b3d7cdf48c
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Apr 22 21:31:01 2022 +0000

    Merged PR 2527: Enable vectorized shared memory write

    Enable vectorized shared memory write

    - This adds mod simplification support needed for vecotrizing shared
      memory writes
    - Also refactors some of the affine simplification code slightly to
      share some common code between the floordiv and mod simplifications

    Related work items: #3586, #3661, #3689

commit 0eb698af118b94bf3f4d4862a142c86055f8b7bb
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Apr 22 19:13:27 2022 +0000

    Merged PR 2526: Enable GPU global read vectorization

    Enable GPU global read vectorization

    - Implements a floor div simplification that enables better recognition
      of vectorizable load and stores

    Related work items: #3661, #3690

commit df849f066ff6c2c82c796d9b48e3bea6390c7877
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Fri Apr 22 06:03:27 2022 +0000

    Merged PR 2541: Fix a few issues with GEMM benchmarking script

    This PR fixes a couple of errors:
    - there was a bug in the GEMM kernel
    - sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"

    It makes a couple of other tweaks:
    - it fails if the `alpha` and `beta` parameters aren't `1.0` and `0.0`
    - it culls some variants with known-uncompilable tensorization parameters before trying to compile them

commit 339253767ae4bb4f7e5c323f77fc938ba1a4ab92
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri Apr 22 01:26:53 2022 +0000

    Merged PR 2538: Fix std::pair unpacking issue in TensorizeAffineForOpConversion

    In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files

commit 075c83247d34bfd9fb291e4ea6b9df059a94993a
Author: Denny Sun <dennys@microsoft.com>
Date:   Fri Apr 22 00:26:56 2022 +0000

    Merged PR 2536: Parameter supports most of the arithmetic/binary/unary operations defined in operator lib

    Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

    fma_unit_count, vector_size = acc.create_parameters(2)​
    jjj = schedule.split(jj, fma_unit_count * vector_size)​
    jjjj = schedule.split(jjjj, vector_size)

    Related work items: #3692

commit 6d5e71899c6fb606e32ec46ee871ae1af25d3cd6
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 21 18:22:12 2022 +0000

    Merged PR 2539: [nfc][docs] Merging commits from Github/main

    commit ee28126a338d905eb5931038d3c5daba6ead3811
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Wed Apr 20 21:35:20 2022 +0800

        Update arrow label positions (#35)

        * [nfc] [doc] Update arrow label positions

        * make arrowhead more visible

        * nfc

    commit ddcecaaffd9dd0861999a6d29443dc7c37d79665
    Author: Lisa Ong <11318241+lisaong@users.noreply.github.com>
    Date:   Wed Apr 20 21:34:40 2022 +0800

        demo fixes for hatlib 0.0.11 (#36)

commit 9531a2eb4bb9edf9484d09a20a7b2fd74b73720c
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 21 09:55:57 2022 +0000

    Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm smoke tests

    * Docker image for self-hosted build agent on the ROCm development machine
    * Pipeline will front-load the Python ROCm tests so that we fail faster
    * The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
    * CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
    * The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
    * Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.

    Related work items: #3682

commit 49f176ad8f2f56a20b3028e9c1648b0518b71bd4
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 21 07:27:28 2022 +0000

    Merged PR 2537: [Compliance] Make dependency on ffmpeg optional

    ffmpeg-python is only needed for video export from the Iteration Visualizer Tool

    Removing the hard dependency from the tool.

commit 8519558ff63c2142ad1cb6ab3ebcfec556416432
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Apr 21 01:06:25 2022 +0000

    Merged PR 2525: Fix vectorization plumbing for GPU scenarios

    Fix vectorization plumbing for GPU scenarios

    Related work items: #3661

commit 793343a838492fba63b344dbe0c3c147721b11da
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 21 00:09:09 2022 +0000

    Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main

    commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Mon Apr 18 20:15:49 2022 +0500

        Refactoring of functions docs in reference files (#34)

        * complete refactoring of safety analysis

        * minor tweaks

        * rebasing and minor tweak

        * Update create_parameter_grid.md

        Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com>

    commit fe880fb269cefb5af774b7085b0e4c1a95692630
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Mon Apr 18 19:12:46 2022 +0500

        Complete refactoring of safety analysis (#33)

        * minor fixes and ensure that all links are working

        * complete refactoring of safety analysis

        * Update accera.md

        * Update safety_analysis.md

        Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com>

    commit d21918b8b366c369d63a507ede696236cbbd8dc6
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Mon Apr 18 16:22:41 2022 +0500

        Refactoring of Accera.md from reference docs (#32)

        * minor fixes and ensure that all links are working

        * Update accera.md

        Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com>

    commit 4dc0ce9ee841db837350f9288c821257df53acc3
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Mon Apr 18 07:26:00 2022 +0500

        Docs refactoring tutorials optimized matmul (#31)

        * minor tweeks

        * initial fixes

        * complete the file with minor tweeks and grammatical fixes

        * did grammatical fixes, rephrasing, and ensure conciseness

        * Update Optimized_MatMul.md

        * Update Optimized_MatMul.md

        Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com>

    commit ae92be4e67b45b60dd37921974387bc8dd34088e
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Mon Apr 18 07:07:20 2022 +0500

        Docs refactoring tutorials hello matmul gpu (#30)

        * minor tweeks

        * initial fixes

        * complete the file with minor tweeks and grammatical fixes

        * Addressed the provided feedback

commit 543ea83e8272923b2e44c363bc376a50131622d5
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Apr 20 21:05:53 2022 +0000

    Merged PR 2530: Adds initial GPU benchmarking infrastructure

    Related work items: #3685

commit 6cf59ed41eada571019a71e1af11a887d39a7aad
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed Apr 20 20:17:49 2022 +0000

    Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file

    [nfc] Refactor RangeValue utilities to separate file

    Related work items: #3661

commit fe78918d25f2925053a37b13ee8c71d7f111b32f
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Apr 20 07:44:57 2022 +0000

    Merged PR 2532: [prog] Fallback to known TargetDevice names for looking up the LLVM triple

    Resolves the issue where the CPU type is resolved (e.g. "zen2"), but does not match anything in the known triples list in TargetDevice.cpp

    Future work can consider lifting the TargetDevice.cpp list to the Python layer

commit 734dd15193c571bef2d0ce3b62f24c29778f01d8
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 19 23:46:54 2022 +0000

    Merged PR 2523: [nfc][docs] Incorporate generated visualizations from Iteration Space Visualizer

    * Add Alex's visualization tool to our tree
    * Updated Schedule documentation and examples to align with existing visualizations
    * Moved logos to subfolder under assets

    TODO: Add Fusing visualizations in a subsequent PR

commit b48296b39a6ef84b3dd3220624fa1c681b98caf0
Author: Kern Handa <kerha@microsoft.com>
Date:   Sun Apr 17 18:33:10 2022 +0000

    Merged PR 2521: Updates formatting of the unknown HOST warning message

    Updates formatting of the unknown HOST warning message

commit 8d487fa45d49c2379a4899810716aa3dcde2eb46
Author: Kern Handa <kerha@microsoft.com>
Date:   Fri Apr 15 09:47:42 2022 +0000

    Merged PR 2514: Makes module compilation resist func compilation fails

    Makes module compilation resist func compilation fails

commit 6bcbd1892edba8bcd5c0c82d7cbde57ff5896c0b
Author: Denny Sun <dennys@microsoft.com>
Date:   Thu Apr 14 00:24:40 2022 +0000

    Merged PR 2517: Get the known device for host machine and give a warning if the host is an unknown device

    When it is a host target, we call cpuinfo to query cpu model from the host machine, then use regex to match with the model names in known devices, we will use the configs in known devices if matched, or else we will use some default configs to generate code for the host target and give our users a warning about the potential suboptimum code.

    Related work items: #3546

commit 41166ffd6b012464dd70eb021efba0dc1485fe0f
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Apr 13 18:40:23 2022 +0000

    Merged PR 2519: Merging changes from Github remote

    commit ee8ad1ed7b7911109d76a40fb3990a419de05fe5
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Tue Apr 12 16:16:38 2022 +0500

    Revise Pi3_Cross_Compilation.md (28)

    commit aa1b9672d1d76fe1f1493959c8cced6b89e1b0a0
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Fri Apr 8 17:51:34 2022 +0500

    Docs refactoring install (27)

commit 77f4ae34c3cf1dd506bbe6bd148577c670d4ec53
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Tue Apr 12 20:12:52 2022 +0000

    Merged PR 2513: Removed inaccurate warp size computation for Vulkan targets

    The previous barrier optimization PR added so inaccurate code to `util::resolveWarpSize()` for Vulkan targets. This PR removes that, and fixes up some tests that depended on it.

commit 9fa84695b15c73a403649c269a4b71c31c2375d5
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Tue Apr 12 17:59:15 2022 +0000

    Merged PR 2516: Add fp16 support for mfma in the DSL (+tests)

    - Add support for fp16 input and fp32 output
    - Support fp16 input and output
    - Clean up some tests

    Related work items: #3670

commit 79594afb12332e368e8f156393461838c8592a1e
Author: Ritwik Das <ritdas@microsoft.com>
Date:   Fri Apr 8 05:42:24 2022 +0000

    Merged PR 2510: Add different mfma tile sizes for FP32

    - Fix couple of offset bugs
    - Add multi-block tile sizes
    - Add unit tests

    Related work items: #3666

commit 2046b0529f9aac47c0a7ea50467a1b08a36dc5bb
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Apr 8 02:26:37 2022 +0000

    Merged PR 2511: Enable smoke test GPU matmul correctness checks

    Enable smoke test GPU matmul correctness checks

    - Also fix some FP16 scenarios
    - Add some more Accera <-> numpy mapping utilities

commit 40ad2ee4bfb1dc44ec028baeef5908204f375c30
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Apr 7 18:37:31 2022 +0000

    Merged PR 2502: Support different input array layouts for GPU caching

    Support different input array layouts for GPU caching

    This change mainly configures the thread assignments in order to get
    coalesced global memory access. The logical accessing should have
    already been correct, this is primarily for performance.

    Related work items: #3660

commit 3a62c46308be00ae70d28e3f05df9a60b4999021
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Thu Apr 7 16:32:34 2022 +0000

    Merged PR 2487: Barrier optimization, part 2

    This PR improves the previous barrier optimization code. It now works with non-straight-line code (if/else constructs and loops).

    It doesn't yet do the "move barriers outside of loops" optimization.

    For debugging, there's an option to output a graphviz dot file showing the graph of relevant instructions that are used during the optimization:

    ```
    acc-opt ... --barrier-opt-dot --barrier-opt-dot-filename="barrier.dot"
    ```

    Related work items: #3649

commit a306c54ce09c8257d791a7927da8ac9f80ddafa4
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Thu Apr 7 00:59:58 2022 +0000

    Merged PR 2509: [nfc] sync quickstart demo from GitHub/demo branch

    Use a subset of MLAS optimizations that are sufficient to show a 3x improvement over the default schedule.

    This version was already in the Github repo for some time.

commit 7af0adfe88b4db022c9a7c8cfe712181ab3c4df5
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Apr 6 22:34:28 2022 +0000

    Merged PR 2508: [release] Bump docs version to 1.2.3

    In preparation for a PyPI release to facilitate community contributions for case studies

    Synced doc editorials from public Github repo

commit 958d64e74ed02a08462d2316c488b21dc51804cd
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Wed Apr 6 09:32:27 2022 +0000

    Merged PR 2503: [prog] Support unsigned integer types in the DSL

    * Add ScalarType.uint8/16/32/64 support
    * Use UnrealizedConversionCastOps to convert these unsigned ints to signless ints
    * Refactored CastImpl now that we have to handle both unsigned and signless cases for casts to/from ints
    * Use a tuple of (mlir Type, llvm Type) to infer the C type when writing function declarations in the HAT file. The former holds sign-ness information, the latter determines the C type (e.g. pointer or not)
    * Simplified CheckAllClose function to reduce unnecessary casting
    * Doc updates
    * Fixed HAT file issues with ScalarType.bool

    Note: Pipelines will fail until the next release of hatlib (https://github.com/microsoft/hat/pull/37)

    Related work items: #3520

commit 510f2f59eef1de8c6a4872c3bb5a6ff147da2f5f
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Apr 6 01:58:48 2022 +0000

    Merged PR 2507: Updates acc-translate output for ROCm 5.1

commit adcc6f0888fa111d76049e90e9d77168e0a47c68
Author: Denny Sun <dennys@microsoft.com>
Date:   Wed Apr 6 00:48:02 2022 +0000

    Merged PR 2437: Add more known targets(from our team's devices)

    The new list covers the following cpus, these cpus are being used by our devs,
    Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
    11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz   2.11 GHz
    Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
    Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz

    Related work items: #3546

commit c2d5af44077a6344eb1cc595cd4bcf137f77164c
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Apr 5 07:32:47 2022 +0000

    Merged PR 2505: [nfc] Rename parameters for schedule.tile and plan.bind

    [nfc] Rename parameters for schedule.tile and plan.bind

commit d61827f078676de5d660ba819f7d30d2896cb449
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Apr 5 06:00:32 2022 +0000

    Merged PR 2501: Adds support for more than one GPU function per package

    Adds support for more than one GPU function per package

    Related work items: #3686

commit 2d2605a3b90ffc70fc9b078315237bbff51b1273
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Apr 5 05:16:10 2022 +0000

    Merged PR 2504: [docs] Update stale versions in Reference docs

    Fixing while considering better approaches....

commit 7fd0d8f0412343d0dd82a8817775bf853a700074
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Apr 5 00:53:42 2022 +0000

    Merged PR 2499: Updates the syntax for schedule.tile

    Updates the syntax for schedule.tile

commit e221cad3ffabe4c31f7c92e9bc0e112002581b1a
Author: Kern Handa <kerha@microsoft.com>
Date:   Mon Apr 4 23:21:14 2022 +0000

    Merged PR 2498: Updates the syntax for plan.bind

    Updates the syntax for plan.bind

    Re…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant