Skip to content

v1.2.4

Choose a tag to compare

@lisaong lisaong released this 06 May 02:12
· 49 commits to main since this release

What's Changed

  • Merged PR 2563: Add a table of operators and code examples to the
    Parameters.md. [Denny Sun]

    Update the Manuals with the supported operators and code examples.

  • Merged PR 2562: [nfc] Add some macOS targets and synced Model.md.
    [Lisa Ong]

    • Re-generated Model.md to add missing models
    • Handle zero (unknown) vector_bytes cases in tests
    • Opportunistically added these models used during development:
      • 2016 macbook pro
      • M1 max
  • Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc
    versions to 1.2.4. [Lisa Ong]

  • Merged PR 2558: [nfc] update requirements to latest version of six.
    [Lisa Ong]

    Fixes this warning:

    <frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()
    
  • Merged PR 2559: Finer-granularity error reporting for python tests.
    [Chuck Jacobs]

    This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.

  • Merged PR 2556: [non-functional] Change ROCM code to generate gcn
    intrinsics when possible. [Ritwik Das]

    • Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
    • Add helpers which automatically check for runtime before emitting the proper code

    Related work items: #3698

  • Merged PR 2547: [non-functional] Change custom mfma types to Memref
    and some refactoring. [Ritwik Das]

    Make inital changes to remove custom mfma types

    Related work items: #3691

  • Merged PR 2555: create_parameters(count: int) no longer needs count as
    an argument. [Denny Sun]

    1. Remove the count of parameters to be created from the DSL
    2. Throw exception when users write the following code:
      create_parameters()
    3. The correct way of calling create_parameters() is:
      p1, p2 , p3 ..., pN = create_parameters()
  • Merged PR 2554: [doc] Updated some missing enums and fixed Case Study
    path. [Lisa Ong]

  • Merged PR 2522: Generalize array indexing in tensorized GEMM. [Chuck
    Jacobs]

    This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

    There are also a number of tests.

    This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

    Related work items: #3676

  • Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool.
    [Lisa Ong]

    • The Linux1ESPool is created to support internal builds of LLVM

    • Fix regression in pipeline due to overzealous .dockerignore

  • Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote. [Lisa
    Ong]

    In preparation for merge from ADO to GitHub for Case Studies publishing

  • Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third
    party containers. [Lisa Ong]

    Updating Dockerfile references

  • Merged PR 2548: Add README file for case studies. [Denny Sun]

    README file has a table where each case study points to the external repo link.

  • Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for
    development. [Lisa Ong]

    Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

    No plans to release pip packages until there is CI support

    Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10

  • Merged PR 2543: Add precomputed offset map optimization for
    tensorization (no caching) [Ritwik Das]

    • Add flag to tensorize() to enable optimization (off by default)
    • Optimization only affects load/store of accumulator (C) argument
    • Supports all 4 mfma shapes

    Related work items: #3671

  • Merged PR 2542: An assortment of minor fixes. [Chuck Jacobs]

    This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

    The specific things are:

    • Add 2 new target models to Targets.py (that correspond to my local dev boxes)
    • Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
    • Print a warning message if check_correctness skips a correctness check because no hat file was generated
    • Add a "minimum version" constraint to requirements.txt for hatlib
  • Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for
    Package.Format. [Kern Handa]

    Unifies CUDA and CPP enum values to SOURCE for Package.Format

    Related work items: #3679

  • Merged PR 2544: [nfc] Removes now unnecessary ldebug output. [Kern
    Handa]

    [nfc] Removes now unnecessary ldebug output

  • Merged PR 2527: Enable vectorized shared memory write. [Mason Remy]

    Enable vectorized shared memory write

    • This adds mod simplification support needed for vecotrizing shared
      memory writes
    • Also refactors some of the affine simplification code slightly to
      share some common code between the floordiv and mod simplifications

    Related work items: #3586, #3661, #3689

  • Merged PR 2526: Enable GPU global read vectorization. [Mason Remy]

    Enable GPU global read vectorization

    • Implements a floor div simplification that enables better recognition
      of vectorizable load and stores

    Related work items: #3661, #3690

  • Merged PR 2541: Fix a few issues with GEMM benchmarking script. [Chuck
    Jacobs]

    This PR fixes a couple of errors:

    • there was a bug in the GEMM kernel
    • sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"

    It makes a couple of other tweaks:

    • it fails if the alpha and beta parameters aren't 1.0 and 0.0
    • it culls some variants with known-uncompilable tensorization parameters before trying to compile them
  • Merged PR 2538: Fix std::pair unpacking issue in
    TensorizeAffineForOpConversion. [Lisa Ong]

    In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files

  • Merged PR 2536: Parameter supports most of the
    arithmetic/binary/unary operations defined in operator lib. [Denny
    Sun]

    Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

    fma_unit_count, vector_size = acc.create_parameters(2)​
    jjj = schedule.split(jj, fma_unit_count * vector_size)​
    jjjj = schedule.split(jjjj, vector_size)

    Related work items: #3692

  • Merged PR 2539: [nfc][docs] Merging commits from Github/main. [Lisa
    Ong]

    commit ee28126a338d905eb5931038d3c5daba6ead3811

  • Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm
    smoke tests. [Lisa Ong]

    • Docker image for self-hosted build agent on the ROCm development machine
    • Pipeline will front-load the Python ROCm tests so that we fail faster
    • The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
    • CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
    • The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
    • Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.

    Related work items: #3682

  • Merged PR 2537: [Compliance] Make dependency on ffmpeg optional. [Lisa
    Ong]

    ffmpeg-python is only needed for video export from the Iteration Visualizer Tool

    Removing the hard dependency from the tool.

  • Merged PR 2525: Fix vectorization plumbing for GPU scenarios. [Mason
    Remy]

    Fix vectorization plumbing for GPU scenarios

    Related work items: #3661

  • Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main.
    [Lisa Ong]

    commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d

  • Merged PR 2530: Adds initial GPU benchmarking infrastructure. [Kern
    Handa]

    Related work items: #3685

  • Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file.
    [Mason Remy]

    [nfc] Refactor RangeValue utilities to separate file

    Related work items: #3661

  • Merged PR 2532: [prog] Fallback to known TargetDevice names for
    looking up the LLVM triple. [Lisa Ong]

    Resolves the issue where the CPU type is resolved (e.g. "zen2"), but does not match anything in the known triples list in TargetDevice.cpp

    Future work can consider lifting the TargetDevice.cpp list to the Python layer

  • Merged PR 2523: [nfc][docs] Incorporate generated visualizations from
    Iteration Space Visualizer. [Lisa Ong]

    • Add Alex's visualization tool to our tree
    • Updated Schedule documentation and examples to align with existing visualizations
    • Moved logos to subfolder under assets
  • Merged PR 2521: Updates formatting of the unknown HOST warning
    message. [Kern Handa]

    Updates formatting of the unknown HOST warning message

  • Merged PR 2514: Makes module compilation resist func compilation
    fails. [Kern Handa]

    Makes module compilation resist func compilation fails

  • Merged PR 2517: Get the known device for host machine and give a
    warning if the host is an unknown device. [Denny Sun]

    When it is a host target, we call cpuinfo to query cpu model from the host machine, then use regex to match with the model names in known devices, we will use the configs in known devices if matched, or else we will use some default configs to generate code for the host target and give our users a warning about the potential suboptimum code.

    Related work items: #3546

  • Merged PR 2519: Merging changes from Github remote. [Lisa Ong]

    commit ee8ad1ed7b7911109d76a40fb3990a419de05fe5

  • Merged PR 2513: Removed inaccurate warp size computation for Vulkan
    targets. [Chuck Jacobs]

    The previous barrier optimization PR added so inaccurate code to util::resolveWarpSize() for Vulkan targets. This PR removes that, and fixes up some tests that depended on it.

  • Merged PR 2516: Add fp16 support for mfma in the DSL (+tests) [Ritwik
    Das]

    • Add support for fp16 input and fp32 output
    • Support fp16 input and output
    • Clean up some tests

    Related work items: #3670

  • Merged PR 2510: Add different mfma tile sizes for FP32. [Ritwik Das]

    • Fix couple of offset bugs
    • Add multi-block tile sizes
    • Add unit tests

    Related work items: #3666

  • Merged PR 2511: Enable smoke test GPU matmul correctness checks.
    [Mason Remy]

    Enable smoke test GPU matmul correctness checks

    • Also fix some FP16 scenarios
    • Add some more Accera <-> numpy mapping utilities
  • Merged PR 2502: Support different input array layouts for GPU caching.
    [Mason Remy]

    Support different input array layouts for GPU caching

    This change mainly configures the thread assignments in order to get
    coalesced global memory access. The logical accessing should have
    already been correct, this is primarily for performance.

    Related work items: #3660

  • Merged PR 2487: Barrier optimization, part 2. [Chuck Jacobs]

    This PR improves the previous barrier optimization code. It now works with non-straight-line code (if/else constructs and loops).

    It doesn't yet do the "move barriers outside of loops" optimization.

    For debugging, there's an option to output a graphviz dot file showing the graph of relevant instructions that are used during the optimization:

    acc-opt ... --barrier-opt-dot --barrier-opt-dot-filename="barrier.dot"
    

    Related work items: #3649

  • Merged PR 2509: [nfc] sync quickstart demo from GitHub/demo branch.
    [Lisa Ong]

    Use a subset of MLAS optimizations that are sufficient to show a 3x improvement over the default schedule.

    This version was already in the Github repo for some time.

Full Changelog: v1.2.3...v1.2.4