v1.2.4
What's Changed
- Docs refactoring install by @Arslan-e-Mustafa in #27
- Revise Pi3_Cross_Compilation.md by @Arslan-e-Mustafa in #28
- Docs refactoring tutorials hello matmul by @Arslan-e-Mustafa in #29
- Docs refactoring tutorials hello matmul gpu by @Arslan-e-Mustafa in #30
- Docs refactoring tutorials optimized matmul by @Arslan-e-Mustafa in #31
- Refactoring of Accera.md from reference docs by @Arslan-e-Mustafa in #32
- Complete refactoring of safety analysis by @Arslan-e-Mustafa in #33
- Refactoring of functions docs in reference files by @Arslan-e-Mustafa in #34
- Demo fixes for hatlib 0.0.11 by @lisaong in #36
- [nfc] [doc] Update arrow label positions by @lisaong in #35
- completed reference docs by @Arslan-e-Mustafa in #37
- Update docstrings to match reference doc changes by @lisaong in #38
- [ci][nfc] Update CI pipeline to Azure Container Registry by @lisaong in #39
- [doc] Contributing guide for Case Studies by @lisaong in #40
-
Merged PR 2563: Add a table of operators and code examples to the
Parameters.md. [Denny Sun]Update the Manuals with the supported operators and code examples.
-
Merged PR 2562: [nfc] Add some macOS targets and synced Model.md.
[Lisa Ong]- Re-generated Model.md to add missing models
- Handle zero (unknown) vector_bytes cases in tests
- Opportunistically added these models used during development:
- 2016 macbook pro
- M1 max
-
Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc
versions to 1.2.4. [Lisa Ong] -
Merged PR 2558: [nfc] update requirements to latest version of six.
[Lisa Ong]Fixes this warning:
<frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module() -
Merged PR 2559: Finer-granularity error reporting for python tests.
[Chuck Jacobs]This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.
-
Merged PR 2556: [non-functional] Change ROCM code to generate gcn
intrinsics when possible. [Ritwik Das]- Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
- Add helpers which automatically check for runtime before emitting the proper code
Related work items: #3698
-
Merged PR 2547: [non-functional] Change custom mfma types to Memref
and some refactoring. [Ritwik Das]Make inital changes to remove custom mfma types
Related work items: #3691
-
Merged PR 2555: create_parameters(count: int) no longer needs count as
an argument. [Denny Sun]- Remove the count of parameters to be created from the DSL
- Throw exception when users write the following code:
create_parameters() - The correct way of calling create_parameters() is:
p1, p2 , p3 ..., pN = create_parameters()
-
Merged PR 2554: [doc] Updated some missing enums and fixed Case Study
path. [Lisa Ong] -
Merged PR 2522: Generalize array indexing in tensorized GEMM. [Chuck
Jacobs]This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.
There are also a number of tests.
This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.
Related work items: #3676
-
Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool.
[Lisa Ong]-
The Linux1ESPool is created to support internal builds of LLVM
-
Fix regression in pipeline due to overzealous .dockerignore
-
-
Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote. [Lisa
Ong]In preparation for merge from ADO to GitHub for Case Studies publishing
-
Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third
party containers. [Lisa Ong]Updating Dockerfile references
-
Merged PR 2548: Add README file for case studies. [Denny Sun]
README file has a table where each case study points to the external repo link.
-
Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for
development. [Lisa Ong]Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)
No plans to release pip packages until there is CI support
Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10
-
Merged PR 2543: Add precomputed offset map optimization for
tensorization (no caching) [Ritwik Das]- Add flag to tensorize() to enable optimization (off by default)
- Optimization only affects load/store of accumulator (C) argument
- Supports all 4 mfma shapes
Related work items: #3671
-
Merged PR 2542: An assortment of minor fixes. [Chuck Jacobs]
This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.
The specific things are:
- Add 2 new target models to
Targets.py(that correspond to my local dev boxes) - Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
- Print a warning message if
check_correctnessskips a correctness check because no hat file was generated - Add a "minimum version" constraint to
requirements.txtforhatlib
- Add 2 new target models to
-
Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for
Package.Format. [Kern Handa]Unifies CUDA and CPP enum values to SOURCE for Package.Format
Related work items: #3679
-
Merged PR 2544: [nfc] Removes now unnecessary ldebug output. [Kern
Handa][nfc] Removes now unnecessary ldebug output
-
Merged PR 2527: Enable vectorized shared memory write. [Mason Remy]
Enable vectorized shared memory write
- This adds mod simplification support needed for vecotrizing shared
memory writes - Also refactors some of the affine simplification code slightly to
share some common code between the floordiv and mod simplifications
Related work items: #3586, #3661, #3689
- This adds mod simplification support needed for vecotrizing shared
-
Merged PR 2526: Enable GPU global read vectorization. [Mason Remy]
Enable GPU global read vectorization
- Implements a floor div simplification that enables better recognition
of vectorizable load and stores
Related work items: #3661, #3690
- Implements a floor div simplification that enables better recognition
-
Merged PR 2541: Fix a few issues with GEMM benchmarking script. [Chuck
Jacobs]This PR fixes a couple of errors:
- there was a bug in the GEMM kernel
- sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"
It makes a couple of other tweaks:
- it fails if the
alphaandbetaparameters aren't1.0and0.0 - it culls some variants with known-uncompilable tensorization parameters before trying to compile them
-
Merged PR 2538: Fix std::pair unpacking issue in
TensorizeAffineForOpConversion. [Lisa Ong]In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files
-
Merged PR 2536: Parameter supports most of the
arithmetic/binary/unary operations defined in operator lib. [Denny
Sun]Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:
fma_unit_count, vector_size = acc.create_parameters(2)
jjj = schedule.split(jj, fma_unit_count * vector_size)
jjjj = schedule.split(jjjj, vector_size)Related work items: #3692
-
Merged PR 2539: [nfc][docs] Merging commits from Github/main. [Lisa
Ong]commit ee28126a338d905eb5931038d3c5daba6ead3811
-
Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm
smoke tests. [Lisa Ong]- Docker image for self-hosted build agent on the ROCm development machine
- Pipeline will front-load the Python ROCm tests so that we fail faster
- The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
- CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
- The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
- Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.
Related work items: #3682
-
Merged PR 2537: [Compliance] Make dependency on ffmpeg optional. [Lisa
Ong]ffmpeg-python is only needed for video export from the Iteration Visualizer Tool
Removing the hard dependency from the tool.
-
Merged PR 2525: Fix vectorization plumbing for GPU scenarios. [Mason
Remy]Fix vectorization plumbing for GPU scenarios
Related work items: #3661
-
Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main.
[Lisa Ong]commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d
-
Merged PR 2530: Adds initial GPU benchmarking infrastructure. [Kern
Handa]Related work items: #3685
-
Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file.
[Mason Remy][nfc] Refactor RangeValue utilities to separate file
Related work items: #3661
-
Merged PR 2532: [prog] Fallback to known TargetDevice names for
looking up the LLVM triple. [Lisa Ong]Resolves the issue where the CPU type is resolved (e.g. "zen2"), but does not match anything in the known triples list in TargetDevice.cpp
Future work can consider lifting the TargetDevice.cpp list to the Python layer
-
Merged PR 2523: [nfc][docs] Incorporate generated visualizations from
Iteration Space Visualizer. [Lisa Ong]- Add Alex's visualization tool to our tree
- Updated Schedule documentation and examples to align with existing visualizations
- Moved logos to subfolder under assets
-
Merged PR 2521: Updates formatting of the unknown HOST warning
message. [Kern Handa]Updates formatting of the unknown HOST warning message
-
Merged PR 2514: Makes module compilation resist func compilation
fails. [Kern Handa]Makes module compilation resist func compilation fails
-
Merged PR 2517: Get the known device for host machine and give a
warning if the host is an unknown device. [Denny Sun]When it is a host target, we call cpuinfo to query cpu model from the host machine, then use regex to match with the model names in known devices, we will use the configs in known devices if matched, or else we will use some default configs to generate code for the host target and give our users a warning about the potential suboptimum code.
Related work items: #3546
-
Merged PR 2519: Merging changes from Github remote. [Lisa Ong]
commit ee8ad1ed7b7911109d76a40fb3990a419de05fe5
-
Merged PR 2513: Removed inaccurate warp size computation for Vulkan
targets. [Chuck Jacobs]The previous barrier optimization PR added so inaccurate code to
util::resolveWarpSize()for Vulkan targets. This PR removes that, and fixes up some tests that depended on it. -
Merged PR 2516: Add fp16 support for mfma in the DSL (+tests) [Ritwik
Das]- Add support for fp16 input and fp32 output
- Support fp16 input and output
- Clean up some tests
Related work items: #3670
-
Merged PR 2510: Add different mfma tile sizes for FP32. [Ritwik Das]
- Fix couple of offset bugs
- Add multi-block tile sizes
- Add unit tests
Related work items: #3666
-
Merged PR 2511: Enable smoke test GPU matmul correctness checks.
[Mason Remy]Enable smoke test GPU matmul correctness checks
- Also fix some FP16 scenarios
- Add some more Accera <-> numpy mapping utilities
-
Merged PR 2502: Support different input array layouts for GPU caching.
[Mason Remy]Support different input array layouts for GPU caching
This change mainly configures the thread assignments in order to get
coalesced global memory access. The logical accessing should have
already been correct, this is primarily for performance.Related work items: #3660
-
Merged PR 2487: Barrier optimization, part 2. [Chuck Jacobs]
This PR improves the previous barrier optimization code. It now works with non-straight-line code (if/else constructs and loops).
It doesn't yet do the "move barriers outside of loops" optimization.
For debugging, there's an option to output a graphviz dot file showing the graph of relevant instructions that are used during the optimization:
acc-opt ... --barrier-opt-dot --barrier-opt-dot-filename="barrier.dot"Related work items: #3649
-
Merged PR 2509: [nfc] sync quickstart demo from GitHub/demo branch.
[Lisa Ong]Use a subset of MLAS optimizations that are sufficient to show a 3x improvement over the default schedule.
This version was already in the Github repo for some time.
Full Changelog: v1.2.3...v1.2.4