IREE: An Experimental MLIR Execution Environment
DISCLAIMER: This is not an officially supported Google product. It's an experimental playground for low-level/tightly integrated machine learning libraries that make use of modern hardware acceleration APIs and techniques (see non goals).
Table of Contents
IREE (Intermediate Representation Execution Environment, pronounced as "eerie") is an experimental compiler backend for MLIR that lowers ML models to an IR that is optimized for real-time mobile/edge inference against heterogeneous hardware accelerators.
The IR produced contains the sequencing information required to communicate pipelined data dependencies and parallelism to low-level hardware APIs like Vulkan and embed hardware/API-specific binaries such as SPIR-V or compiled ARM code. As the IR is specified against an abstract execution environment there are many potential ways to run a compiled model, and one such way is included as an example and testbed for runtime optimization experiments.
The included layered runtime scales from generated code for a particular API (such as emitting C code calling external DSP kernels), to a HAL (Hardware Abstraction Layer) that allows the same generated code to target multiple APIs (like Vulkan and Direct3D 12), to a full VM allowing runtime model loading for flexible deployment options and heterogeneous execution. Consider both the compiler and the included runtime a toolbox for making it easier - via the versatility of MLIR - to take ML models from their source to some varying degree of integration with your application.
IREE has been developed alongside MLIR and is used as an example of how non-traditional ML compiler backends and runtimes can be built: it focuses more on the math being performed and how that math is scheduled rather than graphs of "ops" and in some cases allows doing away with a runtime entirely. It seeks to show how more holistic approaches that exploit the MLIR framework and its various dialects can be both easy to understand and powerful in the optimizations to code size, runtime complexity, and performance they enable.
Demonstrate Advanced Models
By using models with much greater complexity than the usual references (such as MobileNet) we want to show how weird things can get when model authors are allowed to get creative: dynamic shapes, dynamic flow control, dynamic multi-model dispatch (including models that conditionally dispatch other models), streaming models, tree-based search algorithms, etc. We are trying to build IREE from the ground-up to enable these models and run them efficiently on modern hardware. Many of our example models are sequence-to-sequence language models from the Lingvo project representing cutting edge speech recognition and translation work.
An observation that has driven the development of IREE is one of ML workloads not being much different than traditional game rendering workloads: math is performed on buffers with varying levels of concurrency and ordering in a pipelined fashion against accelerators designed to make such operations fast. In fact, most ML is performed on the same hardware that was designed for games! Our approach is to use the compiler to transform ML workloads to ones that look eerily (pun intended) similar to what a game performs in per-frame render workloads, optimize for low-latency and predictable execution, and integrate well into existing systems both for batched and interactive usage. The IREE runtime is designed to feel more like game engine middleware than a standalone ML inference system, though we still have much work towards that goal. This should make it easy to use existing tools for high-performance/low-power optimization of GPU workloads, identify driver or system issues introducing latency, and help to improve the ecosystem overall.
Demonstrate Standards-based ML via Vulkan and SPIR-V
With the above observation that ML can look like games from the systems perspective it follows that APIs and technologies good for games should probably also be good for ML. In many cases we've identified only a few key differences that exist and just as extensions have been added and API improvements have been made to graphics/compute standards for decades we hope to demonstrate and evaluate small, tactical changes that can have large impacts on ML performance through these standard APIs. We would love to allow hardware vendors to be able to make ML efficient on their hardware without the need for bespoke runtimes and special access such that any ML workload produced by any tool runs well. We'd consider the IREE experiment a success if what resulted was some worked examples that help advance the entire ecosystem!
- Replace parts of the supported TensorFlow ecosystem of tools: The authors within Google work closely with TensorFlow and contribute to it regularly. However, IREE is exploring some different angles of the problem and is experimental. We will seek to leverage anything of value that we learn in an appropriate way to make TensorFlow better over time, but the two should not be conflated.
- Providing an SLA of any kind: IREE is infrastructure research, not a supported product. If it gains mind-share or traction, we would revisit that in conjunction with finding a more permanent way to align it with the broader constellation of ML tooling.
We are currently just at the starting line, with basic MNIST MLP running end-to-end on both a CPU interpreter and Vulkan. As we scale out the compiler we will be increasing the complexity of the models that can run and demonstrating more of the optimizations we've found useful in previous efforts to run them efficiently on edge devices.
A short-term Roadmap is available talking about the major areas where are focusing on in addition to the more infrastructure-focused work listed below.
We'll be setting up GitHub milestones with issues tracking major feature work we are planning. For now, our areas of work are:
- Allocation tracking and aliasing in the compiler
- Pipelined scheduler in the VM for issuing proper command buffers
- New CPU interpreter that enables lightweight execution on ARM and x86
- C code generator and API to demonstrate "runtimeless" mode
- Quantization using the MLIR quantization framework
- High-level integration and examples when working with TensorFlow 2.0
- Switching from IREE's XLA-to-SPIR-V backend to the general MLIR SPIR-V backend
Things we are interested in but don't yet have in-progress:
- Ahead-of-time compiled ARM NEON backend (perhaps via SPIRV-LLVM, SPIRV-to-ISPC, or some other technique)
- HAL backends for Metal 2 and Direct3D 12
- Profile-guided optimization support for scheduling feedback
Coming soon :)
Build System and CI
We are still working to port our build configuration to Bazel and will be open sourcing the BUILD files and Starlark rules we use soon. We will also be maintaining a CMake build for integration with the dependent projects (Abseil and MLIR).
Code and Style
The project is currently very early and a mix of code written prior to a lot of the more recent ergonomics improvements in MLIR and its tablegen. Future changes will replace the legacy code style with prettier forms and simplify the project structure to make it easier to separate the different components. Some entire portions of the code (such as the CPU interpreter) will likely be dropped or rewritten. For now, assume churn!
The compiler portions of the code (almost exclusively under
follows the LLVM style guide and has the same system requirements as MLIR
itself. It general requires a more modern C++ compiler.
The runtime portions vary but most are designed to work with C++11 and use Abseil to bring in future C++14 and C++17 features.
We are mostly targeting Vulkan and Metal on recent mobile devices and as such
have limited our usage of hardware features and vendor extensions to those we
have broad access to there. This is mainly just to keep our focus tight and does
not preclude usage of features outside the standard sets or for other hardware
types (in fact, we have a lot of fun ideas for
VK_NVX_device_generated_commands and Metal 2.1's Indirect Command Buffers!).
NOTE: during the initial open source release we are still cleaning things up. If there's weird dependencies/layering that makes life difficult for your particular use case please file an issue so we can make sure to fix it.
The compiler has several layers that allow scaling the dependencies required based on the source and target formats. In all cases MLIR is required and for models not originating from TensorFlow (or already in XLA HLO format) it is the only dependency. When targeting the IREE Runtime VM and HAL FlatBuffers is required for serialization. Converting from TensorFlow models requires a dependency on TensorFlow (however only those parts required for conversion).
The VM providing dynamic model deployment and advanced scheduling behavior requires Abseil for its common types, however contributions are welcome to make it possible to replace Abseil with other libraries via shims/forwarding. The core types used by the runtime (excluding command line flags and such in tools) are limited to types coming in future C++ versions (variant, optional, string_view, etc), cheap types (absl::Span), or simple standard containers (absl::InlinedVector). FlatBuffers is used to load compiled modules.
The HAL and the provided implementations (Vulkan, etc) also use Abseil. Contributions are welcome to allow other types to be swapped in. A C99 HAL API is planned for code generation targets that will use no dependencies.
Testing and Tooling
Swiftshader is used to provide fast hardware-independent testing of the Vulkan and SPIR-V portions of the toolchain.
More Coming soon! Performing full model translation may require a few steps (such as ensuring you have a working TensorFlow build), however we'll have pre-translated example models that allow independent testing of the runtime portions.
IREE is licensed under the terms of the Apache license. See LICENSE for more information.