Skip to content

Adding an executable environment and plumbing through processor information.#8436

Merged
benvanik merged 3 commits into
mainfrom
benvanik-processor-info
Mar 8, 2022
Merged

Adding an executable environment and plumbing through processor information.#8436
benvanik merged 3 commits into
mainfrom
benvanik-processor-info

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

@benvanik benvanik commented Mar 2, 2022

This adds an iree_hal_executable_environment_*_t struct to wrap up the existing import table, a new processor info struct, and a new reserved spot for specialization constants.

An "executable environment" is now provided to executables as they are loaded as well as on each dispatch invocation. On initial query the executable code can use the optional processor information, import functions, and executable-level specialization constants to decide which implementation of functions to use by switching the pointers returned in the library table. There may be future work around that area to support different concurrently specialized versions of the same library but for today our usage doesn't trigger that issue. Dispatch functions receive the same processor information, imports, and constants and can be used to enable per-invocation specialization (switch paths based on the current processor microarchitecture when in a heterogenous system, use a runtime-derived executable-level constant computed by the host code such as workgroup sizes, etc).

The general equivalence to regular software is that flags provided to clang become HAL configurations in iree-translate, runtime queries done once to switch between implementations compiled in become specialization constants, and queries done per invocation become the environment/processor info. This design gives us the same capabilities normal software has while preserving our intra-architecture artifact portability and forward compatibility along with the ability to separate host and device (running in enclaves, on remote devices, etc).

Isolation is being maintained between the internal parts of the runtime that only need be source compatible (such as how to query the current processor ordinal via a platform/arch-specific impl) and those that bridge into the compiler. The executable library structs are part of the public stable ABI for the executables and once we go stable will only be able to be appended or versioned coarsely. By construction it's not possible to load an executable compiled for any other major architecture than the one doing the loading so each can have its own data format and there's no need to normalize all the various architectural quirks - instead the runtime is just passing data from some function that produces the data to the executable using it by memcpy. Beyond keeping things simple this also allows the runtime to be architecture-agnostic with respect to out-of-tree ISA's/OS'/bare-metal HAL's/etc; there's no enum of architectures or list of features in the runtime that needs to change beyond custom data query functions that can be overlaid into the build.

The processor data itself is represented as a set of opaque uint64_t fields. One could just store the architectural info registers right in the data, though of all the thousands of bits there are only a handful we care about and we can append to the set as we need it. A benefit of the arch registers would be that the runtime does not need to be updated to run models using newer features, but that's risky (it may mean the kernel or hosting code doesn't support it - like trying to use avx512 in an app compiled for avx2). The arch registers may not always be available to query either (sandboxing/fingerprinting avoidance), and having something that was robust to partial information is important. Another example use of the fields is a bitmap indicating which logical processors are running on a little core in a big.LITTLE setup such that testing per-invocation which code path should be used becomes (processor.data[CORE_TYPE_FIELD] >> processor_id) & 1. All other information about cache topology and such can come through specialization constants queried from the host side as that's what needs to adjust workgroup parameters anyway. The first few bits we add will likely be those from iree/tools/utils/cpu_features.h.

Note that the intent here is that executables we produce from the compiler always run on some baseline; unless a user is hyper optimizing for size the additional fallback paths would at most double the small executable size but realistically only increase it a marginal amount: not every dispatch is going to use all of the conditional features and need to be duplicated. There's going to need to be some compiler infra to make that happen, though (plumbing in MLIR through to LLVM's function multi-versioning), so today we're probably stuck with whatever the user specifies to the compiler :(

Progress on #5417. Follow-on PRs will fold in the iree/tools/utils/cpu_features.h work though there will need to be some discussion about how to represent it. Prepares for #8469.
Work remains to wire up the ability to specify the specialization constants in the compiler/runtime HAL layers; what's here is just the plumbing for CPU.

@benvanik benvanik added compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) runtime Relating to the IREE runtime library codegen/llvm LLVM code generation compiler backend hal/cpu Runtime Host/CPU-based HAL backend labels Mar 2, 2022
@benvanik benvanik force-pushed the benvanik-processor-info branch 2 times, most recently from 9407628 to 81395f6 Compare March 2, 2022 15:09
@iree-github-actions-bot
Copy link
Copy Markdown
Contributor

iree-github-actions-bot commented Mar 2, 2022

Abbreviated Benchmark Summary

@ commit 2412f882f2f096beaef70726bd6f9e18ed9768f4 (vs. base c3cefa5151e0d94d08dbf2ed9019594df24dd3fd)

Improved Benchmarks 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small [fp32,imagenet] (TFLite) big-core,full-inference,experimental-flags with IREE-Dylib-Sync @ Pixel-4 (CPU-ARMv8.2-A) 14 (vs. 15, 6.67%↓) 14 1
PoseNet [fp32] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ Pixel-6-Pro (GPU-Mali-G78) 16 (vs. 17, 5.88%↓) 16 0

For more information:

@benvanik
Copy link
Copy Markdown
Collaborator Author

benvanik commented Mar 5, 2022

(this intentionally doesn't decide what information we give to executables or how we use it in codegen, but defines how we move that information from the runtime into executable code)

Comment thread iree/base/internal/BUILD
Comment thread iree/hal/local/executable_library.h Outdated
Comment thread iree/hal/local/executable_library.h Outdated
Comment thread experimental/web/sample_static/device_sync.c
@benvanik benvanik force-pushed the benvanik-processor-info branch from 1798703 to 2412f88 Compare March 7, 2022 18:33
@MaheshRavishankar MaheshRavishankar removed their request for review March 8, 2022 00:30
@MaheshRavishankar
Copy link
Copy Markdown
Collaborator

THanks Ben. I cant really grock all the details here. I know the intent here. I'll play around with this a bit more and sync with you if I have any questions.

@benvanik
Copy link
Copy Markdown
Collaborator Author

benvanik commented Mar 8, 2022

No problem - you can consider it mostly FYI that this is now possible to use. It's not taking any opinions about the actual data used or how it integrates into codegen as I figured that'd be a bigger design discussion - at the end of the day all this does is memcpy a blob :P

benvanik and others added 3 commits March 7, 2022 16:46
This adds an iree_hal_executable_environment_*_t struct to wrap up
the existing import table, a new processor info struct, and a new
reserved spot for specialization constants.
Matches the style of other parameter structs.
@benvanik benvanik force-pushed the benvanik-processor-info branch from 2412f88 to 9f3b9fe Compare March 8, 2022 00:47
@benvanik benvanik merged commit 7e1322e into main Mar 8, 2022
@benvanik benvanik deleted the benvanik-processor-info branch March 8, 2022 01:54
ScottTodd added a commit that referenced this pull request Mar 29, 2022
…8672)

Tested on Windows with emsdk 3.1.8

---

Warnings are now treated as errors thanks to applying `IREE_DEFAULT_COPTS`. I hesitate to use an internal IREE CMake variable in a sample, but the convenience seems worth it here. Anyone forking this can just set the options they want.

---

Mismatched pointer type was from #8436 (since these don't build/test on presubmit and warnings were not treated as errors before). These samples now match https://github.com/google/iree/blob/2cf4a787ad631be33b092eaf65223252fd63eab8/iree/samples/static_library/static_library_demo.c#L33-L41

---

The `-sMAIN_MODULE=2` change is to fix this warning:

```
[10/10] Linking C executable experimental\web\sample_dynamic\web-sample-dynamic-sync.js
emcc: warning: EXPORTED_FUNCTIONS is not valid with LINKABLE set (normally due to SIDE_MODULE=1/MAIN_MODULE=1) since all functions are exported this mode.  To export only a subset use SIDE_MODULE=2/MAIN_MODULE=2 [-Wunused-command-line-argument]
```

See the docs: https://emscripten.org/docs/compiling/Dynamic-Linking.html#code-size. Setting to `2` enables DCE and `EXPORTED_FUNCTIONS` again (we tightly control the symbols we need so the default export all behavior isn't what we want).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codegen/llvm LLVM code generation compiler backend compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) hal/cpu Runtime Host/CPU-based HAL backend runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants