Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish support for devirtualizing the HAL API. #6641

Closed
benvanik opened this issue Aug 4, 2021 · 4 comments
Closed

Finish support for devirtualizing the HAL API. #6641

benvanik opened this issue Aug 4, 2021 · 4 comments
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime
Projects

Comments

@benvanik
Copy link
Collaborator

benvanik commented Aug 4, 2021

The HAL C interface was designed to be devirtualized in size-optimized builds where only a single HAL backend is linked in. The implementation was not completed, though, and needs finishing.

The intent is that all of the vtables used within the HAL like iree_hal_buffer_vtable_t are stripped and dispatches to the vtables that all route through IREE_HAL_VTABLE_DISPATCH become direct function calls.

Instead of having a vtable that references some implementation function like iree_hal_heap_buffer_map_range, an alias for iree_hal_buffer_map_range_direct that points at iree_hal_heap_buffer_map_range would be created. Then any dispatch to the vtable through that macro becomes a direct call to type_prefix##method_name#_direct. The vtable is then never used, stripped, and LTO takes care of stripping unused HAL implementations and such.

This would reduce total binary size as only those HAL methods used would be required, improves call overhead as direct functions are used vs indirections through the vtables, and reduces binary size as no per-type vtables need to be stored.

There's a few gotchas that were never finished - like we still may need distinct vtable-like identifiers for type checking - but they can be unique byte memory locations vs. a full vtable with function pointers such that simple pointer equality works. We'll also need something that emits the _direct aliases - either automatically, via copy/paste, or macro magic around vtable definition. There may be some easier techniques for that too like making the vtable static and having _direct calls go through the vtable - in theory a compiler should be plenty smart to devirtualize that for us (so we still have the vtable struct but it's only ever used from within the compilation unit).

@benvanik benvanik created this issue from a note in Tiny IREE (Backlog) Aug 4, 2021
@benvanik benvanik added hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime labels Aug 4, 2021
@ergawy
Copy link
Contributor

ergawy commented Aug 16, 2021

Hello,
As mentioned on #34, I would like to start working on this issue.

I had a look under the hal directory to better understand what's there. However, since this is huge, I still miss the starting point. Since you stated devirtualization was started but not finished, can you point me to the relevant parts of the code where devirtualization is currently implemented and the test code that exercises it (if any)?

(note: I am new the runtime world so I hope you bear with the possibly stupid questions I might have :))

@benvanik
Copy link
Collaborator Author

Did some tests with the current style and the struct handle style: https://godbolt.org/z/3Wr6Ko5ET
clang is impressive O_O

We'd need to analyze our usage to see if this optimization is able to kick in - if it does work then we're done 🎉
This does require LTO, but anyone caring about performance or binary size needs to use LTO anyway.

@benvanik
Copy link
Collaborator Author

benvanik commented Dec 17, 2021

The static const on the vtable is load-bearing to DCE - for some reason extern const is used in IREE right now - LTO still kills it but it'd be nice to fix that.

update: c++ needs it to be extern; I can fix up all the C impls though

benvanik added a commit that referenced this issue Dec 17, 2021
benvanik added a commit that referenced this issue Dec 17, 2021
benvanik added a commit that referenced this issue Dec 17, 2021
benvanik added a commit that referenced this issue Dec 17, 2021
benvanik added a commit that referenced this issue Dec 17, 2021
benvanik added a commit that referenced this issue Dec 19, 2021
The extra validation can still be toggled with the build option to
remove the binary size overhead but now is enabled by default unless
requested per-command buffer. This avoids one additional allocation and
the additional indirect jump that was defeating devirtualization analysis.

Progress on #6641.
benvanik added a commit that referenced this issue Dec 20, 2021
The extra validation can still be toggled with the build option to
remove the binary size overhead but now is enabled by default unless
requested per-command buffer. This avoids one additional allocation and
the additional indirect jump that was defeating devirtualization analysis.

Progress on #6641.
benvanik added a commit that referenced this issue Dec 20, 2021
The extra validation can still be toggled with the build option to
remove the binary size overhead but now is enabled by default unless
requested per-command buffer. This avoids one additional allocation and
the additional indirect jump that was defeating devirtualization analysis.

Progress on #6641.
@benvanik
Copy link
Collaborator Author

Not going to pursue this for now. Making the vtables smaller would be more worthwhile.

@benvanik benvanik closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2022
Tiny IREE automation moved this from Backlog to Done Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants