Skip to content

Conversation

pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Sep 25, 2025

Devices store a raw pointer to back to their owning Platform. Platforms are stored directly inside of a vector. Modifying this vector risks invalidating all the platform pointers stored in devices.

This change moves the host platform emplace to happen before device object creation, eliminating the risk of vector reallocation after platform pointers are set.

This popped up when trying to use #158900 with liboffload. The platforms vector was reallocated and liboffload segfaulted when iterating over the platforms. This is an attempt at the simplest possible fix.

Since the number of platforms is fixed, I initially wanted to refactor this code so that it became impossible to modify the platforms container after Context is created, but, when that patch grew large, I pivoted to this instead. Another option is to use shared_ptr/weak_ptr to allocate the Platform instead of storing it in the vector directly.

Given, #159636, which requires a non-trivial change to this same code anyway, I'm thinking I can do follow-up refactor that tackles both issues at the same time.
Thoughts?

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Sep 25, 2025

@llvm/pr-subscribers-offload

Author: Piotr Balcer (pbalcer)

Changes

Devices store a raw pointer to back to their owning Platform. Platforms are stored directly inside of a vector. Modifying this vector risks invalidating all the platform pointers stored in devices.

This change moves the host platform emplace to happen before device object creation, eliminating the risk of vector reallocation after platform pointers are set.

This popped up when trying to use #158900 with liboffload. The platforms vector was reallocated and liboffload segfaulted when iterating over the platforms. This is at attempt at the simplest possible fix.

Since the number of platforms is fixed, I initially wanted to refactor this code so that it became impossible to modify the platforms container after Context is created, but, when that patch grew large, I pivoted to this instead. Another option is to use shared_ptr/weak_ptr to allocate the Platform instead of storing it in the vector directly.

Given, #159636, which requires a non-trivial change to this same code anyway, I'm thinking I can do follow-up refactor that tackles both issues at the same time.
Thoughts?


Full diff: https://github.com/llvm/llvm-project/pull/160702.diff

1 Files Affected:

  • (modified) offload/liboffload/src/OffloadImpl.cpp (+12-2)
diff --git a/offload/liboffload/src/OffloadImpl.cpp b/offload/liboffload/src/OffloadImpl.cpp
index 08a2e25b97d85..0701ff12c4953 100644
--- a/offload/liboffload/src/OffloadImpl.cpp
+++ b/offload/liboffload/src/OffloadImpl.cpp
@@ -256,9 +256,18 @@ Error initPlugins(OffloadContext &Context) {
           pluginNameToBackend(#Name)});                                        \
   } while (false);
 #include "Shared/Targets.def"
+  // Host platform must be added before initializing plugin devices because
+  // devices contain a pointer back to the owning platform, and modifying the
+  // Platforms vector risks reallocating the underlying storage, thus invalidating
+  // all the platform pointers.
+  Context.Platforms.emplace_back(
+    ol_platform_impl_t{nullptr, OL_PLATFORM_BACKEND_HOST});
 
   // Preemptively initialize all devices in the plugin
   for (auto &Platform : Context.Platforms) {
+    if (Platform.BackendType == OL_PLATFORM_BACKEND_HOST)
+      continue;
+
     auto Err = Platform.Plugin->init();
     [[maybe_unused]] std::string InfoMsg = toString(std::move(Err));
     for (auto DevNum = 0; DevNum < Platform.Plugin->number_of_devices();
@@ -273,10 +282,11 @@ Error initPlugins(OffloadContext &Context) {
       }
     }
   }
+  // The Context.Platforms cannot be modified after this point without updating
+  // all the Device.Platform pointers.
 
   // Add the special host device
-  auto &HostPlatform = Context.Platforms.emplace_back(
-      ol_platform_impl_t{nullptr, OL_PLATFORM_BACKEND_HOST});
+  auto &HostPlatform = Context.Platforms.back();
   HostPlatform.Devices.emplace_back(
       std::make_unique<ol_device_impl_t>(-1, nullptr, nullptr, InfoTreeNode{}));
   Context.HostDevice()->Platform = &HostPlatform;

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to just make SmallVector<ol_platform_impl_t, 4> Platforms{}; take a unique pointer instead? In the future I really want to remove this restriction that all platforms and devices must be initialized at init time, as it's extremely expensive if you don't end up needing them.

@pbalcer
Copy link
Contributor Author

pbalcer commented Sep 25, 2025

How difficult would it be to just make SmallVector<ol_platform_impl_t, 4> Platforms{}; take a unique pointer instead?

I was thinking of SmallVector<shared_ptr<ol_platform_impl_t>, 4> Platforms{}; in the context and weak_ptr<ol_platform_impl_t> Platform in devices. That's one of the options I mentioned in the commit message.
Using unique_ptr would probably be simpler, but would require the devices to have a non-owning raw pointer taken from that unique_ptr (or a reference maybe? platforms will outlive devices).
I have no strong preference.

In the future I really want to remove this restriction that all platforms and devices must be initialized at init time, as it's extremely expensive if you don't end up needing them.

I did see #159636, maybe we could represent that as a fixed-size collection of optionals that are initialized on first use? Given we have an enum of backends (enum ol_platform_backend_t), it could be a simple array. Devices could just store the enum / index into that array.

Devices store a raw pointer to back to their owning Platform.
Platforms are stored directly inside of a vector. Modifying this
vector risks invalidating all the platform pointers stored in devices.

This change moves the host platform emplace to happen before
device object creation, eliminating the risk of vector reallocation
after platform pointers are set.
@jhuber6
Copy link
Contributor

jhuber6 commented Sep 25, 2025

How difficult would it be to just make SmallVector<ol_platform_impl_t, 4> Platforms{}; take a unique pointer instead?

I was thinking of SmallVector<shared_ptr<ol_platform_impl_t>, 4> Platforms{}; in the context and weak_ptr<ol_platform_impl_t> Platform in devices. That's one of the options I mentioned in the commit message. Using unique_ptr would probably be simpler, but would require the devices to have a non-owning raw pointer taken from that unique_ptr (or a reference maybe? platforms will outlive devices). I have no strong preference.

I figured we'd just keep a reference since the platform outlives all the devices, but I don't know the code intimately.

In the future I really want to remove this restriction that all platforms and devices must be initialized at init time, as it's extremely expensive if you don't end up needing them.

I did see #159636, maybe we could represent that as a fixed-size collection of optionals that are initialized on first use? Given we have an enum of backends (enum ol_platform_backend_t), it could be a simple array. Devices could just store the enum / index into that array.

Since all access goes through this, I was wondering if we turned it first into a smart pointers we could then wrap that interface with some kind of lazy initializer that checks if it's active and does so on first use. Would mean we have some fixed overhead on every single API call, but I'm assuming it's worthwhile.

@pbalcer
Copy link
Contributor Author

pbalcer commented Sep 26, 2025

Closing in favor of #160888

@pbalcer pbalcer closed this Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants