feat: Windows SHA-stamped surrogate filename and configurable surrogate pool size#1339
Conversation
ludfjig
left a comment
There was a problem hiding this comment.
Did you consider making these knobs on SandboxConfigurtion instead?
src/hyperlight_host/src/hypervisor/surrogate_process_manager.rs
Outdated
Show resolved
Hide resolved
Yes, but since I wanted the values to apply equally to any surrogate process managers in the process there isn't a place to put it, the values are also not SandboxConfiguration, I think maybe when we revise the API we can look at this again |
d91522c to
d933e50
Compare
Hmm not sure I understand this? You really want the config per process not per sandbox? |
The config cannot be per sandbox , all sandboxes in a single hyperlight implementation share the same surrogate process manager and surrogate processes |
src/hyperlight_host/src/hypervisor/surrogate_process_manager.rs
Outdated
Show resolved
Hide resolved
ludfjig
left a comment
There was a problem hiding this comment.
LGTM. nit: there's a potential TOCTOU issue with if !p.exists() { ... where another process can create the file inbetween the check. Could be fixed with something like https://doc.rust-lang.org/std/fs/struct.File.html#method.create_new, but up to you, not critical.
We also seem to be using both blake3 and sha256, maybe we can remove 1 of these dependencies? Not blocking this PR though
src/hyperlight_host/src/hypervisor/surrogate_process_manager.rs
Outdated
Show resolved
Hide resolved
- Use SHA-256 content hash in surrogate binary filename
(hyperlight_surrogate_{sha8}.exe) to eliminate cross-version
ACCESS_DENIED race when multiple hyperlight versions coexist.
- Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512)
to control how many surrogate processes are pre-created at startup.
- Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512)
as hard cap with on-demand CAS growth when pool is exhausted.
- Rollback created_count on process creation failure to prevent
permanent capacity loss from transient errors.
- Increment created_count per-process (not store-after-loop) to
prevent count drift on partial init failure.
- Warn when env var values are clamped to valid range.
- Add tests for env var parsing (with #[serial] for thread safety)
and locked-file extraction resilience.
- Update surrogate development notes documentation.
Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
5f1912b
d933e50 to
5f1912b
Compare
Fixed |
Fixes hyperlight surrogate process manager on Windows:
If two different implementations of hyperlight are used by a single host (e.g. hyperlight and hyperlight-js) then they may (even if this is highly unlikely) have different versions of the hyperlight surrogate binary (if the hyperlight versions are different). More likely is that they use the same version, when this happens, then the second implementation will try to overwrite the in already in use surrogate process exe.
If there are multiple implementations of hyperlight each with their own surrogate process manager then under current behavior each one will spin up 512 surrogate process, not only does this waste resources and take time it also means that there will be 1024 which is more that can be supported in a single process.
The issue with file copying prevents hyperagent from running on Windows (as it uses both hyperlight and hyperlight-js). It also does not need the overhead of 512 surrogate processes.
There are other scenarios where hyperlight may be used where this upfront creation of surrogate processes is both unnecessary and wasteful.
This PR introduces the following changes to deal with this:
Use SHA-256 content hash in surrogate binary filename (hyperlight_surrogate_{sha8}.exe) to eliminate cross-version ACCESS_DENIED race when multiple hyperlight versions coexist.
Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512) to control how many surrogate processes are pre-created at startup.
Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512) as hard cap with on-demand CAS growth when pool is exhausted.
Add tests for env var parsing (with #[serial] for thread safety) and locked-file extraction resilience.
Update surrogate development notes documentation.