Skip to content

feat: Windows SHA-stamped surrogate filename and configurable surrogate pool size#1339

Merged
simongdavies merged 1 commit intohyperlight-dev:mainfrom
simongdavies:update-surrogate
Mar 26, 2026
Merged

feat: Windows SHA-stamped surrogate filename and configurable surrogate pool size#1339
simongdavies merged 1 commit intohyperlight-dev:mainfrom
simongdavies:update-surrogate

Conversation

@simongdavies
Copy link
Copy Markdown
Contributor

Fixes hyperlight surrogate process manager on Windows:

If two different implementations of hyperlight are used by a single host (e.g. hyperlight and hyperlight-js) then they may (even if this is highly unlikely) have different versions of the hyperlight surrogate binary (if the hyperlight versions are different). More likely is that they use the same version, when this happens, then the second implementation will try to overwrite the in already in use surrogate process exe.

If there are multiple implementations of hyperlight each with their own surrogate process manager then under current behavior each one will spin up 512 surrogate process, not only does this waste resources and take time it also means that there will be 1024 which is more that can be supported in a single process.

The issue with file copying prevents hyperagent from running on Windows (as it uses both hyperlight and hyperlight-js). It also does not need the overhead of 512 surrogate processes.

There are other scenarios where hyperlight may be used where this upfront creation of surrogate processes is both unnecessary and wasteful.

This PR introduces the following changes to deal with this:

  • Use SHA-256 content hash in surrogate binary filename (hyperlight_surrogate_{sha8}.exe) to eliminate cross-version ACCESS_DENIED race when multiple hyperlight versions coexist.

  • Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512) to control how many surrogate processes are pre-created at startup.

  • Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512) as hard cap with on-demand CAS growth when pool is exhausted.

  • Add tests for env var parsing (with #[serial] for thread safety) and locked-file extraction resilience.

  • Update surrogate development notes documentation.

Copy link
Copy Markdown
Contributor

@ludfjig ludfjig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making these knobs on SandboxConfigurtion instead?

@simongdavies
Copy link
Copy Markdown
Contributor Author

Did you consider making these knobs on SandboxConfigurtion instead?

Yes, but since I wanted the values to apply equally to any surrogate process managers in the process there isn't a place to put it, the values are also not SandboxConfiguration, I think maybe when we revise the API we can look at this again

@simongdavies simongdavies force-pushed the update-surrogate branch 3 times, most recently from d91522c to d933e50 Compare March 25, 2026 12:14
@ludfjig
Copy link
Copy Markdown
Contributor

ludfjig commented Mar 25, 2026

Did you consider making these knobs on SandboxConfigurtion instead?

Yes, but since I wanted the values to apply equally to any surrogate process managers in the process there isn't a place to put it, the values are also not SandboxConfiguration, I think maybe when we revise the API we can look at this again

Hmm not sure I understand this? You really want the config per process not per sandbox?

@simongdavies
Copy link
Copy Markdown
Contributor Author

Did you consider making these knobs on SandboxConfigurtion instead?

Yes, but since I wanted the values to apply equally to any surrogate process managers in the process there isn't a place to put it, the values are also not SandboxConfiguration, I think maybe when we revise the API we can look at this again

Hmm not sure I understand this? You really want the config per process not per sandbox?

The config cannot be per sandbox , all sandboxes in a single hyperlight implementation share the same surrogate process manager and surrogate processes

jsturtevant
jsturtevant previously approved these changes Mar 25, 2026
ludfjig
ludfjig previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Contributor

@ludfjig ludfjig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. nit: there's a potential TOCTOU issue with if !p.exists() { ... where another process can create the file inbetween the check. Could be fixed with something like https://doc.rust-lang.org/std/fs/struct.File.html#method.create_new, but up to you, not critical.

We also seem to be using both blake3 and sha256, maybe we can remove 1 of these dependencies? Not blocking this PR though

dblnz
dblnz previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Contributor

@dblnz dblnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

- Use SHA-256 content hash in surrogate binary filename
  (hyperlight_surrogate_{sha8}.exe) to eliminate cross-version
  ACCESS_DENIED race when multiple hyperlight versions coexist.

- Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512)
  to control how many surrogate processes are pre-created at startup.

- Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512)
  as hard cap with on-demand CAS growth when pool is exhausted.

- Rollback created_count on process creation failure to prevent
  permanent capacity loss from transient errors.

- Increment created_count per-process (not store-after-loop) to
  prevent count drift on partial init failure.

- Warn when env var values are clamped to valid range.

- Add tests for env var parsing (with #[serial] for thread safety)
  and locked-file extraction resilience.

- Update surrogate development notes documentation.

Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
@simongdavies simongdavies dismissed stale reviews from dblnz, ludfjig, and jsturtevant via 5f1912b March 25, 2026 22:06
@simongdavies
Copy link
Copy Markdown
Contributor Author

LGTM. nit: there's a potential TOCTOU issue with if !p.exists() { ... where another process can create the file inbetween the check. Could be fixed with something like https://doc.rust-lang.org/std/fs/struct.File.html#method.create_new, but up to you, not critical.

We also seem to be using both blake3 and sha256, maybe we can remove 1 of these dependencies? Not blocking this PR though

Fixed

@simongdavies simongdavies merged commit ced010b into hyperlight-dev:main Mar 26, 2026
132 of 154 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bugfix For PRs that fix bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants