Skip to content

Decouple bandwidth reservations from oversubscription#238

Merged
rgarcia merged 1 commit into
mainfrom
hypeship/diskio-reservation-cfn
May 19, 2026
Merged

Decouple bandwidth reservations from oversubscription#238
rgarcia merged 1 commit into
mainfrom
hypeship/diskio-reservation-cfn

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented May 19, 2026

Summary

  • decouple default per-VM network and disk I/O reservations from the same oversubscription ratios used for admission
  • keep oversubscription ratios as aggregate admission multipliers via resource effective limits
  • optionally configure the AWS quickstart gp3 data volume IOPS/throughput, and set CAPACITY__DISK_IO only when DataVolumeThroughput is provided

Why

Default per-VM bandwidth reservations previously used the oversubscribed effective capacity as their input. That meant raising the oversubscription ratio increased the aggregate limit and each default VM request by the same factor, so the ratio did not let the host pack more default VMs when that resource was the binding constraint.

Illustrative example for disk I/O on a 32-vCPU host with 1GB/s disk I/O and 1-vCPU VMs:

ratio effective disk I/O limit old default per VM max default VMs
1.0 1GB/s 31.25MB/s 32
2.0 2GB/s 62.5MB/s 32
4.0 4GB/s 125MB/s 32

Network had the same shape. This change makes the default per-VM reservation proportional to raw capacity, while admission still uses capacity * oversubscription_ratio.

Validation

  • go test ./lib/resources
  • go test ./... in deploy/aws/cloudformation
  • git diff --check

A full repo-wide go test ./... was not run successfully from this checkout because some test packages require embedded local binaries that are not present in the repository checkout.

@rgarcia rgarcia requested a review from sjmiller609 May 19, 2026 12:43
@rgarcia rgarcia marked this pull request as ready for review May 19, 2026 12:43
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies resource reservation logic and CloudFormation templates, but does not appear to change API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) that would trigger kernel API monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue. You can view the agent here.

Reviewed by Cursor Bugbot for commit e603ced. Configure here.

install -d -m 755 /etc/systemd/system/hypeman.service.d
cat >/etc/systemd/system/hypeman.service.d/disk-io-capacity.conf <<EOF
[Service]
Environment="CAPACITY__DISK_IO=${DataVolumeThroughput}MB/s"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiB vs MB unit mismatch in disk I/O capacity

Low Severity

The DataVolumeThroughput parameter is documented as "throughput in MiB/s" (matching AWS gp3 units), but the generated environment variable appends MB/s — a decimal megabyte unit. The parseDiskIOLimit function and the c2h5oh/datasize library treat MB as 1,000,000 bytes, while AWS's MiB equals 1,048,576 bytes. This causes CAPACITY__DISK_IO to be ~4.9% lower than the actual provisioned throughput, so per-VM defaults and admission limits are consistently underestimated.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e603ced. Configure here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugbot Autofix determined this is a false positive.

This is not a real capacity mismatch because the datasize parser used by parseDiskIOLimit interprets MB as 1024^2 bytes (MiB), so ${DataVolumeThroughput}MB/s maps to the same throughput value provisioned in MiB/s.

You can send follow-ups to the cloud agent here.

@rgarcia rgarcia changed the title Decouple disk IO reservation from oversubscription Decouple bandwidth reservations from oversubscription May 19, 2026
@rgarcia rgarcia force-pushed the hypeship/diskio-reservation-cfn branch from e603ced to 2aa19a3 Compare May 19, 2026 13:46
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks

@rgarcia rgarcia merged commit 58b9a9f into main May 19, 2026
15 checks passed
@rgarcia rgarcia deleted the hypeship/diskio-reservation-cfn branch May 19, 2026 14:10
@rgarcia rgarcia restored the hypeship/diskio-reservation-cfn branch May 19, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants