Skip to content

Add AWS spot instance support for EC2 hypervisors#63

Open
gamado wants to merge 2 commits intoopenshift-eng:mainfrom
gamado:aws-spot-support-to-reduce-cousts
Open

Add AWS spot instance support for EC2 hypervisors#63
gamado wants to merge 2 commits intoopenshift-eng:mainfrom
gamado:aws-spot-support-to-reduce-cousts

Conversation

@gamado
Copy link
Copy Markdown
Contributor

@gamado gamado commented Apr 21, 2026

Spot instances can reduce EC2 costs by up to 90%. Uses persistent spot requests with stop (not terminate) interruption behavior, so instances can be restarted after capacity is restored.

  • Add USE_SPOT_INSTANCE and AVAILABILITY_ZONE options to instance.env
  • Create a dedicated spot launch template in CloudFormation
  • Make spot and capacity reservations mutually exclusive
  • Skip capacity reservation API calls for spot instances on start
  • Detect spot lifecycle in start.sh to avoid unsupported operations

Spot instances can reduce EC2 costs by up to 90%. Uses persistent spot
requests with stop (not terminate) interruption behavior, so instances
can be restarted after capacity is restored.

- Add USE_SPOT_INSTANCE and AVAILABILITY_ZONE options to instance.env
- Create a dedicated spot launch template in CloudFormation
- Make spot and capacity reservations mutually exclusive
- Skip capacity reservation API calls for spot instances on start
- Detect spot lifecycle in start.sh to avoid unsupported operations

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot requested review from jaypoulz and qJkee April 21, 2026 09:41
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gamado
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 21, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 21, 2026

Hi @gamado. Thanks for your PR.

I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@gamado has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 50 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 50 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 443fda13-20ec-4ac0-9244-fabf492f2bfe

📥 Commits

Reviewing files that changed from the base of the PR and between 683cac2 and 8a662e9.

📒 Files selected for processing (1)
  • deploy/aws-hypervisor/scripts/destroy.sh

Walkthrough

The pull request adds AWS Spot Instance support to the hypervisor deployment infrastructure. It introduces environment variables for spot instance configuration and availability zone override, updates deployment scripts to enforce mutual exclusion between spot instances and capacity reservations, and modifies CloudFormation templates to conditionally configure spot instance purchasing and interruption behavior.

Changes

Cohort / File(s) Summary
Configuration & Environment Defaults
deploy/aws-hypervisor/instance.env.template, deploy/aws-hypervisor/scripts/common.sh
Added documentation and default values for USE_SPOT_INSTANCE and AVAILABILITY_ZONE environment variables. Updated capacity reservation section header to indicate it is disregarded when spot instances are enabled.
Deployment Script Logic
deploy/aws-hypervisor/scripts/create.sh, deploy/aws-hypervisor/scripts/start.sh
Added spot instance status reporting and parameter passing to CloudFormation. Implemented mutual-exclusion logic to disable capacity reservations when spot instances are active. Added instance lifecycle detection to conditionally skip capacity reservation preference setup for spot instances.
CloudFormation Template
deploy/aws-hypervisor/templates/rhel-instance.yaml
Added UseSpot parameter and UseSpotInstance condition. Introduced RHELSpotLaunchTemplate resource with spot instance market options (persistent stop behavior). Updated instance launch template selection to prefer spot template when enabled, otherwise fall back to existing capacity reservation logic.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant DeployScript as Deployment Script<br/>(create.sh)
    participant Env as Environment Config<br/>(common.sh)
    participant CloudFormation as CloudFormation
    participant EC2 as EC2 API

    User->>DeployScript: Execute deployment with USE_SPOT_INSTANCE
    DeployScript->>Env: Read USE_SPOT_INSTANCE & AVAILABILITY_ZONE
    alt Spot Instance Enabled
        DeployScript->>DeployScript: Set ENABLE_CAPACITY_RESERVATION=false
        DeployScript->>CloudFormation: Pass UseSpot="Yes"
    else Spot Instance Disabled
        DeployScript->>CloudFormation: Pass UseSpot="No"<br/>& capacity reservation config
    end
    CloudFormation->>CloudFormation: Select launch template<br/>(spot or standard)
    CloudFormation->>EC2: Create instance with<br/>selected template
    EC2-->>CloudFormation: Instance created
    CloudFormation-->>DeployScript: Stack creation complete
Loading
sequenceDiagram
    participant StartScript as Start Script<br/>(start.sh)
    participant EC2 as EC2 API
    participant StartScript2 as Start Script<br/>(continued)

    StartScript->>EC2: Query instance state &<br/>InstanceLifecycle
    EC2-->>StartScript: Return state & lifecycle
    alt Instance is Spot
        StartScript->>StartScript2: Log lifecycle="spot"
        alt Instance Stopped/Stopping
            StartScript2->>StartScript2: Skip capacity<br/>reservation preference
        end
    else Instance is On-Demand
        alt Instance Stopped/Stopping
            StartScript2->>StartScript2: Ensure capacity<br/>reservation preference
        end
    end
    StartScript2->>EC2: Start instance
    EC2-->>StartScript2: Instance started
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/aws-hypervisor/scripts/start.sh (1)

55-73: ⚠️ Potential issue | 🟡 Minor

Spot instance in stopped state may fail to start due to capacity.

Persistent spot instances stopped via AWS interruption can fail ec2 start-instances with InsufficientInstanceCapacity until capacity returns. AWS normally auto-restarts persistent spot instances when capacity is available, so manually invoking start on such an instance may race with that behavior or fail outright with no retry/backoff here.

Consider:

  • Detecting StateReason.Code == "Server.SpotInstanceShutdown" / Client.InstanceInitiatedShutdown for spot and either waiting for AWS to auto-restart, or retrying the start on capacity errors with backoff.
  • Surfacing a clearer error message to the user when a spot start fails due to capacity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/aws-hypervisor/scripts/start.sh` around lines 55 - 73, The spot
instance start path may fail with InsufficientInstanceCapacity when the stopped
instance was a persistent spot; update the "stopped" case around the aws ec2
start-instances call to detect spot shutdowns and handle capacity errors: first
query the instance StateReason.Code (use INSTANCE_ID/REGION and existing
ensure_open_capacity_preference helper) and if it equals
"Server.SpotInstanceShutdown" or "Client.InstanceInitiatedShutdown" either wait
for AWS to auto-restart instead of calling start or implement a retry loop that
calls aws ec2 start-instances and on InsufficientInstanceCapacity errors
performs exponential backoff and retries a bounded number of times; ensure you
log a clear, actionable error when retries are exhausted or when start fails due
to capacity so the user knows it’s a spot capacity issue.
🧹 Nitpick comments (3)
deploy/aws-hypervisor/scripts/create.sh (1)

60-65: Consider warning (not just info) when overriding an explicit capacity reservation setting.

If a user explicitly sets ENABLE_CAPACITY_RESERVATION=true in instance.env alongside USE_SPOT_INSTANCE=true, the script silently downgrades it with only an info-level message. Consider promoting this to msg_warning so the conflict is more visible, or erroring out so users are forced to resolve the conflict deliberately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/aws-hypervisor/scripts/create.sh` around lines 60 - 65, The script
currently logs a silent info when USE_SPOT (from USE_SPOT_INSTANCE) forces
ENABLE_CAPACITY_RESERVATION="false"; change the behavior in the block where
USE_SPOT is checked (symbols: USE_SPOT, USE_SPOT_INSTANCE,
ENABLE_CAPACITY_RESERVATION, msg_info) to use msg_warning instead of msg_info
and include text that clearly states the user explicitly set
ENABLE_CAPACITY_RESERVATION and it is being overridden by spot selection;
alternatively, if you prefer fail-fast behavior, replace the msg_info with an
error/exit path that prints the conflict and aborts so the user must resolve it.
deploy/aws-hypervisor/scripts/start.sh (1)

44-49: Collapse two describe-instances calls into one.

Lines 44 and 45 issue two separate describe-instances API calls for the same instance. You can fetch both fields in a single call to reduce API churn and the chance of race conditions between the two queries.

♻️ Proposed refactor
-INSTANCE_STATE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].State.Name' --output text --no-cli-pager)
-INSTANCE_LIFECYCLE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].InstanceLifecycle' --output text --no-cli-pager || echo "unknown")
+INSTANCE_INFO=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" \
+    --query 'Reservations[0].Instances[0].[State.Name, InstanceLifecycle]' \
+    --output text --no-cli-pager || echo "unknown unknown")
+INSTANCE_STATE=$(echo "${INSTANCE_INFO}" | awk '{print $1}')
+INSTANCE_LIFECYCLE=$(echo "${INSTANCE_INFO}" | awk '{print $2}')

Note: InstanceLifecycle is returned as the literal string None (not empty) for on-demand instances, so the != "spot" guard on lines 57 and 71 is correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/aws-hypervisor/scripts/start.sh` around lines 44 - 49, The script
currently calls aws ec2 describe-instances twice to populate INSTANCE_STATE and
INSTANCE_LIFECYCLE; combine into a single call that queries both
Reservations[0].Instances[0].State.Name and
Reservations[0].Instances[0].InstanceLifecycle and capture the two outputs
(e.g., with --query returning a list or JSON then parse into INSTANCE_STATE and
INSTANCE_LIFECYCLE). Update the variables INSTANCE_STATE and INSTANCE_LIFECYCLE
assignment to use that single aws invocation (keep REGION and INSTANCE_ID
usage), preserve the existing logic that treats InstanceLifecycle == "spot" and
the msg_info call, and ensure fallback/unknown handling remains when
InstanceLifecycle is missing or "None".
deploy/aws-hypervisor/instance.env.template (1)

9-20: Document spot-mode limitations.

The spot comments explain cost savings and interruption behavior but omit a few operational caveats worth surfacing here so users aren't surprised:

  • Spot instances may experience an unbounded stopped duration until capacity returns; start.sh does not currently retry on InsufficientInstanceCapacity.
  • c5n.metal (the default EC2_INSTANCE_TYPE) is a bare-metal type and spot availability for bare-metal is historically much more constrained than for virtualized types — users defaulting to metal may rarely see capacity.
  • Setting AVAILABILITY_ZONE with spot pins capacity to a single AZ; leaving it empty lets AWS pick.

A short note on each will save downstream debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/aws-hypervisor/instance.env.template` around lines 9 - 20, Update the
spot instance comments to document operational caveats: mention that start.sh
does not retry on InsufficientInstanceCapacity so stopped durations can be
unbounded, warn that the default EC2_INSTANCE_TYPE (c5n.metal) is a bare-metal
type with much more constrained spot availability, and note that setting
AVAILABILITY_ZONE pins capacity to a single AZ (leaving it empty lets AWS pick);
reference USE_SPOT_INSTANCE as the toggle to enable spot behavior and keep the
notes brief and adjacent to the existing Spot Instance Settings block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/aws-hypervisor/templates/rhel-instance.yaml`:
- Around line 249-260: The teardown lacks cancellation of persistent Spot
Instance Requests created by the RHELSpotLaunchTemplate (SpotOptions ->
SpotInstanceType: persistent); update the destroy.sh script to locate the Spot
Request ID recorded when the instance was launched (or query for requests tied
to the CloudFormation stack/instance tags) and call aws ec2
cancel-spot-instance-requests --spot-instance-request-ids <ids> as part of
teardown (before/after stack delete) to ensure persistent requests are removed
and no replacement instances are launched.

---

Outside diff comments:
In `@deploy/aws-hypervisor/scripts/start.sh`:
- Around line 55-73: The spot instance start path may fail with
InsufficientInstanceCapacity when the stopped instance was a persistent spot;
update the "stopped" case around the aws ec2 start-instances call to detect spot
shutdowns and handle capacity errors: first query the instance StateReason.Code
(use INSTANCE_ID/REGION and existing ensure_open_capacity_preference helper) and
if it equals "Server.SpotInstanceShutdown" or "Client.InstanceInitiatedShutdown"
either wait for AWS to auto-restart instead of calling start or implement a
retry loop that calls aws ec2 start-instances and on
InsufficientInstanceCapacity errors performs exponential backoff and retries a
bounded number of times; ensure you log a clear, actionable error when retries
are exhausted or when start fails due to capacity so the user knows it’s a spot
capacity issue.

---

Nitpick comments:
In `@deploy/aws-hypervisor/instance.env.template`:
- Around line 9-20: Update the spot instance comments to document operational
caveats: mention that start.sh does not retry on InsufficientInstanceCapacity so
stopped durations can be unbounded, warn that the default EC2_INSTANCE_TYPE
(c5n.metal) is a bare-metal type with much more constrained spot availability,
and note that setting AVAILABILITY_ZONE pins capacity to a single AZ (leaving it
empty lets AWS pick); reference USE_SPOT_INSTANCE as the toggle to enable spot
behavior and keep the notes brief and adjacent to the existing Spot Instance
Settings block.

In `@deploy/aws-hypervisor/scripts/create.sh`:
- Around line 60-65: The script currently logs a silent info when USE_SPOT (from
USE_SPOT_INSTANCE) forces ENABLE_CAPACITY_RESERVATION="false"; change the
behavior in the block where USE_SPOT is checked (symbols: USE_SPOT,
USE_SPOT_INSTANCE, ENABLE_CAPACITY_RESERVATION, msg_info) to use msg_warning
instead of msg_info and include text that clearly states the user explicitly set
ENABLE_CAPACITY_RESERVATION and it is being overridden by spot selection;
alternatively, if you prefer fail-fast behavior, replace the msg_info with an
error/exit path that prints the conflict and aborts so the user must resolve it.

In `@deploy/aws-hypervisor/scripts/start.sh`:
- Around line 44-49: The script currently calls aws ec2 describe-instances twice
to populate INSTANCE_STATE and INSTANCE_LIFECYCLE; combine into a single call
that queries both Reservations[0].Instances[0].State.Name and
Reservations[0].Instances[0].InstanceLifecycle and capture the two outputs
(e.g., with --query returning a list or JSON then parse into INSTANCE_STATE and
INSTANCE_LIFECYCLE). Update the variables INSTANCE_STATE and INSTANCE_LIFECYCLE
assignment to use that single aws invocation (keep REGION and INSTANCE_ID
usage), preserve the existing logic that treats InstanceLifecycle == "spot" and
the msg_info call, and ensure fallback/unknown handling remains when
InstanceLifecycle is missing or "None".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6b8ce601-caee-443b-b63c-c99f86fccb1c

📥 Commits

Reviewing files that changed from the base of the PR and between 55c91e1 and 683cac2.

📒 Files selected for processing (5)
  • deploy/aws-hypervisor/instance.env.template
  • deploy/aws-hypervisor/scripts/common.sh
  • deploy/aws-hypervisor/scripts/create.sh
  • deploy/aws-hypervisor/scripts/start.sh
  • deploy/aws-hypervisor/templates/rhel-instance.yaml

Comment on lines +249 to +260
# Launch Template for Spot Instances (only created when spot is used)
RHELSpotLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Condition: UseSpotInstance
Properties:
LaunchTemplateData:
InstanceMarketOptions:
MarketType: spot
SpotOptions:
SpotInstanceType: persistent
InstanceInterruptionBehavior: stop

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Look for any spot-request cancellation logic across the deploy tree
rg -nP -C2 'cancel-spot-instance-requests|describe-spot-instance-requests|SpotInstanceRequestId' 
fd -t f 'destroy|delete|cleanup|teardown' deploy/aws-hypervisor | xargs -I{} sh -c 'echo "=== {} ==="; cat "{}"'

Repository: openshift-eng/two-node-toolbox

Length of output: 2729


🌐 Web query:

When an EC2 instance is launched via CloudFormation AWS::EC2::Instance using a LaunchTemplate with InstanceMarketOptions MarketType spot and SpotInstanceType persistent, is the associated spot instance request cancelled automatically when the CloudFormation stack is deleted?

💡 Result:

No, the associated Spot Instance request is not cancelled automatically when the CloudFormation stack is deleted. CloudFormation deletes the EC2 instance by terminating it, but for persistent Spot requests, termination causes the request to become open again, potentially launching a new instance. Multiple user reports confirm the Spot request persists after stack deletion, requiring manual cancellation.

Citations:


Add spot instance request cancellation to teardown process.

The destroy.sh script handles capacity reservation cleanup but does not cancel the persistent Spot Instance Request created by SpotInstanceType: persistent. When the CloudFormation stack deletes the EC2 instance, the persistent Spot request remains open and can relaunch a replacement instance, leading to orphaned instances and continued billing.

The teardown script must explicitly cancel all Spot Instance Requests associated with the stack before or after CloudFormation stack deletion. Reference the Spot request ID stored during instance launch and use aws ec2 cancel-spot-instance-requests to clean it up during the destroy phase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/aws-hypervisor/templates/rhel-instance.yaml` around lines 249 - 260,
The teardown lacks cancellation of persistent Spot Instance Requests created by
the RHELSpotLaunchTemplate (SpotOptions -> SpotInstanceType: persistent); update
the destroy.sh script to locate the Spot Request ID recorded when the instance
was launched (or query for requests tied to the CloudFormation stack/instance
tags) and call aws ec2 cancel-spot-instance-requests --spot-instance-request-ids
<ids> as part of teardown (before/after stack delete) to ensure persistent
requests are removed and no replacement instances are launched.

Persistent spot requests survive instance termination and could launch
a new instance. Cancel the spot request explicitly before deleting the
CloudFormation stack to prevent orphaned resources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fonta-rh
Copy link
Copy Markdown
Contributor

Hey Gal, thanks for putting this together — spot support for TNT is a great idea and the cost savings are real. I've gone through the implementation and have some feedback. The core direction is sound but I think the persistent spot request lifecycle needs more attention before this is ready.

Architecture: spot request as a managed resource

The existing codebase treats capacity reservations as a first-class managed resource — created, persisted to a file, cleaned up in both success and error paths. The spot request that persistent + stop creates is an equally important resource, but the PR treats it as an implicit side effect of instance creation.

I'd suggest following the same pattern as capacity reservations:

  1. After stack creation in create.sh, query and persist the SpotInstanceRequestId to instance-data/spot-request-id
  2. stop.sh reads that file and disables the request (prevents auto-restart while preserving it)
  3. start.sh reads that file and re-enables the request
  4. destroy.sh reads that file and cancels the request

This makes the spot lifecycle explicit, file-driven, and consistent with how capacity reservations are already managed.

Blocking: make stop will auto-restart spot instances

With persistent + InstanceInterruptionBehavior: stop, a persistent SpotInstanceRequest is created alongside the instance. When a persistent spot instance is stopped — whether by AWS interruption or user-initiated stop-instances — the spot request remains active and AWS will restart the instance when capacity is available.

Neither stop.sh nor force-stop.sh is modified by this PR, which means make stop → instance stops → AWS restarts it. This breaks the expected semantics.

The fix is to disable (not cancel) the spot request on stop, and re-enable on start. AWS supports this via:

# On stop: prevent auto-restart
aws ec2 disable-spot-instance-requests --spot-instance-request-ids "$SPOT_REQUEST_ID"

# On start: allow auto-restart again (for AWS-initiated interruptions)
aws ec2 enable-spot-instance-requests --spot-instance-request-ids "$SPOT_REQUEST_ID"

This way, user-initiated stops stay stopped, but if the user restarts and AWS later interrupts, the persistent request handles recovery correctly.

Suggestion: combine API calls in start.sh

The PR adds a second describe-instances call for InstanceLifecycle. These can be combined:

INSTANCE_INFO=$(aws --region "${REGION}" ec2 describe-instances \
    --instance-ids "${INSTANCE_ID}" \
    --query 'Reservations[0].Instances[0].[State.Name, InstanceLifecycle]' \
    --output text --no-cli-pager)
INSTANCE_STATE=$(echo "$INSTANCE_INFO" | awk '{print $1}')
INSTANCE_LIFECYCLE=$(echo "$INSTANCE_INFO" | awk '{print $2}')

Minor, but sets a good precedent — metal instance API calls aren't free latency-wise.

Suggestion: separate the AVAILABILITY_ZONE change

The change from AVAILABILITY_ZONE="" to AVAILABILITY_ZONE="${AVAILABILITY_ZONE:-}" in create.sh enables AZ pinning for all instance types, not just spot. This is independently useful — could you split it into its own commit? It makes the spot PR easier to review and gives the AZ fix its own visibility.

Important: orphaned spot request risk in destroy.sh

The spot cancellation code queries the instance for SpotInstanceRequestId via describe-instances. This has two problems:

  1. If the instance is already gone (manually terminated, or a previous failed destroy), describe-instances returns nothing and the spot request is never canceled. A persistent spot request that isn't canceled will keep trying to launch new instances.

  2. No trap for partial failure. If destroy.sh is interrupted after stack deletion starts but before spot cancellation completes, the spot request is orphaned. Unlike an expired capacity reservation (which is time-limited and harmless), an orphaned persistent spot request is actively dangerous — it will keep requesting instances using resources that may no longer exist.

Both problems go away if the spot request ID is persisted to a file at creation time (per the architecture suggestion above). Then destroy.sh can cancel it from the file regardless of whether the instance still exists, and a trap can ensure cleanup on any exit path.

Informational: documentation gaps

A couple of things worth adding to the instance.env.template comments or a short doc:

  • Interruption frequency: c5n.metal spot interruption rates can be 10-25%+ (per AWS Spot Advisor). Worth a note so users set expectations.
  • MaxPrice behavior: No MaxPrice is set, which defaults to the on-demand price ceiling. This is AWS's recommended approach, but users might assume "spot" means they could overpay. A one-liner clarifying this would help.
  • Cluster impact: A spot interruption kills the hypervisor, which means all cluster VMs go down hard. Recovery requires make startup-cluster or make redeploy-cluster after the instance restarts. This should be documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants