Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accumulating spot instances that are not attached to ASG's #484

Closed
jjones-smug opened this issue Mar 28, 2022 · 5 comments
Closed

Accumulating spot instances that are not attached to ASG's #484

jjones-smug opened this issue Mar 28, 2022 · 5 comments

Comments

@jjones-smug
Copy link
Contributor

Github issue

We upgraded our "live" ASG's to the latest release of the Autospotting code last week after running it without issue in out "test" environment for several days. But after about 12 hours we wound up with a group of unattached spot instances that were launched, running, and tagged like one of the Autospotting enabled ASG's. In other words it looked like spot replacements were provisioned and configured, but never ended up getting attached to the ASG's.

Issue type

  • Bug Report

Build number

This is a custom build, but we are using an as-is clone of the repo. Meaning there are no local code modifications in place.

Configuration

We're running this as a Lambda, invoked by CW Events every 2 minutes.

Environment

  • us-east-1
  • VPC
  • Anonymized launch configuration: I don't know what this means?

Summary

We upgraded our "live" ASG's to the latest release of the Autospotting code last week after running it without issue in out "test" environment for several days. But after about 12 hours we wound up with a group of unattached spot instances that were launched, running, and tagged like one of the Autospotting enabled ASG's. In other words it looked like spot replacements were provisioned and configured, but never ended up getting attached to the ASG's.

Steps to reproduce

I don't know what specific steps/events are required to cause "stranded spot instances". We've just noticed that they accumulate over time (and are never terminated).

Expected results

N/A

Actual results

What I do have is sample output from CloudWatch logs and CloudTrail API calls that show something I think it unusual. I'll submit those as attachments, but will also describe what I think they show.

One invocation of the Lambda appears to terminated an OD instance after replacing it with a spot instance. That appears to work fine, and our EC2 logs show the OD instance is terminated by AWS. But a subsequent Lambda invocation ~4 minutes later logs that it doesn't like the state of that same instance and tries to terminate it again. That API call fails, since the instance is no longer running.

@jjones-smug
Copy link
Contributor Author

Samples fields from several TerminateInstanceInAutoScalingGroup API calls. The ones that are failing don't have null "request" data. The ones that work have a request blob with an instance id.

cloudtrail-api-calls-extract.txt

Scrubbed CloudWatch logs from two Lambda invocations.

cloudwatch-logs-extract.txt

@cristim
Copy link
Member

cristim commented Mar 29, 2022

Thanks for reporting this issue.

I currently only work on issues and feature requests submitted by paying customers.

If you installed AutoSpotting from the marketplace tell me your account number on Slack and I'll look into this.

Otherwise I'm going to leave this open just in case someone else is willing to give it a try and contribute a pull request, which I'm more than happy to consider.

@cristim
Copy link
Member

cristim commented May 19, 2022

I've recently committed a change to the instance replacement logic that should address this.

We're now essentially using the same replacement logic we use for the new instances, without waiting for the grace period anymore with the Spot instance outside the ASG but just swapping instances once every 30min.

Let me know if this addresses your issues.

@cristim cristim closed this as completed May 19, 2022
@jjones-smug
Copy link
Contributor Author

Awesome! Thanks for the update...

@cristim
Copy link
Member

cristim commented Jul 27, 2022

@jjones-smug did you get the chance to test this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants