Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud runner terminate improvements #653

Merged
merged 12 commits into from
Jul 20, 2021
Merged

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Jul 11, 2021

@DavidGOrtega DavidGOrtega temporarily deployed to internal July 11, 2021 18:56 Inactive
@DavidGOrtega DavidGOrtega self-assigned this Jul 11, 2021
@DavidGOrtega DavidGOrtega marked this pull request as draft July 11, 2021 18:57
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 19, 2021 21:29 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 19, 2021 22:41 Inactive
@DavidGOrtega DavidGOrtega changed the title Use api to check runner idle status to avoid dead instances Cloud runner terminate improvements Jul 19, 2021
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 19, 2021 23:20 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 00:16 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 08:50 Inactive
@DavidGOrtega DavidGOrtega marked this pull request as ready for review July 20, 2021 10:20
@@ -5,6 +5,7 @@ const { homedir } = require('os');

const fs = require('fs').promises;
const yargs = require('yargs');
const { SpotNotifier } = require('ec2-spot-notification');
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just curl the 169.254.169.254 endpoint in a loop and run systemctl stop cml once we get a termination notice, all from the instance script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets open another ticket and use for now the already done work that also works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we might just want this behaviour with the runner, so I guess it should go in the TPI runner resource at most

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.

Actually I disagree, Its a part of the logic that we are setting to handle long job runners. The same that we do with the 72 hours in example. And ensures the integrty and responsability of the runner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im thinking to do all in JS executed with V8. At the very end we are using the vendors SDK that already exists in JS

Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.

  • More complex in the TPI

Not that complex.

  • More obscure having another service or whatever...

Not as obscure as moving cloud-specific code to the CML repository instead of abstracting over signals, in my opinion.

  • This runner could be launched manually inside an AWS instance and still works on its own

Do we really want to support this use case? What's the advantage of supporting it if we don't do the same for the other providers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All advantages despite that you disagree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as obscure as moving cloud-specific code to the CML repository

CML repository is a runner also that can check whatever is needed to ensure the feature it offers. Transparent spot instances

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that complex.

We have to check also the noRetry... pass it.. create another service or whatever, business logic broken in multiple parts ... Ugly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.

Im telling you a solution to avoid Golang source code that Im planning to also unify the base source code that we were discussing

@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 10:50 Inactive
src/drivers/github.js Outdated Show resolved Hide resolved
src/drivers/gitlab.js Outdated Show resolved Hide resolved
src/drivers/gitlab.js Outdated Show resolved Hide resolved
src/drivers/gitlab.js Outdated Show resolved Hide resolved
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 11:36 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 11:40 Inactive
@DavidGOrtega
Copy link
Contributor Author

@0x2b3bfa0 are you approving it? We have several p0s that are fixed here. I want to release it ASAP

@0x2b3bfa0
Copy link
Member

I'm hesitant to approve the inclusion of SpotNotifier in the CML code base, especially without a follow-up technical debt / discussion issue.

I'm going to run tests on Google Cloud to confirm that #661 gets fixed and check everything for approval, but I'd prefer to discuss this in the weekly meeting so we can take better decisions.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 20, 2021

I'm hesitant to approve the inclusion of SpotNotifier in the CML code base, especially without a follow-up technical debt / discussion issue.

Open a ticket to follow up. I do agree with it and I wont agree the opposite, so... To me there is not technical debt here.
A working solution must always to be released and not to be hold to discuss

I'm going to run tests on Google Cloud to confirm that #661 gets fixed and check everything for approval, but I'd prefer to discuss this in the weekly meeting so we can take better decisions.

I have tested it. The machines are disposed

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except for the inclusion of SpotNotifier. Approving just because of #653 (comment) to avoid blocking priority 0 fixes.

@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 13:44 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 14:38 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 15:36 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 20, 2021 15:52 Inactive
@DavidGOrtega DavidGOrtega merged commit 0c6abf3 into master Jul 20, 2021
@DavidGOrtega DavidGOrtega deleted the runner-dispose-improvement branch July 20, 2021 16:20
@ivyleavedtoadflax
Copy link

@DavidGOrtega @0x2b3bfa0 thanks so much for this!

@lemontheme
Copy link

Reporter of #661 checking in. Indeed, thanks for jumping on this!

Unnnnnfortunately, I'm afraid that I'm still seeing the same behavior as before. (I'm assuming that the iterative/setup-cml@v1 action uses the latest cml release.) In 5 out of 5 separate runs, my Google CE instances stayed alive rather than being shutdown.

Seeing as your tests passed though, maybe it's user error on my end?

@DavidGOrtega
Copy link
Contributor Author

@lemontheme I would need to see your workflow. Tests with TPI indicates that the instances are disposed after the expected time.

@lemontheme
Copy link

@DavidGOrtega sorry for the late reply. I tested with the same workflow as in #661

@0x2b3bfa0
Copy link
Member

@lemontheme, can you please open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment