Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScaleDown lambda fails to run if a "bad" runner exists #3910

Open
PerGon opened this issue May 13, 2024 · 0 comments
Open

ScaleDown lambda fails to run if a "bad" runner exists #3910

PerGon opened this issue May 13, 2024 · 0 comments

Comments

@PerGon
Copy link

PerGon commented May 13, 2024

We have recently faced an issue where a "bad" runner was created (I've created a PR to fix the source of the issue), but I think the ScaleDown lambda still needs to be improved to avoid this to happen again.

We have run into a problem that made the ScaleDown lambda to simply fail all the time and eventually we were running with 1000s of VM for many hours (💸 ).
The problem happens when a User level repo triggers a job, the ScaleUp lambda will create a EC2 VM, but the ScaleDown lambda will fail to terminate such VM because of the following error:

HttpError: Request failed with status code 404
    at /var/task/index.js:109709:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async getOrCreateOctokit (/var/task/index.js:153999:12)
    at async listGitHubRunners (/var/task/index.js:154033:20)
    at async evaluateAndRemoveRunners (/var/task/index.js:154101:31)
    at async scaleDown (/var/task/index.js:154169:5)
    at async Runtime.scaleDownHandler [as handler] (/var/task/index.js:153696:9)

Whenever that error happens, the Lambda will just stop iterating thru all potential VMs to terminate and finish it's execution. After the bug has been triggered, for some time the ScaleDown lambda will still terminate some VMs, but eventually the "bad" runner will always be the first in line to be "checked" and the lambda will bail, not cleaning anything - over and over again.

We were able to replicate the issue easily in our Dev setup. Only using runners with enable_organization_runners: true, then adding the GitHub App to a User and creating a GHA in a User Repository. This will create an action request for a repository that belongs to an user, the ScaleUp Lambda will create the VM and the ScaleDown lambda will never be able to terminate said VM.

I was able to nail down the exact place where the error happen to this line here: https://github.com/philips-labs/terraform-aws-github-runner/blob/main/lambdas/functions/control-plane/src/scale-runners/scale-down.ts#L34
I believe that particular GitHub Endpoint will throw a 404 if called for a "non-organization" repo-owner.

The fix in the PR #3909 should reduce the likelihood of this problem happening, but I think this should still be fixed.
I can try to contribute with a fix, but I might need to guidelines 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant