You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have recently faced an issue where a "bad" runner was created (I've created a PR to fix the source of the issue), but I think the ScaleDown lambda still needs to be improved to avoid this to happen again.
We have run into a problem that made the ScaleDown lambda to simply fail all the time and eventually we were running with 1000s of VM for many hours (💸 ).
The problem happens when a User level repo triggers a job, the ScaleUp lambda will create a EC2 VM, but the ScaleDown lambda will fail to terminate such VM because of the following error:
HttpError: Request failed with status code 404
at /var/task/index.js:109709:11
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async getOrCreateOctokit (/var/task/index.js:153999:12)
at async listGitHubRunners (/var/task/index.js:154033:20)
at async evaluateAndRemoveRunners (/var/task/index.js:154101:31)
at async scaleDown (/var/task/index.js:154169:5)
at async Runtime.scaleDownHandler [as handler] (/var/task/index.js:153696:9)
Whenever that error happens, the Lambda will just stop iterating thru all potential VMs to terminate and finish it's execution. After the bug has been triggered, for some time the ScaleDown lambda will still terminate some VMs, but eventually the "bad" runner will always be the first in line to be "checked" and the lambda will bail, not cleaning anything - over and over again.
We were able to replicate the issue easily in our Dev setup. Only using runners with enable_organization_runners: true, then adding the GitHub App to a User and creating a GHA in a User Repository. This will create an action request for a repository that belongs to an user, the ScaleUp Lambda will create the VM and the ScaleDown lambda will never be able to terminate said VM.
The fix in the PR #3909 should reduce the likelihood of this problem happening, but I think this should still be fixed.
I can try to contribute with a fix, but I might need to guidelines 🤔
The text was updated successfully, but these errors were encountered:
We have recently faced an issue where a "bad" runner was created (I've created a PR to fix the source of the issue), but I think the ScaleDown lambda still needs to be improved to avoid this to happen again.
We have run into a problem that made the ScaleDown lambda to simply fail all the time and eventually we were running with 1000s of VM for many hours (💸 ).
The problem happens when a User level repo triggers a job, the ScaleUp lambda will create a EC2 VM, but the ScaleDown lambda will fail to terminate such VM because of the following error:
Whenever that error happens, the Lambda will just stop iterating thru all potential VMs to terminate and finish it's execution. After the bug has been triggered, for some time the ScaleDown lambda will still terminate some VMs, but eventually the "bad" runner will always be the first in line to be "checked" and the lambda will bail, not cleaning anything - over and over again.
We were able to replicate the issue easily in our Dev setup. Only using runners with
enable_organization_runners: true
, then adding the GitHub App to a User and creating a GHA in a User Repository. This will create an action request for a repository that belongs to an user, the ScaleUp Lambda will create the VM and the ScaleDown lambda will never be able to terminate said VM.I was able to nail down the exact place where the error happen to this line here: https://github.com/philips-labs/terraform-aws-github-runner/blob/main/lambdas/functions/control-plane/src/scale-runners/scale-down.ts#L34
I believe that particular GitHub Endpoint will throw a
404
if called for a "non-organization" repo-owner.The fix in the PR #3909 should reduce the likelihood of this problem happening, but I think this should still be fixed.
I can try to contribute with a fix, but I might need to guidelines 🤔
The text was updated successfully, but these errors were encountered: