Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents aren't spawning on infra.ci #3918

Closed
NotMyFault opened this issue Jan 21, 2024 · 12 comments
Closed

Agents aren't spawning on infra.ci #3918

NotMyFault opened this issue Jan 21, 2024 · 12 comments

Comments

@NotMyFault
Copy link
Member

Service(s)

infra.ci.jenkins.io

Summary

The build queue counts 30+ items while writing this, but the executor status is stuck in the launching state:
Screenshot 2024-01-21 at 16 06 27

ref https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/view/change-requests/job/PR-4886/ and other PRs

Reproduction steps

No response

@NotMyFault NotMyFault added the triage Incoming issues that need review label Jan 21, 2024
@smerle33
Copy link
Contributor

smerle33 commented Jan 22, 2024

trying to manually trigger an arm64 node in the node pool, it seems that the autoscale fail.

Capture d’écran 2024-01-22 à 08 54 50

@smerle33
Copy link
Contributor

once the first node spawn, the autoscale works:

Capture d’écran 2024-01-22 à 08 56 30

@smerle33 smerle33 removed the triage Incoming issues that need review label Jan 22, 2024
@smerle33 smerle33 self-assigned this Jan 22, 2024
@NotMyFault NotMyFault changed the title Agen't aren't spawning on infra.ci Agents aren't spawning on infra.ci Jan 22, 2024
@smerle33
Copy link
Contributor

I opened an issue with azure/microsoft,

the detail of my ticket are not in the email but the title was:
Issue Definition: autoscalling not working from 0 but does from 1 node

the first feedback is :

Hi Stephane,
 
I hope you're doing well.

Starting the scaling at 0 may prevent the autoscaling from working as expected. The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints. If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation. It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

if my understanding is correct I probably explained badly the problem, so I replied:

Hi All,

I hope you're doing well.

`The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.` this is exactly the problem, arm64 pods cannot be schedule as there is no node in the nodepool. So I expect the autoscaler to spawn one node.
Starting the scaling from 1 mean spending more money, with a node for nothing for part of the time.

Stéphane

@timja
Copy link
Member

timja commented Jan 23, 2024

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

@dduportal dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024
@smerle33
Copy link
Contributor

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

@dduportal
Copy link
Contributor

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

Given we can't afford (in the current subscription) to use non-spot instances, WDYT if we start working on adding a new AKS cluster only for infra.ci.jenkins.io and release.ci.jenkins.io Kubernetes agents in the new "sponsored" subscription.

Multiple achievement for us:

  • Less consumption in the current subscription, and we consume sponsored credits
  • Separation of concern between controllers and agents
  • No problem to use non-spot instances until MS solves the problem

If it make sense, I propose to close this issue and track the solution above in a new one (ping @smerle33 if you don't mind writing it).

The new issue would need to mention:

@timja
Copy link
Member

timja commented Jan 26, 2024

Is there an issue to just leave 1 spot instance?

@dduportal
Copy link
Contributor

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

@smerle33
Copy link
Contributor

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

I think its quite acceptable ... less than 15$/month

image

image

smerle33 added a commit to jenkins-infra/azure that referenced this issue Jan 26, 2024
@smerle33 smerle33 modified the milestones: infra-team-sync-2024-01-30, infra-team-sync-2024-02-13 Jan 30, 2024
@smerle33
Copy link
Contributor

smerle33 commented Feb 8, 2024

last answer from microsoft confirm our choice :

According to the microsoft guidelines,
One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

We don't have any official documents for the spot instance.

@smerle33 smerle33 closed this as completed Feb 8, 2024
@smerle33
Copy link
Contributor

new answer from microsoft :

One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

@lemeurherve
Copy link
Member

Isn't it exactly the same response as last week minus the last phrase? ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants