Agents aren't spawning on infra.ci #3918

NotMyFault · 2024-01-21T15:06:51Z

Service(s)

infra.ci.jenkins.io

Summary

The build queue counts 30+ items while writing this, but the executor status is stuck in the launching state:

ref https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/view/change-requests/job/PR-4886/ and other PRs

Reproduction steps

No response

smerle33 · 2024-01-22T07:54:37Z

trying to manually trigger an arm64 node in the node pool, it seems that the autoscale fail.

smerle33 · 2024-01-22T07:57:18Z

once the first node spawn, the autoscale works:

smerle33 · 2024-01-23T13:18:52Z

I opened an issue with azure/microsoft,

the detail of my ticket are not in the email but the title was:
Issue Definition: autoscalling not working from 0 but does from 1 node

the first feedback is :

Hi Stephane,
 
I hope you're doing well.

Starting the scaling at 0 may prevent the autoscaling from working as expected. The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints. If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation. It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

if my understanding is correct I probably explained badly the problem, so I replied:

Hi All,

I hope you're doing well.

`The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.` this is exactly the problem, arm64 pods cannot be schedule as there is no node in the nodepool. So I expect the autoscaler to spawn one node.
Starting the scaling from 1 mean spending more money, with a node for nothing for part of the time.

Stéphane

timja · 2024-01-23T14:31:09Z

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

smerle33 · 2024-01-26T08:16:30Z

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

dduportal · 2024-01-26T09:32:32Z

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

Given we can't afford (in the current subscription) to use non-spot instances, WDYT if we start working on adding a new AKS cluster only for infra.ci.jenkins.io and release.ci.jenkins.io Kubernetes agents in the new "sponsored" subscription.

Multiple achievement for us:

Less consumption in the current subscription, and we consume sponsored credits
Separation of concern between controllers and agents
No problem to use non-spot instances until MS solves the problem

If it make sense, I propose to close this issue and track the solution above in a new one (ping @smerle33 if you don't mind writing it).

The new issue would need to mention:

"Using subscription credits instead of CDF money" following [Sponsorships] Setup the secondary Azure subscription to consume the sponsor credits #3818 and Migrate ci.jenkins.io to the sponsored subscription #3913
"Continue work about infra.ci on arm64" as part of infra.ci.jenkins.io on arm64 (controller and agents) #3823
=> Let's scope the initial implementation to only infra.ci.jenkins.io agents, and only 1 "non system" nodepool of type linux/arm64 so we can start switching workloads out of privatek8s.

timja · 2024-01-26T10:35:48Z

Is there an issue to just leave 1 spot instance?

dduportal · 2024-01-26T11:07:18Z

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

smerle33 · 2024-01-26T13:28:02Z

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

I think its quite acceptable ... less than 15$/month

…ces nodes (#609) as per jenkins-infra/helpdesk#3918 for the ARM64 node the price for 1 node all month is around 14$: ![image](https://github.com/jenkins-infra/azure/assets/95630726/dcf8a6fc-7432-4710-b7f8-f6fb24b2864d) for the intel node the price for 1 node all month is around 33$: ![image](https://github.com/jenkins-infra/azure/assets/95630726/93fd25e9-426a-4a7a-a703-b4052498ce64)

smerle33 · 2024-02-08T15:28:45Z

last answer from microsoft confirm our choice :

According to the microsoft guidelines,
One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

We don't have any official documents for the spot instance.

smerle33 · 2024-02-14T13:28:28Z

new answer from microsoft :

One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

lemeurherve · 2024-02-14T14:54:58Z

Isn't it exactly the same response as last week minus the last phrase? ^^

NotMyFault added the triage Incoming issues that need review label Jan 21, 2024

jenkins-infra-helpdesk-app bot added the infra.ci.jenkins.io label Jan 21, 2024

smerle33 removed the triage Incoming issues that need review label Jan 22, 2024

smerle33 self-assigned this Jan 22, 2024

lemeurherve added this to the infra-team-sync-2024-01-23 milestone Jan 22, 2024

NotMyFault changed the title ~~Agen't aren't spawning on infra.ci~~ Agents aren't spawning on infra.ci Jan 22, 2024

dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024

smerle33 mentioned this issue Jan 26, 2024

Add a new private kubernetes cluster in the new sponsored azure subscription #3923

Open

7 tasks

smerle33 mentioned this issue Jan 26, 2024

feat(privatek8s): minimum count for autoscalling to 1 for spot instances nodes jenkins-infra/azure#609

Merged

smerle33 modified the milestones: infra-team-sync-2024-01-30, infra-team-sync-2024-02-13 Jan 30, 2024

smerle33 closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents aren't spawning on infra.ci #3918

Agents aren't spawning on infra.ci #3918

NotMyFault commented Jan 21, 2024

smerle33 commented Jan 22, 2024 •

edited

smerle33 commented Jan 22, 2024

smerle33 commented Jan 23, 2024

timja commented Jan 23, 2024 •

edited

smerle33 commented Jan 26, 2024

dduportal commented Jan 26, 2024

timja commented Jan 26, 2024

dduportal commented Jan 26, 2024

smerle33 commented Jan 26, 2024

smerle33 commented Feb 8, 2024

smerle33 commented Feb 14, 2024

lemeurherve commented Feb 14, 2024

Agents aren't spawning on infra.ci #3918

Agents aren't spawning on infra.ci #3918

Comments

NotMyFault commented Jan 21, 2024

Service(s)

Summary

Reproduction steps

smerle33 commented Jan 22, 2024 • edited

smerle33 commented Jan 22, 2024

smerle33 commented Jan 23, 2024

timja commented Jan 23, 2024 • edited

smerle33 commented Jan 26, 2024

dduportal commented Jan 26, 2024

timja commented Jan 26, 2024

dduportal commented Jan 26, 2024

smerle33 commented Jan 26, 2024

smerle33 commented Feb 8, 2024

smerle33 commented Feb 14, 2024

lemeurherve commented Feb 14, 2024

smerle33 commented Jan 22, 2024 •

edited

timja commented Jan 23, 2024 •

edited