Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Azure Mariner CI jobs regularly failing: "File not found: 'docker'" #6316

Closed
jameslamb opened this issue Feb 15, 2024 · 11 comments
Closed

Comments

@jameslamb
Copy link
Collaborator

Description

Since switching the Linux CI jobs at Azure DevOps to Mariner Linux in #6222, we've seen an increased rate of Azure DevOps jobs failing.

Typically with an error like this in the "initialize job" stage:

##[error]File not found: 'docker'

Creating this issue to track those cases.

Reproducible example

A few recent cases:

All had similar logs, like this:

Starting: Initialize job
Agent name: 'lgbub6c540001AS'
Agent machine name: 'lgbub6c540001AS'
Current agent version: '3.234.0'
Agent running as: 'AzDevOps'
Prepare build directory.
Set build variables.
Download all required tasks.
Downloading task: 1ESHostedPoolValidation (1.0.26350327)
Downloading task: CmdLine (2.231.1)
Downloading task: Bash (3.231.5)
Downloading task: PublishBuildArtifacts (1.231.1)
Downloading task: ComponentGovernanceComponentDetection (0.2420208.1)
Checking job knob settings.
   Knob: DockerActionRetries = true Source: $(VSTSAGENT_DOCKER_ACTION_RETRIES) 
   Knob: AgentEnablePipelineArtifactLargeChunkSize = true Source: $(AGENT_ENABLE_PIPELINEARTIFACT_LARGE_CHUNK_SIZE) 
   Knob: ContinueAfterCancelProcessTreeKillAttempt = true Source: $(VSTSAGENT_CONTINUE_AFTER_CANCEL_PROCESSTREEKILL_ATTEMPT) 
   Knob: ProcessHandlerTelemetry = true Source: $(AZP_75787_ENABLE_COLLECT) 
   Knob: IgnoreVSTSTaskLib = true Source: $(AZP_AGENT_IGNORE_VSTSTASKLIB) 
   Knob: FailJobWhenAgentDies = true Source: $(FAIL_JOB_WHEN_AGENT_DIES) 
   Knob: CheckForTaskDeprecation = true Source: $(AZP_AGENT_CHECK_FOR_TASK_DEPRECATION) 
   Knob: MountWorkspace = true Source: $(AZP_AGENT_MOUNT_WORKSPACE) 
Finished checking job knob settings.
##[error]File not found: 'docker'
Finishing: Initialize job

Environment info

N/A

Additional Comments

Pulling this into its own issue (original conversation started in #6307 (comment)).

@jameslamb
Copy link
Collaborator Author

@jameslamb
Copy link
Collaborator Author

And one on #6019:

@shiyu1994
Copy link
Collaborator

I will investigate this. Thanks for creating the issue.

@jameslamb
Copy link
Collaborator Author

Thanks! I'll keep sharing these links, but will stop if you tell me it's not necessary any more.

Saw many more in the last 24 hours:

Since we made this change in #6222 (December 19, 2023), it definitely seems that the rate of pipeline failures on Azure DevOps has increased.

There's some data on that available at https://dev.azure.com/lightgbm-ci/lightgbm-ci/_pipeline/analytics/stageawareoutcome?definitionId=1&contextType=build.

image

Over the last 30 days, for example, 62% of all LightGBM's jobs on Azure DevOps have failed, and 30% of those have been in the Initialize step, where Azure DevOps connects to the agent process.

@jameslamb
Copy link
Collaborator Author

I'm not going to post as many of these since I think it's clear at this point that there's an issue.

But want to share this from a run on #6341 about 16 hours ago... 8 of 18 jobs on the same run failed at initialization with the error reported in the original post in this issue.

image image

(build link)

@jameslamb
Copy link
Collaborator Author

In addition to these frequent failures, over the last few days I've observed what looks like significantly reduced capacity.

For example, the Linux tasks in this build have all been stuck in Queued for 11 hours: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=15983&view=results. Despite there being 0 other commits currently building on Azure DevOps in the LightGBM project.

@shiyu1994 could you look into that? Is LightGBM competing with other pipelines for capacity?

@jameslamb
Copy link
Collaborator Author

I just restarted the jobs on #6364 and saw all the Linux jobs on Azure DevOps get queued, with messages like this

The agent request is not running because all potential agents are running other requests. Current position in queue: 131

Some jobs have been stuck in "queued", waiting to be picked up, for more than 4 days.

Screenshot 2024-03-28 at 9 48 57 PM

@shiyu1994 can you please look into this? Is LightGBM competing with other projects, or is Azure's capacity for these types of VMs just very limited?

We really rely heavily on these Linux jobs on Azure DevOps and this disruption is blocking development on the project.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Mar 31, 2024

I've added the blocking label. It's now been 5+ days since we had a single successful run of CI.

#6394 was opened about 12 hours ago, for example, and all of its Linux CI jobs are stuck in Queued: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=16010&view=results

These jobs just not being run at all is different from the other issue reported in the first post here (file not found: docker), but I'm including this here so all discussion about instability with these new runners can be kept in one place.

@shiyu1994
Copy link
Collaborator

The VM scale set on Azure for the ci is failed. I'm fixing this.

@jameslamb
Copy link
Collaborator Author

Thanks! Whenever it's fixed, I can take care of rebuilding + merging all the already-approved PRs.

@jameslamb
Copy link
Collaborator Author

Over the last 2 weeks at least, I have not seen the "File not found: 'docker'" issue a single time! 🎉

It's becoming much more common that all of CI across all providers passes with 0 manual re-runs, as just happened on master after I merged #6398.

I wonder if some combination of #6407, #6416, and other fixes made by Azure have all contributed to stabilizing this. Either way, I think we can close this issue for now and re-open it if the problems come up again.

Thanks for all your help @shiyu1994 !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants