Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] CUDA jobs failing when installing packages #6001

Closed
jameslamb opened this issue Jul 21, 2023 · 11 comments
Closed

[ci] CUDA jobs failing when installing packages #6001

jameslamb opened this issue Jul 21, 2023 · 11 comments

Comments

@jameslamb
Copy link
Collaborator

Description

All the CUDA jobs across several PRs (e.g. #5997, #5999) started failing yesterday, with the following errors.

The following packages were automatically installed and are no longer required:
  clamav clamav-base clamav-freshclam libclamav9 libllvm3.9 libtfm1
  linux-azure-5.4-cloud-tools-5.4.0-1031
  linux-azure-5.4-cloud-tools-5.4.0-1032
  linux-azure-5.4-cloud-tools-5.4.0-1034
  linux-azure-5.4-cloud-tools-5.4.0-1035
  linux-azure-5.4-cloud-tools-5.4.0-1036
  linux-azure-5.4-cloud-tools-5.4.0-1039
  linux-azure-5.4-cloud-tools-5.4.0-1040
  linux-azure-5.4-cloud-tools-5.4.0-1041
  linux-azure-5.4-cloud-tools-5.4.0-1043
  linux-azure-5.4-cloud-tools-5.4.0-1044
  linux-azure-5.4-cloud-tools-5.4.0-1046
  linux-azure-5.4-cloud-tools-5.4.0-1047
  linux-azure-5.4-cloud-tools-5.4.0-1048
  linux-azure-5.4-cloud-tools-5.4.0-10[51](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:52)
  linux-azure-5.4-cloud-tools-5.4.0-10[55](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:56)
  linux-azure-5.4-cloud-tools-5.4.0-10[56](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:57)
  linux-azure-5.4-cloud-tools-5.4.0-10[58](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:59)
  linux-azure-5.4-cloud-tools-5.4.0-10[59](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:60)
  linux-azure-5.4-cloud-tools-5.4.0-10[61](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:62)
  linux-azure-5.4-cloud-tools-5.4.0-10[62](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:63)
  linux-azure-5.4-cloud-tools-5.4.0-10[63](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:64)
  linux-azure-5.4-cloud-tools-5.4.0-10[64](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:65)
  linux-azure-5.4-cloud-tools-5.4.0-10[65](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:66)
  linux-azure-5.4-cloud-tools-5.4.0-10[67](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:68)
  linux-azure-5.4-cloud-tools-5.4.0-10[68](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:69)
  linux-azure-5.4-cloud-tools-5.4.0-10[69](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:70)
  linux-azure-5.4-cloud-tools-5.4.0-10[70](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:71)
  linux-azure-5.4-cloud-tools-5.4.0-10[72](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:73)
  linux-azure-5.4-cloud-tools-5.4.0-10[73](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:74)
  linux-azure-5.4-cloud-tools-5.4.0-10[74](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:75)
  linux-azure-5.4-cloud-tools-5.4.0-10[77](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:78)
  linux-azure-5.4-cloud-tools-5.4.0-10[78](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:79)
  linux-azure-5.4-cloud-tools-5.4.0-10[80](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:81)
  linux-azure-5.4-cloud-tools-5.4.0-10[83](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:84)
  linux-azure-5.4-cloud-tools-5.4.0-10[85](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:86)
  linux-azure-5.4-cloud-tools-5.4.0-10[86](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:87)
  linux-azure-5.4-cloud-tools-5.4.0-1089
  ... truncated ...
  linux-azure-5.4-tools-5.4.0-1091 nvidia-kernel-source-515 nvidia-utils-515
  xserver-xorg-video-nvidia-515
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 115 not upgraded.
3 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up azure-mdsd (1.8.0-build.master.189) ...

Configuration file '/etc/default/mdsd'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** mdsd (Y/I/N/O/D/Z) [default=N] ? dpkg: error processing package azure-mdsd (--configure):
 end of file on stdin at conffile prompt
Setting up auoms (2.7.0.11) ...

Configuration file '/etc/opt/microsoft/auoms/auoms.conf'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** auoms.conf (Y/I/N/O/D/Z) [default=N] ? dpkg: error processing package auoms (--configure):
 end of file on stdin at conffile prompt
dpkg: dependency problems prevent configuration of azsec-monitor:
 azsec-monitor depends on auoms (>= 2.4.5); however:
  Package auoms is not configured yet.
  Version of auoms on system, provided by auoms:amd64, is <none>.

dpkg: error processing package azsec-monitor (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
Errors were encountered while processing:
 azure-mdsd
 auoms
 azsec-monitor
E: Sub-process /usr/bin/dpkg returned an error code (1)
Error: Process completed with exit code [100](https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15245982451?pr=5999#step:2:101).

Reproducible example

This is happening on master and all PRs.

(example build link)

Additional Comments

Some resources that might be helpful:

@jameslamb
Copy link
Collaborator Author

I just tried another rebuild on #5999, and this is still happening. (build link)

I think fixing this requires some administrative action. @shiyu1994 since you're the only one with access to the machine these CUDA jobs run on, could you please try the following:

Run the following on that machine, and choose Y if presented with that interactive menu.

sudo apt-get update
sudo apt autoremove
sudo apt-get install --no-install-recommends -y \
    curl \
    lsb-release \
    software-properties-common

Then try re-triggering the CUDA jobs, e.g. at https://github.com/microsoft/LightGBM/actions/runs/5619064630/job/15276054852?pr=5999.

Screen Shot 2023-07-23 at 9 31 17 PM

@jameslamb
Copy link
Collaborator Author

@shiyu1994 can you please help?

Sorry to keep @-ing you, but development in the repo is blocked until these jobs are fixed and you're the only one with access. If I had access to the machine, I'd be happy to handle thiis.

I just tried re-running again, and they jobs failed the same way: https://github.com/microsoft/LightGBM/actions/runs/5648086657

This problem won't go away on its own.

@shiyu1994
Copy link
Collaborator

@jameslamb Sorry for the late response. I'll check the machine.

@jameslamb
Copy link
Collaborator Author

I just tried re-running again, and the jobs failed the same way:

https://github.com/microsoft/LightGBM/actions/runs/5640538711/job/15486941200

@shiyu1994 if you don't have time to help with this, can I just have access to the machine so I can fix it? I want development to restart in the repo as soon as possible.

@shiyu1994
Copy link
Collaborator

The issue is fixed. And now the cuda ci jobs seems ok to run.

@shiyu1994
Copy link
Collaborator

@shiyu1994
Copy link
Collaborator

And sorry for the delay.

@jameslamb
Copy link
Collaborator Author

Thank you so much!

And sorry for the delay

Is it possible for me to get access to the machine so I can do things like this in the future? Or for someone else like @guolinke to also have access?

These parts of our development where there is only one person who can do something pose a big risk of long disruptions like this.

@jameslamb jameslamb removed the blocking label Aug 4, 2023
@shiyu1994
Copy link
Collaborator

I'm trying with @guolinke to see if the machine can be safely accessed by other maintainers.

@jameslamb
Copy link
Collaborator Author

Thank you, that would be helpful!

As a general rule, having more than one person for every operational responsibility in the repo would improve the long-term health and sustainability of this project.

@jameslamb
Copy link
Collaborator Author

This issue was fixed a while ago, and we've moved the discussion about expanding access to administer the machine for these jobs into a private space.

This can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants