Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machines keep going offline during builds #232

Closed
evanlucas opened this issue Oct 29, 2015 · 18 comments
Closed

Machines keep going offline during builds #232

evanlucas opened this issue Oct 29, 2015 · 18 comments

Comments

@evanlucas
Copy link

https://ci.nodejs.org/job/node-test-commit-plinux/180/ is an example of it

@jbergstroem
Copy link
Member

Thing is, they don't. I've been logged into the slaves while the timeout occurred. There's also:

..so I'm suspecting either jenkins (yay) or perhaps even the ci host.

@jbergstroem jbergstroem changed the title PPC machines keep going offline during builds Machines keep going offline during builds Oct 30, 2015
@jbergstroem
Copy link
Member

minor update: I've updated most of our slaves to 1.52 -- seems like it didn't help; I've seen windows slaves go offline since.

@rvagg
Copy link
Member

rvagg commented Nov 1, 2015

fwiw @joaocgreis mentioned that the machines on azure had some weird networking problem that was causing the on/off behaviour, they are unique in this respect, judging by the CI status emails we're getting anyway.

@jbergstroem
Copy link
Member

@rvagg lets assume that was the case then. i'll try and keep a close look on fails over the next few days.

@jbergstroem
Copy link
Member

@jbergstroem
Copy link
Member

@jbergstroem
Copy link
Member

@jbergstroem
Copy link
Member

I'm starting to think the host is to blame. Should we try updating jenkins? Can't find anything relevant in the changelog.

@joaocgreis
Copy link
Member

I've had connection problems very frequently when I was setting up the cross compiler machine (running Linux). I changed the connection to ssh from Jenkins and haven't seen it fail since.

I've installed Cygwin on node-msft-win10-5 to try ssh to Windows (it should work), but no luck connecting so far.

It's strange that the machines in Azure are constantly having this problem, but the ones on Rackspace are always fine.

@jbergstroem
Copy link
Member

@joaocgreis
Copy link
Member

I might have figured out the problem with Azure machines. Jenkins slave has a keep alive signal with a 5 minutes default interval. That seems to be too much for Azure, the the connections were broken because of that. I added a JVM option to Azure slaves to reduce it to 2 minutes (that's what Azure uses for SSH). Let's see if that's the correct fix for the correct problem, but I'm hopeful.

On the other hand, Jenkins has been completely broken since https://ci.nodejs.org/job/node-test-commit/1107/ . Apparently, sub jobs are being started only if it detects any change in git, even though that option is explicitly disabled everywhere.

Right now, my best guess is that some plugin update broke it. The multijob plugin was updated (does it have automatic updates?) to 1.19, that introduced the "Resume build" button, and that button appears for the first time in the first build with problems. This might be a coincidence, I'm still looking into it, this is just to share progress.

@rmg
Copy link

rmg commented Nov 13, 2015

@joaocgreis I've noticed a similar problem on my own multi-jobs. I had to enable the "Only build when VCS changes are detected" because it doesn't actually mean what it says.

See https://issues.jenkins-ci.org/browse/JENKINS-30952

@joaocgreis
Copy link
Member

I downgraded the multijob plugin to 1.18 and it's building, looks good so far. I'd rather leave it at 1.18 instead of flipping all the "build only if VCS" checkboxes because we have quite a few. That issue is 8 hours old, perhaps the fix won't take too long.

@joaocgreis
Copy link
Member

The test-binary jobs did not work after downgrading the multijob plugin, had to upgrade again and flip all the switches. We'll probably have to flip them again when this gets fixed.

But they still don't work: https://ci.nodejs.org/job/node-test-binary-arm/482/console and https://ci.nodejs.org/job/node-test-binary-windows/284/console EDIT: I cloned the jobs to clear the history, they seem to be working now.

@joaocgreis
Copy link
Member

I haven't seen Azure machines failing again, so I assume the keep alive interval change fixed it. As for jenkins, jobs seem to be running well now.

So, keeping this issue alive is the (much fewer) random failures not tied to a specific set of slaves. Are those still happening?

@jbergstroem
Copy link
Member

I think we're improving on all fronts 👍

@jbergstroem
Copy link
Member

We haven't seen disconnects for a long while. Very good news! Lets close this and sleep better at night, hoping it won't be reopened. I guess the bad part is that we didn't really identify a few of the issues as to why they disconnected, but it's pretty much established that lowering the ping interval between master and slaves did a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants