Major outage right now #749

rvagg · 2017-06-07T22:54:06Z

Power company has shut me down just now. ARM cluster is offline as is the primary osx build / release machines. This could last for up to ~5 hours apparently, will keep updated as I know more. Sorry for the lack of notice.

@nodejs/release no releases possible just yet.

mikeal · 2017-06-07T22:55:28Z

This is a good time to figure out what service we want to pay for to get OSX builds. Budget is already allocated for it by the board.

jbergstroem · 2017-06-07T23:01:58Z

I don't think downtime is a good time to talk about future plans, but I'll bite. We already have the initiatives (#724, #741 or the classic #367) to get mac resources in place. To me, it seems we lack man hours. As you can see, we think the main issue is lack of sponsorship visibility which could be seen as a marketing effort. Suggestions on how to improve and speed up?

rvagg · 2017-06-07T23:39:02Z

Not unreasonable @mikeal, we've been in a limbo situation with our osx resources for a bit too long. We decided on some concrete steps we need to take ASAP and we've also looped in @mhdawson to the process so we're not reliant on fragile and unreliable dependencies (me). Best to take discussion on specifics to email at this stage though.

jasnell · 2017-06-07T23:43:40Z

@rvagg ... For the arm cluster, have you written up a detailed description of what it would take to set up an equivalent mirror? If not, that would be quite helpful.

mikeal · 2017-06-07T23:57:54Z

@jbergstroem we have budget allocated to pay a provider. I'm not saying we shouldn't pursue donors but we should get setup on a paid provider if only for the reliability if a donor goes down.

jasnell · 2017-06-07T23:58:47Z

@rvagg ... It would be helpful to have a post mortem write up of this (and all outages really) to the CTC/TSC once it's resolved. We need to have more visibility into these kinds of things.

rvagg · 2017-06-08T00:19:43Z

👍 @jasnell, FYI we're tracking to a point where the only difficult to duplicate resource is armv6 and we already have a precedent of allowing releases to go out without those. OSX is about to be sorted (just got good news via email on that front @mikeal) and armv8 has two new providers stepping up to take the load off the noisy boxes in my garage (I might decommission those entirely when we're redundant). I'll provide more detail when we're over this current hump tho, I know a number of people are concerned about our resilience at the moment.

rvagg · 2017-06-08T01:47:13Z

We've just successfully kicked off a relationship with MacStadium, initially with a 6 month relationship with a review after that to see how it's working. Still working on setting up some resources but this will nix the weakest point in our infra at the moment! Will give more details soon but thought the good news is worth sharing.

jbergstroem · 2017-06-08T02:04:44Z

Just want to add that I'm also involved in both setting up the new MacOS cluster as well as offloading ARM to new sponsors. Downtimes like these shouldn't have to be a driver for improvement but usually ends up being anyway (remember digital ocean issues?).

rvagg · 2017-06-08T06:15:22Z

all working again, I also took the opportunity to do some minor maintenance across the Pi's

refack · 2017-06-08T13:44:54Z

No mining cryptocurrency with the spare cycles?
https://www.bleepingcomputer.com/news/security/linux-malware-mines-for-cryptocurrency-using-raspberry-pi-devices/

rvagg · 2017-06-09T02:52:06Z

OK so the pi cluster is back to "healthy" again, my guess is simply that it's poor nfs performance to blame, because I cleared out workspaces after restarting everything it had to start from scratch but that leads to all of these machines simultaneously reading and writing over the network.

The disk can handle it, the server should be able to handle it, the individual machines should be able to handle it too, it's either the network topology or hardware that sucks or simply NFS that sucks. I've been assuming the latter and will continue to do so without better insight. Anyone with the nodejs_build_test key should be able to get in to these machines to do diagnosis so if someone thinks they have the skills to dig then you're welcome to.

Perhaps I should be exploring alternatives to NFS? Why hasn't NFS matured more than it currently is, are there better solutions? Should I be using cifs (windows/samba), sshfs, or something else?

refack · 2017-06-09T03:15:29Z

For my experience CIFS is not better than NFS

rvagg · 2017-06-13T11:33:48Z

@jasnell, @mhdawson, @Trott, @nodejs/build I did a write-up of the outage here: https://github.com/nodejs/build/wiki/Service-disruption-post-mortems#2017-06-07-ci-infrastructure-partial-outages

Includes:

Details
Impact
Resolution
Weaknesses exposed
Action items post-outage

Plus also links to various issues related.

I'd like us to produce similar write-ups on the same wiki page for future outages. It'd be a good habit for us to build, it keeps us accountable and it gives us a single place to record and share the info instead of scattering it across GitHub, IRC & email.

@jasnell, @mhdawson & @Trott can I get you to weigh in on whether this needs to be shared more widely?

jasnell · 2017-06-13T12:49:05Z

Thank you @rvagg. I think making it available on the wiki or even as repo issues somewhere is sufficient.

mhdawson · 2017-06-14T16:11:08Z

I think putting these in a directory in the repo is probably good enough. Maybe something like doc/service-disruptions-post-mortems (or something shorter if somebody has a better name).

On the PPC/AIX front I agree with your write up that its not a high priority to get additional redundancy as the uptime for those systems has been quite good. I can't remember the last time they were down, just unfortunate timing of an uplanned power outage as OSUOSL this time.

rvagg mentioned this issue Jun 7, 2017

8.1.0 Proposal nodejs/node#13483

Closed

rvagg closed this as completed Jun 8, 2017

refack mentioned this issue Jun 10, 2017

test: improve async-hooks/test-callback-error nodejs/node#13559

Merged

3 tasks

rvagg mentioned this issue Jun 13, 2017

Raspberry Pi 3 NFS speed problems #758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major outage right now #749

Major outage right now #749

rvagg commented Jun 7, 2017

mikeal commented Jun 7, 2017

jbergstroem commented Jun 7, 2017

rvagg commented Jun 7, 2017

jasnell commented Jun 7, 2017

mikeal commented Jun 7, 2017

jasnell commented Jun 7, 2017

rvagg commented Jun 8, 2017

rvagg commented Jun 8, 2017

jbergstroem commented Jun 8, 2017 •

edited

Loading

rvagg commented Jun 8, 2017

refack commented Jun 8, 2017

rvagg commented Jun 9, 2017

refack commented Jun 9, 2017

rvagg commented Jun 13, 2017

jasnell commented Jun 13, 2017

mhdawson commented Jun 14, 2017

Major outage right now #749

Major outage right now #749

Comments

rvagg commented Jun 7, 2017

mikeal commented Jun 7, 2017

jbergstroem commented Jun 7, 2017

rvagg commented Jun 7, 2017

jasnell commented Jun 7, 2017

mikeal commented Jun 7, 2017

jasnell commented Jun 7, 2017

rvagg commented Jun 8, 2017

rvagg commented Jun 8, 2017

jbergstroem commented Jun 8, 2017 • edited Loading

rvagg commented Jun 8, 2017

refack commented Jun 8, 2017

rvagg commented Jun 9, 2017

refack commented Jun 9, 2017

rvagg commented Jun 13, 2017

jasnell commented Jun 13, 2017

mhdawson commented Jun 14, 2017

jbergstroem commented Jun 8, 2017 •

edited

Loading