Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major outage right now #749

Closed
rvagg opened this issue Jun 7, 2017 · 16 comments
Closed

Major outage right now #749

rvagg opened this issue Jun 7, 2017 · 16 comments

Comments

@rvagg
Copy link
Member

rvagg commented Jun 7, 2017

Power company has shut me down just now. ARM cluster is offline as is the primary osx build / release machines. This could last for up to ~5 hours apparently, will keep updated as I know more. Sorry for the lack of notice.

@nodejs/release no releases possible just yet.

@mikeal
Copy link
Contributor

mikeal commented Jun 7, 2017

This is a good time to figure out what service we want to pay for to get OSX builds. Budget is already allocated for it by the board.

@jbergstroem
Copy link
Member

I don't think downtime is a good time to talk about future plans, but I'll bite. We already have the initiatives (#724, #741 or the classic #367) to get mac resources in place. To me, it seems we lack man hours. As you can see, we think the main issue is lack of sponsorship visibility which could be seen as a marketing effort. Suggestions on how to improve and speed up?

@rvagg
Copy link
Member Author

rvagg commented Jun 7, 2017

Not unreasonable @mikeal, we've been in a limbo situation with our osx resources for a bit too long. We decided on some concrete steps we need to take ASAP and we've also looped in @mhdawson to the process so we're not reliant on fragile and unreliable dependencies (me). Best to take discussion on specifics to email at this stage though.

@jasnell
Copy link
Member

jasnell commented Jun 7, 2017

@rvagg ... For the arm cluster, have you written up a detailed description of what it would take to set up an equivalent mirror? If not, that would be quite helpful.

@mikeal
Copy link
Contributor

mikeal commented Jun 7, 2017

@jbergstroem we have budget allocated to pay a provider. I'm not saying we shouldn't pursue donors but we should get setup on a paid provider if only for the reliability if a donor goes down.

@jasnell
Copy link
Member

jasnell commented Jun 7, 2017

@rvagg ... It would be helpful to have a post mortem write up of this (and all outages really) to the CTC/TSC once it's resolved. We need to have more visibility into these kinds of things.

@rvagg
Copy link
Member Author

rvagg commented Jun 8, 2017

👍 @jasnell, FYI we're tracking to a point where the only difficult to duplicate resource is armv6 and we already have a precedent of allowing releases to go out without those. OSX is about to be sorted (just got good news via email on that front @mikeal) and armv8 has two new providers stepping up to take the load off the noisy boxes in my garage (I might decommission those entirely when we're redundant). I'll provide more detail when we're over this current hump tho, I know a number of people are concerned about our resilience at the moment.

@rvagg
Copy link
Member Author

rvagg commented Jun 8, 2017

We've just successfully kicked off a relationship with MacStadium, initially with a 6 month relationship with a review after that to see how it's working. Still working on setting up some resources but this will nix the weakest point in our infra at the moment! Will give more details soon but thought the good news is worth sharing.

@jbergstroem
Copy link
Member

jbergstroem commented Jun 8, 2017

Just want to add that I'm also involved in both setting up the new MacOS cluster as well as offloading ARM to new sponsors. Downtimes like these shouldn't have to be a driver for improvement but usually ends up being anyway (remember digital ocean issues?).

@rvagg
Copy link
Member Author

rvagg commented Jun 8, 2017

all working again, I also took the opportunity to do some minor maintenance across the Pi's

@rvagg rvagg closed this as completed Jun 8, 2017
@refack
Copy link
Contributor

refack commented Jun 8, 2017

@rvagg
Copy link
Member Author

rvagg commented Jun 9, 2017

OK so the pi cluster is back to "healthy" again, my guess is simply that it's poor nfs performance to blame, because I cleared out workspaces after restarting everything it had to start from scratch but that leads to all of these machines simultaneously reading and writing over the network.

The disk can handle it, the server should be able to handle it, the individual machines should be able to handle it too, it's either the network topology or hardware that sucks or simply NFS that sucks. I've been assuming the latter and will continue to do so without better insight. Anyone with the nodejs_build_test key should be able to get in to these machines to do diagnosis so if someone thinks they have the skills to dig then you're welcome to.

Perhaps I should be exploring alternatives to NFS? Why hasn't NFS matured more than it currently is, are there better solutions? Should I be using cifs (windows/samba), sshfs, or something else?

@refack
Copy link
Contributor

refack commented Jun 9, 2017

For my experience CIFS is not better than NFS

@rvagg
Copy link
Member Author

rvagg commented Jun 13, 2017

@jasnell, @mhdawson, @Trott, @nodejs/build I did a write-up of the outage here: https://github.com/nodejs/build/wiki/Service-disruption-post-mortems#2017-06-07-ci-infrastructure-partial-outages

Includes:

  • Details
  • Impact
  • Resolution
  • Weaknesses exposed
  • Action items post-outage

Plus also links to various issues related.

I'd like us to produce similar write-ups on the same wiki page for future outages. It'd be a good habit for us to build, it keeps us accountable and it gives us a single place to record and share the info instead of scattering it across GitHub, IRC & email.

@jasnell, @mhdawson & @Trott can I get you to weigh in on whether this needs to be shared more widely?

@jasnell
Copy link
Member

jasnell commented Jun 13, 2017

Thank you @rvagg. I think making it available on the wiki or even as repo issues somewhere is sufficient.

@mhdawson
Copy link
Member

I think putting these in a directory in the repo is probably good enough. Maybe something like doc/service-disruptions-post-mortems (or something shorter if somebody has a better name).

On the PPC/AIX front I agree with your write up that its not a high priority to get additional redundancy as the uptime for those systems has been quite good. I can't remember the last time they were down, just unfortunate timing of an uplanned power outage as OSUOSL this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants