Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads of binaries from nodejs.org are slow / not working #1993

Closed
watson opened this issue Oct 23, 2019 · 54 comments

Comments

@watson
Copy link
Member

@watson watson commented Oct 23, 2019

All binary downloads from nodejs.org are very slow currently. E.g. https://nodejs.org/dist/v13.0.1/node-v13.0.1-darwin-x64.tar.xz.

In many cases, they will time out, which affects CI systems like Travis who rely on nvm to install Node.js.

According to @targos, the CPU of the server is spinning 100% being used by nginx, but we can't figure out what's wrong.

@targos

This comment has been minimized.

Copy link
Member

@targos targos commented Oct 23, 2019

Top of the top:
image

@targos

This comment has been minimized.

Copy link
Member

@targos targos commented Oct 23, 2019

@atulmy

This comment has been minimized.

Copy link

@atulmy atulmy commented Oct 23, 2019

Same here, using tj/n, the download just terminates after a while

Screenshot 2019-10-23 at 8 09 52 PM

@kachkaev

This comment has been minimized.

Copy link

@kachkaev kachkaev commented Oct 23, 2019

Azure DevOps pipelines also suffer, that's because a standard Node.js setup task downloads the binary from nodejs.org.

Screenshot 2019-10-23 at 15 43 41

Kudos to those who are looking into the CPU usage ✌️ You'll nail it! 🙌

@sam-github

This comment has been minimized.

Copy link
Member

@sam-github sam-github commented Oct 23, 2019

@targos what machine is the top output from?

Fwiw, only infra team members are: https://github.com/orgs/nodejs/teams/build-infra/members and Gibson isn't current, he should be removed.

@MylesBorins

This comment has been minimized.

Copy link
Member

@MylesBorins MylesBorins commented Oct 23, 2019

I wonder what traffic looks like... could be a DOS

We've talked about it in the past, but we really should be serving these resources via a CDN... I believe that the issue was maintaining our metrics. I can try and find some time to help out with the infrastructure here if others don't have the time

@sam-github

This comment has been minimized.

Copy link
Member

@sam-github sam-github commented Oct 23, 2019

It could also be that 13.x is super popular? Hard to know without access to access logs.

@MylesBorins the infrasture team is even less staffed than the build team as a whole, if you wanted to pick that up, it would be 🎉 fabulous.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

The issue has been the metrics in terms of why they are not on a CDN, @joaocgreis and @rvagg do have an active discussion on what we might do on that front.

@MylesBorins if you can get help for this specific issue that might be good (please check with @rvagg and @joaocgreis ). Better would be ongoing sustained engagement and contribution to the overall build work.

@targos

This comment has been minimized.

Copy link
Member

@targos targos commented Oct 23, 2019

@sam-github it's from direct.nodejs.org. I don't have root access to the machine, so top is basically the only thing I could do to help.

@sam-github

This comment has been minimized.

Copy link
Member

@sam-github sam-github commented Oct 23, 2019

I don't even have non-root access, I guess you do because you are on the release team.

We'll have to wait for an infra person.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

I can help take a look after the TSC meeting.

@michaelsbradleyjr

This comment has been minimized.

Copy link

@michaelsbradleyjr michaelsbradleyjr commented Oct 23, 2019

In addition to downloads only intermittently succeeding, when they do succeed it seems that sometimes (?) the bundled node-gyp is broken. I'm seeing that behavior in CI builds on Azure Pipelines.

For linux builds with successful downloads of Node.js v8.x and v10.x, node-gyp fails to build sources that were previously unproblematic.

For windows builds with successful downloads of Node.js v8.x and v10.x, I'm seeing errors like this:

gyp ERR! stack Error: win-x86/node.lib local checksum 629a559d347863902b81fad9b09ba12d08d939d20a0762578e904c6ee1a7adab not match remote fea7d0c5a94fc834a78b44b7da206f0e60ef11dd20d84dc0d49d57ee77e20e16

That is, in the case of windows node-gyp doesn't attempt to compile; rather, it fails outright owing to the checksum error.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

@michaelsbradleyjr I suspect that the node-gyp failures would have been related to downloading header files.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

Was looking into this with @targos. It seems to have resolved itself at this point.

Looking at the access logs it was a bit hard to tell if there was more traffic or not as the files for the previous day are compressed. We should be able to compare tomorrow more easily.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

Top on the machine looks pretty much the same as it did while there were issues as well

@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Oct 23, 2019

hm, that top doesn't look exceptional (load average would have been nice to see though @targos, should be up the top right of top). Cloudflare also monitors our servers for overload and switches between them as needed and they're not recording any incidents. It's true that this is a weak point in our infra and that needing to get downloads fully CDNified is a high priority (and we're working on it but are having trouble getting solid solutions that solve all our needs), but I'm not sure we can fully blame this on our main server, there might be something network-related at play that we don't have insight to.

btw, it's myself and @jbergstroem having the discussion about this @mhdawson, not so much @joaocgreis, although you and he are also in the email chain.

@MylesBorins one thing that has come up that might be relevant to you is that if we can get access to Cloudflare's Logpush feature (still negotiating) then we'd need a place to put logs that they support. GCP Storage is an option for that, so maybe getting hold of some credits may be handy in this instance.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

@rvagg sorry right @jbergstroem, got the wrong name.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 23, 2019

Possibly unrelated, but I did see a complaint of a completely different site reporting slow traffic around the same time.

@watson

This comment has been minimized.

Copy link
Member Author

@watson watson commented Oct 24, 2019

I can ask my company if they'd be interested in sponsoring us with a hosted Elasticsearch and Kibana setup that can pull all our logs and server metrics and make them easily accessible. This will also give us the ability to send out alerts etc. Would we be interested in that?

@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Oct 24, 2019

Thanks @watson, it's not the metrics gathering process that's difficult, it's gathering the logs in the first place and doing it in a reliable enough way that we have the confidence that it'll keep on working even if we're not checking it regularly, we don't have dedicated infra staff to do that level of monitoring.
Our Cloudflare sponsorship got a minor bump to include logpull, normally an enterprise feature, but so far we've not come up with a solution that we're confident enough to set-and-forget like we currently can with nginx logs. Logpush, a newer feature, would be nice because they take care of storing the logs in a place we nominate. Part of this is also the people-time to invest in doing it, access to server logs is not something we can hand off to just anyone either so there's a small number of individuals who have enough access and trust to implement all of this stuff.

@bvitale

This comment has been minimized.

Copy link

@bvitale bvitale commented Oct 24, 2019

Hi folks, this seems to be cropping up again this morning.

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 24, 2019

Current load average:

cat /proc/loadavg
2.12 2.15 1.96 1/240 15965
@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 24, 2019

load average seems faily steady:

cat /proc/loadavg
1.76 1.90 1.91 3/252 26955

@renaudaste

This comment has been minimized.

Copy link

@renaudaste renaudaste commented Oct 24, 2019

Hello,
Having the issue again on our Azure pipelines (west europe) :
image

Thank you & good luck to the team !

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 24, 2019

captured ps -ef to /root/processes-oct-24-11am so that we can compare later on.

@mbwhite

This comment has been minimized.

Copy link

@mbwhite mbwhite commented Oct 25, 2019

FYI - 0800GMT but the Azure pipeline Node task worked that didn't yesterday due to this.. so :-)

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 25, 2019

Load today seems lower than past 2 days

cat /proc/loadavg
0.57 0.63 0.62 1/202 23650
root@infra-digitalocean-ubuntu1604-x64-1:~# sh net.sh eth0
eth0  {rx: 2843, tx: 1566456}
eth0  {rx: 3563, tx: 1901493}
eth0  {rx: 3656, tx: 1849576}
eth0  {rx: 2042, tx: 1089992}
eth0  {rx: 2664, tx: 1552519}
eth0  {rx: 2375, tx: 1052912}
eth0  {rx: 4914, tx: 2037342}
rvagg added a commit that referenced this issue Oct 26, 2019
Ref: #1993
@OriginalEXE

This comment has been minimized.

Copy link

@OriginalEXE OriginalEXE commented Oct 29, 2019

Seems to be happening again at the moment.

@merlinnot

This comment has been minimized.

Copy link

@merlinnot merlinnot commented Oct 29, 2019

I can confirm that.

@vibou

This comment has been minimized.

Copy link

@vibou vibou commented Oct 29, 2019

we confirm too

@mhdawson

This comment has been minimized.

Copy link
Member

@mhdawson mhdawson commented Oct 29, 2019

Load average

cat /proc/loadavg
1.82 1.86 1.82 2/213 27297

network traffic:

eth0  {rx: 21581, tx: 2055858}
eth0  {rx: 21093, tx: 2027874}
eth0  {rx: 21810, tx: 2051826}
eth0  {rx: 30523, tx: 2278048}
eth0  {rx: 22663, tx: 2091662}
eth0  {rx: 20852, tx: 1983614}
eth0  {rx: 21671, tx: 2062691}
eth0  {rx: 21534, tx: 2043597}
eth0  {rx: 21430, tx: 2029947}
eth0  {rx: 21972, tx: 2070486}
eth0  {rx: 22611, tx: 2047423}
eth0  {rx: 20209, tx: 1935555}
eth0  {rx: 20411, tx: 1940259}

Although seems to have improved a bit.

@shouze

This comment has been minimized.

Copy link

@shouze shouze commented Oct 29, 2019

I confirm too.

@rvagg @mhdawson You can probably benefit from enabling into nginx config sendfile and read how to optimize the backlog queue?

@shouze

This comment has been minimized.

Copy link

@shouze shouze commented Oct 29, 2019

or maybe some people at https://bintray.com/ @bintray (or some other competitor) would offer to host node artifacts for the sake? 💝 👼 🙏

@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Oct 30, 2019

Thanks for the sendfile hint @shouze, I've done that now.

Looking at the metrics in DigitalOcean these brownouts are correlated with bandwidth peaks; both CPU and I/O are not a problem. We're hitting new highs that we haven't before and I think this is the ultimate cause of our problems. I don't know where that problem lies, whether our main server or somewhere in the DigitalOcean network pipeline. Droplets are not individually constrained on bandwidth so in theory that pipeline should be fine.

Anyway, we're making good progress on getting fully CDNified. While we're thankful for suggestions, we have some constraints that don't make it as simple as just switching on full CDN for these things. But we have promising progress that should let us get there pretty soon. There's obviously some urgency with this given these outages! Will update here when we've made the transition.

Please continue let us know here if you experience outages still, it helps us correlate with other metrics to try and identify the true cause.

rvagg added a commit that referenced this issue Oct 30, 2019
Ref: #1993
@bvitale

This comment has been minimized.

Copy link

@bvitale bvitale commented Oct 30, 2019

Downloads are slow again

@MattIPv4

This comment has been minimized.

Copy link

@MattIPv4 MattIPv4 commented Oct 30, 2019

Seeing this as an issue again

@SamChou19815

This comment has been minimized.

Copy link

@SamChou19815 SamChou19815 commented Oct 30, 2019

Downloads are slow again

@MattIPv4

This comment has been minimized.

Copy link

@MattIPv4 MattIPv4 commented Oct 30, 2019

@rvagg If there is anything we can help with on the DigitalOcean side of things, please do reach out to our support team :)

@visnup visnup referenced this issue Oct 30, 2019
rvagg added a commit that referenced this issue Oct 30, 2019
@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Oct 30, 2019

Yep, that's the next step, thanks @MattIPv4. We have nicer metrics now and these outages correlate with bandwidth peaks and nothing else and nothing we've done on the server seems to have made an impact.

@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Oct 31, 2019

Very prompt intervention from DO Support, they appear to agree that there's something odd going on at the network level, we hit a peak and it seems like a ceiling. So our primary droplet has been migrated to new hardware and we'll continue to monitor. Please let us know if problems persist. We have a peak time of day that correlates directly with the US workday, starting around 2pm UTC / 7am Pacific. Peak time lasts for roughly 4 hours from there. That's when these problems are most likely to be experienced so please let us know.

I've also tightened up the health-check in Cloudflare so it will hopefully recognise a problem quicker and shunt traffic over to our backup server. The load balancing logs don't show any problems over the last week so either our health check has been too relaxed or the problems are not experienced at the same level and maybe it's somewhere else.

@jbergstroem and I are continuing to work on better CDN fronting with some urgency.

(EDIT: there was a bit of trouble with the CDN health check tweaks, hopefully that didn't have flow-on effects to users during the 10 minutes it was having difficulties but if you had download troubles at roughly 1.30am UTC then that was the cause, apologies!)

@MylesBorins

This comment has been minimized.

Copy link
Member

@MylesBorins MylesBorins commented Oct 31, 2019

@chrizzo84

This comment has been minimized.

Copy link

@chrizzo84 chrizzo84 commented Oct 31, 2019

Hi all,
problem still exists... Tested on Azure DevOps Agent.
gzip: stdin: unexpected end of file
/bin/tar: Unexpected EOF in archive
/bin/tar: Unexpected EOF in archive
/bin/tar: Error is not recoverable: exiting now

@jbergstroem

This comment has been minimized.

Copy link
Member

@jbergstroem jbergstroem commented Nov 4, 2019

Small update: we are now carefully testing a caching strategy for the most common (artifact) file formats. If things proceed well (as it does seem to), we will start covering more parts of the site and be more generous with TTL's.

@rvagg

This comment has been minimized.

Copy link
Member

@rvagg rvagg commented Nov 6, 2019

I don't believe we'll see this particular incarnation of problems with nodejs.org from now on. Future problems are likely to be of a different nature, so I'm closing this issue.

We're now fully fronting our downloads with CDN and it appears to be working well. It's taken considerable load off our backend servers of course and we're unlikely to hit the bandwidth peaks at the network interface that appears to have been a problem previously.

Primary server bandwidth:

Screenshot 2019-11-06 12 02 39

Cloudflare caching:

Screenshot 2019-11-06 12 01 35

@rvagg rvagg closed this Nov 6, 2019
@OriginalEXE

This comment has been minimized.

Copy link

@OriginalEXE OriginalEXE commented Nov 6, 2019

Thank you to everyone for their hard work, sending virtual 🤗.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.