ansible: metrics server (WIP) & removal of Cloudflare cache bypass #2025

rvagg · 2019-11-01T09:14:03Z

(This is a WIP and may not end up being merged, explained below)

Background

#1993 - we need to ditch our CDN bypasses on nodejs.org. We bypass for most /dist and /download URIs so we can serve fully from Cloudflare. Our primary server has been having problems serving the content (apparently, and it could have been specific hardware, our droplet has been migrated to new hardware but we won't be able to try out a new peak traffic until Monday or Tuesday US time).

We have bypasses in place so we can collect download metrics. I made a major push to get log collection from Cloudflare a couple of years ago but I could never get it stable enough to have assurance that we were going to reliably collect all of the logs and not miss out. See #987 for some of that work.

Cloudflare now have a "logpush" service where they take care of offloading log files onto a block storage service you manage. We've enabled this and @jbergstroem has set up a GCP Storage bucket we control which is collecting logs (he's currently donating this personally but we should try and get donation of credits from either GCP or Azure at some point).

What this is

I've been working to verify that the logs we are collecting replace what we have now and we're not missing out on any fields or having any other weirdness that might surprise us later when we dig in properly.

This config is for a dedicated metrics server that I have running. It's collecting all logs from both of or servers and from the GCP bucket. I've translated some of my original AWK scripts to JavaScript and have some very basic processing utilities in this PR. I can process the Cloudflare logs and produce the same data as in https://nodejs.org/metrics/logs/ and there's a utility in here to print a basic JSON summary of a given set of input. This is a precursor to what's needed to start producing the summaries @ https://nodejs.org/metrics/summaries/, but can come later.

Comparing our logs to Cloudflare logs

So, regarding assurance that we're collecting the data we need so we can make a switch. Comparing the log summaries for both the 29th and the 30th of October I have discrepancies, but they are not massive. We show more downloads and even more bytes from our servers than Cloudflare is recording.

The total download count for Cloudflare are 99.76% of that we have on our servers, the total bytes is 96.07%. The count differential could be accounted for by cancels I think. Perhaps it's connections that hit CF, CF downloads from one of our servers but the client cancels and our server has sent the data to CF before it can cancel on our side. Date differences could play a small role too, the timestamp we're using from CF is the original connection time of the client. I'm not sure what nginx uses but it'd be slightly delayed in any case so we might not have the day boundaries quite right. The bytes differential could perhaps be accounted for by compression but that's a wild guess, I don't know what CF does with already-compressed resources.

The days I'm comparing were days with outages reported by users so there's a good chance that's having an impact on discrepancies.

Next steps

I think I'm satisfied that the data is good enough to make a switch and get rid of the bypass. I wouldn't mind collecting a clean (outage free) day over the weekend and making another comparison just to reassure myself. I can share results if anyone's particularly interested. It would also be good to make a switch-over before the US work week. So please interject if you'd like to participate in this decision.

Metrics generation for https://nodejs.org/metrics/ will have to pause, we won't have the pieces hooked up. I don't think that's critical as any problems with that site don't tend to get reported quickly so I don't think many people keep a keen eye on it. But the scripts here are part way to restoring that. The cut-over point will have to be carefully calibrated though so we don't get double-ups.

@jbergstroem has suggested we do at least some of this log processing on GCP workers (or Cloudflare workers? I'm not sure exactly what the idea here is). That would be great because the logs we're collecting have way more data than we need for just /dist/ and /download/ right now. So maybe we modernise this while we're messing with it and make a small step toward getting specialised work off our primary server. So this PR doesn't necessarily have to be merged, it could be a temporary arrangement until we get the rest sorted out.

rvagg · 2019-11-04T02:03:29Z

Stats differential for 2019-11-02 is roughly the same, 727,246 downloads from cloudflare, 729901 from our server. Over 99.5% the same, but bytes is still down at 96%. I'm calling it good enough and will disassemble the bypass so CF is serving the majority of our traffic.

jbergstroem · 2019-11-04T12:17:09Z

Post-merge LGTM. We still have some cache header work to do, but I'm on it.

targos · 2019-11-04T12:20:23Z

@jbergstroem this is not merged

jbergstroem · 2019-11-04T12:20:25Z

@jbergstroem has suggested we do at least some of this log processing on GCP workers (or Cloudflare workers? I'm not sure exactly what the idea here is). That would be great because the logs we're collecting have way more data than we need for just /dist/ and /download/ right now. So maybe we modernise this while we're messing with it and make a small step toward getting specialised work off our primary server. So this PR doesn't necessarily have to be merged, it could be a temporary arrangement until we get the rest sorted out.

Here are my thoughts:

Cloudflare logs are currently stored in daily "folders" in gcs. Since changes to files emit events in GCP-land, we can attach a cloud function that would act as cron job and do the parse we now via above. I have a rudimentary JQ oneliner that transforms the json into a common log format that the old scripts can accept, but we shoudl obviously either generate the csv's straight from here or look at moving to another format that can provide more input to the curious user.

jbergstroem · 2019-11-04T12:22:34Z

@jbergstroem this is not merged

Hm, well, changes to cloudflare seems to have been made and the metrics server is up and running. Most traffic still passes to nginx (cache-control needs to be improved) but the logs are being pushed.

Edit: re-read the full thread, I got the impression it got merged with @rvagg's last comment. My bad!

rvagg · 2019-11-04T23:43:02Z

This isn't merged because I'm not convinced it's needed, we need to figure out medium and long-term strategy. The scripts in here to do the transforms should be helpful whichever way we choose though.

jbergstroem · 2019-11-04T23:55:20Z

This isn't merged because I'm not convinced it's needed, we need to figure out medium and long-term strategy. The scripts in here to do the transforms should be helpful whichever way we choose though.

Ah, ok. Thanks for elaborating. At least they're working for now.

BridgeAR · 2020-02-14T22:23:09Z

Is there any update here? Getting back the metrics would be really great!

rvagg · 2020-02-17T03:10:15Z

no movements since last time, still needs me to allocate time for it

AshCripps · 2020-05-07T15:51:19Z

Ive been looking into this recently and I agree with @jbergstroem that using a google cloud function to process the data that triggers everytime cloudflare pushes a log to the bucket is best. The next thing though is where to put this data afterwards? Do we want to put it in a seperate bucket and then pull it down to the metrics server to serve the webpage, or could we get the webpage to get its data straight from GCP or is that too complicated?

@rvagg how often were the metrics updated in the old system? was it once a day or every x minutes? if its more frequent than once a day we still we have to set something up to pull from GCP storage very frequent which may cause us to have loads of random files on the metrics server.

rvagg · 2020-05-08T02:00:57Z

I think I had a cron job for every 6 hours. But it can be whatever we want and is mostly down to what the users of this data want. @bnb and @mhdawson are two consumers that I’m aware of that might have an opinion about how “fresh” the data needs to be. My guess is that once a day would be perfectly fine since most people are looking at much longer timeframes for trends.

So, I don’t think you need to be too constrained by the update frequency in designing a new workflow. Consider what’s optimal for the new workflow and we’ll work back from there. The thing we need to be consistent is the data itself.

bnb · 2020-05-08T02:40:55Z

Once a day totally works for me 👍

mhdawson · 2020-05-08T16:07:42Z

Once a day would be often enough.

AshCripps · 2020-05-11T09:57:12Z

So this is what I have in my head of a basic workflow:

Cloudflare pushes a file to GCP -> This triggers a cloud function which processes the file and sends the result to a new file in a different bucket -> This file is appened to throughout the day -> at the end of the day this processed file is pulled down to the metrics server -> This file is used to produce the graphs and summaries on nodejs.org/metrics.

one thing im not certain on is how to automate the back filling of data that needs processing

AshCripps · 2020-05-11T16:27:50Z

Think ive managed to convert process-cloudflare to be used inside a GCP function, here is this main change I made:

exports.processLogs = (data, context) => {
  const file = data;
  bucketName = file.bucket;
  fileName = file.name;
  processedFile = fileName.split("/")[0];

  pipeline(
    storage.bucket(bucketName).file(fileName).createReadStream(),
    split2(),
    jsonStream,
    logTransformStream,
    storage.bucket('processed-logs-nodejs').file(processedFile).createWriteStream({ resumable: false, flags: 'a'}), //We want to appened to the file all of the days results
    (err) => {
      if (err) {
        console.error('ERROR', err)
        process.exit(1)
      }
    }
  )
}

I will try this in an example function tomorrow

rvagg · 2020-05-12T07:15:53Z

sounds reasonable

Re backfilling, that's a one-off process and maybe I can put my hand up to swallow that one. There'll be some manual sync+process steps in there and it could be achieved with some Bash batch work -- which isn't as daunting if it doesn't need to be automated and has a human watching it.

AshCripps · 2020-05-12T12:13:20Z

So the code seemingly can be deployed in a function but its not writing anything to the new bucket, I think this is because metrics service account is a viewer only, so I may need to grant it the ability to edit storage bucket to allow it to write to the new one

rvagg · 2020-05-12T23:32:27Z

Or make a new one for this specific purpose. The read only one is still useful as it is.

AshCripps · 2020-05-14T16:03:41Z

Still battling with this.

The code deploys to a function which is triggered by a file being uploaded and successfully reads the file and processes the data.

The issue im hitting is the data is just not being written to a new file, yet the code works locally just not in the function.

Another issue ive discovered is google storage objects are immutable which means I cannot just append to one big file as I was hoping, but im just using the same folder scheme we use for the cloudflare logs currently in the new bucket to keep everything neat and tidy.

AshCripps · 2020-05-20T14:46:38Z

I think im going to have to call the google functions solutions quits for the moment as its just refusing to work, the code works locally and I can see its triggered but it just wont write to a new file - its give no errors or anything.

So I think ill have to tweak it slightly to work on the metrics server, I got some code which I can merge with the process cloudflare script to allow for the pulling down of files and processing them into the smaller chunks. ATM I have it so it could pull all the days files at once, do the processing and stick them in the csv to serve the webpage.

example code which ill tweak for this:

#!/usr/bin/env node
const {Storage} = require('@google-cloud/storage');

const storage = new Storage({keyFilename: "gcp-metrics-service-acct-key.json"});
const fs = require('fs');

async function main(
  bucketName = 'processed-logs-nodejs',
) {

  let date = new Date();
  date.setDate(date.getDate() - 1);
  let folder = date.toISOString().replace(/-/g, '').split('T')[0];
  console.log(folder);

    // Lists files in the bucket
    const [files] = await storage.bucket(bucketName).getFiles({ prefix: folder});
    files.forEach(file => {
      console.log(file.name);
      storage.bucket(bucketName).file(file.name).createReadStream()
      .on('error', function(err) { console.error(err) })
      .pipe(fs.createWriteStream(`${folder}.csv`, {flags: 'a'}));
    });

}
main(...process.argv.slice(2)).catch(console.error);

@rvagg with the metrics server being full currently I remember you saying there were some files that could be cleared off but I can't remember which files they were.

rvagg · 2020-05-22T03:04:43Z

@jbergstroem are you able to help @AshCripps out with GCP functions? He's trying to engineer the thing you originally suggested - process logs as they come in from Cloudflare.

I think im going to have to call the google functions solutions quits for the moment as its just refusing to work, the code works locally and I can see its triggered but it just wont write to a new file - its give no errors or anything.

@AshCripps regarding space on that server -- it's all up for grabs, there's nothing on that server that's super critical and couldn't be deleted. Mostly it was just a WIP.

The thing that filled it up was /home/nodejs/logs/cloudflare/ which started off as a mirror of the cloudflare logs to start processing like you're talking about. I've deleted everything in there now so there's ~128G free. /home is a 500G disk, it's currently got mirrors of logs from nodejs.org and the backup server in /home/nodejs/logs - these could be deleted if need be. And also the processed metrics served at nodejs.org/metrics/ in /home/nodejs/metrics-out, although it might be nice to keep those and use that as the authoritative home of what we push onto nodejs.org to be served at /metrics. They're also useful for comparing against new files to make sure we're getting it right!

There's also the option of replacing /home with a larger disk. I just chose 500G as a fairly arbitrary number to start with but if it's helpful to have a larger cache space then we can do that easily enough. Let me know.

I've cleaned it up a little, rebooted and updated the OS so it should be good to go now. I'm pretty sure all the timers I had that did syncing are all disabled so it shouldn't fill up again without asking it to.

The (hopefully disabled) timer I used for syncing GCP are in /lib/systemd/system/sync-cloudflare-periodic.service and /lib/systemd/system/sync-cloudflare-periodic.timer if that's helpful.

jbergstroem · 2020-05-22T03:47:52Z

@jbergstroem are you able to help @AshCripps out with GCP functions? He's trying to engineer the thing you originally suggested - process logs as they come in from Cloudflare.

Sure, no problem. Perhaps reach out to me through the #build channel on IRC (jbergstroem) and I'll provide access/feedback as needed.

As a final note; processing logs outside of gcp is probably a no-go; the cost of egress traffic was in the hundreds/mo (whereas our sponsorship was ~$100/mo).

MylesBorins · 2020-06-21T07:16:45Z

What is the current progress here? Can I help at all?

jbergstroem · 2020-06-21T13:24:11Z

What is the current progress here? Can I help at all?

I believe I might be in the line of fire. I was looking at the worker created by @AshCripps and didn't quite figure out why it wasn't working properly. Help of course appreciated; perhaps hit me up on IRC this week and we can figure it out?

AshCripps · 2020-06-21T15:26:41Z

Yeah I have a worker which does the processing part but seemingly doesn't want to write the output back out to a file in google storage

AshCripps · 2020-06-29T15:10:37Z

@MylesBorins you may be able to help. I dont know if you came across anything similar during your time at google or might know who to ask about it but basically heres the flow.

New file uploaded to google cloud -> triggers a function -> function creates a readstream of the file and pipes the data through some transform streams that cleans the data -> a createwritestream is called send the data to a file in another bucket in storage.

The code all works locally and creates the new file in the cloud storage but when in the function the createwritestream seems to fail with no error message or anything, the function just completes with no file being created.

Any ideas?

MylesBorins · 2020-07-01T21:24:45Z

I can attempt to help with that. You might not want to use the stream interface, especially if it is dealing with ingress / egress (network interfaces might have a. Issue)

Happy birthday to pair 1:1 on this.

Cloud run might work better than a function of you want to deploy an entire application

AshCripps · 2020-07-02T11:46:12Z

Ive pushed up what I had (I think may have crossed some wires) if you want to take a quick look @MylesBorins and pick apart anything that you think should be done differently.

ansible/roles/metrics/files/process-cloudflare/process-cloudflare.js

AshCripps · 2020-07-08T11:37:00Z

Update: we have files 🎉

Big shout out to @BethGriggs who massively helped me out to rewrite the code!

AshCripps · 2020-07-08T13:36:29Z

ive pushed up the code we have that works but now this faces a new issue, after about 15 minutes (or 3 run throughs) it starts failing on Exception from a finished function: Error [ERR_STREAM_WRITE_AFTER_END]: write after end but the functions itself finishes on ok so I cant just set it to retry on error. This ends up creating empty files.

From some googling theres seems to be a similar error with streams in the google pub/sub library - googleapis/nodejs-pubsub#825

AshCripps · 2020-07-21T14:14:29Z

Still running into this issue with Exception from a finished function: Error [ERR_STREAM_WRITE_AFTER_END]: write after end, Annoyingly I can't seem to catch the exception and google offers not output besides "this exception happened" so its really hard to debug.

ansible/roles/metrics/files/process-cloudflare/process-cloudflare.js

rvagg · 2020-07-28T22:27:39Z

ansible/roles/metrics/files/process-cloudflare/process-cloudflare.js

+  console.log("PROCESSEDFILENAME " + processedFile);
+
+  storage.bucket(bucketName).file(file.name).createReadStream()
+  .on('error', function(err) { console.error(err) })


Each one of the pipes below here could emit an error you won't see. Copy this line and paste it below each pipe and you may catch an error that's otherwise being lost. The alternative is to use util.pipeline() which handles error propagation nicely.

@rvagg After doing this the log now reads like this

Did you try removing the e from the jsonStream error callback? Perhaps we're getting a line somewhere that we can't parse properly for some reason but should be able to ignore - might be worth trying. It could be that an error in either jsonStream or logTransformStream is closing the downstream streams and split2 can't push anymore in.

util.pipeline() is really worth trying here, not just for the error handling but also for the nice cleanup it does and managing these complicated events.

Yep removed the e changed the script to this

exports.processLogs = (data, context, callback) => { console.log('Node version is: ' + process.version); const file = data; bucketName = file.bucket; fileName = file.name; console.log("DATA " + data); console.log("BUCKET " + bucketName); console.log("FILENAME " + fileName); processedFile = fileName.split(".")[0]; processedFile = processedFile.split("_")[0].concat("_", processedFile.split("_")[1]); console.log("PROCESSEDFILENAME " + processedFile); pipeline( storage.bucket(bucketName).file(file.name).createReadStream(), split2(), jsonStream, logTransformStream, storage.bucket('processed-logs-nodejs').file(processedFile).createWriteStream({resumable: false}, (err) => { if (err) { console.log("PIPELINE HAS FAILED", err) } else { console.log("PIPELINE SUCCESS") callback() } } ) )

been running for halfhour with no errors yet but more interestingly it constantly timesout - no matter how big I set the timeout, So I assume im not calling the callback() in the right place so the function doesn't end

Feel free to keep on pushing your changes to this branch, at least we can have a discussion around the same code in that case.

One of the tricky things about custom streams is that you have to always be aware of calling the callback on transform() and flush() on any path through your code. Failure to do so will cause weird errors that involve back-pressure that has a delayed effect so it can be hard to track down. So the jsonStream and logTransformStream are worth studying to find codepaths that may not properly trigger a callback(), possibly even on the case of an uncaught exception which will just halt that execution path.

You do have one obvious instance of a failure to call callback though, at the bottom in the failure case, maybe it should be callback(err) after the console.log("PIPELINE HAS FAILED", err), just for completeness. But presumably you'd see that somewhere in the output (not necessarily right at the end though).

ive pushed my code up, ive realised that I dont see any pipeline finish message - either pass or fail but the files seem to be roughly correct in size (its not an exact science) so I assume the processing is working but the pipeline is just not ending untill GCP kills it.

rebased as @BethGriggs pointed out I had a bracket in the wrong place - time outs now stop but we are back to the weird behaviour of before with empty out files. the pipeline is reporting a success however @rvagg.

rvagg · 2020-07-28T22:31:13Z

ansible/roles/metrics/files/process-cloudflare/process-cloudflare.js

+      return callback()
+    }
+
+    if (!extensionRe.test(chunk.ClientRequestPath)) { // not a file we care about


These regexes are going to assume this property exists. Maybe we should be checking for the existence of each of the properties on chunk that we care about at the top of the function (chunk.foo !== undefined or 'foo' in chunk style).

AshCripps · 2020-08-11T14:36:38Z

This is still broken but interestingly when left for a while the logs are full of the above message - maybe the error is caused by a memory leak we arent dealing with?

AshCripps · 2020-09-01T15:25:38Z

@MylesBorins are you able to look at the code in the latest commit?

ive converted to a docker image and now use cloud run instead of cloud functions but still get the same issue of it working once or twice and then just sending empty files.

The logs all say PIPELINE SUCCESS

AshCripps · 2021-01-08T12:51:31Z

Update: Ive really screwed up my branches so have been forced to open #2518 as a replacement - would appreciate people checking it over for anything ive missed before closing this one.

I now a solution that has been working for approximately 2 days with no faults or errors 🎉 going to give it over the weekend to make sure before moving on the next stage of summary.

Would appreciate some input from @rvagg and @jbergstroem on whats the best method for that stage, remaining in google or running on a seperate machine? run once a day or on every new file? and etc.

rvagg · 2021-01-12T02:51:33Z

whats the best method for that stage, remaining in google or running on a seperate machine?

I think the conclusion on this was that it's too cost prohibitive on data transfer for the amount we're dealing with to get data in and out of GCP so we have to do it internally, with whatever tools are available within GCP.

AshCripps · 2021-01-18T16:13:49Z

Merged #2518 instead, closing this and will open new PR for next steps/additions

rvagg force-pushed the rvagg/metrics-server branch from b42cc8f to fa793e8 Compare November 1, 2019 09:42

rvagg mentioned this pull request Nov 5, 2019

Node.js Foundation Technical Steering Committee (TSC) Meeting 2019-11-06 nodejs/TSC#773

Closed

rvagg mentioned this pull request Nov 28, 2019

www: add initial cloudflare log-collect tooling #987

Closed

richardlau mentioned this pull request Dec 17, 2019

Node.js download metrics nodejs/TSC#794

Closed

richardlau mentioned this pull request Feb 16, 2020

Why did download metrics drop significantly around October 2019? #2179

Closed

rvagg mentioned this pull request Mar 10, 2020

Including Node.js Docker Image Builds in the Node.js Metrics Microsite #1289

Closed

richardlau mentioned this pull request Jun 10, 2020

Version download metrics are all wonky nodejs/nodejs.org#3244

Closed

AshCripps mentioned this pull request Jun 12, 2020

Another massive drop in download numbers #2349

Closed

MylesBorins reviewed Jul 2, 2020

View reviewed changes

ansible/roles/metrics/files/process-cloudflare/process-cloudflare.js Outdated Show resolved Hide resolved

AshCripps force-pushed the rvagg/metrics-server branch from 51a8342 to 869fe04 Compare July 8, 2020 13:33

rvagg commented Jul 28, 2020

View reviewed changes

AshCripps force-pushed the rvagg/metrics-server branch from ba1474d to 9fdaaed Compare August 5, 2020 15:13

rvagg and others added 4 commits September 8, 2020 16:07

ansible: metrics server

33e176e

update for google cloud function

86d7653

Change to use pipeline

5fede92

Convert metrics to use cloud run

ca1f940

AshCripps force-pushed the rvagg/metrics-server branch from f5b1db8 to ca1f940 Compare September 8, 2020 15:07

Ash Cripps added 2 commits September 17, 2020 15:19

fixup! Convert metrics to use cloud run

2e5248f

test emitClose

0773ec1

AshCripps mentioned this pull request Jan 8, 2021

Metrics #2518

Merged

AshCripps mentioned this pull request Jan 12, 2021

Node.js Build WorkGroup Meeting 2021-01-12 #2513

Closed

AshCripps closed this Jan 18, 2021

ansible: metrics server (WIP) & removal of Cloudflare cache bypass #2025

ansible: metrics server (WIP) & removal of Cloudflare cache bypass #2025

Conversation

rvagg commented Nov 1, 2019

Background

What this is

Comparing our logs to Cloudflare logs

Next steps

rvagg commented Nov 4, 2019

jbergstroem commented Nov 4, 2019

targos commented Nov 4, 2019

jbergstroem commented Nov 4, 2019

jbergstroem commented Nov 4, 2019 • edited Loading

rvagg commented Nov 4, 2019

jbergstroem commented Nov 4, 2019

BridgeAR commented Feb 14, 2020

rvagg commented Feb 17, 2020

AshCripps commented May 7, 2020

rvagg commented May 8, 2020

bnb commented May 8, 2020

mhdawson commented May 8, 2020

AshCripps commented May 11, 2020

AshCripps commented May 11, 2020

rvagg commented May 12, 2020

AshCripps commented May 12, 2020

rvagg commented May 12, 2020

AshCripps commented May 14, 2020

AshCripps commented May 20, 2020

rvagg commented May 22, 2020

jbergstroem commented May 22, 2020

MylesBorins commented Jun 21, 2020

jbergstroem commented Jun 21, 2020

AshCripps commented Jun 21, 2020

AshCripps commented Jun 29, 2020

MylesBorins commented Jul 1, 2020

AshCripps commented Jul 2, 2020

AshCripps commented Jul 8, 2020

AshCripps commented Jul 8, 2020

AshCripps commented Jul 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshCripps commented Aug 11, 2020

AshCripps commented Sep 1, 2020

AshCripps commented Jan 8, 2021

rvagg commented Jan 12, 2021

AshCripps commented Jan 18, 2021

jbergstroem commented Nov 4, 2019 •

edited

Loading