Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible: metrics server (WIP) & removal of Cloudflare cache bypass #2025

Closed
wants to merge 6 commits into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Nov 1, 2019

(This is a WIP and may not end up being merged, explained below)

Background

#1993 - we need to ditch our CDN bypasses on nodejs.org. We bypass for most /dist and /download URIs so we can serve fully from Cloudflare. Our primary server has been having problems serving the content (apparently, and it could have been specific hardware, our droplet has been migrated to new hardware but we won't be able to try out a new peak traffic until Monday or Tuesday US time).

We have bypasses in place so we can collect download metrics. I made a major push to get log collection from Cloudflare a couple of years ago but I could never get it stable enough to have assurance that we were going to reliably collect all of the logs and not miss out. See #987 for some of that work.

Cloudflare now have a "logpush" service where they take care of offloading log files onto a block storage service you manage. We've enabled this and @jbergstroem has set up a GCP Storage bucket we control which is collecting logs (he's currently donating this personally but we should try and get donation of credits from either GCP or Azure at some point).

What this is

I've been working to verify that the logs we are collecting replace what we have now and we're not missing out on any fields or having any other weirdness that might surprise us later when we dig in properly.

This config is for a dedicated metrics server that I have running. It's collecting all logs from both of or servers and from the GCP bucket. I've translated some of my original AWK scripts to JavaScript and have some very basic processing utilities in this PR. I can process the Cloudflare logs and produce the same data as in https://nodejs.org/metrics/logs/ and there's a utility in here to print a basic JSON summary of a given set of input. This is a precursor to what's needed to start producing the summaries @ https://nodejs.org/metrics/summaries/, but can come later.

Comparing our logs to Cloudflare logs

So, regarding assurance that we're collecting the data we need so we can make a switch. Comparing the log summaries for both the 29th and the 30th of October I have discrepancies, but they are not massive. We show more downloads and even more bytes from our servers than Cloudflare is recording.

The total download count for Cloudflare are 99.76% of that we have on our servers, the total bytes is 96.07%. The count differential could be accounted for by cancels I think. Perhaps it's connections that hit CF, CF downloads from one of our servers but the client cancels and our server has sent the data to CF before it can cancel on our side. Date differences could play a small role too, the timestamp we're using from CF is the original connection time of the client. I'm not sure what nginx uses but it'd be slightly delayed in any case so we might not have the day boundaries quite right. The bytes differential could perhaps be accounted for by compression but that's a wild guess, I don't know what CF does with already-compressed resources.

The days I'm comparing were days with outages reported by users so there's a good chance that's having an impact on discrepancies.

Next steps

I think I'm satisfied that the data is good enough to make a switch and get rid of the bypass. I wouldn't mind collecting a clean (outage free) day over the weekend and making another comparison just to reassure myself. I can share results if anyone's particularly interested. It would also be good to make a switch-over before the US work week. So please interject if you'd like to participate in this decision.

Metrics generation for https://nodejs.org/metrics/ will have to pause, we won't have the pieces hooked up. I don't think that's critical as any problems with that site don't tend to get reported quickly so I don't think many people keep a keen eye on it. But the scripts here are part way to restoring that. The cut-over point will have to be carefully calibrated though so we don't get double-ups.

@jbergstroem has suggested we do at least some of this log processing on GCP workers (or Cloudflare workers? I'm not sure exactly what the idea here is). That would be great because the logs we're collecting have way more data than we need for just /dist/ and /download/ right now. So maybe we modernise this while we're messing with it and make a small step toward getting specialised work off our primary server. So this PR doesn't necessarily have to be merged, it could be a temporary arrangement until we get the rest sorted out.

@rvagg
Copy link
Member Author

rvagg commented Nov 4, 2019

Stats differential for 2019-11-02 is roughly the same, 727,246 downloads from cloudflare, 729901 from our server. Over 99.5% the same, but bytes is still down at 96%. I'm calling it good enough and will disassemble the bypass so CF is serving the majority of our traffic.

@jbergstroem
Copy link
Member

Post-merge LGTM. We still have some cache header work to do, but I'm on it.

@targos
Copy link
Member

targos commented Nov 4, 2019

@jbergstroem this is not merged

@jbergstroem
Copy link
Member

@jbergstroem has suggested we do at least some of this log processing on GCP workers (or Cloudflare workers? I'm not sure exactly what the idea here is). That would be great because the logs we're collecting have way more data than we need for just /dist/ and /download/ right now. So maybe we modernise this while we're messing with it and make a small step toward getting specialised work off our primary server. So this PR doesn't necessarily have to be merged, it could be a temporary arrangement until we get the rest sorted out.

Here are my thoughts:

Cloudflare logs are currently stored in daily "folders" in gcs. Since changes to files emit events in GCP-land, we can attach a cloud function that would act as cron job and do the parse we now via above. I have a rudimentary JQ oneliner that transforms the json into a common log format that the old scripts can accept, but we shoudl obviously either generate the csv's straight from here or look at moving to another format that can provide more input to the curious user.

@jbergstroem
Copy link
Member

jbergstroem commented Nov 4, 2019

@jbergstroem this is not merged

Hm, well, changes to cloudflare seems to have been made and the metrics server is up and running. Most traffic still passes to nginx (cache-control needs to be improved) but the logs are being pushed.

Edit: re-read the full thread, I got the impression it got merged with @rvagg's last comment. My bad!

@rvagg
Copy link
Member Author

rvagg commented Nov 4, 2019

This isn't merged because I'm not convinced it's needed, we need to figure out medium and long-term strategy. The scripts in here to do the transforms should be helpful whichever way we choose though.

@jbergstroem
Copy link
Member

This isn't merged because I'm not convinced it's needed, we need to figure out medium and long-term strategy. The scripts in here to do the transforms should be helpful whichever way we choose though.

Ah, ok. Thanks for elaborating. At least they're working for now.

@BridgeAR
Copy link
Member

Is there any update here? Getting back the metrics would be really great!

@rvagg
Copy link
Member Author

rvagg commented Feb 17, 2020

no movements since last time, still needs me to allocate time for it

@AshCripps
Copy link
Member

Ive been looking into this recently and I agree with @jbergstroem that using a google cloud function to process the data that triggers everytime cloudflare pushes a log to the bucket is best. The next thing though is where to put this data afterwards? Do we want to put it in a seperate bucket and then pull it down to the metrics server to serve the webpage, or could we get the webpage to get its data straight from GCP or is that too complicated?

@rvagg how often were the metrics updated in the old system? was it once a day or every x minutes? if its more frequent than once a day we still we have to set something up to pull from GCP storage very frequent which may cause us to have loads of random files on the metrics server.

@rvagg
Copy link
Member Author

rvagg commented May 8, 2020

I think I had a cron job for every 6 hours. But it can be whatever we want and is mostly down to what the users of this data want. @bnb and @mhdawson are two consumers that I’m aware of that might have an opinion about how “fresh” the data needs to be. My guess is that once a day would be perfectly fine since most people are looking at much longer timeframes for trends.

So, I don’t think you need to be too constrained by the update frequency in designing a new workflow. Consider what’s optimal for the new workflow and we’ll work back from there. The thing we need to be consistent is the data itself.

@bnb
Copy link

bnb commented May 8, 2020

Once a day totally works for me 👍

@mhdawson
Copy link
Member

mhdawson commented May 8, 2020

Once a day would be often enough.

@AshCripps
Copy link
Member

So this is what I have in my head of a basic workflow:

Cloudflare pushes a file to GCP -> This triggers a cloud function which processes the file and sends the result to a new file in a different bucket -> This file is appened to throughout the day -> at the end of the day this processed file is pulled down to the metrics server -> This file is used to produce the graphs and summaries on nodejs.org/metrics.

one thing im not certain on is how to automate the back filling of data that needs processing

@AshCripps
Copy link
Member

Think ive managed to convert process-cloudflare to be used inside a GCP function, here is this main change I made:

exports.processLogs = (data, context) => {
  const file = data;
  bucketName = file.bucket;
  fileName = file.name;
  processedFile = fileName.split("/")[0];

  pipeline(
    storage.bucket(bucketName).file(fileName).createReadStream(),
    split2(),
    jsonStream,
    logTransformStream,
    storage.bucket('processed-logs-nodejs').file(processedFile).createWriteStream({ resumable: false, flags: 'a'}), //We want to appened to the file all of the days results
    (err) => {
      if (err) {
        console.error('ERROR', err)
        process.exit(1)
      }
    }
  )
}

I will try this in an example function tomorrow

@rvagg
Copy link
Member Author

rvagg commented May 12, 2020

sounds reasonable

Re backfilling, that's a one-off process and maybe I can put my hand up to swallow that one. There'll be some manual sync+process steps in there and it could be achieved with some Bash batch work -- which isn't as daunting if it doesn't need to be automated and has a human watching it.

@AshCripps
Copy link
Member

So the code seemingly can be deployed in a function but its not writing anything to the new bucket, I think this is because metrics service account is a viewer only, so I may need to grant it the ability to edit storage bucket to allow it to write to the new one

@rvagg
Copy link
Member Author

rvagg commented May 12, 2020

Or make a new one for this specific purpose. The read only one is still useful as it is.

@AshCripps
Copy link
Member

Still battling with this.

The code deploys to a function which is triggered by a file being uploaded and successfully reads the file and processes the data.

The issue im hitting is the data is just not being written to a new file, yet the code works locally just not in the function.

Another issue ive discovered is google storage objects are immutable which means I cannot just append to one big file as I was hoping, but im just using the same folder scheme we use for the cloudflare logs currently in the new bucket to keep everything neat and tidy.

@AshCripps
Copy link
Member

I think im going to have to call the google functions solutions quits for the moment as its just refusing to work, the code works locally and I can see its triggered but it just wont write to a new file - its give no errors or anything.

So I think ill have to tweak it slightly to work on the metrics server, I got some code which I can merge with the process cloudflare script to allow for the pulling down of files and processing them into the smaller chunks. ATM I have it so it could pull all the days files at once, do the processing and stick them in the csv to serve the webpage.

example code which ill tweak for this:

#!/usr/bin/env node
const {Storage} = require('@google-cloud/storage');

const storage = new Storage({keyFilename: "gcp-metrics-service-acct-key.json"});
const fs = require('fs');

async function main(
  bucketName = 'processed-logs-nodejs',
) {

  let date = new Date();
  date.setDate(date.getDate() - 1);
  let folder = date.toISOString().replace(/-/g, '').split('T')[0];
  console.log(folder);

    // Lists files in the bucket
    const [files] = await storage.bucket(bucketName).getFiles({ prefix: folder});
    files.forEach(file => {
      console.log(file.name);
      storage.bucket(bucketName).file(file.name).createReadStream()
      .on('error', function(err) { console.error(err) })
      .pipe(fs.createWriteStream(`${folder}.csv`, {flags: 'a'}));
    });

}
main(...process.argv.slice(2)).catch(console.error);

@rvagg with the metrics server being full currently I remember you saying there were some files that could be cleared off but I can't remember which files they were.

@rvagg
Copy link
Member Author

rvagg commented May 22, 2020

@jbergstroem are you able to help @AshCripps out with GCP functions? He's trying to engineer the thing you originally suggested - process logs as they come in from Cloudflare.

I think im going to have to call the google functions solutions quits for the moment as its just refusing to work, the code works locally and I can see its triggered but it just wont write to a new file - its give no errors or anything.

@AshCripps regarding space on that server -- it's all up for grabs, there's nothing on that server that's super critical and couldn't be deleted. Mostly it was just a WIP.

The thing that filled it up was /home/nodejs/logs/cloudflare/ which started off as a mirror of the cloudflare logs to start processing like you're talking about. I've deleted everything in there now so there's ~128G free. /home is a 500G disk, it's currently got mirrors of logs from nodejs.org and the backup server in /home/nodejs/logs - these could be deleted if need be. And also the processed metrics served at nodejs.org/metrics/ in /home/nodejs/metrics-out, although it might be nice to keep those and use that as the authoritative home of what we push onto nodejs.org to be served at /metrics. They're also useful for comparing against new files to make sure we're getting it right!

There's also the option of replacing /home with a larger disk. I just chose 500G as a fairly arbitrary number to start with but if it's helpful to have a larger cache space then we can do that easily enough. Let me know.

I've cleaned it up a little, rebooted and updated the OS so it should be good to go now. I'm pretty sure all the timers I had that did syncing are all disabled so it shouldn't fill up again without asking it to.

The (hopefully disabled) timer I used for syncing GCP are in /lib/systemd/system/sync-cloudflare-periodic.service and /lib/systemd/system/sync-cloudflare-periodic.timer if that's helpful.

@jbergstroem
Copy link
Member

@jbergstroem are you able to help @AshCripps out with GCP functions? He's trying to engineer the thing you originally suggested - process logs as they come in from Cloudflare.

Sure, no problem. Perhaps reach out to me through the #build channel on IRC (jbergstroem) and I'll provide access/feedback as needed.

As a final note; processing logs outside of gcp is probably a no-go; the cost of egress traffic was in the hundreds/mo (whereas our sponsorship was ~$100/mo).

@MylesBorins
Copy link
Contributor

What is the current progress here? Can I help at all?

@jbergstroem
Copy link
Member

What is the current progress here? Can I help at all?

I believe I might be in the line of fire. I was looking at the worker created by @AshCripps and didn't quite figure out why it wasn't working properly. Help of course appreciated; perhaps hit me up on IRC this week and we can figure it out?

@AshCripps
Copy link
Member

Yeah I have a worker which does the processing part but seemingly doesn't want to write the output back out to a file in google storage

@AshCripps
Copy link
Member

@MylesBorins you may be able to help. I dont know if you came across anything similar during your time at google or might know who to ask about it but basically heres the flow.

New file uploaded to google cloud -> triggers a function -> function creates a readstream of the file and pipes the data through some transform streams that cleans the data -> a createwritestream is called send the data to a file in another bucket in storage.

The code all works locally and creates the new file in the cloud storage but when in the function the createwritestream seems to fail with no error message or anything, the function just completes with no file being created.

Any ideas?

@MylesBorins
Copy link
Contributor

I can attempt to help with that. You might not want to use the stream interface, especially if it is dealing with ingress / egress (network interfaces might have a. Issue)

Happy birthday to pair 1:1 on this.

Cloud run might work better than a function of you want to deploy an entire application

@AshCripps
Copy link
Member

Ive pushed up what I had (I think may have crossed some wires) if you want to take a quick look @MylesBorins and pick apart anything that you think should be done differently.

@AshCripps
Copy link
Member

Update: we have files 🎉
image

Big shout out to @BethGriggs who massively helped me out to rewrite the code!

@AshCripps
Copy link
Member

ive pushed up the code we have that works but now this faces a new issue, after about 15 minutes (or 3 run throughs) it starts failing on Exception from a finished function: Error [ERR_STREAM_WRITE_AFTER_END]: write after end but the functions itself finishes on ok so I cant just set it to retry on error. This ends up creating empty files.

From some googling theres seems to be a similar error with streams in the google pub/sub library - googleapis/nodejs-pubsub#825

@AshCripps
Copy link
Member

Still running into this issue with Exception from a finished function: Error [ERR_STREAM_WRITE_AFTER_END]: write after end, Annoyingly I can't seem to catch the exception and google offers not output besides "this exception happened" so its really hard to debug.

console.log("PROCESSEDFILENAME " + processedFile);

storage.bucket(bucketName).file(file.name).createReadStream()
.on('error', function(err) { console.error(err) })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each one of the pipes below here could emit an error you won't see. Copy this line and paste it below each pipe and you may catch an error that's otherwise being lost. The alternative is to use util.pipeline() which handles error propagation nicely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg After doing this the log now reads like this
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try removing the e from the jsonStream error callback? Perhaps we're getting a line somewhere that we can't parse properly for some reason but should be able to ignore - might be worth trying. It could be that an error in either jsonStream or logTransformStream is closing the downstream streams and split2 can't push anymore in.

util.pipeline() is really worth trying here, not just for the error handling but also for the nice cleanup it does and managing these complicated events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep removed the e changed the script to this

exports.processLogs = (data, context, callback) => {
  console.log('Node version is: ' + process.version);
  const file = data;
  bucketName = file.bucket;
  fileName = file.name;
  console.log("DATA " + data);
  console.log("BUCKET " + bucketName);
  console.log("FILENAME " + fileName);
  processedFile = fileName.split(".")[0];
  processedFile = processedFile.split("_")[0].concat("_", processedFile.split("_")[1]); 
  console.log("PROCESSEDFILENAME " + processedFile);

  pipeline(
    storage.bucket(bucketName).file(file.name).createReadStream(),
    split2(),
    jsonStream,
    logTransformStream,
    storage.bucket('processed-logs-nodejs').file(processedFile).createWriteStream({resumable: false},
    (err) => {
      if (err) {
        console.log("PIPELINE HAS FAILED", err)
        
      } else {
        console.log("PIPELINE SUCCESS")
        callback()
      }
    }
  )
  )

been running for halfhour with no errors yet but more interestingly it constantly timesout - no matter how big I set the timeout, So I assume im not calling the callback() in the right place so the function doesn't end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to keep on pushing your changes to this branch, at least we can have a discussion around the same code in that case.

One of the tricky things about custom streams is that you have to always be aware of calling the callback on transform() and flush() on any path through your code. Failure to do so will cause weird errors that involve back-pressure that has a delayed effect so it can be hard to track down. So the jsonStream and logTransformStream are worth studying to find codepaths that may not properly trigger a callback(), possibly even on the case of an uncaught exception which will just halt that execution path.

You do have one obvious instance of a failure to call callback though, at the bottom in the failure case, maybe it should be callback(err) after the console.log("PIPELINE HAS FAILED", err), just for completeness. But presumably you'd see that somewhere in the output (not necessarily right at the end though).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive pushed my code up, ive realised that I dont see any pipeline finish message - either pass or fail but the files seem to be roughly correct in size (its not an exact science) so I assume the processing is working but the pipeline is just not ending untill GCP kills it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased as @BethGriggs pointed out I had a bracket in the wrong place - time outs now stop but we are back to the weird behaviour of before with empty out files. the pipeline is reporting a success however @rvagg.

return callback()
}

if (!extensionRe.test(chunk.ClientRequestPath)) { // not a file we care about
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These regexes are going to assume this property exists. Maybe we should be checking for the existence of each of the properties on chunk that we care about at the top of the function (chunk.foo !== undefined or 'foo' in chunk style).

@AshCripps
Copy link
Member

image

This is still broken but interestingly when left for a while the logs are full of the above message - maybe the error is caused by a memory leak we arent dealing with?

@AshCripps
Copy link
Member

@MylesBorins are you able to look at the code in the latest commit?

ive converted to a docker image and now use cloud run instead of cloud functions but still get the same issue of it working once or twice and then just sending empty files.
image

The logs all say PIPELINE SUCCESS

@AshCripps AshCripps mentioned this pull request Jan 8, 2021
@AshCripps
Copy link
Member

Update: Ive really screwed up my branches so have been forced to open #2518 as a replacement - would appreciate people checking it over for anything ive missed before closing this one.

I now a solution that has been working for approximately 2 days with no faults or errors 🎉 going to give it over the weekend to make sure before moving on the next stage of summary.

Would appreciate some input from @rvagg and @jbergstroem on whats the best method for that stage, remaining in google or running on a seperate machine? run once a day or on every new file? and etc.

@rvagg
Copy link
Member Author

rvagg commented Jan 12, 2021

whats the best method for that stage, remaining in google or running on a seperate machine?

I think the conclusion on this was that it's too cost prohibitive on data transfer for the amount we're dealing with to get data in and out of GCP so we have to do it internally, with whatever tools are available within GCP.

@AshCripps
Copy link
Member

Merged #2518 instead, closing this and will open new PR for next steps/additions

@AshCripps AshCripps closed this Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants