Improve logging #511

tsibley · 2022-03-31T23:36:53Z

Logging in nextstrain.org is currently ad-hoc and sometimes useful, sometimes not enough. The goal is to make what's logged more systematic and comprehensively useful while at the same time improving our ability to inspect and act on the logs.

Below are some proposed changes to make, spanning both the codebase itself as well as our deployment infra. I expect to break each of these items out into separate issues in due course.

Standardize on debug or bunyan for emitting logs.

debug is already used by Express and several related deps (set DEBUG='*' to see these logs), which is a point in its favor, but it's relatively limited (e.g. no log levels, no structured logs, no child loggers). bunyan provides a more traditional and modern set of logging features, including a focus on structured logs which greatly helps with analysis.

Retire our existing logging functions (e.g. utils.verbose()) in favor of the above.
Switch from ad-hoc messages that embed limited context to structured logs with increased context across the board (e.g. all logs include req context, incl. critical request ids).

The ad-hoc messages don't all have to go away—they're still useful for skimming by eye, particularly if well-written—but they shouldn't be the only thing produced by a log line.
Archive raw logs to S3 for retrospective debugging/analysis.

Archives should be raw, straight from Heroku's Logplex service.

Two quick examples of things archives are useful for:
- Investigating reports like "this weird thing happened last
  week"
- Analyzing the impact of a bug over time, e.g. "how many requests
  were impacted by this over the last month?"
Keep the retention period fairly short so we're not storing (potentially sensitive) data we don't need. Maybe retain for 30 days? (which is also the minimum duration you pay for with S3)

Papertrail supports S3 archiving, but we don't currently ingest raw logs into Papertrail due to account limitations (see our incoming log filters).
Ingest logs into analysis system for real-time inspection.

Currently this is Papertrail, a Heroku add-on. It's got a nice straightforward interface. However, we're using a pretty stringent free tier (10MB/day, 2d retention for search) so have to omit lots of useful logs (e.g. all GET requests!). Retention period is also pretty short. Paid tiers that would accommodate our usage largely unfiltered get pricey.

Logtail is another (glossier) app that also has a Heroku add-on, with a free tier that's a bit more generous than Papertrail (5GB/month, 8d retention for search). Even it, however, would require a bit of filtering on ingest and prices increase quickly from free. Interestingly, the Heroku add-on's free tier is more generous than the non-Heroku free tier of the same product.

Grafana Logs is part of Grafana's hosted offering, Grafana Cloud, with a generous free tier of 50GB logs and 14d retention. No Heroku add-on, so we'd have to manually configure the Heroku log drain → Grafana connection. This seems straightforward to do with Vector (a log-processing multi-tool), though means we'll have to run Vector somewhere persistent. The smallest EC2 instance or Heroku dyno would almost certainly be fine, both at around $7/month. Fly.io uses Vector exactly like this for their fly-log-shipper sidecar service.

A big benefit of using Grafana is that we'd also get an industry-leading suite of metrics, monitoring, dashboarding, and alerting tools. For the Seattle Flu infra, Grafana was a very nice place to ship our metrics and logs to. We don't have non-Heroku-generated metrics for nextstrain.org right now, but there are several I think would be good to start tracking. Downside is it does include an additional moving part we'd have to manage (the connection between Heroku and it). For this reason primarily, it could make sense to stick with Papertrail (or switch to Logtail) and pay a bit more (or ask for a gratis bump up as a non-profit).

The text was updated successfully, but these errors were encountered:

Makes tracing what's happening during dev much easier. The full session id isn't included since it's sensitive; a short prefix should be enough for troubleshooting when multiple sessions are involved. I opted to use the "debug" module instead of our own utils.verbose() because this is very chatty log output (i.e. at least one line for every authenticated request, even static assets). The "debug" module is already used by many of our dependencies, most notably Express and its ecosystem, so it isn't much a change to use it ourselves. That said, this isn't an endorsement of using it longer term and we should evaluate that with other logging improvements.¹ ¹ #511

It seems that ostensibly the main reason for going through CloudFront for requests from the nextstrain.org server → S3¹ is that it's cheaper (and maybe faster). While this is true for requests from the internet, our Heroku dynos run on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus are (AFAIK) treated as free internal requests for the purposes of pricing. I don't believe CloudFront marks down data egress back into AWS separately from other egress like S3, and CloudFront also has more expensive per-request pricing than S3, esp. for our most common requests (GET). That said, CloudFront has higher free tier quotas for both bandwidth and requests, so it's hard to know the net balance of the two services without our request/egress numbers for S3 ⟷ Heroku (which we don't have and can't easily acquire without a better story for logging and metrics¹). Re: speed, given that the nextstrain.org server is a fixed point in network space rather than a moving set of users around the world (i.e. the nextstrain.org *users*), it's hard to see how CloudFront could improve response speed by *adding* a layer of indirection. All of that said, the biggest actual win gained by removing CloudFront is that it means removing a caching layer and the chance for bugs due to stale data (particularly read-after-write inconsistency issues). In this regard, I don't think CloudFront is helping, and by itself this makes the switch worth it to me. ¹ See #511

It seems that ostensibly the main reason for going through CloudFront for requests from the nextstrain.org server → S3¹ is that it's cheaper (and maybe faster). While this is true for requests from the internet, our Heroku dynos run on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus get free internal data transfer for the purposes of pricing. I don't believe CloudFront marks down data egress back into AWS separately from other egress like S3. CloudFront also has more expensive per-request pricing than S3, esp. for our most common requests (GET). That said, CloudFront has generous free tier quotas for both bandwidth and requests, so it's hard to know the net balance of the two services without our request/egress numbers for S3 ⟷ Heroku (which we don't have and can't easily acquire without a better story for logging and metrics¹). Re: speed, given that the nextstrain.org server is a fixed point in network space rather than a moving set of users around the world (i.e. the nextstrain.org *users*), it's hard to see how CloudFront could improve response speed by *adding* a layer of indirection. All of that said, the biggest actual win gained by removing CloudFront is that it means removing a caching layer and the chance for bugs due to stale data (particularly read-after-write inconsistency issues). In this regard, I don't think CloudFront is helping, and by itself this makes the switch worth it to me. ¹ See #511

Following on from #847 and 3cf94a7, use the 'debug' logging library which is already in use (explicitly and implicitly) within the nextstrain.org codebase. See #511 for more context.

jameshadfield · 2024-05-02T02:41:00Z

Standardize on debug or bunyan for emitting logs.

We only have 40-50 usages of our utils.{verbose,log,warn,error} functions in the entire server and I'd imagine switching them all to {debug,console.*} calls would be well worth it in the short term. This doesn't rule out a move to some other logging tool in the future, but debug calls are already widely used throughout the server (mostly implicitly).

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

tsibley · 2024-05-02T18:41:19Z

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

Not really? I mean, you always have to specify DEBUG at invocation, or modify it programmatically very early at process start (but there be pitfalls). It really is meant for debug logging which defaults to off.

This is why I do think we want a proper logging library beyond debug.

But agreed that ditching utils.{verbose,log,warn,error} for {debug,console.*} in the short-term would probably be worthwhile.

Following on from #847 and 3cf94a7, use the 'debug' logging library which is already in use (explicitly and implicitly) within the nextstrain.org codebase. See #511 for more context.

tsibley mentioned this issue Apr 8, 2022

Improve server error logging #330

Open

victorlin added the enhancement New feature or request label Apr 13, 2022

tsibley mentioned this issue Jun 8, 2022

Refresh user info #537

Merged

3 tasks

tsibley mentioned this issue Jun 13, 2022

sources: Bypass CloudFront for core and staging source S3 buckets #552

Merged

3 tasks

tsibley mentioned this issue Apr 17, 2023

Move manifest.json files into source code #678

Merged

jameshadfield added a commit that referenced this issue May 2, 2024

Use debug for verbose logging

3cdb08d

Following on from #847 and 3cf94a7, use the 'debug' logging library which is already in use (explicitly and implicitly) within the nextstrain.org codebase. See #511 for more context.

jameshadfield mentioned this issue May 2, 2024

[resource indexer] Use debug library for verbose logging #848

Merged

1 task

jameshadfield added a commit that referenced this issue May 2, 2024

Use debug for verbose logging

6cc18a5

Following on from #847 and 3cf94a7, use the 'debug' logging library which is already in use (explicitly and implicitly) within the nextstrain.org codebase. See #511 for more context.

jameshadfield added a commit that referenced this issue May 2, 2024

Use debug for verbose logging

357cea6

Following on from #847 and 3cf94a7, use the 'debug' logging library which is already in use (explicitly and implicitly) within the nextstrain.org codebase. See #511 for more context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging #511

Improve logging #511

tsibley commented Mar 31, 2022 •

edited

jameshadfield commented May 2, 2024

tsibley commented May 2, 2024

Improve logging #511

Improve logging #511

Comments

tsibley commented Mar 31, 2022 • edited

jameshadfield commented May 2, 2024

tsibley commented May 2, 2024

tsibley commented Mar 31, 2022 •

edited