-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging #511
Comments
Makes tracing what's happening during dev much easier. The full session id isn't included since it's sensitive; a short prefix should be enough for troubleshooting when multiple sessions are involved. I opted to use the "debug" module instead of our own utils.verbose() because this is very chatty log output (i.e. at least one line for every authenticated request, even static assets). The "debug" module is already used by many of our dependencies, most notably Express and its ecosystem, so it isn't much a change to use it ourselves. That said, this isn't an endorsement of using it longer term and we should evaluate that with other logging improvements.¹ ¹ #511
Makes tracing what's happening during dev much easier. The full session id isn't included since it's sensitive; a short prefix should be enough for troubleshooting when multiple sessions are involved. I opted to use the "debug" module instead of our own utils.verbose() because this is very chatty log output (i.e. at least one line for every authenticated request, even static assets). The "debug" module is already used by many of our dependencies, most notably Express and its ecosystem, so it isn't much a change to use it ourselves. That said, this isn't an endorsement of using it longer term and we should evaluate that with other logging improvements.¹ ¹ #511
Makes tracing what's happening during dev much easier. The full session id isn't included since it's sensitive; a short prefix should be enough for troubleshooting when multiple sessions are involved. I opted to use the "debug" module instead of our own utils.verbose() because this is very chatty log output (i.e. at least one line for every authenticated request, even static assets). The "debug" module is already used by many of our dependencies, most notably Express and its ecosystem, so it isn't much a change to use it ourselves. That said, this isn't an endorsement of using it longer term and we should evaluate that with other logging improvements.¹ ¹ #511
It seems that ostensibly the main reason for going through CloudFront for requests from the nextstrain.org server → S3¹ is that it's cheaper (and maybe faster). While this is true for requests from the internet, our Heroku dynos run on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus are (AFAIK) treated as free internal requests for the purposes of pricing. I don't believe CloudFront marks down data egress back into AWS separately from other egress like S3, and CloudFront also has more expensive per-request pricing than S3, esp. for our most common requests (GET). That said, CloudFront has higher free tier quotas for both bandwidth and requests, so it's hard to know the net balance of the two services without our request/egress numbers for S3 ⟷ Heroku (which we don't have and can't easily acquire without a better story for logging and metrics¹). Re: speed, given that the nextstrain.org server is a fixed point in network space rather than a moving set of users around the world (i.e. the nextstrain.org *users*), it's hard to see how CloudFront could improve response speed by *adding* a layer of indirection. All of that said, the biggest actual win gained by removing CloudFront is that it means removing a caching layer and the chance for bugs due to stale data (particularly read-after-write inconsistency issues). In this regard, I don't think CloudFront is helping, and by itself this makes the switch worth it to me. ¹ See #511
It seems that ostensibly the main reason for going through CloudFront for requests from the nextstrain.org server → S3¹ is that it's cheaper (and maybe faster). While this is true for requests from the internet, our Heroku dynos run on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus get free internal data transfer for the purposes of pricing. I don't believe CloudFront marks down data egress back into AWS separately from other egress like S3. CloudFront also has more expensive per-request pricing than S3, esp. for our most common requests (GET). That said, CloudFront has generous free tier quotas for both bandwidth and requests, so it's hard to know the net balance of the two services without our request/egress numbers for S3 ⟷ Heroku (which we don't have and can't easily acquire without a better story for logging and metrics¹). Re: speed, given that the nextstrain.org server is a fixed point in network space rather than a moving set of users around the world (i.e. the nextstrain.org *users*), it's hard to see how CloudFront could improve response speed by *adding* a layer of indirection. All of that said, the biggest actual win gained by removing CloudFront is that it means removing a caching layer and the chance for bugs due to stale data (particularly read-after-write inconsistency issues). In this regard, I don't think CloudFront is helping, and by itself this makes the switch worth it to me. ¹ See #511
We only have 40-50 usages of our Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like |
Not really? I mean, you always have to specify This is why I do think we want a proper logging library beyond But agreed that ditching |
Logging in nextstrain.org is currently ad-hoc and sometimes useful, sometimes not enough. The goal is to make what's logged more systematic and comprehensively useful while at the same time improving our ability to inspect and act on the logs.
Below are some proposed changes to make, spanning both the codebase itself as well as our deployment infra. I expect to break each of these items out into separate issues in due course.
Standardize on debug or bunyan for emitting logs.
debug
is already used by Express and several related deps (setDEBUG='*'
to see these logs), which is a point in its favor, but it's relatively limited (e.g. no log levels, no structured logs, no child loggers).bunyan
provides a more traditional and modern set of logging features, including a focus on structured logs which greatly helps with analysis.Retire our existing logging functions (e.g.
utils.verbose()
) in favor of the above.Switch from ad-hoc messages that embed limited context to structured logs with increased context across the board (e.g. all logs include req context, incl. critical request ids).
The ad-hoc messages don't all have to go away—they're still useful for skimming by eye, particularly if well-written—but they shouldn't be the only thing produced by a log line.
Archive raw logs to S3 for retrospective debugging/analysis.
Archives should be raw, straight from Heroku's Logplex service.
Two quick examples of things archives are useful for:
week"
were impacted by this over the last month?"
Keep the retention period fairly short so we're not storing (potentially sensitive) data we don't need. Maybe retain for 30 days? (which is also the minimum duration you pay for with S3)
Papertrail supports S3 archiving, but we don't currently ingest raw logs into Papertrail due to account limitations (see our incoming log filters).
Ingest logs into analysis system for real-time inspection.
Currently this is Papertrail, a Heroku add-on. It's got a nice straightforward interface. However, we're using a pretty stringent free tier (10MB/day, 2d retention for search) so have to omit lots of useful logs (e.g. all GET requests!). Retention period is also pretty short. Paid tiers that would accommodate our usage largely unfiltered get pricey.
Logtail is another (glossier) app that also has a Heroku add-on, with a free tier that's a bit more generous than Papertrail (5GB/month, 8d retention for search). Even it, however, would require a bit of filtering on ingest and prices increase quickly from free. Interestingly, the Heroku add-on's free tier is more generous than the non-Heroku free tier of the same product.
Grafana Logs is part of Grafana's hosted offering, Grafana Cloud, with a generous free tier of 50GB logs and 14d retention. No Heroku add-on, so we'd have to manually configure the Heroku log drain → Grafana connection. This seems straightforward to do with Vector (a log-processing multi-tool), though means we'll have to run Vector somewhere persistent. The smallest EC2 instance or Heroku dyno would almost certainly be fine, both at around $7/month. Fly.io uses Vector exactly like this for their fly-log-shipper sidecar service.
A big benefit of using Grafana is that we'd also get an industry-leading suite of metrics, monitoring, dashboarding, and alerting tools. For the Seattle Flu infra, Grafana was a very nice place to ship our metrics and logs to. We don't have non-Heroku-generated metrics for nextstrain.org right now, but there are several I think would be good to start tracking. Downside is it does include an additional moving part we'd have to manage (the connection between Heroku and it). For this reason primarily, it could make sense to stick with Papertrail (or switch to Logtail) and pay a bit more (or ask for a gratis bump up as a non-profit).
The text was updated successfully, but these errors were encountered: