Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging #511

Open
4 tasks
tsibley opened this issue Mar 31, 2022 · 2 comments
Open
4 tasks

Improve logging #511

tsibley opened this issue Mar 31, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@tsibley
Copy link
Member

tsibley commented Mar 31, 2022

Logging in nextstrain.org is currently ad-hoc and sometimes useful, sometimes not enough. The goal is to make what's logged more systematic and comprehensively useful while at the same time improving our ability to inspect and act on the logs.

Below are some proposed changes to make, spanning both the codebase itself as well as our deployment infra. I expect to break each of these items out into separate issues in due course.

  • Standardize on debug or bunyan for emitting logs.

    debug is already used by Express and several related deps (set DEBUG='*' to see these logs), which is a point in its favor, but it's relatively limited (e.g. no log levels, no structured logs, no child loggers). bunyan provides a more traditional and modern set of logging features, including a focus on structured logs which greatly helps with analysis.

    Retire our existing logging functions (e.g. utils.verbose()) in favor of the above.

  • Switch from ad-hoc messages that embed limited context to structured logs with increased context across the board (e.g. all logs include req context, incl. critical request ids).

    The ad-hoc messages don't all have to go away—they're still useful for skimming by eye, particularly if well-written—but they shouldn't be the only thing produced by a log line.

  • Archive raw logs to S3 for retrospective debugging/analysis.

    Archives should be raw, straight from Heroku's Logplex service.

    Two quick examples of things archives are useful for:

    • Investigating reports like "this weird thing happened last
      week"
    • Analyzing the impact of a bug over time, e.g. "how many requests
      were impacted by this over the last month?"

    Keep the retention period fairly short so we're not storing (potentially sensitive) data we don't need. Maybe retain for 30 days? (which is also the minimum duration you pay for with S3)

    Papertrail supports S3 archiving, but we don't currently ingest raw logs into Papertrail due to account limitations (see our incoming log filters).

  • Ingest logs into analysis system for real-time inspection.

    Currently this is Papertrail, a Heroku add-on. It's got a nice straightforward interface. However, we're using a pretty stringent free tier (10MB/day, 2d retention for search) so have to omit lots of useful logs (e.g. all GET requests!). Retention period is also pretty short. Paid tiers that would accommodate our usage largely unfiltered get pricey.

    Logtail is another (glossier) app that also has a Heroku add-on, with a free tier that's a bit more generous than Papertrail (5GB/month, 8d retention for search). Even it, however, would require a bit of filtering on ingest and prices increase quickly from free. Interestingly, the Heroku add-on's free tier is more generous than the non-Heroku free tier of the same product.

    Grafana Logs is part of Grafana's hosted offering, Grafana Cloud, with a generous free tier of 50GB logs and 14d retention. No Heroku add-on, so we'd have to manually configure the Heroku log drain → Grafana connection. This seems straightforward to do with Vector (a log-processing multi-tool), though means we'll have to run Vector somewhere persistent. The smallest EC2 instance or Heroku dyno would almost certainly be fine, both at around $7/month. Fly.io uses Vector exactly like this for their fly-log-shipper sidecar service.

    A big benefit of using Grafana is that we'd also get an industry-leading suite of metrics, monitoring, dashboarding, and alerting tools. For the Seattle Flu infra, Grafana was a very nice place to ship our metrics and logs to. We don't have non-Heroku-generated metrics for nextstrain.org right now, but there are several I think would be good to start tracking. Downside is it does include an additional moving part we'd have to manage (the connection between Heroku and it). For this reason primarily, it could make sense to stick with Papertrail (or switch to Logtail) and pay a bit more (or ask for a gratis bump up as a non-profit).

@victorlin victorlin added the enhancement New feature or request label Apr 13, 2022
tsibley added a commit that referenced this issue May 10, 2022
Makes tracing what's happening during dev much easier.  The full session
id isn't included since it's sensitive; a short prefix should be enough
for troubleshooting when multiple sessions are involved.

I opted to use the "debug" module instead of our own utils.verbose()
because this is very chatty log output (i.e. at least one line for every
authenticated request, even static assets).  The "debug" module is
already used by many of our dependencies, most notably Express and its
ecosystem, so it isn't much a change to use it ourselves.  That said,
this isn't an endorsement of using it longer term and we should evaluate
that with other logging improvements.¹

¹ #511
tsibley added a commit that referenced this issue May 10, 2022
Makes tracing what's happening during dev much easier.  The full session
id isn't included since it's sensitive; a short prefix should be enough
for troubleshooting when multiple sessions are involved.

I opted to use the "debug" module instead of our own utils.verbose()
because this is very chatty log output (i.e. at least one line for every
authenticated request, even static assets).  The "debug" module is
already used by many of our dependencies, most notably Express and its
ecosystem, so it isn't much a change to use it ourselves.  That said,
this isn't an endorsement of using it longer term and we should evaluate
that with other logging improvements.¹

¹ #511
tsibley added a commit that referenced this issue May 16, 2022
Makes tracing what's happening during dev much easier.  The full session
id isn't included since it's sensitive; a short prefix should be enough
for troubleshooting when multiple sessions are involved.

I opted to use the "debug" module instead of our own utils.verbose()
because this is very chatty log output (i.e. at least one line for every
authenticated request, even static assets).  The "debug" module is
already used by many of our dependencies, most notably Express and its
ecosystem, so it isn't much a change to use it ourselves.  That said,
this isn't an endorsement of using it longer term and we should evaluate
that with other logging improvements.¹

¹ #511
@tsibley tsibley mentioned this issue Jun 8, 2022
3 tasks
tsibley added a commit that referenced this issue Jun 13, 2022
It seems that ostensibly the main reason for going through CloudFront
for requests from the nextstrain.org server → S3¹ is that it's cheaper
(and maybe faster).

While this is true for requests from the internet, our Heroku dynos run
on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus are
(AFAIK) treated as free internal requests for the purposes of pricing.
I don't believe CloudFront marks down data egress back into AWS
separately from other egress like S3, and CloudFront also has more
expensive per-request pricing than S3, esp. for our most common requests
(GET).  That said, CloudFront has higher free tier quotas for both
bandwidth and requests, so it's hard to know the net balance of the two
services without our request/egress numbers for S3 ⟷ Heroku (which we
don't have and can't easily acquire without a better story for logging
and metrics¹).

Re: speed, given that the nextstrain.org server is a fixed point in
network space rather than a moving set of users around the world (i.e.
the nextstrain.org *users*), it's hard to see how CloudFront could
improve response speed by *adding* a layer of indirection.

All of that said, the biggest actual win gained by removing CloudFront
is that it means removing a caching layer and the chance for bugs due to
stale data (particularly read-after-write inconsistency issues).  In
this regard, I don't think CloudFront is helping, and by itself this
makes the switch worth it to me.

¹ See #511
tsibley added a commit that referenced this issue Jun 22, 2022
It seems that ostensibly the main reason for going through CloudFront
for requests from the nextstrain.org server → S3¹ is that it's cheaper
(and maybe faster).

While this is true for requests from the internet, our Heroku dynos run
on AWS EC2 in us-east-1 (the same region as our S3 buckets) and thus get
free internal data transfer for the purposes of pricing.  I don't
believe CloudFront marks down data egress back into AWS separately from
other egress like S3.  CloudFront also has more expensive per-request
pricing than S3, esp. for our most common requests (GET).  That said,
CloudFront has generous free tier quotas for both bandwidth and
requests, so it's hard to know the net balance of the two services
without our request/egress numbers for S3 ⟷ Heroku (which we don't have
and can't easily acquire without a better story for logging and
metrics¹).

Re: speed, given that the nextstrain.org server is a fixed point in
network space rather than a moving set of users around the world (i.e.
the nextstrain.org *users*), it's hard to see how CloudFront could
improve response speed by *adding* a layer of indirection.

All of that said, the biggest actual win gained by removing CloudFront
is that it means removing a caching layer and the chance for bugs due to
stale data (particularly read-after-write inconsistency issues).  In
this regard, I don't think CloudFront is helping, and by itself this
makes the switch worth it to me.

¹ See #511
jameshadfield added a commit that referenced this issue May 2, 2024
Following on from #847 and 3cf94a7, use the 'debug' logging library
which is already in use (explicitly and implicitly) within the
nextstrain.org codebase. See #511 for more context.
@jameshadfield
Copy link
Member

Standardize on debug or bunyan for emitting logs.

We only have 40-50 usages of our utils.{verbose,log,warn,error} functions in the entire server and I'd imagine switching them all to {debug,console.*} calls would be well worth it in the short term. This doesn't rule out a move to some other logging tool in the future, but debug calls are already widely used throughout the server (mostly implicitly).

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

@tsibley
Copy link
Member Author

tsibley commented May 2, 2024

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

Not really? I mean, you always have to specify DEBUG at invocation, or modify it programmatically very early at process start (but there be pitfalls). It really is meant for debug logging which defaults to off.

This is why I do think we want a proper logging library beyond debug.

But agreed that ditching utils.{verbose,log,warn,error} for {debug,console.*} in the short-term would probably be worthwhile.

jameshadfield added a commit that referenced this issue May 2, 2024
Following on from #847 and 3cf94a7, use the 'debug' logging library
which is already in use (explicitly and implicitly) within the
nextstrain.org codebase. See #511 for more context.
jameshadfield added a commit that referenced this issue May 2, 2024
Following on from #847 and 3cf94a7, use the 'debug' logging library
which is already in use (explicitly and implicitly) within the
nextstrain.org codebase. See #511 for more context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

3 participants