-
Notifications
You must be signed in to change notification settings - Fork 12
Monitoring
Chapter 6 of the SRE Book
Monitoring
Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break.
Effective alerting systems have good signal and very low noise.
A good monitoring system tells what's broken (symptom) and why (cause)
The Four Golden Signals
- Latency
- The time it takes to respond to a request
- Traffic
- Web: HTTP requests per second
- Key/value storage: transaction and retrivals per second
- Errors
- Saturation
- Knowing when your service is 'full'
When monitoring, keep things simple:
The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal. Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
https://landing.google.com/sre/workbook/chapters/monitoring/
Metrics and structured logging
A good monitoring system has the following attributes:
- Speed
- How 'fresh' do your need your data to be?
- How fast can you retrive your data?
- Calculations
- Interfaces
- You should give people different options when looking at the data (types of graphs, ways of drilling down into the data )
- Alerts
- Some alerts are more important than others. How do you classify your alerts? (slack-only vs. pagerduty)
- Sources of monitoring data: logs, metrics, distributed tracing and runtime introspection
- Metrics are numbers that represent attributes and events
- Logs are an append-only record of events
- Alert when you SLI metrics show that your error budget is under threat
- SLI metrics should be easy to see on the landing page of your dashboard
- Make sure you can tweek alerting to know when you have made changes to your codebase
- Monitor responses coming from important dependencies
- Write tests for your monitoring systems
- Good luck, you'll have to develop a DSL