Skip to content

Monitoring

kimschles edited this page Jul 12, 2019 · 4 revisions

Chapter 6 of the SRE Book

Monitoring

Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break.

Effective alerting systems have good signal and very low noise.

A good monitoring system tells what's broken (symptom) and why (cause)

The Four Golden Signals

  • Latency
    • The time it takes to respond to a request
  • Traffic
    • Web: HTTP requests per second
    • Key/value storage: transaction and retrivals per second
  • Errors
  • Saturation
    • Knowing when your service is 'full'

When monitoring, keep things simple:

The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal. Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.

Chapter 4 of the SRE Workbook

https://landing.google.com/sre/workbook/chapters/monitoring/

Metrics and structured logging

A good monitoring system has the following attributes:

  • Speed
    • How 'fresh' do your need your data to be?
    • How fast can you retrive your data?
  • Calculations
  • Interfaces
    • You should give people different options when looking at the data (types of graphs, ways of drilling down into the data )
  • Alerts
    • Some alerts are more important than others. How do you classify your alerts? (slack-only vs. pagerduty)

Sources of Monitoring Data

  • Sources of monitoring data: logs, metrics, distributed tracing and runtime introspection
  • Metrics are numbers that represent attributes and events
  • Logs are an append-only record of events

Metrics with Purpose

  • Alert when you SLI metrics show that your error budget is under threat
  • SLI metrics should be easy to see on the landing page of your dashboard

Intended Changes

  • Make sure you can tweek alerting to know when you have made changes to your codebase

Dependencies

  • Monitor responses coming from important dependencies

Saturation

Status of Served Traffic

Testing Alerting Logic

  • Write tests for your monitoring systems
  • Good luck, you'll have to develop a DSL
Clone this wiki locally