Skip to content

x/build: collect key metrics from the build infrastructure #47325

@cagedmantis

Description

@cagedmantis

This is a tracking issue for the collection of key operational metrics from the build infrastructure. These metrics are being collected for the following reasons:

  • Increasing speed of root cause analysis when an issue arrises.
  • Understanding how changes correlate to performance changes.
  • Facilitating key areas for possible optimizations.
  • Facilitating monitoring and alerting on metrics.

The task list below will be appended to once a detailed list of key metrics are identified.

  • Collect Metrics

  • Create Dashboards

  • GCP Aggregate Service API metrics

  • AWS Aggregate Service API

  • GitHub Aggregate Service API

  • Gerrit Aggregate Service API

  • TLS certificate lifetime

  • General OS/Application/Container specific metrics

Coordinator

  • Amount of time waiting for VM quota
  • Buildlet creation latency by stage and type
  • Total buildlet creation latency
  • VM instance creation failures
  • Instance creation queue depth
  • Instance creation queue latency
  • Active Trybot count, latency, failures by type
  • Buildlet count by pool
  • Pending build count by type
  • Pending build latency by type
  • Uptime
  • Build rate
  • General API instrumentation (like ochttp)

Gomote

  • Sessions created
  • Sessions destroyed
  • Session duration
  • Command usage (SSH, put, etc.)

@golang/release

Metadata

Metadata

Assignees

Labels

Buildersx/build issues (builders, bots, dashboards)NeedsFixThe path to resolution is known, but the work has not been done.

Type

No type

Projects

Status

Planned

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions