Skip to content

Commit

Permalink
chore: Draft host-metrics RFC (vectordotdev#3581)
Browse files Browse the repository at this point in the history
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
  • Loading branch information
jamtur01 authored and Brian Menges committed Dec 9, 2020
1 parent 9f924fe commit 6fc826f
Showing 1 changed file with 206 additions and 0 deletions.
206 changes: 206 additions & 0 deletions rfcs/2020-08-26-3191-host-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
# RFC 3191 - 2020-08-26 - Collecting host-based metrics

This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, network, and memory metrics from Linux-based hosts.

## Scope

This RFC will cover:

- A new source for host-based metrics, specifically:
- CPU
- Memory
- Disk
- Network
- Collection on Linux, Windows, OSX.

This RFC will not cover:

- Other host metrics.
- Other platforms

## Motivation

Users want to collect, transform, and forward metrics to better observe how their hosts are performing.

## Internal Proposal

Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics.

I've found a number of possible Rust-based solutions for implementing this collection, cross-platform.

- https://crates.io/crates/heim (Rust)
- https://lib.rs/crates/sys-info (Rust + C)
- https://github.com/myfreeweb/systemstat (pure Rust)
- https://docs.rs/sysinfo/0.3.19/sysinfo/index.html

The Heim crate has a useful comparison doc showing the available tools and their capabilities.

- https://github.com/heim-rs/heim/blob/master/COMPARISON.md

For this implementation we're recommending the Heim crate based on platform and feature coverage.

We'd use one of these to collect the following metrics:

- `cpu_seconds_total` tagged with mode (idle, nice, system, user) and CPU. (counter)
- `disk_read_bytes_total` tagged with the disk (counter)
- `disk_read_errors_total` tagged with the disk (counter)
- `disk_read_retries_total` tagged with the disk (counter)
- `disk_read_sectors_total` tagged with the disk (counter)
- `disk_read_time_seconds_total` tagged with the disk (counter)
- `disk_reads_completed_total` tagged with the disk (counter)
- `disk_write_errors_total` tagged with the disk (counter)
- `disk_write_retries_total` tagged with the disk (counter)
- `disk_write_time_seconds_total` tagged with the disk (counter)
- `disk_writes_completed_total` tagged with the disk (counter)
- `disk_written_bytes_total` tagged with the disk (counter)
- `disk_written_sectors_total` tagged with the disk (counter)
- `filesystem_avail_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
- `filesystem_device_error` tagged with the device, filesystem type, and mountpoint (gauge)
- `filesystem_free_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
- `filesystem_size_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
- `filesystem_total_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge)
- `filesystem_free_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge)
- `load1` (gauge)
- `load5` (gauge)
- `load15` (gauge)
- `memory_active_bytes` (gauge)
- `memory_compressed_bytes` (gauge)
- `memory_free_bytes` (gauge)
- `memory_inactive_bytes` (gauge)
- `memory_swap_total_bytes` (gauge)
- `memory_swap_used_bytes` (gauge)
- `memory_swapped_in_bytes_total` (gauge)
- `memory_swapped_out_bytes_total` (gauge)
- `memory_total_bytes` (gauge)
- `memory_wired_bytes` (gauge)
- `network_receive_bytes_total` tagged with device (counter)
- `network_receive_errs_total` tagged with device (counter)
- `network_receive_multicast_total` tagged with device (counter)
- `network_receive_packets_total` tagged with device (counter)
- `network_transmit_bytes_total` tagged with device (counter)
- `network_transmit_errs_total` tagged with device (counter)
- `network_transmit_multicast_total` tagged with device (counter)
- `network_transmit_packets_total` tagged with device (counter)

Users should be able to limit the collection of metrics to specific classes, here: `cpu`, `memory`, `disk`, `filesystem`, `load`, and `network`.

Metrics will also be tagged with:

- `host`: the host name of the host being monitored.

And `collector` for type of metric:

- `disk` for disk based metrics
- `cpu` for CPU based metrics
- `filesystem` for filesystem based metrics
- `memory` for memory based metrics.
- `load` for load based metrics.
- `network` for network based metrics.

Specific explanation of some of the filesystem metrics:

- `filesystem_avail_bytes` = Filesystem space available to non-root users in bytes (including reserved blocks).
- `filesystem_free_bytes` = Filesystem free space in bytes (excluding reserved blocks).
- `filesystem_size_bytes` = Filesystem size in bytes.


## Doc-level Proposal

The following additional source configuration will be added:

```toml
[sources.my_source_id]
type = "host_metrics" # required
collectors = [ "all"] # optional, defaults collecting all metrics.
filesystem.mountpoints = [ "*" ] # optional, defaults to all mountpoints.
disk.devices = [ "*" ] # optional, defaults to all to disk devices.
network.devices = [ "*" ] # optional, defaults to all to network devices.
scrape_interval_secs = 15 # optional, default, seconds
namespace = "host" # optional, default is "host", namespace to put metrics under
```

For `collector` (name open to discussion) we'd specify an array of metric collectors:

```toml
collectors = [ "cpu", "memory", "network" ]
```

For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example:

```toml
network.devices = [ "eth0" ]
```

Or, if we think its feasible, using globbing like so:

```toml
network.devices = [ "eth*" ]
```

And, if feasible, have syntax for exclusion of resources.

- We'd also add a guide for doing this on Docker (similar to https://github.com/prometheus/node_exporter#using-docker).

## Rationale

CPU, Memory, Disk, and Network are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool

As part of Vector's vision to be the "one tool" for ingesting and shipping
observability data, it makes sense to add as many sources as possible to reduce
the likelihood that a user will not be able to ingest metrics from their tools.

## Prior Art

- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk
- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/net
- https://github.com/elastic/beats/tree/master/metricbeat
- https://github.com/prometheus/node_exporter
- https://docs.fluentbit.io/manual/pipeline/inputs/cpu-metrics
- https://docs.fluentbit.io/manual/pipeline/inputs/disk-io-metrics
- https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics
- https://docs.fluentbit.io/manual/pipeline/inputs/network-io-metrics

## Drawbacks

- Additional maintenance and integration testing burden of a new source

## Alternatives

### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it

We could not add the source directly to Vector and instead instruct users to run
Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects.

I decided against this as it would be in contrast with one of the listed
principles of Vector:

> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
> (coming soon) from A to B.
[Vector
principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)

On the same page, it is mentioned that Vector should be a replacement for
Telegraf.

> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
> similar tools.
If users are already running Telegraf or Node Exporter though, they could opt for this path.

## Outstanding Questions

- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?

## Plan Of Attack

Incremental steps that execute this change. Generally this is in the form of:

- [ ] Submit a PR with the initial source implementation

## Future Work

- Extend source to collect additional system-level metrics.
- Identify additional potential platforms.

0 comments on commit 6fc826f

Please sign in to comment.