chore: Draft host-metrics RFC (vectordotdev#3581)

Signed-off-by: Brian Menges <brian.menges@anaplan.com>
jacobbraaten · Dec 9, 2020 · 6fc826f · 6fc826f
1 parent 9f924fe
commit 6fc826f
Showing 1 changed file with 206 additions and 0 deletions.
diff --git a/rfcs/2020-08-26-3191-host-metrics.md b/rfcs/2020-08-26-3191-host-metrics.md
@@ -0,0 +1,206 @@
+# RFC 3191 - 2020-08-26 - Collecting host-based metrics
+
+This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, network, and memory metrics from Linux-based hosts.
+
+## Scope
+
+This RFC will cover:
+
+- A new source for host-based metrics, specifically:
+  - CPU
+  - Memory
+  - Disk
+  - Network
+- Collection on Linux, Windows, OSX.
+
+This RFC will not cover:
+
+- Other host metrics.
+- Other platforms
+
+## Motivation
+
+Users want to collect, transform, and forward metrics to better observe how their hosts are performing.
+
+## Internal Proposal
+
+Build a single source called `host_metrics` (name to be confirmed) to collect host/system level metrics.
+
+I've found a number of possible Rust-based solutions for implementing this collection, cross-platform.
+
+- https://crates.io/crates/heim (Rust)
+- https://lib.rs/crates/sys-info (Rust + C)
+- https://github.com/myfreeweb/systemstat (pure Rust)
+- https://docs.rs/sysinfo/0.3.19/sysinfo/index.html
+
+The Heim crate has a useful comparison doc showing the available tools and their capabilities.
+
+- https://github.com/heim-rs/heim/blob/master/COMPARISON.md
+
+For this implementation we're recommending the Heim crate based on platform and feature coverage.
+
+We'd use one of these to collect the following metrics:
+
+- `cpu_seconds_total` tagged with mode (idle, nice, system, user) and CPU. (counter)
+- `disk_read_bytes_total` tagged with the disk (counter)
+- `disk_read_errors_total` tagged with the disk (counter)
+- `disk_read_retries_total` tagged with the disk (counter)
+- `disk_read_sectors_total` tagged with the disk (counter)
+- `disk_read_time_seconds_total` tagged with the disk (counter)
+- `disk_reads_completed_total` tagged with the disk (counter)
+- `disk_write_errors_total` tagged with the disk (counter)
+- `disk_write_retries_total` tagged with the disk (counter)
+- `disk_write_time_seconds_total` tagged with the disk (counter)
+- `disk_writes_completed_total` tagged with the disk (counter)
+- `disk_written_bytes_total` tagged with the disk (counter)
+- `disk_written_sectors_total` tagged with the disk (counter)
+- `filesystem_avail_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
+- `filesystem_device_error` tagged with the device, filesystem type, and mountpoint (gauge)
+- `filesystem_free_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
+- `filesystem_size_bytes` tagged with the device, filesystem type, and mountpoint (gauge)
+- `filesystem_total_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge)
+- `filesystem_free_file_nodes` tagged with the device, filesystem type, and mountpoint (gauge)
+- `load1` (gauge)
+- `load5` (gauge)
+- `load15` (gauge)
+- `memory_active_bytes` (gauge)
+- `memory_compressed_bytes` (gauge)
+- `memory_free_bytes` (gauge)
+- `memory_inactive_bytes` (gauge)
+- `memory_swap_total_bytes` (gauge)
+- `memory_swap_used_bytes` (gauge)
+- `memory_swapped_in_bytes_total` (gauge)
+- `memory_swapped_out_bytes_total` (gauge)
+- `memory_total_bytes` (gauge)
+- `memory_wired_bytes` (gauge)
+- `network_receive_bytes_total` tagged with device (counter)
+- `network_receive_errs_total` tagged with device (counter)
+- `network_receive_multicast_total` tagged with device (counter)
+- `network_receive_packets_total` tagged with device (counter)
+- `network_transmit_bytes_total` tagged with device (counter)
+- `network_transmit_errs_total` tagged with device (counter)
+- `network_transmit_multicast_total` tagged with device (counter)
+- `network_transmit_packets_total` tagged with device (counter)
+
+Users should be able to limit the collection of metrics to specific classes, here: `cpu`, `memory`, `disk`, `filesystem`, `load`, and `network`.
+
+Metrics will also be tagged with:
+
+- `host`: the host name of the host being monitored.
+
+And `collector` for type of metric:
+
+- `disk` for disk based metrics
+- `cpu` for CPU based metrics
+- `filesystem` for filesystem based metrics
+- `memory` for memory based metrics.
+- `load` for load based metrics.
+- `network` for network based metrics.
+
+Specific explanation of some of the filesystem metrics:
+
+- `filesystem_avail_bytes` = Filesystem space available to non-root users in bytes (including reserved blocks).
+- `filesystem_free_bytes` = Filesystem free space in bytes (excluding reserved blocks).
+- `filesystem_size_bytes` = Filesystem size in bytes.
+
+
+## Doc-level Proposal
+
+The following additional source configuration will be added:
+
+```toml
+[sources.my_source_id]
+  type = "host_metrics" # required
+  collectors = [ "all"] # optional, defaults collecting all metrics.
+  filesystem.mountpoints = [ "*" ] # optional, defaults to all mountpoints.
+  disk.devices = [ "*" ] # optional, defaults to all to disk devices.
+  network.devices = [ "*" ] # optional, defaults to all to network devices.
+  scrape_interval_secs = 15 # optional, default, seconds
+  namespace = "host" # optional, default is "host", namespace to put metrics under
+```
+
+For `collector` (name open to discussion) we'd specify an array of metric collectors:
+
+```toml
+collectors = [ "cpu", "memory", "network" ]
+```
+
+For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example:
+
+```toml
+network.devices = [ "eth0" ]
+```
+
+Or, if we think its feasible, using globbing like so:
+
+```toml
+network.devices = [ "eth*" ]
+```
+
+And, if feasible, have syntax for exclusion of resources.
+
+- We'd also add a guide for doing this on Docker (similar to https://github.com/prometheus/node_exporter#using-docker).
+
+## Rationale
+
+CPU, Memory, Disk, and Network are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool
+
+As part of Vector's vision to be the "one tool" for ingesting and shipping
+observability data, it makes sense to add as many sources as possible to reduce
+the likelihood that a user will not be able to ingest metrics from their tools.
+
+## Prior Art
+
+- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu
+- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem
+- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk
+- https://github.com/influxdata/telegraf/tree/master/plugins/inputs/net
+- https://github.com/elastic/beats/tree/master/metricbeat
+- https://github.com/prometheus/node_exporter
+- https://docs.fluentbit.io/manual/pipeline/inputs/cpu-metrics
+- https://docs.fluentbit.io/manual/pipeline/inputs/disk-io-metrics
+- https://docs.fluentbit.io/manual/pipeline/inputs/memory-metrics
+- https://docs.fluentbit.io/manual/pipeline/inputs/network-io-metrics
+
+## Drawbacks
+
+- Additional maintenance and integration testing burden of a new source
+
+## Alternatives
+
+### Having users run telegraf or Prom node exporter and using Vector's prometheus source to scrape it
+
+We could not add the source directly to Vector and instead instruct users to run
+Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects.
+
+I decided against this as it would be in contrast with one of the listed
+principles of Vector:
+
+> One Tool. All Data. - One simple tool gets your logs, metrics, and traces
+> (coming soon) from A to B.
+
+[Vector
+principles](https://vector.dev/docs/about/what-is-vector/#who-should-use-vector)
+
+On the same page, it is mentioned that Vector should be a replacement for
+Telegraf.
+
+> You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or
+> similar tools.
+
+If users are already running Telegraf or Node Exporter though, they could opt for this path.
+
+## Outstanding Questions
+
+- One source or many? Should we have `host_metrics` or `cpu_metrics`, `mem_metrics`, `disk_metrics`, or `load_metrics`?
+
+## Plan Of Attack
+
+Incremental steps that execute this change. Generally this is in the form of:
+
+- [ ] Submit a PR with the initial source implementation
+
+## Future Work
+
+- Extend source to collect additional system-level metrics.
+- Identify additional potential platforms.