Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(writeData): adding intel_rdt and ras telegraf plugins #19746

Merged
merged 3 commits into from
Oct 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion ui/src/writeData/components/telegrafPlugins/cloudwatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,13 @@ API endpoint. In the following order the plugin will attempt to authenticate.
## gaps or overlap in pulled data
interval = "5m"

## Recommended if "delay" and "period" are both within 3 hours of request time. Invalid values will be ignored.
## Recently Active feature will only poll for CloudWatch ListMetrics values that occurred within the last 3 Hours.
## If enabled, it will reduce total API usage of the CloudWatch ListMetrics API and require less memory to retain.
## Do not enable if "period" or "delay" is longer than 3 hours, as it will not return data more than 3 hours old.
## See https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListMetrics.html
#recently_active = "PT3H"

## Configure the TTL for the internal cache of metrics.
# cache_ttl = "1h"

Expand Down Expand Up @@ -150,7 +157,7 @@ To maximize efficiency and savings, consider making fewer requests by increasing

### Measurements & Fields:

Each CloudWatch Namespace monitored records a measurement with fields for each available Metric Statistic
Each CloudWatch Namespace monitored records a measurement with fields for each available Metric Statistic.
Namespace and Metrics are represented in [snake case](https://en.wikipedia.org/wiki/Snake_case)

- cloudwatch_{namespace}
Expand Down
26 changes: 24 additions & 2 deletions ui/src/writeData/components/telegrafPlugins/consul.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,14 @@ report those stats already using StatsD protocol if needed.
## URI scheme for the Consul server, one of "http", "https"
# scheme = "http"

## Metric version controls the mapping from Consul metrics into
## Telegraf metrics. Version 2 moved all fields with string values
## to tags.
##
## example: metric_version = 1; deprecated in 1.16
## metric_version = 2; recommended version
# metric_version = 1

## ACL token used in every request
# token = ""

Expand All @@ -41,7 +49,7 @@ report those stats already using StatsD protocol if needed.
```

### Metrics:

##### metric_version = 1:
- consul_health_checks
- tags:
- node (node that check/service is registered on)
Expand All @@ -55,9 +63,23 @@ report those stats already using StatsD protocol if needed.
- critical (integer)
- warning (integer)

##### metric_version = 2:
- consul_health_checks
- tags:
- node (node that check/service is registered on)
- service_name
- check_id
- check_name
- service_id
- status
- fields:
- passing (integer)
- critical (integer)
- warning (integer)

`passing`, `critical`, and `warning` are integer representations of the health
check state. A value of `1` represents that the status was the state of the
the health check at this sample.
the health check at this sample. `status` is string representation of the same state.

## Example output

Expand Down
2 changes: 1 addition & 1 deletion ui/src/writeData/components/telegrafPlugins/exec.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Exec Input Plugin

The `exec` plugin executes the `commands` on every interval and parses metrics from
The `exec` plugin executes all the `commands` in parallel on every interval and parses metrics from
their output in any one of the accepted [Input Data Formats](https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md).

This plugin can be used to poll for custom metrics from any source.
Expand Down
8 changes: 3 additions & 5 deletions ui/src/writeData/components/telegrafPlugins/http_response.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@ This input plugin checks HTTP/HTTPS connections.
```toml
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## Deprecated in 1.12, use 'urls'
## Server address (default http://localhost)
# address = "http://localhost"
## address is Deprecated in 1.12, use 'urls'

## List of urls to query.
# urls = ["http://localhost"]
Expand Down Expand Up @@ -39,8 +37,8 @@ This input plugin checks HTTP/HTTPS connections.
# {'fake':'data'}
# '''

## Optional name of the field that will contain the body of the response.
## By default it is set to an empty String indicating that the body's content won't be added
## Optional name of the field that will contain the body of the response.
## By default it is set to an empty String indicating that the body's content won't be added
# response_body_field = ''

## Maximum allowed HTTP response body size in bytes.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,15 @@ The `/api/v2/write` endpoint supports the `precision` query parameter and can be
to one of `ns`, `us`, `ms`, `s`. All other parameters are ignored and
defer to the output plugins configuration.

Telegraf minimum version: Telegraf 1.16.0

### Configuration:

```toml
[[inputs.influxdb_v2_listener]]
## Address and port to host InfluxDB listener on
service_address = ":9999"
## (Double check the port. Could be 9999 if using OSS Beta)
service_address = ":8086"

## Maximum allowed HTTP request body size in bytes.
## 0 means to use the default of 32MiB.
Expand Down
108 changes: 108 additions & 0 deletions ui/src/writeData/components/telegrafPlugins/intel_rdt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Intel RDT Input Plugin
The intel_rdt plugin collects information provided by monitoring features of
Intel Resource Director Technology (Intel(R) RDT) like Cache Monitoring Technology (CMT),
Memory Bandwidth Monitoring (MBM), Cache Allocation Technology (CAT) and Code
and Data Prioritization (CDP) Technology provide the hardware framework to monitor
and control the utilization of shared resources, like last level cache, memory bandwidth.
These Technologies comprise Intel’s Resource Director Technology (RDT).
As multithreaded and multicore platform architectures emerge,
running workloads in single-threaded, multithreaded, or complex virtual machine environment,
the last level cache and memory bandwidth are key resources to manage. Intel introduces CMT,
MBM, CAT and CDP to manage these various workloads across shared resources.

To gather Intel RDT metrics plugin uses _pqos_ cli tool which is a part of [Intel(R) RDT Software Package](https://github.com/intel/intel-cmt-cat).
Before using this plugin please be sure _pqos_ is properly installed and configured regarding that the plugin
run _pqos_ to work with `OS Interface` mode. This plugin supports _pqos_ version 4.0.0 and above.
Be aware pqos tool needs root privileges to work properly.

Metrics will be constantly reported from the following `pqos` commands within the given interval:

#### In case of cores monitoring:
```
pqos -r --iface-os --mon-file-type=csv --mon-interval=INTERVAL --mon-core=all:[CORES]\;mbt:[CORES]
```
where `CORES` is equal to group of cores provided in config. User can provide many groups.

#### In case of process monitoring:
```
pqos -r --iface-os --mon-file-type=csv --mon-interval=INTERVAL --mon-pid=all:[PIDS]\;mbt:[PIDS]
```
where `PIDS` is group of processes IDs which name are equal to provided process name in a config.
User can provide many process names which lead to create many processes groups.

In both cases `INTERVAL` is equal to sampling_interval from config.

Because PIDs association within system could change in every moment, Intel RDT plugin provides a
functionality to check on every interval if desired processes change their PIDs association.
If some change is reported, plugin will restart _pqos_ tool with new arguments. If provided by user
process name is not equal to any of available processes, will be omitted and plugin will constantly
check for process availability.

### Useful links
Pqos installation process: https://github.com/intel/intel-cmt-cat/blob/master/INSTALL
Enabling OS interface: https://github.com/intel/intel-cmt-cat/wiki, https://github.com/intel/intel-cmt-cat/wiki/resctrl
More about Intel RDT: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html

### Configuration
```toml
# Read Intel RDT metrics
[[inputs.IntelRDT]]
## Optionally set sampling interval to Nx100ms.
## This value is propagated to pqos tool. Interval format is defined by pqos itself.
## If not provided or provided 0, will be set to 10 = 10x100ms = 1s.
# sampling_interval = "10"

## Optionally specify the path to pqos executable.
## If not provided, auto discovery will be performed.
# pqos_path = "/usr/local/bin/pqos"

## Optionally specify if IPC and LLC_Misses metrics shouldn't be propagated.
## If not provided, default value is false.
# shortened_metrics = false

## Specify the list of groups of CPU core(s) to be provided as pqos input.
## Mandatory if processes aren't set and forbidden if processes are specified.
## e.g. ["0-3", "4,5,6"] or ["1-3,4"]
# cores = ["0-3"]

## Specify the list of processes for which Metrics will be collected.
## Mandatory if cores aren't set and forbidden if cores are specified.
## e.g. ["qemu", "pmd"]
# processes = ["process"]
```

### Exposed metrics
| Name | Full name | Description |
|---------------|-----------------------------------------------|-------------|
| MBL | Memory Bandwidth on Local NUMA Node | Memory bandwidth utilization by the relevant CPU core/process on the local NUMA memory channel |
| MBR | Memory Bandwidth on Remote NUMA Node | Memory bandwidth utilization by the relevant CPU core/process on the remote NUMA memory channel |
| MBT | Total Memory Bandwidth | Total memory bandwidth utilized by a CPU core/process on local and remote NUMA memory channels |
| LLC | L3 Cache Occupancy | Total Last Level Cache occupancy by a CPU core/process |
| *LLC_Misses | L3 Cache Misses | Total Last Level Cache misses by a CPU core/process |
| *IPC | Instructions Per Cycle | Total instructions per cycle executed by a CPU core/process |

*optional

### Troubleshooting
Pointing to non-existing core will lead to throwing an error by _pqos_ and plugin will not work properly.
Be sure to check if provided core number exists within desired system.

Be aware reading Intel RDT metrics by _pqos_ cannot be done simultaneously on the same resource.
So be sure to not use any other _pqos_ instance which is monitoring the same cores or PIDs within working system.
Also there is no possibility to monitor same cores or PIDs on different groups.

Pids association for the given process could be manually checked by `pidof` command. E.g:
```
pidof PROCESS
```
where `PROCESS` is process name.

### Example Output
```
> rdt_metric,cores=12\,19,host=r2-compute-20,name=IPC,process=top value=0 1598962030000000000
> rdt_metric,cores=12\,19,host=r2-compute-20,name=LLC_Misses,process=top value=0 1598962030000000000
> rdt_metric,cores=12\,19,host=r2-compute-20,name=LLC,process=top value=0 1598962030000000000
> rdt_metric,cores=12\,19,host=r2-compute-20,name=MBL,process=top value=0 1598962030000000000
> rdt_metric,cores=12\,19,host=r2-compute-20,name=MBR,process=top value=0 1598962030000000000
> rdt_metric,cores=12\,19,host=r2-compute-20,name=MBT,process=top value=0 1598962030000000000
```
2 changes: 2 additions & 0 deletions ui/src/writeData/components/telegrafPlugins/proxmox.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

The proxmox plugin gathers metrics about containers and VMs using the Proxmox API.

Telegraf minimum version: Telegraf 1.16.0

### Configuration:

```toml
Expand Down
58 changes: 58 additions & 0 deletions ui/src/writeData/components/telegrafPlugins/ras.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# RAS Input Plugin

The `RAS` plugin gathers and counts errors provided by [RASDaemon](https://github.com/mchehab/rasdaemon).

### Configuration

```toml
[[inputs.ras]]
## Optional path to RASDaemon sqlite3 database.
## Default: /var/lib/rasdaemon/ras-mc_event.db
# db_path = ""
```

In addition `RASDaemon` runs, by default, with `--enable-sqlite3` flag. In case of problems with SQLite3 database please verify this is still a default option.

### Metrics

- ras
- tags:
- socket_id
- fields:
- memory_read_corrected_errors
- memory_read_uncorrectable_errors
- memory_write_corrected_errors
- memory_write_uncorrectable_errors
- cache_l0_l1_errors
- tlb_instruction_errors
- cache_l2_errors
- upi_errors
- processor_base_errors
- processor_bus_errors
- internal_timer_errors
- smm_handler_code_access_violation_errors
- internal_parity_errors
- frc_errors
- external_mce_errors
- microcode_rom_parity_errors
- unclassified_mce_errors

Please note that `processor_base_errors` is aggregate counter measuring the following MCE events:
- internal_timer_errors
- smm_handler_code_access_violation_errors
- internal_parity_errors
- frc_errors
- external_mce_errors
- microcode_rom_parity_errors
- unclassified_mce_errors

### Permissions

This plugin requires access to SQLite3 database from `RASDaemon`. Please make sure that user has required permissions to this database.

### Example Output

```
ras,host=ubuntu,socket_id=0 external_mce_base_errors=1i,frc_errors=1i,instruction_tlb_errors=5i,internal_parity_errors=1i,internal_timer_errors=1i,l0_and_l1_cache_errors=7i,memory_read_corrected_errors=25i,memory_read_uncorrectable_errors=0i,memory_write_corrected_errors=5i,memory_write_uncorrectable_errors=0i,microcode_rom_parity_errors=1i,processor_base_errors=7i,processor_bus_errors=1i,smm_handler_code_access_violation_errors=1i,unclassified_mce_base_errors=1i 1598867393000000000
ras,host=ubuntu level_2_cache_errors=0i,upi_errors=0i 1598867393000000000
```
5 changes: 5 additions & 0 deletions ui/src/writeData/components/telegrafPlugins/redis.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@
## If no servers are specified, then localhost is used as the host.
## If no port is specified, 6379 is used
servers = ["tcp://localhost:6379"]
## Optional. Specify redis commands to retrieve values
# [[inputs.redis.commands]]
# command = ["get", "sample-key"]
# field = "sample-key-value"
# type = "string"

## specify server password
# password = "s#cr@t%"
Expand Down
22 changes: 11 additions & 11 deletions ui/src/writeData/components/telegrafPlugins/smart.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ SMART information is separated between different measurements: `smart_device` is

If no devices are specified, the plugin will scan for SMART devices via the following command:

```bash
```
smartctl --scan
```

Metrics will be reported from the following `smartctl` command:

```bash
```
smartctl --info --attributes --health -n <nocheck> --format=brief <device>
```

Expand All @@ -23,7 +23,7 @@ Also, NVMe capabilities were introduced in version 6.5.

To enable SMART on a storage device run:

```bash
```
smartctl -s on <device>
```
## NVMe vendor specific attributes
Expand All @@ -35,29 +35,29 @@ In case of `nvme-cli` absence NVMe vendor specific metrics will not be obtained.

Vendor specific SMART metrics for NVMe disks may be reported from the following `nvme` command:

```bash
```
nvme <vendor> smart-log-add <device>
```

Note that vendor plugins for `nvme-cli` could require different naming convention and report format.

To see installed plugin extensions, depended on the nvme-cli version, look at the bottom of:
```bash
```
nvme help
```

To gather disk vendor id (vid) `id-ctrl` could be used:
```bash
```
nvme id-ctrl <device>
```
Association between a vid and company can be found there: https://pcisig.com/membership/member-companies.

Devices affiliation to being NVMe or non NVMe will be determined thanks to:
```bash
```
smartctl --scan
```
and:
```bash
```
smartctl --scan -d nvme
```

Expand Down Expand Up @@ -203,16 +203,16 @@ If this plugin is not working as expected for your SMART enabled device,
please run these commands and include the output in a bug report:

For non NVMe devices (from smartctl version >= 7.0 this will also return NVMe devices by default):
```bash
```
smartctl --scan
```
For NVMe devices:
```bash
```
smartctl --scan -d nvme
```
Run the following command replacing your configuration setting for NOCHECK and
the DEVICE (name of the device could be taken from the previous command):
```bash
```
smartctl --info --health --attributes --tolerance=verypermissive --nocheck NOCHECK --format=brief -d DEVICE
```
If you try to gather vendor specific metrics, please provide this commad
Expand Down
3 changes: 3 additions & 0 deletions ui/src/writeData/components/telegrafPlugins/snmp.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ information.
## SNMP community string.
# community = "public"

## Agent host tag
# agent_host_tag = "agent_host"

## Number of retries to attempt.
# retries = 3

Expand Down
Loading