Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: New input plugin for libvirt #11814

Merged
merged 7 commits into from Oct 12, 2022

Conversation

p-zak
Copy link
Collaborator

@p-zak p-zak commented Sep 15, 2022

resolves #65
resolves #70
resolves #690

This is a continuation of work being done in following PRs:

This PR uses https://github.com/digitalocean/go-libvirt which became very mature project. It is still cgo-free and provides a pure Go interface for interacting with libvirt.

The plugin exposes all possible domain statistics which can be gathered from the newest versions of libvirt (>= 7.x.y but it will do its best to expose as much as possible from previous versions).

List of exposed metrics:

Statistics group Metric name Exposed Telegraf field Description
state state.state state state of the VM, returned as number from virDomainState enum
state.reason reason reason for entering given state, returned as int from virDomain*Reason enum corresponding to given state
cpu_total cpu.time time total cpu time spent for this domain in nanoseconds
cpu.user user user cpu time spent in nanoseconds
cpu.system system system cpu time spent in nanoseconds
cpu.haltpoll.success.time haltpoll_success_time cpu halt polling success time spent in nanoseconds
cpu.haltpoll.fail.time haltpoll_fail_time cpu halt polling fail time spent in nanoseconds
cpu.cache.monitor.count count the number of cache monitors for this domain
cpu.cache.monitor.<num>.name name the name of cache monitor <num>, not available for kernels from 4.14 upwards
cpu.cache.monitor.<num>.vcpus vcpus vcpu list of cache monitor <num>, not available for kernels from 4.14 upwards
cpu.cache.monitor.<num>.bank.count bank_count the number of cache banks in cache monitor <num>, not available for kernels from 4.14 upwards
cpu.cache.monitor.<num>.bank.<index>.id id host allocated cache id for bank <index> in cache monitor <num>, not available for kernels from 4.14 upwards
cpu.cache.monitor.<num>.bank.<index>.bytes bytes the number of bytes of last level cache that the domain is using on cache bank <index>, not available for kernels from 4.14 upwards
balloon balloon.current current the memory in KiB currently used
balloon.maximum maximum the maximum memory in KiB allowed
balloon.swap_in swap_in the amount of data read from swap space (in KiB)
balloon.swap_out swap_out the amount of memory written out to swap space (in KiB)
balloon.major_fault major_fault the number of page faults when disk IO was required
balloon.minor_fault minor_fault the number of other page faults
balloon.unused unused the amount of memory left unused by the system (in KiB)
balloon.available available the amount of usable memory as seen by the domain (in KiB)
balloon.rss rss Resident Set Size of running domain's process (in KiB)
balloon.usable usable the amount of memory which can be reclaimed by balloon without causing host swapping (in KiB)
balloon.last-update last_update timestamp of the last update of statistics (in seconds)
balloon.disk_caches disk_caches the amount of memory that can be reclaimed without additional I/O, typically disk (in KiB)
balloon.hugetlb_pgalloc hugetlb_pgalloc the number of successful huge page allocations from inside the domain via virtio balloon
balloon.hugetlb_pgfail hugetlb_pgfail the number of failed huge page allocations from inside the domain via virtio balloon
vcpu vcpu.current current yes current number of online virtual CPUs
vcpu.maximum maximum maximum number of online virtual CPUs
vcpu.<num>.state state state of the virtual CPU <num>, as number from virVcpuState enum
vcpu.<num>.time time virtual cpu time spent by virtual CPU <num> (in microseconds)
vcpu.<num>.wait wait virtual cpu time spent by virtual CPU <num> waiting on I/O (in microseconds)
vcpu.<num>.halted halted virtual CPU <num> is halted: yes or no (may indicate the processor is idle or even disabled, depending on the architecture)
vcpu.<num>.halted halted_i virtual CPU <num> is halted: 1 (for "yes") or 0 (for other values) (may indicate the processor is idle or even disabled, depending on the architecture)
vcpu.<num>.delay delay time the vCPU <num> thread was enqueued by the host scheduler, but was waiting in the queue instead of running. Exposed to the VM as a steal time.
--- cpu_id Information about mapping vcpu_id to cpu_id (id of physical cpu). Should only be exposed when statistics_group contains vcpu and additional_statistics contains vcpu_mapping (in config)
interface net.count count number of network interfaces on this domain
net.<num>.name name name of the interface <num>
net.<num>.rx.bytes rx_bytes number of bytes received
net.<num>.rx.pkts rx_pkts number of packets received
net.<num>.rx.errs rx_errs number of receive errors
net.<num>.rx.drop rx_drop number of receive packets dropped
net.<num>.tx.bytes tx_bytes number of bytes transmitted
net.<num>.tx.pkts tx_pkts number of packets transmitted
net.<num>.tx.errs tx_errs number of transmission errors
net.<num>.tx.drop tx_drop number of transmit packets dropped
perf perf.cmt cmt the cache usage in Byte currently used, not available for kernels from 4.14 upwards
perf.mbmt mbmt total system bandwidth from one level of cache, not available for kernels from 4.14 upwards
perf.mbml mbml bandwidth of memory traffic for a memory controller, not available for kernels from 4.14 upwards
perf.cpu_cycles cpu_cycles the count of cpu cycles (total/elapsed)
perf.instructions instructions the count of instructions
perf.cache_references cache_references the count of cache hits
perf.cache_misses cache_misses the count of caches misses
perf.branch_instructions branch_instructions the count of branch instructions
perf.branch_misses branch_misses the count of branch misses
perf.bus_cycles bus_cycles the count of bus cycles
perf.stalled_cycles_frontend stalled_cycles_frontend the count of stalled frontend cpu cycles
perf.stalled_cycles_backend stalled_cycles_backend the count of stalled backend cpu cycles
perf.ref_cpu_cycles ref_cpu_cycles the count of ref cpu cycles
perf.cpu_clock cpu_clock the count of cpu clock time
perf.task_clock task_clock the count of task clock time
perf.page_faults page_faults the count of page faults
perf.context_switches context_switches the count of context switches
perf.cpu_migrations cpu_migrations the count of cpu migrations
perf.page_faults_min page_faults_min the count of minor page faults
perf.page_faults_maj page_faults_maj the count of major page faults
perf.alignment_faults alignment_faults the count of alignment faults
perf.emulation_faults emulation_faults the count of emulation faults
block block.count count number of block devices being listed
block.<num>.name name name of the target of the block device <num> (the same name for multiple entries if --backing is present)
block.<num>.backingIndex backingIndex when --backing is present, matches up with the <backingStore> index listed in domain XML for backing files
block.<num>.path path file source of block device <num>, if it is a local file or block device
block.<num>.rd.reqs rd_reqs number of read requests
block.<num>.rd.bytes rd_bytes number of read bytes
block.<num>.rd.times rd_times total time (ns) spent on reads
block.<num>.wr.reqs wr_reqs number of write requests
block.<num>.wr.bytes wr_bytes number of written bytes
block.<num>.wr.times wr_times total time (ns) spent on writes
block.<num>.fl.reqs fl_reqs total flush requests
block.<num>.fl.times fl_times total time (ns) spent on cache flushing
block.<num>.errors errors Xen only: the 'oo_req' value
block.<num>.allocation allocation offset of highest written sector in bytes
block.<num>.capacity capacity logical size of source file in bytes
block.<num>.physical physical physical size of source file in bytes
block.<num>.threshold threshold threshold (in bytes) for delivering the VIR_DOMAIN_EVENT_ID_BLOCK_THRESHOLD event. See domblkthreshold
iothread iothread.count count maximum number of IOThreads in the subsequent list as unsigned int. Each IOThread in the list will will use it's iothread_id value as the <id>. There may be fewer <id> entries than the iothread.count value if the polling values are not supported
iothread.<id>.poll-max-ns poll_max_ns maximum polling time in nanoseconds used by the <id> IOThread. A value of 0 (zero) indicates polling is disabled
iothread.<id>.poll-grow poll_grow polling time grow value. A value of 0 (zero) growth is managed by the hypervisor
iothread.<id>.poll-shrink poll_shrink polling time shrink value. A value of (zero) indicates shrink is managed by hypervisor
memory memory.bandwidth.monitor.count count the number of memory bandwidth monitors for this domain, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.name name the name of monitor <num>, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.vcpus vcpus the vcpu list of monitor <num>, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.node.count node_count the number of memory controller in monitor <num>, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.node.<index>.id id host allocated memory controller id for controller <index> of monitor <num>, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.node.<index>.bytes.local bytes_local the accumulative bytes consumed by @vcpus that passing through the memory controller in the same processor that the scheduled host CPU belongs to, not available for kernels from 4.14 upwards
memory.bandwidth.monitor.<num>.node.<index>.bytes.total bytes_total the total bytes consumed by @vcpus that passing through all memory controllers, either local or remote controller, not available for kernels from 4.14 upwards
dirtyrate dirtyrate.calc_status calc_status the status of last memory dirty rate calculation, returned as number from virDomainDirtyRateStatus enum
dirtyrate.calc_start_time calc_start_time the start time of last memory dirty rate calculation
dirtyrate.calc_period calc_period the period of last memory dirty rate calculation
dirtyrate.megabytes_per_second megabytes_per_second the calculated memory dirty rate in MiB/s
dirtyrate.calc_mode calc_mode the calculation mode used last measurement (page-sampling/dirty-bitmap/dirty-ring)
dirtyrate.vcpu.<num>.megabytes_per_second megabytes_per_second the calculated memory dirty rate for a virtual cpu in MiB/s

And additional statistics:

Statistics group Exposed Telegraf tag Exposed Telegraf field Description
vcpu_mapping vcpu_id --- ID of Virtual CPU
--- cpu_id Comma separated list (exposed as a string) of Physical CPU IDs

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Sep 15, 2022
@p-zak p-zak added new plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Sep 15, 2022
plugins/inputs/libvirt/README.md Outdated Show resolved Hide resolved
plugins/inputs/libvirt/libvirt.go Outdated Show resolved Hide resolved
@reimda reimda added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Sep 20, 2022
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this one!

@reimda
Copy link
Contributor

reimda commented Oct 3, 2022

Hi @p-zak, this looks pretty good. Thanks!

I have a question about future metric format changes. I assume libvirt's data model doesn't change very often, but if it does, how will it affect the metric format of this plugin? I don't see a hard coded mapping in the code that is like the table in the description so I assume there's a pattern mapping and a change in libvirt would change the metric format.

I would like to avoid the situation where a user starts using using telegraf + inputs.libvirt with one version of libvirt, then upgrades libvirt to a version that removes or renames a field. Then telegraf produces metrics of a slightly different format which depending on the outputs being used can cause write errors or query errors downstream. (see the format changes doc)

@p-zak
Copy link
Collaborator Author

p-zak commented Oct 3, 2022

@reimda I believe that indeed libvirt's data model doesn't change very often (probably that's why most of metrics are in the snake_case format but there are few in dash-case or camelCase format which weren't corrected).

And that's why there is mapping from source metric (from libvirt) to metrics which are exposed by this plugin. You can find it in libvirt_metric_format.go. It allows to expose only metrics which are known (till libvirt 8.7.0). If something changes (removal, addition, renaming), it will need to be adjusted in this plugin.

I hope that this is the approach you want to achieve? :)

@reimda
Copy link
Contributor

reimda commented Oct 5, 2022

@reimda I believe that indeed libvirt's data model doesn't change very often (probably that's why most of metrics are in the snake_case format but there are few in dash-case or camelCase format which weren't corrected).

And that's why there is mapping from source metric (from libvirt) to metrics which are exposed by this plugin. You can find it in libvirt_metric_format.go. It allows to expose only metrics which are known (till libvirt 8.7.0). If something changes (removal, addition, renaming), it will need to be adjusted in this plugin.

I hope that this is the approach you want to achieve? :)

I like how the current mapping code makes the names uniform. The only potential problem I see is that since it is pattern based, if the data from libvirt changes, it will change the metrics telegraf produces. If someone stores the metrics in a database and builds an application or dashboard that queries the database, then when metrics changes it has the potential to break queries and break the downstream application.

This is a problem that some other telegraf plugins have. We don't need to prevent it from happening here. I am ok with relying on libvirt's data model not changing often, but maybe we should put something in the docs that lets users know they need to expect changes in the metric format depending on which version of libvirt they use. What do you think? Could you add a note in readme.md?

Copy link
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor points

plugins/inputs/libvirt/README.md Outdated Show resolved Hide resolved
plugins/inputs/libvirt/libvirt.go Outdated Show resolved Hide resolved
plugins/inputs/libvirt/libvirt.go Outdated Show resolved Hide resolved
Comment on lines 94 to 95
Below the table containing a list of all metrics
supported by the libvirt plugin is presented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to put in a note that the metric format could change in the future depending on what statistics libvirt reports.

Suggested change
Below the table containing a list of all metrics
supported by the libvirt plugin is presented.
See the table below for a list of metrics produced by the plugin.
The exact metric format depends on the statistics libvirt reports, which may vary depending on the version of libvirt on your system.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reimda
Right, changed.

@telegraf-tiger
Copy link
Contributor

@reimda reimda merged commit 94e39fa into influxdata:master Oct 12, 2022
@reimda
Copy link
Contributor

reimda commented Oct 12, 2022

Thanks Paweł!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin new plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libvirt plugin Openstack vm support? Support for libvirt/kvm
4 participants