Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add total VMs created metric #10418

Merged

Conversation

machadovilaca
Copy link
Member

What this PR does / why we need it:

Track the total number of VMs created by virt-controller

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

Add total VMs created metric
https://issues.redhat.com/browse/CNV-15536

/cc @sradco @enp0s3

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L area/monitoring labels Sep 12, 2023
docs/metrics.md Outdated
@@ -81,6 +81,9 @@ The current available memory of the VM containers based on the rss. Type: Gauge.
### kubevirt_vm_container_free_memory_bytes_based_on_working_set_bytes
The current available memory of the VM containers based on the working set. Type: Gauge.

### kubevirt_vm_created_total
Amount of VMs created, broken down by namespace. Type: Counter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created since when?
Since instal, update, controller pod restart, phase of the moon :) …

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't obviously have the counter track VMs from previous versions, but other than that Prometheus takes care of summing all counter values.

maybe this is a good opportunity to think once again about more metrics metadata, release lifecycle, version when it was added, when they are deprecated, etc... (would also help with the discussion in https://groups.google.com/g/kubevirt-dev/c/7p3q5Lo71hs/m/jIJGNJpdAwAJ)

@@ -81,6 +81,9 @@ The current available memory of the VM containers based on the rss. Type: Gauge.
### kubevirt_vm_container_free_memory_bytes_based_on_working_set_bytes
The current available memory of the VM containers based on the working set. Type: Gauge.

### kubevirt_vm_created_total
Amount of VMs created, broken down by namespace. Type: Counter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why do we need this metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

providers using KubeVirt (through insights) and users would benefit from this because it would ease the process of understanding the VM usage through time (with absolute values and increase/delta/etc functions to understand trends)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter appears to be incremented whenever a VM is:

  1. Created for the first time
  2. Deleted and recreated
  3. Reconciled the first time by a new instance of virt-controller

On a system with a lot of churn there will be a big drop when virt-controller is restarted. Maybe that's not a big deal? How is that handled by consumers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus handles the counter restarting at 0, when virt-controller is restarted.

The only issue here would be point 3: "Reconciled the first time by a new instance of virt-controller". But with:

// VM is nil or already processed
	if vm == nil || len(vm.Status.Conditions) != 0 {
		return
	}

that I added in func NewVMCreated(vm *v1.VirtualMachine), since the VM would have the status conditions set, I would expect it not to happen, am I missing something?

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2023
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2023
@machadovilaca machadovilaca force-pushed the add-total-vm-created-metric branch 2 times, most recently from 10f6bf8 to cc992c2 Compare October 9, 2023 14:38
[]string{"namespace"},
)

vmMap = map[string]bool{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're essentially keeping state in the virt-controller here, which (correct me if I'm wrong) means that the created VM metric will start lying as soon as the virt-controller pod dies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this state is to prevent "concurrent" requests when the VM is first created
because the vm.Status.Conditions are not set yet in these (which would make this function being skipped as I mentioned in #10418 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, on a system with existing VMs, when the virt-controller dies because of maintenance or something,
the metric would reset to 0 (lie). Is that not the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is expected with counters, Prometheus handles those counter resets

Copy link
Contributor

@akalenyu akalenyu Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would be okay with this metric not reporting the correct amount of VMs after a restart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put it in that way. We want the metric correctly report the number of VMs created by that instance of 'virt-controller'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can query it generally? so the sum of all instances that ever existed, including ones that died?

Copy link
Contributor

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machadovilaca @sradco Hi. My apologies but I find it hard to follow the motivation behind creating this metric. From my POV we can also add metric for VM deletion, VM modification. But what is the use-case we are trying to reach here? why is it important to track the creation rate?

@machadovilaca
Copy link
Member Author

machadovilaca commented Oct 30, 2023

@machadovilaca @sradco Hi. My apologies but I find it hard to follow the motivation behind creating this metric. From my POV we can also add metric for VM deletion, VM modification. But what is the use-case we are trying to reach here? why is it important to track the creation rate?

@enp0s3
Currently we have no way of getting an historical information of how many VMs were created in a given cluster.
Yes we have some metrics tracking status change, but those are only good during the Prometheus retention time, after that those number are gone.
This metric would allows us to know the 'real' over-time usage of KubeVirt

@enp0s3
Copy link
Contributor

enp0s3 commented Oct 30, 2023

@machadovilaca @sradco Hi. My apologies but I find it hard to follow the motivation behind creating this metric. From my POV we can also add metric for VM deletion, VM modification. But what is the use-case we are trying to reach here? why is it important to track the creation rate?

@enp0s3 Currently we have no way of getting an historical information of how many VMs were created in a given cluster. Yes we have some metrics tracking status change, but those are only good during the Prometheus retention time, after that those number are gone. This metric would allows us to know the 'real' over-time usage of KubeVirt

@machadovilaca Thank you for the reply. I see this as an answer to what but not to why

@sradco
Copy link
Contributor

sradco commented Oct 31, 2023

The why is as @machadovilaca mentioned.
"This metric would allows us to know the 'real' over-time usage of KubeVirt".

@xpivarc
Copy link
Member

xpivarc commented Nov 8, 2023

Did we check if any existing Kubernetes metric could be used for this?

@machadovilaca
Copy link
Member Author

Did we check if any existing Kubernetes metric could be used for this?

yes, we could track virt-launcher pods or even use KubeVirt VM status metric, the problem with those approaches is that we would lose information previous to retention date

@machadovilaca
Copy link
Member Author

@xpivarc

Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why don't we implement this in the admitter? The admitter will process each VM exactly once(at least this hold for creation). It seems to me we will get the wrong numbers for the metric in the current implementation.

@machadovilaca
Copy link
Member Author

I wonder why don't we implement this in the admitter? The admitter will process each VM exactly once(at least this hold for creation). It seems to me we will get the wrong numbers for the metric in the current implementation.

@xpivarc good suggesting, I updated the PR to push the metric on vms-admitter

)
)

func NewVMCreated(vm *v1.VirtualMachine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the vm to namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are using a pointer I think we don't have any performance problems with copying the structure, and personally, I think it makes more sense that a function NewVMCreated receives a VM as an argument and not a Namespace. Even if in the future we want more labels or create new metrics in the function, it would be easier

@@ -210,6 +211,10 @@ func (admitter *VMsAdmitter) Admit(ar *admissionv1.AdmissionReview) *admissionv1
reviewResponse := admissionv1.AdmissionResponse{}
reviewResponse.Allowed = true

if ar.Request.Operation == admissionv1.Create {
metrics.NewVMCreated(&vm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vm might not have the namespace set. Could you check if we populate it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a resource always comes with a namespace, even if it is not set, it gets the value of the current namespace by default

docs/metrics.md Outdated
@@ -66,6 +66,9 @@ The current available memory of the VM containers based on the rss. Type: Gauge.
### kubevirt_vm_container_free_memory_bytes_based_on_working_set_bytes
The current available memory of the VM containers based on the working set. Type: Gauge.

### kubevirt_vm_created_total
Amount of VMs created, broken down by namespace. Type: Counter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @fabiand pointed out, you might want to elaborate on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@machadovilaca machadovilaca force-pushed the add-total-vm-created-metric branch 2 times, most recently from e9704dd to 1787ffc Compare November 21, 2023 14:02
@xpivarc
Copy link
Member

xpivarc commented Nov 28, 2023

/cc @fossedihelm

Copy link
Contributor

@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @machadovilaca! few comments below

pkg/monitoring/virt-api/metrics/metrics.go Show resolved Hide resolved
pkg/monitoring/virt-api/metrics/vm_metrics.go Show resolved Hide resolved
// setup monitoring
err = metrics.SetupMetrics()
if err != nil {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct to silently return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct, added a panic

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 29, 2023
@kubevirt-bot kubevirt-bot added size/L and removed size/M needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 29, 2023
@fossedihelm
Copy link
Contributor

@machadovilaca IIUC the metric is reset every time the instance of the virt-api is restarted.
This means that there is the possibility that the reported value is not correct.
From this point of view, how can a metric that can report wrong values, be useful?
I mean, I think that the reported value could not be real, and as consequence, all the related statistics.
WDYT?
Thanks!

@machadovilaca
Copy link
Member Author

@machadovilaca IIUC the metric is reset every time the instance of the virt-api is restarted. This means that there is the possibility that the reported value is not correct. From this point of view, how can a metric that can report wrong values, be useful? I mean, I think that the reported value could not be real, and as consequence, all the related statistics. WDYT? Thanks!

@fossedihelm that is the expected behavior and common in Prometheus. Since we are using a counter, which is monotonically increasing, Prometheus has mechanisms to handle when a value goes back to an inferior value (with the restart of the virt-api for example, where we would start counting from zero again), what they call a 'counter restart'

@fossedihelm
Copy link
Contributor

/lgtm
Thank you!

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2023
@@ -210,6 +211,10 @@ func (admitter *VMsAdmitter) Admit(ar *admissionv1.AdmissionReview) *admissionv1
reviewResponse := admissionv1.AdmissionResponse{}
reviewResponse.Allowed = true

if ar.Request.Operation == admissionv1.Create {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a dry run, please add a check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check for nils as well

@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2023
Signed-off-by: João Vilaça <jvilaca@redhat.com>
@machadovilaca
Copy link
Member Author

/retest

Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@xpivarc
Copy link
Member

xpivarc commented Dec 4, 2023

@fossedihelm PTAL

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xpivarc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2023
Copy link
Contributor

@fossedihelm fossedihelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2023
@kubevirt-bot kubevirt-bot merged commit a4ee72f into kubevirt:main Dec 5, 2023
39 checks passed
@machadovilaca machadovilaca deleted the add-total-vm-created-metric branch December 5, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants