Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative cpu rate with measures aggregation #1044

Closed
berndbausch opened this issue Aug 2, 2019 · 19 comments
Closed

Negative cpu rate with measures aggregation #1044

berndbausch opened this issue Aug 2, 2019 · 19 comments

Comments

@berndbausch
Copy link

Running two CPU-intensive instances on OpenStack Stein. CPU rate measures of individual instances look correct, but CPU rate measures of the aggregation of the two instances are often negative. In my mind this is impossible, as CPU is a cumulative measure, so that the rate must always be positive.
Even if not negative, rate CPU measures don't seem to be correlated to the non-rate CPU measures.
I am sure I am missing something.

Which version of Gnocchi are you using

4.3.1.dev38
Installed by Stein Devstack.

How to reproduce your problem

On a stable/stein Devstack, I configure this default archive policy in gnocchi_resources.yaml:

  - name: ceilometer-medium-rate
    aggregation_methods:
      - mean
      - rate:mean
    back_window: 0
    definition:
      - granularity: 1 minute
        timespan: 7 days
      - granularity: 1 hour
        timespan: 365 days

Ceilometer adds this policy to Gnocchi, as expected:

$ gnocchi archive-policy list
| ceilometer-medium-rate |           0 | - points: 10080, granularity: 0:01:00, timespan: 7 days, 0:00:00      | rate:mean, mean                 |

and gnocchi metric list confirms that all metrics use the ceilometer-medium-rate policy.

I run two CPU-intensive instances:

openstack server create --property metering.server_group=myapp ... cpu-user1
openstack server create --property metering.server_group=myapp ... cpu-user2

What is the result that you get

$ gnocchi measures aggregation --query server_group=myapp --resource-type instance --aggregation mean --metric cpu
+---------------------------+-------------+---------------+
| timestamp                 | granularity |         value |
+---------------------------+-------------+---------------+
| 2019-08-02T15:13:00+09:00 |        60.0 | 19995000000.0 |
| 2019-08-02T15:14:00+09:00 |        60.0 | 46495000000.0 |
| 2019-08-02T15:15:00+09:00 |        60.0 | 62710000000.0 |
| 2019-08-02T15:16:00+09:00 |        60.0 | 87570000000.0 |
| 2019-08-02T15:17:00+09:00 |        60.0 |   1.16445e+11 |
| 2019-08-02T15:18:00+09:00 |        60.0 |   1.40075e+11 |
| 2019-08-02T15:19:00+09:00 |        60.0 |    1.4856e+11 |

$ gnocchi measures aggregation --query server_group=myapp --resource-type instance --aggregation rate:mean --metric cpu
+---------------------------+-------------+----------------+
| timestamp                 | granularity |          value |
+---------------------------+-------------+----------------+
| 2019-08-02T15:15:00+09:00 |        60.0 | -10285000000.0 |
| 2019-08-02T15:16:00+09:00 |        60.0 |   8645000000.0 |
| 2019-08-02T15:17:00+09:00 |        60.0 |   4015000000.0 |
| 2019-08-02T15:18:00+09:00 |        60.0 |  -5245000000.0 |
| 2019-08-02T15:19:00+09:00 |        60.0 | -15145000000.0 |

What is result that you expected

Positive rate values. Also, the difference between the 15:16 and 15:15 CPU measure is 24860000000, but the rate is 8645000000.
Perhaps I misunderstand the meaning of "rate" in this case. What I want is the CPU utilization of an instance, no matter if measured in percent or in nanoseconds.

@chungg
Copy link
Member

chungg commented Aug 2, 2019

gnocchi measures aggregation is actually the deprecated command i believe. there is a gnocchi aggregates command which has a DSL of sorts and allows you to specify more complex queries.

that said, what is happening i believe is because you don't specify the --reaggregation field, it ends up computing the rate of rates, or the acceleration of timeseries. if you add --reaggregation mean it should give you what you expect. if you use the use the aggregates command you can divide by the frequency to get a percentage.

@berndbausch
Copy link
Author

berndbausch commented Aug 3, 2019

Many thanks Gord. Wow your answer was fast. Yes, by adding --reaggregation mean to my command, I get the expected figures. So for now my problem is solved, though I don't pretend to understand why reaggregation is needed, what reaggregation is in the first place, and what I am doing there. I have to study a bit more.

Just in case you are interested in helping a total newcomer to the world of measurement and statistics:
Is there a "time series for dummies" book somewhere, or anything where the terms "aggregation" and "reaggregation" are defined for dummies? An intro how to use Gnocchi in the context of OpenStack?

Also, it would be great if the Gnocchi client documentation mentioned which commands are deprecated. I would not mind helping out with this kind of work, if I knew what precisely is in fact deprecated:)

Thanks again, you removed a roadblock.

@berndbausch
Copy link
Author

berndbausch commented Aug 3, 2019

I closed this issue because I have a solution, but here is an improvement.

This non-deprecated command provides the same result as the deprecated gnocchi measures aggregation:

gnocchi aggregates --resource-type instance   \
                   "(aggregate rate:mean (metric cpu mean))"    \
                   "server_group=myapp"

This is great, since I can now apply simple arithmetic to turn the nanosecond results into percentages:

gnocchi aggregates --resource-type instance  \
                   "(* ( / (aggregate rate:mean (metric cpu mean)) 60000000000.0) 100)"  \
                   "server_group=myapp"

@chungg
Copy link
Member

chungg commented Aug 3, 2019

i would welcome changes to the docs so feel free to contribute to https://github.com/gnocchixyz/gnocchi/tree/master/doc/source or https://github.com/gnocchixyz/python-gnocchiclient/tree/master/doc/source. unfortunately, it seems the publishing of the docs are not working so even the docs online do not reflect what is in repository :(

yes, i can see how it is ambiguous, it is because gnocchi stores aggregates as its base and not raw datapoints. so in your archive policy you have rate:mean and mean at granularity: 0:01:00 which means gnocchi is actually storing two timeseries, one for each. so when you make your query, you first need to specify the aggregate you told gnocchi to store and then because you are dynamically aggregating on a metric across multiple resources, that is the reaggregation

so in the example:

gnocchi aggregates --resource-type instance   \
                   "(aggregate rate:mean (metric cpu mean))"    \
                   "server_group=myapp

you are selecting all the stored cpu mean metrics for instances with server_group=myapp (which returns many series) and then you're telling gnocchi to aggregate those series into one by computing the mean rate across them.

(aggregate mean (metric cpu rate:mean)) would also work for you (and would be more accurate). it gets the rate:mean metrics and computes the mean across them to return one timeseries.

as last example, (aggregate mean (metric cpu max)) would not work for you, because cpu max is not a aggregate stored according to your policy. alternatively, (aggregate max (metric cpu mean)) will work and will return the max of all your cpu mean metrics

@berndbausch
Copy link
Author

Whenever I leave a comment here I learn something new. Another lightbulb moment. Thanks Gord!

@giorgiove
Copy link

Hi, just a quick one if I may.
How do you create an aodh alarm that tracks CPU util based on the
gnocchi aggregates --resource-type instance
"(* ( / (aggregate rate:mean (metric cpu mean)) 60000000000.0) 100)"
"server_group=myapp"
output.
I'm sure I'm missing something but the documentation is rather concise .. so to speak
Thanks

@chungg
Copy link
Member

chungg commented Jan 29, 2020

disclaimer: i don't remember much about aodh and didn't know much to begin with but you may want to look at gnocchi_aggregation_by_resources_threshold alarm type.

that said, you'll probably get better feedback from openstack community... if not, that's probably a good sign to find an alternative to aodh.

@ohryhorov
Copy link

ohryhorov commented Apr 28, 2020

Hi, just a quick one if I may.
How do you create an aodh alarm that tracks CPU util based on the
gnocchi aggregates --resource-type instance
"(* ( / (aggregate rate:mean (metric cpu mean)) 60000000000.0) 100)"
"server_group=myapp"
output.
I'm sure I'm missing something but the documentation is rather concise .. so to speak
Thanks

Hello,
Have you managed to define aodh alarm based on calculated metric?
Of course it could be done to define threshold in nanoseconds for metric cpu but it definitely not convenient.

@zhenjiangma
Copy link

zhenjiangma commented Nov 25, 2020

Hey, I just meet a problem.
I want to get cpu_util according to the statements above, but my command doesn't work well.
When I run "openstack metric aggregates '(metric cpu rate:mean)' id=e319d4e6-67fb-4398-be0f-3c6790b50eec" , it works well.
However, when I run " openstack metric aggregates '(* (/ (aggregate rate:mean (metric cpu mean)) 30000000000) 100)' id=e319d4e6-67fb-4398-be0f-3c6790b50eec" , it told me "Invalid input: '*' operation invalid for dictionary value @ data[u'operations'] (HTTP 400)", so I don't know why. Could you help me?

@unlenen
Copy link

unlenen commented Jan 26, 2021

Hi ,
I have the same problem . As @berndbausch explained , I can compute the cpu usage of a instance , but I could not find a way to insert this aggregation to aodh. I need a alarm when instance cpu is higher that %90 for auto scaling. Could you help me about this?

My Test:

  • Server : 8 core , 8G Ram
  • Test route : stress-ng --cpu 8 --cpu-load 100
  • Granularity : 300
  • Calculation : gnocchi aggregates '(* (/ (aggregate rate:mean (metric cpu mean)) granularity*1000000000) 100)' id=<server_id>
  • Response : 800 that means every cpu run at %100

Thanks @berndbausch , helps me to understand what gnocchi is 🥇

@zhenjiangma
Copy link

zhenjiangma commented Jan 26, 2021 via email

@unlenen
Copy link

unlenen commented Jan 26, 2021

So it it possible to create a measure or metric from a aggregates , so we can extend usage

@GizemElove
Copy link

Hi, I have similar issue as well.
"gnocchi aggregates '(* (/ (aggregate rate:mean (metric cpu mean)) 300000000000) 100)' id=..." I tried the below query to retrieve CPU utilization and it worked. But, I'm using Openstack Tacker and I need to trigger automatic scaling of my vnf group when my CPU utilization is greater then 80%. I tried to create an alarm with aodh but i can not use this query like this.
Any help on this would be helpful for me.

@paramite
Copy link

paramite commented Jun 30, 2021

Unfortunately looking at the Aodh code [1], there is no gnocchi based alarm evaluator, that would call self._gnocchi_client.aggregates ([2][3]). There is only alarm types which call self._gnocchi_client.metric.aggregation, which does not support "operations". We would need to implement new Aodh alarm type for aggregates.

[1] https://github.com/openstack/aodh/blob/stable/train/aodh/evaluator/gnocchi.py
[2] https://github.com/gnocchixyz/python-gnocchiclient/blob/master/gnocchiclient/v1/aggregates.py
[3] https://github.com/gnocchixyz/python-gnocchiclient/blob/master/gnocchiclient/v1/aggregates_cli.py#L49

@manuvakery1
Copy link

Hi , I have the same problem . As @berndbausch explained , I can compute the cpu usage of a instance , but I could not find a way to insert this aggregation to aodh. I need a alarm when instance cpu is higher that %90 for auto scaling. Could you help me about this?

My Test:

  • Server : 8 core , 8G Ram
  • Test route : stress-ng --cpu 8 --cpu-load 100
  • Granularity : 300
  • Calculation : gnocchi aggregates '(* (/ (aggregate rate:mean (metric cpu mean)) granularity*1000000000) 100)' id=<server_id>
  • Response : 800 that means every cpu run at %100

Thanks @berndbausch , helps me to understand what gnocchi is 1st_place_medal

@unlenen have you managed to do this?

@unlenen
Copy link

unlenen commented Jan 28, 2022

Check this code path in aodh , you need to restart aodh-evaluator after the code changes

https://review.opendev.org/c/openstack/aodh/+/786880

  • Edit : You may need to upgrade your gnocchi-client where you install the aodh . Be use that gnocchi-client ver . must be bigger than 7.0.6 . I also want to metion that this code can only helps where your ceilometer notification inverval is same with heat template interval

@manuvakery1
Copy link

Check this code path in aodh , you need to restart aodh-evaluator after the code changes

https://review.opendev.org/c/openstack/aodh/+/786880

  • Edit : You may need to upgrade your gnocchi-client where you install the aodh . Be use that gnocchi-client ver . must be bigger than 7.0.6 . I also want to metion that this code can only helps where your ceilometer notification inverval is same with heat template interval

@unlenen ok .. thanks .. I will try this

@tobias-urdin
Copy link
Contributor

Aodh will get Dynamic Aggregates API support with [1] with a small issue in Gnocchi [2] (hopefully fixed soon).

[1] https://review.opendev.org/c/openstack/aodh/+/829870
[2] #1202

@manuvakery1
Copy link

@tobias-urdin can you please provide a sample alarm using the dynamic aggregate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

10 participants