Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some rate calculations on rc1 are incorrect #3345

Closed
smarterclayton opened this Issue Oct 24, 2017 · 11 comments

Comments

Projects
None yet
3 participants
@smarterclayton
Copy link

smarterclayton commented Oct 24, 2017

In 2.0.0-rc1 some queries for rate over counters seem to be wildly overstated / incorrectly calculated. This is very large kubernetes cluster that is receiving ~0.1 requests per second on average for a specific request error code, but the rate calculation is reporting 28 req/s.

Here's the samples from the three masters for a particular metric apiserver_request_count sampled from the server directly for one server:

apiserver_request_count{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="resource",subresource="",verb="GET"} 7584

... wait 5-10s

apiserver_request_count{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="resource",subresource="",verb="GET"} 7584

... wait 5-10s

apiserver_request_count{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="resource",subresource="",verb="GET"} 7587

Prometheus console reports the correct instantaneous value:

Query:
apiserver_request_count{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",resource="secrets",scope="resource",subresource="",verb="GET"}

Result:
{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",instance="172.31.21.249:443",job="kubernetes-apiservers",resource="secrets",scope="resource",subresource="",verb="GET"} | 7587

graph

Over the last six hours the total has only increased by about 3k which should be a rate of about 0.15 sec.

But the query for rate is reporting a 5m rate of 28/sec which is impossible:

Query:
rate(apiserver_request_count{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",instance="172.31.21.249:443",job="kubernetes-apiservers",resource="secrets",scope="resource",subresource="",verb="GET"}[5m])

Result:
{client="openshift/v1.7.0+80709908fd (linux/amd64) kubernetes/b0608fa",code="404",contentType="application/vnd.kubernetes.protobuf",instance="172.31.21.249:443",job="kubernetes-apiservers",resource="secrets",scope="resource",subresource="",verb="GET"} | 27.712352499999998

Graph

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Oct 24, 2017

The prometheus instance has been running for about 20 hours since it was last cleared and reset due to #3283 and #3316

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Oct 24, 2017

Other rates on other series are also broken by two or three orders of magnitude

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 24, 2017

Thanks, does this sound like #3337?

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Oct 25, 2017

Yes, will check the samples reported in the morning

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Oct 25, 2017

Looks like this was fixed with rc2:

graph

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Oct 25, 2017

There's a tiny tail of correct rate on the right edge.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 25, 2017

Awesome, thanks for the quick response!

@paultiplady

This comment has been minimized.

Copy link

paultiplady commented Dec 14, 2017

@fabxc any idea when this regression was introduced? I seem to be hitting this on 1.7.0, is that possible? Thanks!

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 14, 2017

@paultiplady

This comment has been minimized.

Copy link

paultiplady commented Dec 14, 2017

Thanks for the quick response -- I'll do some more digging on my side and try upgrading to see if that clears things.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.