Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.3.0 significatnt memory usage increase. #4254

Closed
tcolgate opened this Issue Jun 12, 2018 · 77 comments

Comments

Projects
None yet
6 participants
@tcolgate
Copy link
Contributor

tcolgate commented Jun 12, 2018

Bug Report

What did you do?
Upgraded to 2.3.0

What did you expect to see?
General improvements.

What did you see instead? Under which circumstances?
Memory usage, possibly driven by queries, has considerably increased. Upgrade at 09:27, the memory usage drops on the graph after then are from container restarts due to OOM.

container_memory_usage_bytes

image

Environment

Prometheus in kubernetes 1.9

  • System information:
    Standard docker containers, on docker kubelet on linux.

  • Prometheus version:
    2.3.0
    insert output of prometheus --version here

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Can you share your configuration, a snapshot of the benchmark dashboard, and if you've made any other changes?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@brian-brazil where do I find the "benchmark dashboard", and the config for this is rather large are there any specific areas of interest?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

The entire config please, we don't know what might be relevant.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

Obfuscated by hand, hopefully didn't introduce any additional problems.
We also have a large'ish number of recording rules, due to the heritage of the tooling, these are all in individual rules groups, 1 per group. Not sure if that will have any bearing on things.
config.yaml.txt

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Summary of config: 15s interval, using 13 gce_sd_configs and I think 36 kuberentes_sd_configs.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

How often is the config file being reloaded? Can you try running it without the rules to eliminate those?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

sounds about right.
We have a sidecar that discovers kube clusters within a gce project, we run a prom per project (this config also has some additional scrapes for non-kubernetes GKE instances. This is one of the larger instances but has been running fine on 2.2.1.
The kube scrapes can probably be simplified at this point to just the pod discovery, but I don't think the number of scrapes is the problems.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

config reload only occurs on recording rule publishing, (in the order of < 1 / day).
Unfortunately it will be a while before I can spin up another 2.3 just to test.

Here's a before/after (the blue annotation shows when I reverted to 2.2.1)
https://snapshot.raintank.io/dashboard/snapshot/p9PmKT5K5D57RicIZOZQzdrAsez7vOF8

FWIW queries should be relatively consistant across the timespan of that plot.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

My primary suspicions would be on kuberentes_sd, as that changed a good bit. Query memory should only have gone down - which your graphs show.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

My original graph may be incredibly misleading. I thought the spikes were the cause of the OOM, now I realise that is show the total allocated CPU so the spikes are actually two instances running, not one, I think the instance is already over its mrmory budget.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 12, 2018

@tcolgate could you provide a heap profile SVG when the usage is very high?

go tool pprof -symbolize=remote http://<prometheus>/debug/pprof/heap

And then svg > heap_inuse.svg.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@fabxc I've up'd the memory request to 26GB, which I think is enough now that at least it doesn't get OOM'd (it's pretty much on a dedicated node now).
During the high memory usage spikes, I see scrap interval percentiles climb and sample ingestion drop significantly.
I got the attached capture during one of these spikes (or at least on the way to one)
heap_inuse_break.svg.gz

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 12, 2018

Thanks for the quick response. That appears to be a 30s CPU profile, which looks fairly normal.
Are you sure you used a URL ending in /heap and not /profile?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

Doh
@fabxc I think I got this just as the scrape latency was increasing:
heap_inuse_break.svg.gz

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 12, 2018

This profile does not show any service discovery at all, i.e. usage is probably very minor. The graph your first posted also indicates that usage didn't increase continuously but rather spikes a lot. The baseline seems to be a bit lower than before actually.

Can you get another profile with the -alloc_space flag added? Would be good if the server has been seeing those spikes already for a while.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

This is a alloc_space one from during the climbing memory
allocs3.svg.gz

This one seems to suggest that the labels built up during scraping are the problem? What's changed around the scrapes? more concurrency maybe?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Do you have a normal heap_inuse and one during a spike?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@brian-brazil I think the previous one was at the start of one. Timing is tricky as things become unresponsive. I've had to revert now as I've tested these on a prod instance (thanks to the kind patience of some devs).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 12, 2018

The allocs profile only shows allocations, recording rule evaluations, and serving Prometheus's own /metrics endpoint. The rest is below the threshold for being displayed.

My best guess for now is thus PromQL. The CPU profile doesn't show much GC work, which indicates that generally the allocation improvements in 2.3 are working. But possibly the changed evaluation model pins to much memory for a single query at once? – I've no hard reasoning for this though.

@tcolgate any chance you can share (possibly privately) the recording rules that server is running so we can get an understanding of the queries that are running?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Could you also spin up a test server without the recording rules?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@brian-brazil that's going to take me a while. I can try and find the time tomorrow

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

But possibly the changed evaluation model pins to much memory for a single query at once? – I've no hard reasoning for this though.

I can only imaging that happening with very high churn in the underlying data, which the graphs don't show.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 12, 2018

@tcolgate did you do any other changes apart from just upgrading the prom version?

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 12, 2018

@krasi-georgiev nope, just upgraded. I've updated and reverted a few times, behaviour is consistant. 2.3.0 crashes, 2.2.1, stable.

@rajatjindal

This comment has been minimized.

Copy link

rajatjindal commented Jun 13, 2018

@tcolgate , the graphs that you shared are very interesting. we are also using Prometheus and running into performance issues when we enabling remote storage. (it might be a completely different issue)

is there a place where we can import these dashboards from? will be interesting to see these metrics

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 13, 2018

@rajatjindal The Prometheus Benchmark dashboard is available on grafana.net (make sure you get the 2.0 version), and the other dashboard is our internal prom dashboard, better dashboards exist on grafana.net (our is adapted from one of the earlier v1 prometheus perf dashboards)

@free

This comment has been minimized.

Copy link
Contributor

free commented Jun 13, 2018

From the -alloc_space profile, it looks to me like more than half the memory allocations during scraping (16 of 29 GB, in scrape.mutateSampleLabels) and about a quarter of those during rule evaluation (14 of 55 GB, 10 GB in promql.dropMetricName and 4 GB in promql.(*evaluator).aggregation) are labels.Builder instances.

#4248 tries to optimize away the latter 4 GB (and I've seen it use 10x less memory for query_rangerequests for aggregated rate queries -- sum by(foo) (rate(bar[5m]))), but if that turns out not to be the main cause, I can think of a couple of ways of getting rid of the other, larger, allocations:

  • For scrape.mutateSampleLabels, barring any race conditions, reuse the labels in scrape.scrapeCache.seriesPrev where available, instead of creating new ones on each scrape.
  • For promql.dropMetricName, turn labels.Labels into an interface, with a base implementation (the current one) plus a wrapper around it, that dynamically filters away some labels.

Neither is a particularly difficult technical challenge and might be worth pursuing even if the root cause turns out to be totally different (which I actually doubt).

@free

This comment has been minimized.

Copy link
Contributor

free commented Jun 13, 2018

Oh, and on a related note, while testing my humongous aggregated rate query_range requests, I'm seeing significantly more memory usage the first time I run said requests after long pauses (hours or days) than on subsequent runs. I.e. starting from a steady state go_memstats_heap_inuse_bytes of about 500 MB, the first spike goes to 2.3 GB, subsequent ones to only 1.1 GB.

I haven't managed to pinpoint the cause, but I imagine it's the TSDB loading everything.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 14, 2018

I've managed to trigger it again, got a allocs svg, but I don't think that is terriby useful (seems to be since the process started?)
heap_allocs.svg.gz

There seems to be some kind of time/event element to this, but it doesn't obviously align with some other event having occurred. Basically, if I leave the thing alone for an hour or so, I can trigger the OOM by using the federate query. At some time before this, I can hammer federate mercilessly without issues (tried hitting it with hey).

I need to head off soon. Tomorrow I'll try and capture a CPU profile during a crash.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 14, 2018

This is smelling like a TSDB issue and it doesn't align with blocks, so it's probably chunks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

I've looked through the code changes on the federation codepath between 2.3.0 and 2.2.1, there's a few changes on the path but none of them seem plausible.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 15, 2018

I've not been able to trigger a crash this morning as yet. I'll give it a try every 10 mins or so and see if I can get a trace during the crash, 🤞

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

I'm currently suspecting it's #4185. If you can reproduce again, try rolling that one back.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

Also if you could try the federate call without the {job="prometheus"}, to see if it's overlapping selectors that's the issue.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 15, 2018

removing the {job="prometheus"} doesn't seem to trigger the problem any more reliably. Not managed to crash it yet today.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

I would expect removing the {job="prometheus"} to reduce the chances of a crash, as it's a simpler code path.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 15, 2018

Any suggestions of how I can update the /federate to make it more likely to trigger your guess? I'd rather be able to reliably reproduce the problem.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

Adding some duplicate matchers might do it, but it's a bit of a shot in the dark.

@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 15, 2018

I repeated one of the matchers a couple of times and boom, that did it. and I appear to be able to crash it on demand. Impressive in-the-dark shooting there!
out.trace.gz

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

Okay, I can reproduce locally now on a Prometheus scraping only itself.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

Okay, the issue is if you try to federate a NaN value from a time series that more than one selector returns. NaN's never equal themselves, so it looks like a different value even though it's the same. So we end up in an infinite loop, which is also buffering up all these duplicate points in RAM.

brian-brazil added a commit that referenced this issue Jun 15, 2018

Avoid infinite loop on duplicate NaN values.
Fixes #4254

NaNs don't equal themselves, so a duplicate NaN would
always hit the break statement and never get popped.

We should not be returning multiple data point for the same
timestamp, so don't compare values at all.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
@tcolgate

This comment has been minimized.

Copy link
Contributor Author

tcolgate commented Jun 15, 2018

@brian-brazil great catch, cheers. I'll try a build from release-2.3 once you've merged

brian-brazil added a commit that referenced this issue Jun 18, 2018

Avoid infinite loop on duplicate NaN values. (#4275)
Fixes #4254

NaNs don't equal themselves, so a duplicate NaN would
always hit the break statement and never get popped.

We should not be returning multiple data point for the same
timestamp, so don't compare values at all.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

mknapphrt added a commit to mknapphrt/prometheus that referenced this issue Jul 26, 2018

Return whatever data is available when there is a failed remote read
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

Bubble up errors to promql from populating iterators (prometheus#4136)

This changes the Walk/Inspect API inside the promql package to bubble
up errors. This is done by having the inspector return an error (instead
of a bool) and then bubbling that up in the Walk. This way if any error
is encountered in the Walk() the walk will stop and return the error.
This avoids issues where errors from the Querier where being ignored
(causing incorrect promql evaluation).

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4136

*: cut v2.3.0

Signed-off-by: Fabian Reinartz <freinartz@google.com>

Update changelog

Signed-off-by: Fabian Reinartz <freinartz@google.com>

limit size of POST requests against remote read endpoint (prometheus#4239)

This commit fixes a denial-of-service issue of the remote
read endpoint. It limits the size of the POST request body
to 32 MB such that clients cannot write arbitrary amounts
of data to the server memory.

Fixes prometheus#4238

Signed-off-by: Andreas Auernhammer <aead@mail.de>

Update example console template for node exporter 0.16.0 (prometheus#4208)

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

Makefile: update .PHONY target (prometheus#4234)

Makefile: update .PHONY target

* Move .PHONY declarations near their targets

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

Add prompb/README (prometheus#4222)

Signed-off-by: Henri DF <henridf@gmail.com>

discovery/file: fix logging (prometheus#4178)

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

web: remove security headers

Signed-off-by: Fabian Reinartz <freinartz@google.com>

config: set target group source index during unmarshalling (prometheus#4245)

* config: set target group source index during unmarshalling

Fixes issue prometheus#4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed.  Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.

Signed-off-by: Paul Gier <pgier@redhat.com>

* [issue prometheus#4214] Test that static targets are not modified by discovery manager

Signed-off-by: Paul Gier <pgier@redhat.com>

Log the line when failing a PromQL test. (prometheus#4272)

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>

web: restore old path prefix behavior

Signed-off-by: Fabian Reinartz <freinartz@google.com>

kubernetes_sd: fix namespace filtering (prometheus#4273)

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

fix minor issues in custom SD example (prometheus#4278)

Signed-off-by: Callum Styan <callumstyan@gmail.com>

federation: nil pointer deference when using remove read

```
level=error ts=2018-06-13T07:19:04.515149169Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56202: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.516199547Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56204: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.51717692Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56206: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.564952878Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56208: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.566575791Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56210: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.567106063Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56212: runtime error: invalid memory address or nil pointer dereference"
```

When remove read is enabled, federation will call `q.Select(nil, mset...)`
which will break remote reads because it currently doesn't handle empty
SelectParams.

Signed-off-by: Corentin Chary <c.chary@criteo.com>

Extend API tests to cover remote read API.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Review feedback.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

spelling.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

web: restore old path prefix behavior

Signed-off-by: Fabian Reinartz <freinartz@google.com>

kubernetes_sd: fix namespace filtering (prometheus#4273)

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

Avoid infinite loop on duplicate NaN values. (prometheus#4275)

Fixes prometheus#4254

NaNs don't equal themselves, so a duplicate NaN would
always hit the break statement and never get popped.

We should not be returning multiple data point for the same
timestamp, so don't compare values at all.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

config: set target group source index during unmarshalling (prometheus#4245)

* config: set target group source index during unmarshalling

Fixes issue prometheus#4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed.  Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.

Signed-off-by: Paul Gier <pgier@redhat.com>

* [issue prometheus#4214] Test that static targets are not modified by discovery manager

Signed-off-by: Paul Gier <pgier@redhat.com>

discovery/file: fix logging (prometheus#4178)

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

federation: nil pointer deference when using remove read

```
level=error ts=2018-06-13T07:19:04.515149169Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56202: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.516199547Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56204: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.51717692Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56206: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.564952878Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56208: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.566575791Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56210: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2018-06-13T07:19:04.567106063Z caller=stdlib.go:89 component=web caller="http: panic serving [::1" msg="]:56212: runtime error: invalid memory address or nil pointer dereference"
```

When remove read is enabled, federation will call `q.Select(nil, mset...)`
which will break remote reads because it currently doesn't handle empty
SelectParams.

Signed-off-by: Corentin Chary <c.chary@criteo.com>

Review feedback.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

spelling.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Release 2.3.1

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

Timeout if populating iterators takes too long (prometheus#4291)

Right now promql won't time out a request if populating the iterators
takes a long time.

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4289

return error exit status in prometheus cli (prometheus#4296)

Signed-off-by: mikeykhalil <mikeyfkhalil@gmail.com>

Check for timeout in each iteration of matrixSelector (prometheus#4300)

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4288

Make TestUpdate() do some work (prometheus#4306)

Previously it would set no preconditions and check no postconditions,
as the `groups` member was empty.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

Add "omitempty" to some SD config YAML field tags (prometheus#4338)

Especially for Kubernetes SD, this fixes a bug where the rendered
configuration says "api_server: null", which when read back is not
interpreted as an un-set API server (thus the default is not applied).

Signed-off-by: Julius Volz <julius.volz@gmail.com>

travis: remove testing with go 1.x

Travis and CircleCI should use the same Go version(s).

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

Reduce CircleCI duplication (prometheus#4335)

Reduce the duplication of per-project specifics in the CircleCI config.
* Add docker repo variable, default to docker hub.
* Add make targets for docker push and tag latest.

Signed-off-by: Ben Kochie <superq@gmail.com>

fix the TestManagerReloadNoChange test (prometheus#4267)

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

Reorder startup and shutdown to prevent panics. (prometheus#4321)

Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.

Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.

Fixes prometheus#4133
Fixes prometheus#4262

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

discovery/kubernetes/ingress: add more tests

Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>

discovery/kubernetes/ingress: fix scheme discovery (Closes prometheus#4327)

Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>

discovery/kubernetes/ingress: remove unnecessary check

Signed-off-by: Dmitry Bashkatov <dbashkatov@gmail.com>

Fix markup in example. (prometheus#4351)

Signed-off-by: Marcin Owsiany <marcin@owsiany.pl>

fix the zookeper race (prometheus#4355)

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

docs: added undocumented step api parameter format (prometheus#4360)

Update vendoring for tsdb (prometheus#4369)

This pulls in tsdb PRs 330 344 348 353 354 356

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

k8s SD: Fix "schema" -> "scheme" typo (prometheus#4371)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Fix missing 'msg' in remote storage adapter main.go .Log info message (prometheus#4377)

Signed-off-by: Peter Gallerani <peter.gallerani@gmail.com>

Don't forget to register query_duration_seconds{slice="queue_time"} (prometheus#4381)

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

docs: fix OpenStack SD for the hypervisor role

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

discovery/openstack: remove unneeded assignment

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

Bubble up errors to promql from populating iterators (prometheus#4136)

This changes the Walk/Inspect API inside the promql package to bubble
up errors. This is done by having the inspector return an error (instead
of a bool) and then bubbling that up in the Walk. This way if any error
is encountered in the Walk() the walk will stop and return the error.
This avoids issues where errors from the Querier where being ignored
(causing incorrect promql evaluation).

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4136

Timeout if populating iterators takes too long (prometheus#4291)

Right now promql won't time out a request if populating the iterators
takes a long time.

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4289

Check for timeout in each iteration of matrixSelector (prometheus#4300)

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Fixes prometheus#4288

fix the zookeper race (prometheus#4355)

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

return error exit status in prometheus cli (prometheus#4296)

Signed-off-by: mikeykhalil <mikeyfkhalil@gmail.com>

Reorder startup and shutdown to prevent panics. (prometheus#4321)

Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.

Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.

Fixes prometheus#4133
Fixes prometheus#4262

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

Update vendoring for tsdb (prometheus#4369)

This pulls in tsdb PRs 330 344 348 353 354 356

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

Release 2.3.2

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>

rules: Minor naming/comment cleanups (prometheus#4328)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Optimize PromQL aggregations (prometheus#4248)

* Compute hash of label subsets without creating a LabelSet first.

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>

Add offset to selectParams (prometheus#4226)

* Add Start/End to SelectParams
* Make remote read use the new selectParams for start/end

This commit will continue sending the start/end time of the remote read
query as the overarching promql time and the specific range of data that
the query is intersted in receiving a response to is now part of the
ReadHints (upstream discussion in prometheus#4226).

* Remove unused vendored code

The genproto.sh script was updated, but the code wasn't regenerated.
This simply removes the vendored deps that are no longer part of the
codegen output.

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Forbid rule-abiding robots from indexing. (prometheus#4266)

* Resolves github issue prometheus#4257

Signed-off-by: Martin Lee <martin@billforward.net>

Discovery consul service meta (prometheus#4280)

* Upgrade Consul client
* Add ServiceMeta to the labels in ConsulSD

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>

Fix some (valid) lint errors (prometheus#4287)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Update vendoring of Prometheus Go client (prometheus#4283)

This is to pickup changes from
prometheus/client_golang#414. It leads to
better error output in promtool.

Signed-off-by: Sneha Inguva <singuva@digitalocean.com>

Simplify BufferedSeriesIterator usage (prometheus#4294)

* Allow for BufferedSeriesIterator instances to be created without an underlying iterator, to simplify their usage.

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>

Saner defaults and metrics for remote-write (prometheus#4279)

* Rename queueCapacity to shardCapacity
* Saner defaults for remote write
* Reduce allocs on retries

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

Update autorest vedoring (prometheus#4147)

Signed-off-by: bege13mot <bege13mot@gmail.com>

Update aws-sdk-go (prometheus#4153)

Signed-off-by: bege13mot <bege13mot@gmail.com>

add unused pointslices to the pool (prometheus#4363)

Signed-off-by: Tony Lee <tl@hudson-trading.com>

Add 3 commands in `promtool` for getting debug information from prometheus server (prometheus#4247)

`debug all` - all information
`debug metrics` - metrics  information
`debug pprof` - profiling  information

the final result is compressed in a `tar.gz` file

Signed-off-by: chyeh <chyeh.taiwan@gmail.com>

main: Improve / clean up error messages (prometheus#4286)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Document internal Prometheus server architecture (prometheus#4295)

* Document internal Prometheus server architecture

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Review fixups

Signed-off-by: Julius Volz <julius.volz@gmail.com>

promtool: add command for querying series (prometheus#4308)

Signed-off-by: Shubheksha Jalan <jshubheksha@gmail.com>

Add missing import to promtool, fix build (prometheus#4395)

Sorry, I used GitHub's web-based merge-conflict-resolution editor on
prometheus#4308 and it didn't show me
test errors afterwards, but maybe they didn't run again or I should have
waited or something.

Signed-off-by: Julius Volz <julius.volz@gmail.com>

EC2 Discovery: Allow to set a custom endpoint (prometheus#4333)

Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>

Reuse (copy) overlapping matrix samples between range evaluation steps (prometheus#4315)

* Reuse (copy) overlapping matrix samples between range evaluation steps.

Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>

Expose Group.CopyState() (prometheus#4304)

This makes the `rules` package more useful to projects that use
Prometheus as a library.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

add query labels command to promtool (prometheus#4346)

Signed-off-by: Daisy T <daisyts@gmx.com>

web: add named anchors for each rule group (prometheus#4130)

* web: add named anchors for each rule group

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>

Update internal architecture diagram (prometheus#4398)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Only add LookbackDelta to vector selectors (prometheus#4399)

Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>

Related to prometheus#4226

add prefix "common-" to make target names

This allows rules to be overridden with warnings about conflicting
target names.

Signed-off-by: Paul Gier <pgier@redhat.com>

expose log.level for promlog for remote_storage_adapter (prometheus#4195)

* expose log.level for promlog for remote_storage_adapter

Signed-off-by: sipian <cs15btech11019@iith.ac.in>

* replace flag description

Signed-off-by: Harsh Agarwal <cs15btech11019@iith.ac.in>

go-bindata debug clarification (prometheus#4411)

Signed-off-by: Stafford Williams <stafford.williams@gmail.com>

discovery/ec2: Maintain order of subnet_id label

Signed-off-by: José Martínez <xosemp@gmail.com>

discovery/ec2: Add primary_subnet_id label

Signed-off-by: José Martínez <xosemp@gmail.com>

Don't import testing in code which is imported from non-test code. (prometheus#4400)

It polutes the flags.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Log errors encountered when marshalling and writing responses.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Review feedback.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Review feedback.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Review feedback.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Update method name in rules template, fix rendering (prometheus#4416)

Fixes prometheus#4407

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Fix typo (prometheus#4423)

Signed-off-by: Henri DF <henridf@gmail.com>

Send "Accept-Encoding" header in read request (prometheus#4421)

We should be doing this since we only accept Snappy-encoded responses.

Signed-off-by: Henri DF <henridf@gmail.com>

Handle a remote read error and return other results, add remote error as extra field in api response.

Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

Removed some code from other project

Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

gouthamve added a commit to gouthamve/prometheus that referenced this issue Aug 1, 2018

Avoid infinite loop on duplicate NaN values. (prometheus#4275)
Fixes prometheus#4254

NaNs don't equal themselves, so a duplicate NaN would
always hit the break statement and never get popped.

We should not be returning multiple data point for the same
timestamp, so don't compare values at all.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.