Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus `/query` endpoint sometimes returns values that aren't matches to the promql query #4261

Closed
jacksontj opened this Issue Jun 12, 2018 · 22 comments

Comments

Projects
None yet
3 participants
@jacksontj
Copy link
Contributor

jacksontj commented Jun 12, 2018

Bug Report

What did you do?
I send a query like so http://localhost:9090/api/v1/query?query=http_request_duration_microseconds%5b301s%5d&time=1528830551.230 and sometimes I get results for different metrics (in the example below I see results for http_request_duration_microseconds_count.

What did you expect to see?
I expect all results in the /query response to be valid results to the promql query

What did you see instead? Under which circumstances?
The results sometimes don't match the promql query. I have done some testing and I am only able to reproduce this when (1) the query is for a range-- meaning a matrix query (2) there are other in-flight queries for other timeseries in the same time range. Because of this it leads me to believe this is actually some tsdb level issue, but I have not been able to identify root cause yet.

Environment

  • System information:

Linux 4.13.0-39-generic x86_64

  • Prometheus version:

Reproducible on 2.2 but not master

  • Prometheus configuration file:
scrape_configs:
- job_name: "prom"
  metrics_path: "/metrics"
  static_configs:
    - targets:
      - 127.0.0.1:9090

** Repro code **
to simplify reproducing the issue I've made the following program to reproduce the issue on a locally running prometheus box:

package main

import (
	"context"
	"fmt"
	"net/http"
	"sync"
	"time"

	"github.com/jacksontj/promxy/promclient"
	"github.com/jacksontj/promxy/promhttputil"
	"github.com/prometheus/common/model"
	"github.com/prometheus/prometheus/pkg/timestamp"
)

var urls = []string{
	"http://localhost:9090/api/v1/query?query=http_request_duration_microseconds%5b301s%5d",
	"http://localhost:9090/api/v1/query?query=http_request_duration_microseconds_count%5b301s%5d",
}

var WORKERS = 10

func spamNode(ctx context.Context, client *http.Client, url string) {
	wg := &sync.WaitGroup{}
	for i := 0; i < WORKERS; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for {
				result, err := promclient.GetData(ctx, url, client, nil)
				if err != nil {
					continue
				}

				if result.Status == promhttputil.StatusSuccess {
					iterators := promclient.IteratorsForValue(result.Data.Result)
					name := ""
					for _, iterator := range iterators {
						for _, label := range iterator.Labels() {
							if label.Name == "__name__" {
								if name == "" {
									name = label.Value
								} else {
									if label.Value != name {
										fmt.Println("mismatch in set", name, label)
										fmt.Println("while sending to", url)
										//fmt.Println(result.Data.Result)
										panic("mismatch")
									}
								}
								break
							}
						}
					}
				}

			}
		}()
	}
	wg.Wait()
}

func main() {
	ctx := context.Background()
	client := &http.Client{}

	wg := &sync.WaitGroup{}
	for _, url := range urls {
		url = url + "&time=" + model.Time(timestamp.FromTime(time.Now())).String()
		wg.Add(1)
		go func(url string) {
			fmt.Println("starting worker for", url)
			defer fmt.Println("completing worker for", url)
			defer wg.Done()
			spamNode(ctx, client, url)
		}(url)
	}
	wg.Wait()

}
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Reproducible on 2.2 but not master

PromQL changed a good bit, and there was a race with reusing the matrices in the old PromQL code that could explain this. Worth investigating if there's a TSDB level issue though now masked by the new PromQL code.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

@gouthamve @fabxc This needs investigation.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

@brian-brazil for that promql race condition, do you know the commit that fixes that race condition? If so I can test that single commit. I'm still trying to get a bisect to work, but git doesn't like bisecting between the 2 release branches because the merge build is broken too.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

I re-ran my repro steps with a prometheus build made with -race and it does print out a fairly large stack of things -- so presumably that is related :)

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

I was able to complete the bisect and the issue is fixed with dd6781a (the large promql re-do). Any ideas on which commit in that 2k change it would be? This issue seems problematic enough to necessitate a backport, but that seems like a large commit to backport.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Even within the ~60 original commits of that, that fix was spread throughout it. I'm still concerned that this issue is actually down in the TSDB, as the race was across queries generally rather within a time range.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

The issue incidentally was reusing the matrix across queries, it was put back in the pool when execution completed rather than when the result was sent back to the user. Breaking the pooling in engine.go will fix that particular race enough to see if this was entirely that or also a TSDB issue.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

@brian-brazil seeing that it is fixed on master with the promql optimization commit makes me think it is that pooling bug in promql rather than an issue in TSDB. The symptom was data from another query, but if promql was handing the matrix to another query its data would have been mutated and then potentially read (looking like data leaks from TSDB, but inreality just a data race condition).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

That doesn't explain your observation that the mixed data has to be from the same time range. The race I eliminated (by completely changing how that code worked) would apply across all queries.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

I don't know that they must be within the same timerange-- thats just what I'm observing. However it does require it to be a matrix selector (doing foo instead of foo[300s] doesn't repro the issue).

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

And to double check, I confirmed that I can reproduce the issue with non-overlapping queries as long as they are matrix selectors.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

That's sounding more like the the issue I fixed then. Anything tied to ranges of time would indicate a block-level issue in the TSDB.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 12, 2018

Would it be possible to get a fix for this issue backported to 2.2?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

We're working on stabilising 2.3, and this issue is seemingly already fixed there.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 13, 2018

Does that mean we're waiting on 2.3 to stabilize to determine what backport to do? Or is the plan to never backport a fix and leave 2.2 as-is?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

There's no plans to backport anything to 2.2. We generally don't backport.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

I've dug into this a bit, and the only races I saw on 2.2 were due to the issue I already fixed. So there doesn't look to be an underlying tsdb bug here. Thanks for the good reproduction case, this one has been making us scratch our heads for a while.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 13, 2018

Is there an ETA for 2.3? Until that is released I think this bug should remain open, as there isn't a supported release that is not impacted by this bug.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 13, 2018

@jacksontj seems you missed the news 😄 2.3 is already out.
https://github.com/prometheus/prometheus/releases

few reports for other unrelated issues, but these should also be fixed in the next 2-3 weeks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

As a matter of general policy, we close bugs when they are fixed in code.

@jacksontj

This comment has been minimized.

Copy link
Contributor Author

jacksontj commented Jun 13, 2018

@krasi-georgiev I did, thanks for pointing that out! :)

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.