Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus `/query` endpoint sometimes returns values that aren't matches to the promql query #4261
Comments
This comment has been minimized.
This comment has been minimized.
PromQL changed a good bit, and there was a race with reusing the matrices in the old PromQL code that could explain this. Worth investigating if there's a TSDB level issue though now masked by the new PromQL code. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve @fabxc This needs investigation. |
brian-brazil
added
kind/bug
component/local storage
labels
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil for that promql race condition, do you know the commit that fixes that race condition? If so I can test that single commit. I'm still trying to get a bisect to work, but git doesn't like bisecting between the 2 release branches because the merge build is broken too. |
This comment has been minimized.
This comment has been minimized.
|
I re-ran my repro steps with a prometheus build made with |
This comment has been minimized.
This comment has been minimized.
|
I was able to complete the bisect and the issue is fixed with dd6781a (the large promql re-do). Any ideas on which commit in that 2k change it would be? This issue seems problematic enough to necessitate a backport, but that seems like a large commit to backport. |
This comment has been minimized.
This comment has been minimized.
|
Even within the ~60 original commits of that, that fix was spread throughout it. I'm still concerned that this issue is actually down in the TSDB, as the race was across queries generally rather within a time range. |
This comment has been minimized.
This comment has been minimized.
|
The issue incidentally was reusing the matrix across queries, it was put back in the pool when execution completed rather than when the result was sent back to the user. Breaking the pooling in engine.go will fix that particular race enough to see if this was entirely that or also a TSDB issue. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil seeing that it is fixed on master with the promql optimization commit makes me think it is that pooling bug in promql rather than an issue in TSDB. The symptom was data from another query, but if promql was handing the matrix to another query its data would have been mutated and then potentially read (looking like data leaks from TSDB, but inreality just a data race condition). |
This comment has been minimized.
This comment has been minimized.
|
That doesn't explain your observation that the mixed data has to be from the same time range. The race I eliminated (by completely changing how that code worked) would apply across all queries. |
This comment has been minimized.
This comment has been minimized.
|
I don't know that they must be within the same timerange-- thats just what I'm observing. However it does require it to be a matrix selector (doing |
This comment has been minimized.
This comment has been minimized.
|
And to double check, I confirmed that I can reproduce the issue with non-overlapping queries as long as they are matrix selectors. |
This comment has been minimized.
This comment has been minimized.
|
That's sounding more like the the issue I fixed then. Anything tied to ranges of time would indicate a block-level issue in the TSDB. |
This comment has been minimized.
This comment has been minimized.
|
Would it be possible to get a fix for this issue backported to 2.2? |
This comment has been minimized.
This comment has been minimized.
|
We're working on stabilising 2.3, and this issue is seemingly already fixed there. |
This comment has been minimized.
This comment has been minimized.
|
Does that mean we're waiting on 2.3 to stabilize to determine what backport to do? Or is the plan to never backport a fix and leave 2.2 as-is? |
This comment has been minimized.
This comment has been minimized.
|
There's no plans to backport anything to 2.2. We generally don't backport. |
This comment has been minimized.
This comment has been minimized.
|
I've dug into this a bit, and the only races I saw on 2.2 were due to the issue I already fixed. So there doesn't look to be an underlying tsdb bug here. Thanks for the good reproduction case, this one has been making us scratch our heads for a while. |
brian-brazil
closed this
Jun 13, 2018
brian-brazil
added
component/promql
component/api
and removed
component/local storage
labels
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
Is there an ETA for 2.3? Until that is released I think this bug should remain open, as there isn't a supported release that is not impacted by this bug. |
This comment has been minimized.
This comment has been minimized.
|
@jacksontj seems you missed the news few reports for other unrelated issues, but these should also be fixed in the next 2-3 weeks. |
This comment has been minimized.
This comment has been minimized.
|
As a matter of general policy, we close bugs when they are fixed in code. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I did, thanks for pointing that out! :) |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
jacksontj commentedJun 12, 2018
Bug Report
What did you do?
I send a query like so
http://localhost:9090/api/v1/query?query=http_request_duration_microseconds%5b301s%5d&time=1528830551.230and sometimes I get results for different metrics (in the example below I see results forhttp_request_duration_microseconds_count.What did you expect to see?
I expect all results in the
/queryresponse to be valid results to the promql queryWhat did you see instead? Under which circumstances?
The results sometimes don't match the promql query. I have done some testing and I am only able to reproduce this when (1) the query is for a range-- meaning a matrix query (2) there are other in-flight queries for other timeseries in the same time range. Because of this it leads me to believe this is actually some tsdb level issue, but I have not been able to identify root cause yet.
Environment
Linux 4.13.0-39-generic x86_64
Reproducible on 2.2 but not master
** Repro code **
to simplify reproducing the issue I've made the following program to reproduce the issue on a locally running prometheus box: