Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promxy returns no rows when using long range #537

Closed
sc0rp10 opened this issue Feb 18, 2023 · 7 comments
Closed

Promxy returns no rows when using long range #537

sc0rp10 opened this issue Feb 18, 2023 · 7 comments

Comments

@sc0rp10
Copy link

sc0rp10 commented Feb 18, 2023

Hi, trying to use promxy as a frontend in front of two VictoriaMetrics using a very simple config:

promxy:
    server_groups:
        -
            consul_sd_configs:
                -
                    services:
                        - victoriametrics

CLI arguments are

--query.max-samples=150000000
--access-log-destination=none
--log-level=debug

Everything works like a charm except strange behavior when I use some long timerange. For example, I have a query up == 1 - it works fine with any time ranges within my date retention. But when I querying node_filesystem_size_bytes{hostname="foobar"} I had no results for timeranges >= 43 days.
Response is 200 OK {"status":"success","data":{"resultType":"matrix","result":[]}} debug logs said nothing special.

Could anybody point me what's wrong?

Thanks!

@jacksontj
Copy link
Owner

I did some poking around on this and was unable to reproduce this issue. If you could provide some more information to reproduce the issue (ideally some example pointing at http://demo.robustperception.io:9090 (or some other public prometheus API) -- alternatively a tcpdump (or tracelog) of the issue occuring.

@D13410N3
Copy link

D13410N3 commented Apr 2, 2023

@jacksontj I've collected tcpdump for you
http://ux.ci/promxy-2.pcap

In this case promxy is trying to collect data with query up{job="victoriametrics"} just for "now - 7 days" from two victoriametrics instances.

No issues exists when request is done directly from one of instances

@sc0rp10 sc0rp10 changed the title Promxy returns to rows when using long range Promxy returns no rows when using long range Apr 2, 2023
@jacksontj
Copy link
Owner

Thanks for the pcap, that definitely clears things up quite a bit!

Detailed Explanation

In this pcap there are 4 entities:

  • requestor (10.7.168.245)
  • Promxy (10.192.4.164)
  • VM A (10.192.4.92)
  • VM B (10.192.4.60)

In the pcap we can see the client send the following query to promxy:

GET /api/v1/query_range?query=up%7Bjob%3D%22victoriametrics%22%7D&start=1679827467.432&end=1680432267.432&step=2419 HTTP/1.1\r\n

All good so far. When we look at the queries that promxy sends to the VM downstreams we see similar data:

# downstream A
	HTML Form URL Encoded: application/x-www-form-urlencoded
		Form item: "end" = "1680432267.432"
		Form item: "query" = "up{job="victoriametrics"}"
		Form item: "start" = "1679827467.432"
		Form item: "step" = "2419"

# downstream B
Query from Promxy to B (10.192.4.60)
	HTML Form URL Encoded: application/x-www-form-urlencoded
		Form item: "end" = "1680432267.432"
		Form item: "query" = "up{job="victoriametrics"}"
		Form item: "start" = "1679827467.432"
		Form item: "step" = "2419"

At this point things still look good, the query was effectively just passed down to the downstream VM boxes -- which is what we expect. Next if we check the response from either (looking at "A" here):

{
	"status": "success",
	"data": {
		"resultType": "matrix",
		"result": [{
			"metric": {
				"__name__": "up",
				"env": "prod",
				"hostname": "vmetrics-2.node.eu.consul",
				"instance": "10.192.4.60:8428",
				"ip": "10.192.4.60",
				"job": "victoriametrics"
			},
			"values": [
				[1679826170, "1"],
				[1679828589, "1"],
				[1679831008, "1"],
				[1679833427, "1"],
				[1679835846, "1"],
				[1679838265, "1"],
				[1679840684, "1"],
...

Which at first glance seems reasonable, but upon further inspection we note that something is off with the times:

Who Start End Duration
Downstream Request 1679827467.432 1680432267.432 604800
A Response 1679826170 1680430920 604750
B Response 1679826170 1680430920 604750

So to put that in more concrete language, this shows that the downstream VictoriaMetrics boxes are returning data for a time close to what was requested, but not what was actually requested. This is actually an (unfortunately) already known behavior with VictoriaMetrics that was captured in #202. The short version there is that VM has some internal caching -- but the way the prometheus API contract/iterators/promql-engine/etc. work -- the times aren't "close enough" (in this case "A"'s response was 1297.432s (~22 minutes) off of the requested time. This ends up working for shorter time ranges as the "incorrect" times are "close enough". Thankfully since this is a known issue, this can be configured around by adding nocache (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml#L54-L58).

TLDR

This appears to be some VictoriaMetrics caching behavior causing issues -- which can be configured around by adding nocache (https://github.com/jacksontj/promxy/blob/master/cmd/promxy/config.yaml#L54-L58)

@D13410N3
Copy link

D13410N3 commented Apr 3, 2023

Thanks for your detailed response!
I'll try this solution

@D13410N3
Copy link

D13410N3 commented Apr 4, 2023

@jacksontj thanks, your solution is working.

But, unfortunately, it causes high CPU utilization on VM instances, even though it's expected behaviour

Is there any way to configure Promxy to make parallel requests to each instance?
In my case there's two identical VM instances (they store the same metrics). I.e. I'm requesting up{foo="bar"} with range now - 100 days - can Promxy split it to two different requests? In this case the first VM instance will receive request for range [now-100 days : now-50 days], the second one [now-50 days : now] and then Promxy will join it as one result.

Can it be realised at this moment?

@jacksontj
Copy link
Owner

can Promxy split it to two different requests?

Sorta, this can be achieved with relative or absolute time filters -- basically creating 2 server groups one for "recent' and one for "old". But this wouldn't be HA -- it would just "shard" the query across those 2 nodes for performance improvement at the expense of redundancy (think RAID 0 vs RAID1 -- if thats a helpful analogy).

Can it be realised at this moment?

The only other hacky solution I can think of is to mess with the LookbackDelta (I think they changed the name, but that option) to have promql honor those incorrect timestamps from VM. The only other idea I have is we could hack in the same time adjustment into promxy (basically implement the logic here) but I'm not so sure about that as we'd start breaking the API contract ourselves... have to think about that some.

it causes high CPU utilization on VM instances

Out of curiosity, do you have some data on how much increase (how much QPS and what CPU util before and after was). In general this is unfortunate as there seemingly no way to enable caching but still honor the API contract. (Remember the issue here was that the VM response didn't adhere to the start/end defined in the API call). If the caller is something like grafana you could consider using trickster to cache at the API layer -- that does add complexity but may be a reasonable approach? I do have an issue to add caching to promxy but that is a relatively large lift and hasn't been a major priority for most.

@jacksontj
Copy link
Owner

After some consideration I don't think its a good plan to implement the same adjust logic within promxy (as it does break the prometheus API contract). That being said, making a middleware/proxy to do the VMAdjust for the query_range method should be pretty trivial to do (I hacked it up and it adds ~200k lines of dependencies, so just a bit much to add to this project -- which already has so many dependencies). IMO this sort of Timestamp adjusting would make sense as a VM middleware proxy -- since its their custom logic (which is non-standard and technically a violation of the API contract).

@jacksontj jacksontj closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants