Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amount of metrics cause excessive traffic, target-side computation load #2686

Closed
avikivity opened this Issue May 8, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@avikivity
Copy link

avikivity commented May 8, 2017

What did you do?

I'm monitoring Scylla. The target exposes a large amount of metrics. Those metrics are per-cpu, so a target with 48 logical cores (not uncommon) has 48 times as many metrics as a target with 1 logical code (very rare).

What did you expect to see?

Everything working smoothly.

What did you see instead? Under which circumstances?

Everything is working smoothly. However, the target sends 1.2 MB per scrape, and has to spend a considerable effort to create the response. This creates latency in the thread that prepares the response.

Environment

Not relevant -- the problem is not in the Prometheus server, but in the target. But the problem can be fixed by improving the protocol (and Prometheus will benefit by having to process less data).

I think this could be much improved by using a "prepared statement" type of protocol. Instead of having a single scrape target, have two. This first one returns the metadata: all the various metrics, their types, and label sets, and creates a version number of the metadata. The second one accepts a metadata version number, and responds with just the metrics data if the version numbers match, or an error otherwise (prompting Prometheus to re-fetch the metadata). Alternative approaches are possible, perhaps a single get-if-changed endpoint that accepts the metadata version number in a header, and responds with just the data (on match) or both data and metadata (on mismatch).

This could reduce the size of the response by more than an order of magnitude, and reduce processing time for both the target and Prometheus significantly.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 8, 2017

The second one accepts a metadata version number, and responds with just the metrics data if the version numbers match, or an error otherwise (prompting Prometheus to re-fetch the metadata).

This would require keeping a cache of every single scrape - which we would rarely get a hit on.

This could reduce the size of the response by more than an order of magnitude, and reduce processing time for both the target and Prometheus significantly.

This wouldn't help with either. Determining the metadata requires the exact same work as determining the data, as metrics and time series can change from scrape to scrape. It's also really fast to parse.

We also currently throw away the metadata.

@avikivity

This comment has been minimized.

Copy link
Author

avikivity commented May 8, 2017

Does metrics metadata really change from scrape to scrape? I would imagine it doesn't (usually the monitored series doesn't change), so you'd need a cache the size of the number targets you monitor, and the hit rate would be 100%. We're probably not on the same page here.

Here's the first 1k of a scrape:

00000000  f0 09 0a 32 73 63 79 6c  6c 61 5f 73 74 6f 72 61  |...2scylla_stora|
00000010  67 65 5f 70 72 6f 78 79  5f 63 6f 6f 72 64 69 6e  |ge_proxy_coordin|
00000020  61 74 6f 72 5f 72 65 61  64 73 5f 72 65 6d 6f 74  |ator_reads_remot|
00000030  65 5f 6e 6f 64 65 12 24  77 68 65 72 65 20 64 69  |e_node.$where di|
00000040  64 20 74 68 65 20 64 65  73 63 72 69 70 74 69 6f  |d the descriptio|
00000050  6e 20 64 69 73 61 70 70  65 61 72 3f 18 00 22 7f  |n disappear?..".|
00000060  0a 2a 0a 08 69 6e 73 74  61 6e 63 65 12 1e 6e 31  |.*..instance..n1|
00000070  2e 73 61 68 61 72 61 2e  63 6c 6f 75 64 69 75 73  |.sahara.cloudius|
00000080  2d 73 79 73 74 65 6d 73  2e 63 6f 6d 0a 19 0a 0a  |-systems.com....|
00000090  64 61 74 61 63 65 6e 74  65 72 12 0b 64 61 74 61  |datacenter..data|
000000a0  63 65 6e 74 65 72 31 0a  0f 0a 07 6f 70 5f 74 79  |center1....op_ty|
000000b0  70 65 12 04 64 61 74 61  0a 0a 0a 05 73 68 61 72  |pe..data....shar|
000000c0  64 12 01 31 0a 0e 0a 04  74 79 70 65 12 06 64 65  |d..1....type..de|
000000d0  72 69 76 65 1a 09 09 00  00 00 00 00 00 08 40 22  |rive..........@"|
000000e0  7f 0a 2a 0a 08 69 6e 73  74 61 6e 63 65 12 1e 6e  |..*..instance..n|
000000f0  31 2e 73 61 68 61 72 61  2e 63 6c 6f 75 64 69 75  |1.sahara.cloudiu|
00000100  73 2d 73 79 73 74 65 6d  73 2e 63 6f 6d 0a 19 0a  |s-systems.com...|
00000110  0a 64 61 74 61 63 65 6e  74 65 72 12 0b 64 61 74  |.datacenter..dat|
00000120  61 63 65 6e 74 65 72 31  0a 0f 0a 07 6f 70 5f 74  |acenter1....op_t|
00000130  79 70 65 12 04 64 61 74  61 0a 0a 0a 05 73 68 61  |ype..data....sha|
00000140  72 64 12 01 32 0a 0e 0a  04 74 79 70 65 12 06 64  |rd..2....type..d|
00000150  65 72 69 76 65 1a 09 09  00 00 00 00 00 00 08 40  |erive..........@|
00000160  22 7f 0a 2a 0a 08 69 6e  73 74 61 6e 63 65 12 1e  |"..*..instance..|
00000170  6e 31 2e 73 61 68 61 72  61 2e 63 6c 6f 75 64 69  |n1.sahara.cloudi|
00000180  75 73 2d 73 79 73 74 65  6d 73 2e 63 6f 6d 0a 19  |us-systems.com..|
00000190  0a 0a 64 61 74 61 63 65  6e 74 65 72 12 0b 64 61  |..datacenter..da|
000001a0  74 61 63 65 6e 74 65 72  31 0a 0f 0a 07 6f 70 5f  |tacenter1....op_|
000001b0  74 79 70 65 12 04 64 61  74 61 0a 0a 0a 05 73 68  |type..data....sh|
000001c0  61 72 64 12 01 33 0a 0e  0a 04 74 79 70 65 12 06  |ard..3....type..|
000001d0  64 65 72 69 76 65 1a 09  09 00 00 00 00 00 00 08  |derive..........|
000001e0  40 22 7f 0a 2a 0a 08 69  6e 73 74 61 6e 63 65 12  |@"..*..instance.|
000001f0  1e 6e 31 2e 73 61 68 61  72 61 2e 63 6c 6f 75 64  |.n1.sahara.cloud|
00000200  69 75 73 2d 73 79 73 74  65 6d 73 2e 63 6f 6d 0a  |ius-systems.com.|
00000210  19 0a 0a 64 61 74 61 63  65 6e 74 65 72 12 0b 64  |...datacenter..d|
00000220  61 74 61 63 65 6e 74 65  72 31 0a 0f 0a 07 6f 70  |atacenter1....op|
00000230  5f 74 79 70 65 12 04 64  61 74 61 0a 0a 0a 05 73  |_type..data....s|
00000240  68 61 72 64 12 01 34 0a  0e 0a 04 74 79 70 65 12  |hard..4....type.|
00000250  06 64 65 72 69 76 65 1a  09 09 00 00 00 00 00 00  |.derive.........|
00000260  08 40 22 7f 0a 2a 0a 08  69 6e 73 74 61 6e 63 65  |.@"..*..instance|
00000270  12 1e 6e 31 2e 73 61 68  61 72 61 2e 63 6c 6f 75  |..n1.sahara.clou|
00000280  64 69 75 73 2d 73 79 73  74 65 6d 73 2e 63 6f 6d  |dius-systems.com|
00000290  0a 19 0a 0a 64 61 74 61  63 65 6e 74 65 72 12 0b  |....datacenter..|
000002a0  64 61 74 61 63 65 6e 74  65 72 31 0a 0f 0a 07 6f  |datacenter1....o|
000002b0  70 5f 74 79 70 65 12 04  64 61 74 61 0a 0a 0a 05  |p_type..data....|
000002c0  73 68 61 72 64 12 01 35  0a 0e 0a 04 74 79 70 65  |shard..5....type|
000002d0  12 06 64 65 72 69 76 65  1a 09 09 00 00 00 00 00  |..derive........|
000002e0  00 00 40 22 7f 0a 2a 0a  08 69 6e 73 74 61 6e 63  |..@"..*..instanc|
000002f0  65 12 1e 6e 31 2e 73 61  68 61 72 61 2e 63 6c 6f  |e..n1.sahara.clo|
00000300  75 64 69 75 73 2d 73 79  73 74 65 6d 73 2e 63 6f  |udius-systems.co|
00000310  6d 0a 19 0a 0a 64 61 74  61 63 65 6e 74 65 72 12  |m....datacenter.|
00000320  0b 64 61 74 61 63 65 6e  74 65 72 31 0a 0f 0a 07  |.datacenter1....|
00000330  6f 70 5f 74 79 70 65 12  04 64 61 74 61 0a 0a 0a  |op_type..data...|
00000340  05 73 68 61 72 64 12 01  36 0a 0e 0a 04 74 79 70  |.shard..6....typ|
00000350  65 12 06 64 65 72 69 76  65 1a 09 09 00 00 00 00  |e..derive.......|
00000360  00 00 00 40 22 7f 0a 2a  0a 08 69 6e 73 74 61 6e  |...@"..*..instan|
00000370  63 65 12 1e 6e 31 2e 73  61 68 61 72 61 2e 63 6c  |ce..n1.sahara.cl|
00000380  6f 75 64 69 75 73 2d 73  79 73 74 65 6d 73 2e 63  |oudius-systems.c|
00000390  6f 6d 0a 19 0a 0a 64 61  74 61 63 65 6e 74 65 72  |om....datacenter|
000003a0  12 0b 64 61 74 61 63 65  6e 74 65 72 31 0a 0f 0a  |..datacenter1...|
000003b0  07 6f 70 5f 74 79 70 65  12 04 64 61 74 61 0a 0a  |.op_type..data..|
000003c0  0a 05 73 68 61 72 64 12  01 37 0a 0e 0a 04 74 79  |..shard..7....ty|
000003d0  70 65 12 06 64 65 72 69  76 65 1a 09 09 00 00 00  |pe..derive......|
000003e0  00 00 00 00 40 22 7f 0a  2a 0a 08 69 6e 73 74 61  |....@"..*..insta|
000003f0  6e 63 65 12 1e 6e 31 2e  73 61 68 61 72 61 2e 63  |nce..n1.sahara.c|

There are about 7.5 metrics there, so each metric is ~ 136 bytes of data+metadata. ~128 bytes of metadata are exactly the same between scrapes, with just 8 bytes of data changing.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 8, 2017

@avikivity Prometheus's scrape protocol is very well-established, so I don't think such a change (that also makes the protocol and its implementations significantly more complex) will happen.

Going back to your original problem report:

"Everything is working smoothly. However, the target sends 1.2 MB per scrape, and has to spend a considerable effort to create the response. This creates latency in the thread that prepares the response."

Is latency in the response-preparing thread a problem if that happens on an independent thread? What is the latency you are seeing? Do you know where the latency is actually coming from? From producing the values, or assembling them, or serializing, or sending them?

@avikivity

This comment has been minimized.

Copy link
Author

avikivity commented May 8, 2017

@juliusv of course, such a change, if made, would not remove the 0.0.4 format but rather define a new format which would be supported in parallel. Existing clients would not be impacted.

I don't think it's significantly more complex. You could still have a top-level MetricFamily type which would look something list this:

message MetricFamily {
    uint32 metadata_version;
    optional Metadata metadata;
    repeated Data data;  // described in metadata
}

Targets can fill in metadata unconditionally if they don't care about reducing bandwidth.

Back to my system, it is using a thread-per-core architecture with user-space scheduling (not unlike golang's) so I don't have independent threads. I can yield within the thread (similar to the Gosched() function), but if the number of labels or metrics grows, the cpu and bandwidth consumption makes an impact.

I am seeing about 20ms latency from this code (includes assembling and serialization, but not production or transmission). This is running in a database which can serve requests in sub 1ms latency for the 99th percentile in some loads, so these 20ms have a huge impact.

I plan to optimize it by having the message object survive from iteration to iteration with the and changing just the data (essentially applying my proposal for the assembly part), but that's just a band-aid that may buy me some time, but not solve the problem completely.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

I think the best option here is to hand-assemble the text format with just the names and values, no metadata.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.