Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up[Regression] Sample ingestion and latency worse after upgrade to 1.5.0 #2373
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you get a zoomed in graph with https://grafana.net/dashboards/1244 ? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Are you sure that's of the same server? The ingestion rate isn't the same. |
This comment has been minimized.
This comment has been minimized.
|
Positive, yes. I can't say for sure when teams change their monitored targets, but these were stable over time:
|
This comment has been minimized.
This comment has been minimized.
|
Well, and not "by job" |
This comment has been minimized.
This comment has been minimized.
|
There's a difference of 2k/s between those two. Can you get me both graphs zoomed into a 30m period? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The ingestion rate looks fairly normal there, the effect you see in the initial report is due to aliasing. I'm not sure exactly what's going on with the evals. My guess is that due to reduced memory usage, GC is happening more often and thus pushing out the tail. Can you graph the average latency with an irate? |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
They're summaries, use the _sum and _count. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
That's |
This comment has been minimized.
This comment has been minimized.
|
Can you zoom that in and add the rules too? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
That looks less worrying. The only correlation I see for the spikes is with checkpoints completing. There's also a minor correlation with indexing, so maybe there's something around the rules creating new timeseries. Have there been changes in your environment that'd effect what the rules were doing? |
This comment has been minimized.
This comment has been minimized.
|
TTBOMK, there have been no changes other than me compiling and restarting Prometheus. |
This comment has been minimized.
This comment has been minimized.
|
The graphs indicate some form of change to the indexing. |
This comment has been minimized.
This comment has been minimized.
|
I see no change in |
This comment has been minimized.
This comment has been minimized.
jeinwag
commented
Jan 30, 2017
•
|
I'm currently seeing a similar issue, I upgraded from 1.4.1 to 1.5.0 a little over an hour ago at ca. 15:30: This isn't due to changes in our environment. EDIT: Link to benchmark dashboard: EDIT2: It looks like it's got to do with consul service discovery, at least in our case. The "up" count of jobs which use consul sd is fluctuating since the update, jobs with static target definitions are fine. I'm now certain that my issue is rather related to #2377. |
This comment has been minimized.
This comment has been minimized.
|
I'm currently assessing performance impact at SC. So far, everything is pointing to "faster with less RAM". :) |
This comment has been minimized.
This comment has been minimized.
|
Wild guess; Could you try to set It could be that the 1.5.x creates more lock contention, and the flag above would help with that. And I'm genuinely curious if it does. |
This comment has been minimized.
This comment has been minimized.
|
This seems to have been the relabelling issue, combined with graph artifacts. 1.5.2 should work fine. |
brian-brazil
closed this
Feb 15, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
RichiH commentedJan 27, 2017
After upgrading to 1.5.0 from 1.4.1, I see increased jitter in sample ingestion and higher internal latencies. The upgrade time should be obvious fromt the graphs. The drop-off at the end is due to a system reload after a kernel upgrade.
https://snapshot.raintank.io/dashboard/snapshot/US5WvmgcNdtwZjCZNPHgilvt86G2VutD
https://snapshot.raintank.io/dashboard/snapshot/4t87AZipjxQL4Ia2r7efusm8qjXzalN1