Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Sample ingestion and latency worse after upgrade to 1.5.0 #2373

Closed
RichiH opened this Issue Jan 27, 2017 · 24 comments

Comments

Projects
None yet
4 participants
@RichiH
Copy link
Member

RichiH commented Jan 27, 2017

After upgrading to 1.5.0 from 1.4.1, I see increased jitter in sample ingestion and higher internal latencies. The upgrade time should be obvious fromt the graphs. The drop-off at the end is due to a system reload after a kernel upgrade.

https://snapshot.raintank.io/dashboard/snapshot/US5WvmgcNdtwZjCZNPHgilvt86G2VutD
https://snapshot.raintank.io/dashboard/snapshot/4t87AZipjxQL4Ia2r7efusm8qjXzalN1

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

Can you get a zoomed in graph with https://grafana.net/dashboards/1244 ?

@RichiH

This comment has been minimized.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

Are you sure that's of the same server? The ingestion rate isn't the same.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

Positive, yes. I can't say for sure when teams change their monitored targets, but these were stable over time:

sum(up) by (job)
count(up) by (job)
@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

Well, and not "by job"

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

There's a difference of 2k/s between those two. Can you get me both graphs zoomed into a 30m period?

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

The ingestion rate looks fairly normal there, the effect you see in the initial report is due to aliasing.

I'm not sure exactly what's going on with the evals. My guess is that due to reduced memory usage, GC is happening more often and thus pushing out the tail. Can you graph the average latency with an irate?

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

prometheus_notifications_latency_seconds only contains quantiles; what precisely do you want the average of?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

They're summaries, use the _sum and _count.

@RichiH

This comment has been minimized.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

That's irate([2m])

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

Can you zoom that in and add the rules too?

@RichiH

This comment has been minimized.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

That looks less worrying. The only correlation I see for the spikes is with checkpoints completing. There's also a minor correlation with indexing, so maybe there's something around the rules creating new timeseries.

Have there been changes in your environment that'd effect what the rules were doing?

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

TTBOMK, there have been no changes other than me compiling and restarting Prometheus.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 27, 2017

The graphs indicate some form of change to the indexing.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jan 27, 2017

I see no change in count(up) during that time.

@jeinwag

This comment has been minimized.

Copy link

jeinwag commented Jan 30, 2017

I'm currently seeing a similar issue, I upgraded from 1.4.1 to 1.5.0 a little over an hour ago at ca. 15:30:
https://snapshot.raintank.io/dashboard/snapshot/h5F1k4mvnOaO1uvJCH2huebFc8wl5OyI

This isn't due to changes in our environment.

EDIT: Link to benchmark dashboard:
https://snapshot.raintank.io/dashboard/snapshot/PmQdXZ9lZZLAg77GoPzS8ukQyNph1gwJ

EDIT2: It looks like it's got to do with consul service discovery, at least in our case. The "up" count of jobs which use consul sd is fluctuating since the update, jobs with static target definitions are fine.

I'm now certain that my issue is rather related to #2377.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 31, 2017

I'm currently assessing performance impact at SC. So far, everything is pointing to "faster with less RAM". :)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 10, 2017

Wild guess; Could you try to set -storage.local.num-fingerprint-mutexes=100000 on the command line?

It could be that the 1.5.x creates more lock contention, and the flag above would help with that. And I'm genuinely curious if it does.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 15, 2017

This seems to have been the relabelling issue, combined with graph artifacts. 1.5.2 should work fine.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.