Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample/s difference between 1.1.0 and 1.2.1 #2128

Closed
lightpriest opened this Issue Oct 27, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@lightpriest
Copy link
Contributor

lightpriest commented Oct 27, 2016

This is more of a general question than a bug report.

We're running two identical instances of Prometheus, one with 1.1.0 and a mirror which we upgraded to 1.2.1. I've noticed that 1.1.0 consistently reports a larger sample rate (rate(prometheus_local_storage_ingested_samples_total[5m])). Is it something I should worry about? Is it a regression or just a change in the sampling count code?

They are completely identical. Same configuration, same flags and same job and target definitions. There is a slight delay until file_sd gets the updated file contents (10 minutes max), though.

screenshot from 2016-10-27 18-38-19

In the attached screenshot, what's marked "mirror" in the legend is 1.2.1 and the default one is 1.1.0.
The focus should be on the Sample rate graph.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 27, 2016

It looks like the 1.2.1 has more targets. Can you see what the difference is there?

@lightpriest

This comment has been minimized.

Copy link
Contributor Author

lightpriest commented Oct 27, 2016

Ohh, sorry. This graph is a bit confusing.
The bottom part is actually up == 0 and the upper is up == 1.
It's not really visible because they (the main and the mirror one) overlap almost completely.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 27, 2016

There looks to be a difference in target scrapes. My best guess is you're having more timeouts.

Could you binary search the config to narrow things down?

@lightpriest

This comment has been minimized.

Copy link
Contributor Author

lightpriest commented Oct 30, 2016

The configuration is identical (process flags and prometheus.conf).
I also diff'ed the file_sd targets, and it's identical as well. I think that if the files were different it would be visible in the "Targets (up)" graph.

I'm seeing that "Scrape duration (0.9, 0.99)" for 1.2.1 is slightly higher. That's the prometheus_target_interval_length_seconds metric (I borrowed it from Grafana's Prometheus example dashboard). I guess it is possible that 1.2.1 is inside a slower network/host (EC2).

The whole reason I'm doing this check is because I was already in 1.2.1 in both instances and had a very problematic scrape/s metric (variance between 10K and 60K). I traced it back to the upgrade to 1.2.1, so I downgraded one instance to compare them (and reset the datadir completely). What can I use to see how target_interval_length_seconds is affected? Any other similar metrics/flags/logs?

@grobie grobie closed this Mar 5, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.