Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus retention config not resolved correctly when using defaults in prom #5734

Closed
israel-hdez opened this issue Jan 19, 2023 · 6 comments · Fixed by #6958
Closed

Prometheus retention config not resolved correctly when using defaults in prom #5734

israel-hdez opened this issue Jan 19, 2023 · 6 comments · Fixed by #6958
Assignees
Labels
backlog Triaged Issue added to backlog bug Something isn't working

Comments

@israel-hdez
Copy link
Member

israel-hdez commented Jan 19, 2023

Since long ago, Kiali fetches prometheus retention settings from Prometheus's config endpoints. Specifically, and apparently, retention settings seem to be retrieved using the /api/v1/status/flags endpoint of Prometheus.

When retention settings are set (via command line args when starting Prometheus), the /api/v1/status/flags will correctly list such configuration. However, if related CLI args are omitted, this endpoint may provide 0s and won't reflect the retention that went into effect (which at the time of writing is 15 days).

Looks like a better way to fetch the retention interval is by querying the /api/v1/status/runtimeinfo endpoint which seems to provide the configuration in effect, although if multiple retention settings are set, it takes more complex forms like the following:

image

@israel-hdez israel-hdez added the bug Something isn't working label Jan 19, 2023
@israel-hdez israel-hdez changed the title Prometheus retention config not resolved correctly Prometheus retention config not resolved correctly when using defaults in prom Jan 19, 2023
@jshaughn
Copy link
Collaborator

jshaughn commented Feb 8, 2023

We drive a few things off of the time-based retention, I don't think there is much we can do about the size-based retention. It is important that we get it right, so I guess you are saying that if storage.tsdb.retention.time is 0, we should not just use our default, but then try this alternate method?

@jshaughn jshaughn added the backlog Triaged Issue added to backlog label Feb 8, 2023
@israel-hdez
Copy link
Member Author

I guess you are saying that if storage.tsdb.retention.time is 0, we should not just use our default, but then try this alternate method

Either that, or always use the runtimeinfo endpoint, as we would need to add code to parse that field anyway.

Or perhaps, simply change Kiali's default to 15days (rather than 6 hours) to be aligned to what Prometheus would do, assuming it's going to be hard that Prometheus would ever change that default.

@jshaughn
Copy link
Collaborator

jshaughn commented Dec 6, 2023

@jmazzitelli , @nrfox , We've been reducing our use of the debug endpoints, what happens when api is disabled? Even if it isn't disabled, I wonder if we should do something here?

@nrfox
Copy link
Contributor

nrfox commented Dec 6, 2023

@jshaughn from what I can tell, this issue is not about istio. It's solely about prometheus and which endpoint to query for prometheus.

It looks like there's two recommendations here:

  1. Align Kiali default retention period with prometheus
  2. Use the correct endpoint /api/v1/status/runtimeinfo to get the retention info.

Seems like we could do both fairly easily.

@jshaughn
Copy link
Collaborator

jshaughn commented Dec 6, 2023

this issue is not about istio.

@nrfox Oh, right (duh moment for me). I think we should probably do this, especially with the increasing possibility of unified stores, etc.

@jmazzitelli jmazzitelli self-assigned this Dec 14, 2023
@jmazzitelli
Copy link
Collaborator

jmazzitelli commented Dec 14, 2023

Just some more details on prometheus storage retention config.

See https://prometheus.io/docs/prometheus/latest/storage/

--storage.tsdb.retention.time: When to remove old data. Defaults to 15d.
--storage.tsdb.retention.size: The maximum number of bytes of storage blocks to retain.
The oldest data will be removed first. Defaults to 0 or disabled.
Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Based on powers-of-2, so 1KB is 1024B.
If both time and size retention policies are specified, whichever triggers first will be used.

So the /api/v1/status/runtimeinfo endpoint's data.storageRetention field will be in the form of one of these three:

  1. time-based only: 15d
  2. size-based only: 100GB
  3. time-based and size-based: 15d or 100GB

We would need to parse that string and ignore the first space character and everything to the right of it. That resulting string may not be time-based - if we determine it is not time-based, we would then use the default 15d.

Based on this page, the format of the time-based retention period is:

... the retention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms.

We know a value is size-based if it has a "B" in it (even doing a case-insensitive test). This is because all valid size units have a "B" in it, but no time-based units have a "B" in it. So that's an easy way to determine if the value we have is time-based or size-based.

So easy pseudo-code that allows us to determine what the storage retention period is could be:

str = JSON[data.storageRetention]
if str has a space character then
  str = the first word in str
endif 
if lowercase(str) does not contains "b" then
  return str
else
  return "15d"
endif

or

// parseStorageRetentionTime parses the storage retention as reported by Prometheus endpoint /api/v1/status/runtimeinfo
// see:
//   https://prometheus.io/docs/prometheus/latest/command-line/prometheus/
//   https://prometheus.io/docs/prometheus/latest/storage/
func parseStorageRetentionTime(str string) string {
   if strings.Index(str, " ") != -1 {
      str = (strings.Fields(str))[0]
   }
   if !strings.Contains(strings.ToLower(str), "b") {
      return str
   }
   return "15d" // only size-based retention is configured; return the Prometheus default of 15d
}

jmazzitelli added a commit to jmazzitelli/kiali that referenced this issue Dec 14, 2023
jmazzitelli added a commit that referenced this issue Dec 19, 2023
…od (#6958)

* get the prometheus runtime info in order to obtain the retention period

fixes: #5734

* linter suggestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Triaged Issue added to backlog bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

4 participants