-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Size NESE allocation correctly for nerc-ocp-infra cluster #38
Comments
From the slack conversation:
|
Out of curiosity I was looking at
Things that stood out to me:
|
I will be looking into alerts that can watch the available space of our Ceph Cluster. Here are some metrics I have found that can help. I need to make sure that all of these fields are also sent to Observability, because some are missing: |
I was attempting to configure retention of Loki logs with this PR OCP-on-NERC/nerc-ocp-config#165, but this is not yet possible because the These commits in this PR grafana/loki#7106 from Grafana Loki have been merged into OpenShift Loki openshift/loki@9bee101, but the latest version of the Red Hat Loki Operator As soon as a newer version of the Loki Operator is available, we can be sure to upgrade to provide retention. Until them, it seems like there is not a way for us to limit the amount of logs stored in Loki, unless there is a way to limit the data from the OpenShift side. I will look into limits on the OpenShift side. |
Documentation for acceptance tests updated to indicate NERC currently plans to defer Data Management requirements 3 and 4, so we will proceed with acceptance tests and return to this issue later when NERC is ready to address. |
Whoops--sorry closed wrong issue. |
@larsks I have opened a RH Support Case for the Loki Operator Retention: https://access.redhat.com/support/cases/#/case/03391568 |
Closing because we were able to apply basic 90 day retention described in this issue here |
@computate is the existing volume size going to be sufficient for 90 days of data at the rate we're consuming space? |
It's a good question @larsks , I will reopen this and do some estimations. |
@larsks @jtriley Here is a look at the Ceph volume size for Loki:
I don't think this leaves a lot of room to collect more logs, especially with a lot of users in production. Are we constrained to 3.5 TiB, or is it possible to have a lot more? Any thoughts on the data? |
I think I answered this in chat, but I should probably leave the answer here as well. We're not constrained to 3.5TB; we can request additional storage from NESE through @jtriley. From your comment it sounds like we currently have enough storage to support our target 90 day retention, but if we longer term storage we can request additional space. |
@computate now that we are able to properly size retention is this still an issue? If so what size should we request? |
@joachimweyl We setup the new logging and retention features 2 weeks ago. I think we should wait until after 90 days of logs and retention being applied to decide on the amount of storage here. 3.5 TB seems to be enough for the logs from 2 weeks of observation at least at the current load on NERC. |
Based on the recent issue "Unsupported volume count, the maximum supported volume count is 20" reported by @larsks, we need to be able to increase the |
Based on the latest comment in this issue regarding log backups reported by @computate, We are currently generating We currently have |
@computate I'm not sure of 20 is a hard limit, or if it is somehow influenced by number of nodes, size of volumes, etc. I'm going to open a support case with those questions today. |
moving away from Infra housing observability. |
We ran out of space today on the 1TB RBD pool currently allocated to the nerc-ocp-infra cluster. @jtriley has temporarily increased the pool size, but we would like to get a better sense of our utilization has been increasing since we installed the observability tools and use that to select a more appropriate pool allocation.
The text was updated successfully, but these errors were encountered: