Size NESE allocation correctly for nerc-ocp-infra cluster #38

larsks · 2022-11-17T20:13:49Z

We ran out of space today on the 1TB RBD pool currently allocated to the nerc-ocp-infra cluster. @jtriley has temporarily increased the pool size, but we would like to get a better sense of our utilization has been increasing since we installed the observability tools and use that to select a more appropriate pool allocation.

larsks · 2022-11-17T20:29:17Z

From the slack conversation:

14:56 naved001 @ChristopherTate I am looking at those StorageSystem
    dashboards again, and I see:
14:56          Used Capacity is 624.7 GiB
14:56          Requested capacity is 2.36 TiB
14:56
14:56          and the ceph pool was definitely fully utilized (it
    reported 999 GiB) when things started to fail. What I am trying to
    figure out is how can we configure a useful alert to not run into
    this again.
14:56
14:56          The current pool size is 2 TiB, and if we use all the
    requested capacity we will run into it again. cc: @jtriley*
    (edited) [:heavy_check_mark:1]
14:59 jtriley* We probably need to bump the quota further but wanted
    to check with you folks re: forecasting. If we're logging things
    and collecting metrics on infra then we should probably bump it to
    at least 5T.
15:04 ChristopherTate @naved001 what is the right resource to look at
    when looking for the 2TiB limit? I can query
    PersistentVolumeClaims and the storage remaining as a %, but I
    don't think that's the right thing to look at here:
15:04                 https://multicloud-console.apps.nerc-ocp-infra.r
    c.fas.harvard.edu/grafana/explore?orgId=1&left=%5B%22now-
    1h%22,%22now%22,%22Observatorium%22,%7B%22exemplar%22:true,%22expr
    %22:%22kubelet_volume_stats_available_bytes%7Bpersistentvolumeclai
    m%3D~%5C%22db-noobaa-db-pg-.*%7Cmetrics-backing-store-noobaa-
    pvc-.*%7Cnoobaa-default-backing-store-noobaa-
    pvc-.*%5C%22%7D%2Fkubelet_volume_stats_capacity_bytes%22%7D%5D
    (https://multicloud-console.apps.nerc-ocp-infra.rc.fas.harvard.edu
    /grafana/explore?orgId=1&left=%5B%22now-
    […]let_volume_stats_capacity_bytes%22%7D%5D)
15:07 naved001 I am not sure, I think the used capacity should be it
    but I suspect that there may be some orphan volumes in the ceph
    pool that don’t correspond to any PV (maybe from old odf
    installs).
15:11 jtriley* Something driven off raw ceph metrics would be better
    but not sure what metrics ODF publishes exactly.
15:11 naved001 well the number of volumes in ceph pool == pvs in
    openshift, so my theory about orphan volumes is wrong.
15:11 larsks @jtriley* I'm going to follow up with @ChristopherTate
    tomorrow (i hope) on the subject of storage metrics, alerts, etc.
    [:+1:1]
15:12 naved001 so yeah, if we could somehow get it from ceph metrics
    that would be the most reliable.
15:12 larsks I'm going to create an issue about sizing and
    forecasting. [:+1:1]
15:12 ChristopherTate I hope to find some metrics for entire
    StorageSystems, but so far I haven't had any luck.
15:13 ChristopherTate I'm around tomorrow too, so feel free to reach
    out @larsks
15:13 jtriley* > For internal Mode clusters, various alerts related to
    the storage metrics services, storage cluster, disk devices,
    cluster health, cluster capacity, and so on are displayed in the
    Block and File, and the object dashboards. These alerts are not
    available for external Mode. (edited)
15:13 jtriley* https://access.redhat.com/documentation/en-us/red_hat_o
    penshift_data_foundation/4.10/html/monitoring_openshift_data_found
    ation/alerts (edited)
15:13 | Chapter 3. Alerts Red Hat OpenShift Data Foundation 4.10 | Red
    Hat Customer Portal
15:13 | Access Red Hat’s knowledge, guidance, and support through your
    subscription.

naved001 · 2022-11-17T22:54:00Z

Out of curiosity I was looking at rbd du which tells you how much is provisioned vs how much we are actually using:

NAME                                          PROVISIONED  USED
csi-vol-1f6c31ee-f9a4-11ec-aa57-0a580a8000a8      954 MiB  224 MiB
csi-vol-1f6c3587-f9a4-11ec-aa57-0a580a8000a8       94 GiB  9.1 GiB
csi-vol-2f86b744-1f27-11ed-9e8c-0a580a820096       94 GiB  4.6 GiB
csi-vol-30292544-22de-11ed-9e8c-0a580a820096       10 GiB  320 MiB
csi-vol-302f4c6b-22de-11ed-9e8c-0a580a820096       50 GiB  472 MiB
csi-vol-3fa060d7-2a25-11ed-97a4-0a580a810068       10 GiB  124 MiB
csi-vol-3fa0646b-2a25-11ed-97a4-0a580a810068       10 GiB      0 B
csi-vol-3fa1cab1-2a25-11ed-97a4-0a580a810068       20 GiB  1.4 GiB
csi-vol-3fa1cd30-2a25-11ed-97a4-0a580a810068        1 GiB   32 MiB
csi-vol-3fa1cd3b-2a25-11ed-97a4-0a580a810068        5 GiB  248 MiB
csi-vol-3fa1ce2c-2a25-11ed-97a4-0a580a810068       10 GiB  336 MiB
csi-vol-3fa1cebb-2a25-11ed-97a4-0a580a810068        1 GiB   68 MiB
csi-vol-45c9235d-091f-11ed-be28-0a580a8000a8        1 GiB  248 MiB
csi-vol-49a5b626-2f8e-11ed-97a4-0a580a810068        5 GiB  516 MiB
csi-vol-4d08f73e-0920-11ed-be28-0a580a8000a8        1 GiB  420 MiB
csi-vol-56d48967-091f-11ed-be28-0a580a8000a8       10 GiB  528 MiB
csi-vol-572d8230-091f-11ed-be28-0a580a8000a8      100 GiB   26 GiB
csi-vol-572f7952-091f-11ed-be28-0a580a8000a8       10 GiB  540 MiB
csi-vol-57471448-091f-11ed-be28-0a580a8000a8        1 GiB  248 MiB
csi-vol-574883bc-091f-11ed-be28-0a580a8000a8      100 GiB   14 GiB
csi-vol-57aa819c-091f-11ed-be28-0a580a8000a8       10 GiB  492 MiB
csi-vol-57c9615a-091f-11ed-be28-0a580a8000a8        1 GiB  372 MiB
csi-vol-5c8d9ec6-3b59-11ed-ac22-0a580a820044       10 GiB  372 MiB
csi-vol-5c8d9fc9-3b59-11ed-ac22-0a580a820044      150 GiB  141 GiB
csi-vol-5ce35237-0920-11ed-be28-0a580a8000a8        1 GiB  412 MiB
csi-vol-677dff29-091f-11ed-be28-0a580a8000a8        1 GiB  248 MiB
csi-vol-7f665279-229b-11ed-9e8c-0a580a820096       50 GiB  168 MiB
csi-vol-9237c5a8-229b-11ed-9e8c-0a580a820096       50 GiB  168 MiB
csi-vol-95866f8c-0397-11ed-be28-0a580a8000a8       10 GiB  240 MiB
csi-vol-9586731f-0397-11ed-be28-0a580a8000a8       10 GiB  228 MiB
csi-vol-9589f83d-0397-11ed-be28-0a580a8000a8       10 GiB  240 MiB
csi-vol-ad6e4995-f934-11ec-aa57-0a580a8000a8      100 GiB  2.1 GiB
csi-vol-b29e4628-350c-11ed-97a4-0a580a810068       10 GiB  1.6 GiB
csi-vol-c32d498f-02e9-11ed-be28-0a580a8000a8       10 GiB   88 MiB
csi-vol-c32fe68d-02e9-11ed-be28-0a580a8000a8       10 GiB   76 MiB
csi-vol-c332e81b-02e9-11ed-be28-0a580a8000a8       10 GiB   72 MiB
csi-vol-c8e19345-f88c-11ec-aa57-0a580a8000a8       50 GiB   12 GiB
csi-vol-d3acf4f3-18c0-11ed-8ab8-0a580a8200c5       10 GiB   72 MiB
csi-vol-d3acf741-18c0-11ed-8ab8-0a580a8200c5      150 GiB  141 GiB
csi-vol-d3af0b2a-18c0-11ed-8ab8-0a580a8200c5       10 GiB   68 MiB
csi-vol-d3b6a961-18c0-11ed-8ab8-0a580a8200c5       50 GiB  168 MiB
csi-vol-d526632f-19c0-11ed-8ab8-0a580a8200c5      400 GiB  207 GiB
csi-vol-d528069a-19c0-11ed-8ab8-0a580a8200c5      400 GiB  209 GiB
csi-vol-d5296ef1-19c0-11ed-8ab8-0a580a8200c5      400 GiB  199 GiB
csi-vol-e86ab39f-091f-11ed-be28-0a580a8000a8      100 GiB   13 GiB
csi-vol-f3ddc05a-091f-11ed-be28-0a580a8000a8      100 GiB   17 GiB
csi-vol-f5ba8f7d-f8e8-11ec-aa57-0a580a8000a8       50 GiB   22 GiB
csi-vol-faa23e6d-2302-11ed-9e8c-0a580a820096       10 GiB   68 MiB
csi-vol-fabdd2a8-2302-11ed-9e8c-0a580a820096       50 GiB  168 MiB
<TOTAL>                                           2.7 TiB  1.0 TiB

Things that stood out to me:

The total provisioned space matches what we see in StorageCluster dashboard, but the used space is more than what's reported in the openshift dashboard.
the 3 X 400 Gig volumes (openshift-storage/metrics-backing-store-noobaa-pvc) are at 50% usage.
There are 2 loki related 150G volumes (openshift-operators-redhat/wal-lokistack-ingester-0, and loki-namespace/wal-lokistack-ingester-0), and those are 90%+ full.

computate · 2022-11-18T19:35:01Z

I will be looking into alerts that can watch the available space of our Ceph Cluster. Here are some metrics I have found that can help. I need to make sure that all of these fields are also sent to Observability, because some are missing:

computate · 2022-12-01T19:55:28Z

I was attempting to configure retention of Loki logs with this PR OCP-on-NERC/nerc-ocp-config#165, but this is not yet possible because the RetentionStreamSpec feature is not yet released in the latest Loki Operator.

These commits in this PR grafana/loki#7106 from Grafana Loki have been merged into OpenShift Loki openshift/loki@9bee101, but the latest version of the Red Hat Loki Operator v5.5.4-4 does not yet include these commits containing RetentionStreamSpec. See the lokistack_types.go file.

As soon as a newer version of the Loki Operator is available, we can be sure to upgrade to provide retention. Until them, it seems like there is not a way for us to limit the amount of logs stored in Loki, unless there is a way to limit the data from the OpenShift side. I will look into limits on the OpenShift side.

hpdempsey · 2022-12-07T22:04:10Z

Documentation for acceptance tests updated to indicate NERC currently plans to defer Data Management requirements 3 and 4, so we will proceed with acceptance tests and return to this issue later when NERC is ready to address.

hpdempsey · 2022-12-07T22:06:20Z

Whoops--sorry closed wrong issue.

computate · 2022-12-12T22:29:19Z

@larsks I have opened a RH Support Case for the Loki Operator Retention: https://access.redhat.com/support/cases/#/case/03391568

computate · 2023-02-01T16:10:42Z

Closing because we were able to apply basic 90 day retention described in this issue here
#75

larsks · 2023-02-01T16:27:31Z

@computate is the existing volume size going to be sufficient for 90 days of data at the rate we're consuming space?

computate · 2023-02-06T13:58:44Z

It's a good question @larsks , I will reopen this and do some estimations.

computate · 2023-02-08T14:53:10Z

@larsks @jtriley Here is a look at the Ceph volume size for Loki:

You can see from this graph of the Ceph Storage Percent Used that the storage goes up by 0.8% per day. That means that 90 days of logs will currently fill up 72% of the storage.
You can see from this graph of the Ceph Storage Usage that it goes up by approximately 35 GiB per day.
It looks like there is 3.51 TiB of Ceph Total Storage, which means that it can currently store 100 days of logs.
I'm not sure why the total requested storage grows over time, but that is what it shows when you visit the ocs-external-storagecluster-storagesystem details and click on "Requested capacity".

I don't think this leaves a lot of room to collect more logs, especially with a lot of users in production. Are we constrained to 3.5 TiB, or is it possible to have a lot more? Any thoughts on the data?

larsks · 2023-02-15T15:10:02Z

I think I answered this in chat, but I should probably leave the answer here as well. We're not constrained to 3.5TB; we can request additional storage from NESE through @jtriley. From your comment it sounds like we currently have enough storage to support our target 90 day retention, but if we longer term storage we can request additional space.

joachimweyl · 2023-03-01T14:58:24Z

@computate now that we are able to properly size retention is this still an issue? If so what size should we request?

computate · 2023-03-08T16:59:33Z

@joachimweyl We setup the new logging and retention features 2 weeks ago. I think we should wait until after 90 days of logs and retention being applied to decide on the amount of storage here. 3.5 TB seems to be enough for the logs from 2 weeks of observation at least at the current load on NERC.

computate · 2023-09-13T15:37:47Z

Based on the recent issue "Unsupported volume count, the maximum supported volume count is 20" reported by @larsks, we need to be able to increase the numVolumes of noobaa BackingStores over time. We cannot exceed numVolumes: 20, so we need to ensure that the storage values of BackingStores is sufficient to never exceed numVolumes: 20.

computate · 2023-09-13T15:48:33Z

Based on the latest comment in this issue regarding log backups reported by @computate, We are currently generating 1 Ti / 9 days * 365.25 days / 12 months = 3.4 Ti of log data per month.

We currently have 4.8 Ti available Ceph storage for logs. With our log retention of 90 days for audit logs and 30 days for infrastructure and application logs on numVolumes: 12, We're doing fine staying between 1 Ti - 2 Ti of retained log storage on Ceph. I expect application logs to grow, and infrastructure logs to grow as we increase the number of clusters.

larsks · 2023-09-13T17:03:54Z

@computate I'm not sure of 20 is a hard limit, or if it is somehow influenced by number of nodes, size of volumes, etc. I'm going to open a support case with those questions today.

joachimweyl · 2023-11-08T15:32:52Z

moving away from Infra housing observability.

jtriley mentioned this issue Nov 17, 2022

Configure alerts for ODF in external ceph mode #39

Closed

joachimweyl added the research This task is primarily about information discovery label Nov 22, 2022

computate mentioned this issue Nov 29, 2022

Adding storage retention to logging and observability OCP-on-NERC/nerc-ocp-config#165

Closed

larsks assigned larsks and computate Nov 30, 2022

hpdempsey closed this as completed Dec 7, 2022

hpdempsey reopened this Dec 7, 2022

computate closed this as completed Feb 1, 2023

computate reopened this Feb 6, 2023

larsks removed their assignment Apr 25, 2023

joachimweyl added the openshift This issue pertains to NERC OpenShift label Aug 16, 2023

computate added the observability label Sep 11, 2023

computate mentioned this issue Sep 13, 2023

nerc-ocp-infra app failing to apply due to invalid noobaa configuration #226

Closed

computate added the loki-logs label Oct 2, 2023

joachimweyl added AAA Test and removed AAA Test labels Oct 16, 2023

joachimweyl closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size NESE allocation correctly for nerc-ocp-infra cluster #38

Size NESE allocation correctly for nerc-ocp-infra cluster #38

larsks commented Nov 17, 2022

larsks commented Nov 17, 2022 •

edited

Loading

naved001 commented Nov 17, 2022

computate commented Nov 18, 2022 •

edited

Loading

computate commented Dec 1, 2022

hpdempsey commented Dec 7, 2022

hpdempsey commented Dec 7, 2022

computate commented Dec 12, 2022

computate commented Feb 1, 2023

larsks commented Feb 1, 2023

computate commented Feb 6, 2023

computate commented Feb 8, 2023

larsks commented Feb 15, 2023

joachimweyl commented Mar 1, 2023

computate commented Mar 8, 2023

computate commented Sep 13, 2023

computate commented Sep 13, 2023

larsks commented Sep 13, 2023

joachimweyl commented Nov 8, 2023

Size NESE allocation correctly for nerc-ocp-infra cluster #38

Size NESE allocation correctly for nerc-ocp-infra cluster #38

Comments

larsks commented Nov 17, 2022

larsks commented Nov 17, 2022 • edited Loading

naved001 commented Nov 17, 2022

computate commented Nov 18, 2022 • edited Loading

computate commented Dec 1, 2022

hpdempsey commented Dec 7, 2022

hpdempsey commented Dec 7, 2022

computate commented Dec 12, 2022

computate commented Feb 1, 2023

larsks commented Feb 1, 2023

computate commented Feb 6, 2023

computate commented Feb 8, 2023

larsks commented Feb 15, 2023

joachimweyl commented Mar 1, 2023

computate commented Mar 8, 2023

computate commented Sep 13, 2023

computate commented Sep 13, 2023

larsks commented Sep 13, 2023

joachimweyl commented Nov 8, 2023

larsks commented Nov 17, 2022 •

edited

Loading

computate commented Nov 18, 2022 •

edited

Loading