Monitor effects on timeouts and performance. (follow up #473 & #596) #618

schwesig · 2024-06-24T14:00:43Z

Follow up from

Timeouts Between Observability and Loki Pods in Infra Cluster #473
Follow Up #473 Timeouts between pods: Adding 3 Nodes to the infra-cluster to follow RH support recommendation #596
after nodes were successfully added.

Known checks

check warning on observability-thanos-store-shard-0-0
- https://access.redhat.com/support/cases/#/case/03764352
check retention on OBS
- https://access.redhat.com/support/cases/#/case/03861871

Status

Currently in the monitoring state
all needed nodes are available
currently memcache works fine
observability-thanos-store-shard-0-1 and observability-thanos-store-shard-0-2 are good
observability-thanos-store-shard-0-0
creates

level=info ts=2024-06-24T13:46:37.338578312Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=893.994626ms duration_ms=893 cached=2128 returned=694 partial=0

level=warn ts=2024-06-24T13:46:44.853880072Z caller=bucket.go:637 msg="loading block failed" elapsed=7.514862006s id=01HQ6GM... err="create index header reader: write index header: new index reader: get TOC from object storage of 01HQ6GM.../index: Get "https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index\": Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index. Retry again."

schwesig · 2024-06-28T09:05:58Z

check warning on observability-thanos-store-shard-0-0

https://access.redhat.com/support/cases/#/case/03764352

observability-thanos-store-shard-0-0 link

level=warn ts=2024-06-28T08:52:39.405735916Z caller=bucket.go:637 msg="loading block failed"
elapsed=2.930443482s id=01HQ6GMBQ9EWTE00ZSCME83RT3
err="create index header reader: write index
header: new index reader: get TOC from object storage of 01HQ6GMBQ9EWTE00ZSCME83RT3/index:
Get \"https://s3.openshift-storage.svc/observability-a6581571-5ded-446c-8ab8-9008e45e7e33/01HQ6GMBQ9EWTE00ZSCME83RT3/index\":
Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-5ded-446c-8ab8-9008e45e7e33/01HQ6GMBQ9EWTE00ZSCME83RT3/index. Retry again."

schwesig · 2024-06-28T10:00:01Z

check retention on OBS

https://access.redhat.com/support/cases/#/case/03861871

schwesig · 2024-06-28T10:00:27Z

/CC @computate

This PR addresses the retention rate issues as discussed in nerc-project/operations#618 (comment) (having more than 30d raw etc.). The changes include updating the retention and concurrency settings for the Thanos Compactor to enhance observability and metrics performance. We will stay with the defaults where possible, adding remarks with the defaults to better understand the next changes or possible errors. Changes to focus on the needs for class, cost, and invoice analysis, as for future predictions: - Updated `retentionResolutionRaw` from 30d to 90d (quarterly high details for deep analysis, especially GPUs) - Updated `retentionResolution5m` from 90d to 360d (for cost, usage, and invoices; 15 minutes could be enough, but is not a default option) - Set `retentionResolution1h` to 0d (retain forever, following the default and recommendation) - Added `blockDuration`, `cleanupInterval`, `deleteDelay`, `retentionInLocal`, `consistencyDelay`, `compactConcurrency`, and `downsampleConcurrency` settings: even if staying in the default, making the options visible in case of possible future changes) These changes aim to optimize data retention & resolution for needed use cases and ensure better performance. References: 1. [Thanos Compact Component](https://thanos.io/tip/components/compact.md/) 2. [Recommendations for Running Thanos and Prometheus](https://zapier.com/blog/five-recommendations-when-running-thanos-and-prometheus/) 3. [Red Hat Advanced Cluster Management Observability](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/observability/customizing-observability#adding-advanced-config:~:text=is%20not%20displayed.-,4.3.%C2%A0Adding%20advanced%20configuration%20for%20retention,-Add%20the%20advanced) Signed-off-by: /Thor(sten)?/ Schwesig <89909507+schwesig@users.noreply.github.com>

schwesig mentioned this issue Jun 24, 2024

Follow Up #473 Timeouts between pods: Adding 3 Nodes to the infra-cluster to follow RH support recommendation #596

Closed

2 tasks

schwesig self-assigned this Jun 24, 2024

schwesig mentioned this issue Jun 24, 2024

Timeouts Between Observability and Loki Pods in Infra Cluster #473

Closed

1 task

schwesig changed the title ~~Monitor effects on timeouts and performance.~~ Monitor effects on timeouts and performance. (follow up https://github.com/nerc-project/operations/issues/473 & https://github.com/nerc-project/operations/issues/596) Jun 24, 2024

schwesig changed the title ~~Monitor effects on timeouts and performance. (follow up https://github.com/nerc-project/operations/issues/473 & https://github.com/nerc-project/operations/issues/596)~~ Monitor effects on timeouts and performance. (follow up #473 & #596) Jun 24, 2024

schwesig mentioned this issue Jun 28, 2024

feat: Update retention and concurrency for Thanos OCP-on-NERC/nerc-ocp-config#461

Open

This was referenced Jul 3, 2024

add acm-metrics object bucket claim OCP-on-NERC/nerc-ocp-config#463

Merged

PoC: Deploy ?VictoriaMetrics? on TEST then OBS #460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor effects on timeouts and performance. (follow up #473 & #596) #618

Monitor effects on timeouts and performance. (follow up #473 & #596) #618

schwesig commented Jun 24, 2024 •

edited

Loading

schwesig commented Jun 28, 2024 •

edited

Loading

schwesig commented Jun 28, 2024

schwesig commented Jun 28, 2024

Monitor effects on timeouts and performance. (follow up #473 & #596) #618

Monitor effects on timeouts and performance. (follow up #473 & #596) #618

Comments

schwesig commented Jun 24, 2024 • edited Loading

schwesig commented Jun 28, 2024 • edited Loading

check warning on observability-thanos-store-shard-0-0

observability-thanos-store-shard-0-0 link

schwesig commented Jun 28, 2024

check retention on OBS

schwesig commented Jun 28, 2024

schwesig commented Jun 24, 2024 •

edited

Loading

schwesig commented Jun 28, 2024 •

edited

Loading