Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor effects on timeouts and performance. (follow up #473 & #596) #618

Open
4 tasks
schwesig opened this issue Jun 24, 2024 · 3 comments
Open
4 tasks
Assignees

Comments

@schwesig
Copy link
Member

schwesig commented Jun 24, 2024

Follow up from

Known checks

Status

  • Currently in the monitoring state

  • all needed nodes are available

  • currently memcache works fine

  • observability-thanos-store-shard-0-1 and observability-thanos-store-shard-0-2 are good

  • observability-thanos-store-shard-0-0
    creates

level=info ts=2024-06-24T13:46:37.338578312Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=893.994626ms duration_ms=893 cached=2128 returned=694 partial=0

level=warn ts=2024-06-24T13:46:44.853880072Z caller=bucket.go:637 msg="loading block failed" elapsed=7.514862006s id=01HQ6GM... err="create index header reader: write index header: new index reader: get TOC from object storage of 01HQ6GM.../index: Get "https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index\": Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index. Retry again."

@schwesig schwesig self-assigned this Jun 24, 2024
@schwesig schwesig changed the title Monitor effects on timeouts and performance. Monitor effects on timeouts and performance. (follow up https://github.com/nerc-project/operations/issues/473 & https://github.com/nerc-project/operations/issues/596) Jun 24, 2024
@schwesig schwesig changed the title Monitor effects on timeouts and performance. (follow up https://github.com/nerc-project/operations/issues/473 & https://github.com/nerc-project/operations/issues/596) Monitor effects on timeouts and performance. (follow up #473 & #596) Jun 24, 2024
@schwesig
Copy link
Member Author

schwesig commented Jun 28, 2024

check warning on observability-thanos-store-shard-0-0

https://access.redhat.com/support/cases/#/case/03764352

observability-thanos-store-shard-0-0 link

level=warn ts=2024-06-28T08:52:39.405735916Z caller=bucket.go:637 msg="loading block failed"
elapsed=2.930443482s id=01HQ6GMBQ9EWTE00ZSCME83RT3
err="create index header reader: write index
header: new index reader: get TOC from object storage of 01HQ6GMBQ9EWTE00ZSCME83RT3/index:
Get \"https://s3.openshift-storage.svc/observability-a6581571-5ded-446c-8ab8-9008e45e7e33/01HQ6GMBQ9EWTE00ZSCME83RT3/index\":
Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-5ded-446c-8ab8-9008e45e7e33/01HQ6GMBQ9EWTE00ZSCME83RT3/index. Retry again."

@schwesig
Copy link
Member Author

check retention on OBS

https://access.redhat.com/support/cases/#/case/03861871

@schwesig
Copy link
Member Author

/CC @computate

schwesig added a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Jul 1, 2024
This PR addresses the retention rate issues as discussed in nerc-project/operations#618 (comment) (having more than 30d raw etc.).
The changes include updating the retention and concurrency settings for the Thanos Compactor to enhance observability and metrics performance.
We will stay with the defaults where possible, adding remarks with the defaults to better understand the next changes or possible errors.

Changes to focus on the needs for class, cost, and invoice analysis, as for future predictions:
- Updated `retentionResolutionRaw` from 30d to 90d (quarterly high details for deep analysis, especially GPUs)
- Updated `retentionResolution5m` from 90d to 360d (for cost, usage, and invoices; 15 minutes could be enough, but is not a default option)
- Set `retentionResolution1h` to 0d (retain forever, following the default and recommendation)
- Added `blockDuration`, `cleanupInterval`, `deleteDelay`, `retentionInLocal`, `consistencyDelay`, `compactConcurrency`, and `downsampleConcurrency` settings: even if staying in the default, making the options visible in case of possible future changes)

These changes aim to optimize data retention & resolution for needed use cases and ensure better performance.

References:
1. [Thanos Compact Component](https://thanos.io/tip/components/compact.md/)
2. [Recommendations for Running Thanos and Prometheus](https://zapier.com/blog/five-recommendations-when-running-thanos-and-prometheus/)
3. [Red Hat Advanced Cluster Management Observability](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/observability/customizing-observability#adding-advanced-config:~:text=is%20not%20displayed.-,4.3.%C2%A0Adding%20advanced%20configuration%20for%20retention,-Add%20the%20advanced)

Signed-off-by: ​/Thor(sten)?/ Schwesig <89909507+schwesig@users.noreply.github.com>
larsks pushed a commit to schwesig/OCP-on-NERC_nerc-ocp-config that referenced this issue Jul 2, 2024
This PR addresses the retention rate issues as discussed in nerc-project/operations#618 (comment) (having more than 30d raw etc.).
The changes include updating the retention and concurrency settings for the Thanos Compactor to enhance observability and metrics performance.
We will stay with the defaults where possible, adding remarks with the defaults to better understand the next changes or possible errors.

Changes to focus on the needs for class, cost, and invoice analysis, as for future predictions:
- Updated `retentionResolutionRaw` from 30d to 90d (quarterly high details for deep analysis, especially GPUs)
- Updated `retentionResolution5m` from 90d to 360d (for cost, usage, and invoices; 15 minutes could be enough, but is not a default option)
- Set `retentionResolution1h` to 0d (retain forever, following the default and recommendation)
- Added `blockDuration`, `cleanupInterval`, `deleteDelay`, `retentionInLocal`, `consistencyDelay`, `compactConcurrency`, and `downsampleConcurrency` settings: even if staying in the default, making the options visible in case of possible future changes)

These changes aim to optimize data retention & resolution for needed use cases and ensure better performance.

References:
1. [Thanos Compact Component](https://thanos.io/tip/components/compact.md/)
2. [Recommendations for Running Thanos and Prometheus](https://zapier.com/blog/five-recommendations-when-running-thanos-and-prometheus/)
3. [Red Hat Advanced Cluster Management Observability](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/observability/customizing-observability#adding-advanced-config:~:text=is%20not%20displayed.-,4.3.%C2%A0Adding%20advanced%20configuration%20for%20retention,-Add%20the%20advanced)

Signed-off-by: ​/Thor(sten)?/ Schwesig <89909507+schwesig@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant