[BUG] Prometheus volume failed #3260

albertkohl-monotek · 2022-12-08T16:45:35Z

Describe the bug
after some sort of Disk Corruption issue, i cant seem to get Prometheus monitoring health. see #3238. I would like to try to get it health before attempting and upgrade from v1.0.3 -> v1.1.1 to avoid any issues with the upgrade hanging.

To Reproduce
hard to re-preoduce because i believe it was caused by longhorn disk corruption issues for ext4 fs.

Expected behavior
Prometheus volumes should be health, or perhaps a workaround to re-set prometheus.

Support bundle
attached.

Environment

Harvester ISO version: v1.0.3
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): BareMetal R620s. 4 nodes

Additional context
tried to delete the failed replica in longhorn, but it is not an option. also tried to attach volume to other nodes with no luck.

Volume details in LH:
Volume Details
State: Detached
Health:
Unknown
Ready for workload:Ready
Conditions:
restore
scheduled
Frontend:Block Device
Attached Node & Endpoint:
Size:
50 Gi
Actual Size:64.7 Gi
Data Locality:disabled
Access Mode:ReadWriteOnce
Engine Image:longhornio/longhorn-engine:v1.2.4
Created:3 months ago
Encrypted:False
Node Tags:
Disk Tags:
Last Backup:
Last Backup At:
Replicas Auto Balance:ignored
Instance Manager:
Namespace:cattle-monitoring-system
PVC Name:prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-0
PV Name:pvc-7616922b-a530-4bcc-b281-0a0438955d4d
PV Status:Bound
Revision Counter Disabled:False
Pod Name:prometheus-rancher-monitoring-prometheus-0
Pod Status:Pending
Workload Name:prometheus-rancher-monitoring-prometheus
Workload Type:StatefulSet

charts / graphs fail to load i'm guessing because of missing metrics.

additonal thread from slack w/ @Vicente-Cheng:
https://rancher-users.slack.com/archives/C01GKHKAG0K/p1670220962452249 (hopefully this link works!)

albertkohl-monotek · 2022-12-30T18:03:18Z

i found the pod in kubectl and deleted it. came up fine after.

albertkohl-monotek added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Dec 8, 2022

rebeccazzzz added this to New in Community Issue Review via automation Dec 12, 2022

albertkohl-monotek closed this as completed Dec 30, 2022

rebeccazzzz moved this from New to Resolved/Scheduled in Community Issue Review Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Prometheus volume failed #3260

[BUG] Prometheus volume failed #3260

albertkohl-monotek commented Dec 8, 2022 •

edited

albertkohl-monotek commented Dec 30, 2022

[BUG] Prometheus volume failed #3260

[BUG] Prometheus volume failed #3260

Comments

albertkohl-monotek commented Dec 8, 2022 • edited

albertkohl-monotek commented Dec 30, 2022

albertkohl-monotek commented Dec 8, 2022 •

edited