Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Panic during collecting metrics #8098

Closed
derekbit opened this issue Mar 4, 2024 · 2 comments
Closed

[BUG] Panic during collecting metrics #8098

derekbit opened this issue Mar 4, 2024 · 2 comments
Assignees
Labels
area/monitoring System (cluster, node) or volume metrics, logs, stats area/resilience System or volume resilience component/longhorn-manager Longhorn manager (control plane) kind/bug kind/refactoring Request for refactoring (code) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/4 Function working but has a minor issue (a minor incident with low impact)
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Mar 4, 2024

Describe the bug

https://cloud-native.slack.com/archives/CNVPEL9U3/p1709281294886339

2024-03-01T10:28:17.583228766+01:00 time="2024-03-01T09:28:17Z" level=warning msg="Failed to get engine proxy of pvc-4c4da4f6-6585-4afe-8c2c-23b6624573ed-e-0 for volume pvc-4c4da4f6-6585-4afe-8c2c-23b6624573ed" func="metrics_collector.(*VolumeCollector).Collect" file="volume_collector.go:192" collector=volume error="failed to get binary client for engine pvc-4c4da4f6-6585-4afe-8c2c-23b6624573ed-e-0: cannot get client for engine pvc-4c4da4f6-6585-4afe-8c2c-23b6624573ed-e-0: engine is not running" node=core8
2024-03-01T10:28:17.583267365+01:00 time="2024-03-01T09:28:17Z" level=warning msg="Panic during collecting metrics" func="metrics_collector.(*VolumeCollector).Collect.func1" file="volume_collector.go:164" collector=volume error="runtime error: invalid memory address or nil pointer dereference" node=core8
2024-03-01T10:28:17.590616231+01:00 10.0.0.75 - - [01/Mar/2024:09:28:17 +0000] "GET /metrics HTTP/1.1" 200 33847 "" "Prometheus/2.47.1"

To Reproduce

Expected behavior

Support bundle for troubleshooting

supportbundle_e3a48966-8520-476f-90bc-758701853d85_2024-03-01T09-29-10Z.zip

Environment

  • Longhorn version: v1.6.0
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@derekbit derekbit added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. backport/1.6.1 area/monitoring System (cluster, node) or volume metrics, logs, stats component/longhorn-manager Longhorn manager (control plane) labels Mar 4, 2024
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 4, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Monitoring the volume metrics
  2. Trigger [BUG] Volume attach/detach/delete operations stuck in version 1.6.0 #7915
  3. The warning should be observed in longhorn-manager pod
  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-manager#2665

  • Which areas/issues this PR might have potential impacts on?
    Area: metrics
    Issues

@derekbit derekbit self-assigned this Mar 4, 2024
@derekbit derekbit added the area/resilience System or volume resilience label Mar 4, 2024
@derekbit derekbit added this to the v1.7.0 milestone Mar 4, 2024
@derekbit
Copy link
Member Author

derekbit commented Mar 4, 2024

The issue is not harmful and won't destroy longhorn-manager pod, because the recover mechanism is introduced.

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) kind/refactoring Request for refactoring (code) severity/4 Function working but has a minor issue (a minor incident with low impact) and removed priority/0 Must be fixed in this release (managed by PO) backport/1.6.1 backport/1.5.5 labels Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring System (cluster, node) or volume metrics, logs, stats area/resilience System or volume resilience component/longhorn-manager Longhorn manager (control plane) kind/bug kind/refactoring Request for refactoring (code) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/4 Function working but has a minor issue (a minor incident with low impact)
Projects
None yet
Development

No branches or pull requests

3 participants