Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] metric longhorn_volume_robustness did not reflect detached volume #8139

Closed
chriscchien opened this issue Mar 8, 2024 · 4 comments
Closed
Assignees
Labels
area/monitoring System (cluster, node) or volume metrics, logs, stats kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@chriscchien
Copy link
Contributor

chriscchien commented Mar 8, 2024

Describe the bug

From test_volume_metric, metric longhorn_volume_robustness did not reflect detached volume. This issue can not be reproduced on v1.6.x.

All volume detached, no result of metric longhorn_volume_robustness

> k get volume -A
NAMESPACE         NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
longhorn-system   test                                       v1            detached   unknown                  1073741824          4m20s
longhorn-system   pvc-67fc71f0-9880-4cac-88f1-63cc7ebd7be2   v1            detached   unknown                  2147483648          4m27s
> 
> k get pods -A -o wide | grep longhorn-manager
longhorn-system   longhorn-manager-vvm77                              1/1     Running     0               3m      10.42.1.25   ip-172-31-33-245   <none>           <none>
longhorn-system   longhorn-manager-c2rvg                              1/1     Running     1 (2m55s ago)   3m      10.42.0.27   ip-172-31-44-100   <none>           <none>
longhorn-system   longhorn-manager-dmkqj                              1/1     Running     0               3m      10.42.2.31   ip-172-31-45-235   <none>           <none>
> curl -sSL http://10.42.1.25:9500/metrics | grep longhorn_volume_robustness
curl -sSL http://10.42.0.27:9500/metrics | grep longhorn_volume_robustness
curl -sSL http://10.42.2.31:9500/metrics | grep longhorn_volume_robustness
>

Have volume attached then longhorn_volume_robustness have the correspond value

> k get volume -A
NAMESPACE         NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE               AGE
longhorn-system   pvc-67fc71f0-9880-4cac-88f1-63cc7ebd7be2   v1            detached   unknown                  2147483648                      5m29s
longhorn-system   test                                       v1            attached   healthy                  1073741824   ip-172-31-33-245   5m22s
>
> curl -sSL http://10.42.1.25:9500/metrics | grep longhorn_volume_robustness
curl -sSL http://10.42.0.27:9500/metrics | grep longhorn_volume_robustness
curl -sSL http://10.42.2.31:9500/metrics | grep longhorn_volume_robustness
# HELP longhorn_volume_robustness Robustness of this volume
# TYPE longhorn_volume_robustness gauge
longhorn_volume_robustness{node="ip-172-31-33-245",pvc="",pvc_namespace="",volume="test"} 1

To Reproduce

  1. Deploy Longhorn master
  2. Create volume detached
  3. Check metric longhorn_volume_robustness
  4. No result

Expected behavior

Metric longhorn_volume_robustness can reflect detached volume.

Support bundle for troubleshooting

N/A

Environment

  • Longhorn version: master

Additional context

#7626 (comment)

@chriscchien chriscchien added kind/bug reproduce/always 100% reproducible kind/regression Regression which has worked before area/monitoring System (cluster, node) or volume metrics, logs, stats severity/3 Function working but has a major issue w/ workaround require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Mar 8, 2024
@chriscchien chriscchien added this to the v1.7.0 milestone Mar 8, 2024
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) backport/1.6.1 backport/1.5.5 labels Mar 8, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 12, 2024

All volume detached, no result of metric longhorn_volume_robustness

Have volume attached then longhorn_volume_robustness have the correspond value

Cause: The volume collector returned early due to the engine not running. This is a side effect from longhorn/longhorn-manager#2665.

[longhorn-manager-467n7] time="2024-03-12T04:44:05Z" level=warning msg="Failed to get engine proxy of longhorn-testvol-z41999-e-0 for volume longhorn-testvol-z41999" func="metrics_collector.(*VolumeCollector).collectMetrics" file="volume_collector.go:196" collector=volume error="failed to get binary client for engine longhorn-testvol-z41999-e-0: cannot get client for engine longhorn-testvol-z41999-e-0: engine is not running" node=ip-10-0-2-36

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 12, 2024

This issue can not be reproduced on v1.6.x.

This doesn't need to be backported to v1.6.x, v1.5.x because longhorn/longhorn-manager#2665 is not backported.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 12, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at fix(metrics): longhorn_volume_robustness not collected for detached volume longhorn-manager#2691

  • Which areas/issues this PR might have potential impacts on?
    Area monitor, manager
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@chriscchien
Copy link
Contributor Author

Verified pass on longhorn master(longhorn-manager 340c10)

Metric longhorn_volume_robustness can reflect detached volume and test case test_volume_metrics passed on pipeline

> k get volume -A
NAMESPACE         NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE               AGE
longhorn-system   pvc-e5099477-0725-4b34-a5f4-a77ebc190d94   v1            attached   healthy                  2147483648   ip-172-31-34-108   4m52s
longhorn-system   pvc-66d966af-28ae-4148-8679-2f61f0d7d7ae   v1            detached   unknown                  2147483648                      41s
>
> curl -sSL http://10.42.2.31:9500/metrics | grep longhorn_volume_robustness
# HELP longhorn_volume_robustness Robustness of this volume
# TYPE longhorn_volume_robustness gauge
longhorn_volume_robustness{node="ip-172-31-34-108",pvc="vol1",pvc_namespace="default",volume="pvc-e5099477-0725-4b34-a5f4-a77ebc190d94"} 1
longhorn_volume_robustness{node="ip-172-31-34-108",pvc="vol2",pvc_namespace="default",volume="pvc-66d966af-28ae-4148-8679-2f61f0d7d7ae"} 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring System (cluster, node) or volume metrics, logs, stats kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

4 participants