Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot add block-type disk to node resource due to timeout error #7253

Closed
derekbit opened this issue Dec 4, 2023 · 10 comments
Closed
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Dec 4, 2023

Describe the bug (馃悰 if you encounter this issue)

Cannot add block-type disk to node resource after the update of longhorn-spdk-engine and go-spdk-helper in longhorn-instance-manager.

longhorn/longhorn-instance-manager@5652de6

    test-block-1:
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2023-12-04T14:34:03Z"
        message: 'Disk test-block-1(/dev/loop0) on node rancher60-worker2 is not ready:
          failed to get disk config: error: rpc error: code = Internal desc = rpc
          error: code = Internal desc = failed to get AIO bdev with name test-block-1:
          error sending message, id 0, method bdev_get_bdevs, params {test-block-1
          0}: timeout 30s waiting for response during async message send'
        reason: NoDiskInfo
        status: "False"

UPDATE: The disk became ready after waiting for 30 seconds. 30-seconds timeout seems not enough.

cc @shuo-wu

To Reproduce

  1. Create a single node cluster
  2. Enable v2-data-engine
  3. Add a block-type disk

Expected behavior

The disk should be added and should not trigger timeout error.

Support bundle for troubleshooting

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@derekbit derekbit added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 4, 2023
@derekbit derekbit changed the title [BUG] Cannot add block-type disk to node resource [IMPROVEMENT] Cannot add block-type disk to node resource due to timeout error Dec 4, 2023
@derekbit derekbit added area/v2-data-engine v2 data engine (SPDK) and removed kind/bug labels Dec 4, 2023
@derekbit derekbit added this to the v1.6.0 milestone Dec 4, 2023
@derekbit derekbit added the priority/0 Must be fixed in this release (managed by PO) label Dec 4, 2023
@derekbit
Copy link
Member Author

derekbit commented Dec 4, 2023

Tried to add more log messages. Cannot receive some responses

[longhorn-instance-manager] time="2023-12-04T14:23:41Z" level=info msg="Debug ===> dispatch msg: &{method:bdev_nvme_set_options params:{CtrlrLossTimeoutSec:30 ReconnectDelaySec:5 FastIOFailTimeoutSec:15 TransportAckTimeout:14} responseChan:0xc0004c6720}" func="jsonrpc.(*Client).dispatcher" file="client.go:187"
[longhorn-instance-manager] time="2023-12-04T14:23:41Z" level=info msg="Debug ==> id=2360" func="jsonrpc.(*Client).handleSend" file="client.go:152"
[longhorn-instance-manager] time="2023-12-04T14:23:41Z" level=info msg="Debug ==> resp.ID=2360" func="jsonrpc.(*Client).handleRecv" file="client.go:168"
[longhorn-instance-manager] time="2023-12-04T14:23:41Z" level=info msg="Instance Manager SPDK gRPC server listening to 0.0.0.0:8504" func=cmd.start file="start.go:220"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=info msg="Debug ===> dispatch msg: &{method:bdev_get_bdevs params:<nil> responseChan:0xc0004c6900}" func="jsonrpc.(*Client).dispatcher" file="client.go:187"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=info msg="Debug ==> id=2361" func="jsonrpc.(*Client).handleSend" file="client.go:152"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=info msg="Debug ==> resp.ID=2361" func="jsonrpc.(*Client).handleRecv" file="client.go:168"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=info msg="Debug ===> dispatch msg: &{method:bdev_lvol_get_lvstores params:<nil> responseChan:0xc0004c6ae0}" func="jsonrpc.(*Client).dispatcher" file="client.go:187"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=info msg="Debug ==> id=2362" func="jsonrpc.(*Client).handleSend" file="client.go:152"
[longhorn-instance-manager] time="2023-12-04T14:23:44Z" level=error msg="Response receiver queue is blocked for over 3s second" func="jsonrpc.(*Client).read" file="client.go:223"

@derekbit derekbit changed the title [IMPROVEMENT] Cannot add block-type disk to node resource due to timeout error [BUG] Cannot add block-type disk to node resource due to timeout error Dec 4, 2023
@derekbit
Copy link
Member Author

derekbit commented Dec 4, 2023

@shuo-wu Can you help check this issue? Thank you.

@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 5, 2023

UPDATE: The disk became ready after waiting for 3 seconds. 30-seconds timeout seems not enough.

Sorry... This should mean the 30-second timeout is enough. Is this a typo?

@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 5, 2023

I see... It seems that this timer should be reset before select rather than after select.

@derekbit
Copy link
Member Author

derekbit commented Dec 5, 2023

Sorry... This should mean the 30-second timeout is enough. Is this a typo?

Ah, sorry, it's typo.

The issue (timeout 30s waiting for response during async message send) is still persistent after applying longhorn/go-spdk-helper#53.

@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 5, 2023

Weird... And I cannot reproduce it...

@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 7, 2023

The culprit is that this timer does not work as expected. Once it expires, timer.Reset() won't function anymore and select always falls into this case.

Changing the timer to a ticket does not work either. It won't reset the starting time after calling ticker.Reset() with the same interval. Then select still has a chance to fall into this case.

Hence the final solution is, creating a timer each time before select

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 7, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: the ticket description

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Fix the wrong timer for the json client response handling聽go-spdk-helper#53

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is: the same as the reproducing step. And check there is no log Response receiver queue is blocked for over in the instance manager pods.

@innobead
Copy link
Member

@derekbit is this resolved already from your testing? If yes, let's close this issue.

@yangchiu
Copy link
Member

Verified passed on master-head (longhorn-instance-manager 0d57324) following the test steps. Block type disks can be added, and v2 volumes can be created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

5 participants