Support keep alive in spdk tcp backend #462

LiadOz · 2024-04-18T06:01:13Z

According to NVMe TCP Transport Specification 2021.06.02

The NVMe/TCP Transport requires the use of the Keep Alive feature (refer to section 7.12 in the NVMe
Base Specification). The NVMe/TCP Transport does not impose any limitations on the minimum and
maximum Keep Alive Timeout value. The minimum should be set large enough to account for any transient
fabric interconnect failures between the host and controller.
TCP level keep alive functionality is not prohibited but it is recommended that the TCP level keep alive
timeout is set to a higher value than the NVMe Keep Alive Timeout to avoid conflicting policies.

We have encountered the need for keep alive timeout in our application in a case of a sudden shutdown. The shutdown caused the host to abort the commands, but the target has kept them until it was brought back a few minutes later, after that time commands that were sent before the shutdown were completed without the application knowing about them causing data integrity issues. Had a keep alive been active, the target would have aborted those commands.

In lib/xnvme_be_spdk_dev.c the keep_alive_timeout_ms option is disabled for transport types TCP and RDMA, and it seems that this had existed since the spdk backend was created.
Allowing the user to specify a keep alive timeout is not enough, here is how that keep alive should be used according to spdk:

// include/spdk/nvme.h
struct spdk_nvme_ctrlr_opts {
    // ...

    /**
     * Keep alive timeout in milliseconds (0 = disabled).
     *
     * The NVMe library will set the Keep Alive Timer feature to this value and automatically
     * send Keep Alive commands as needed.  The library user must call
     * spdk_nvme_ctrlr_process_admin_completions() periodically to ensure Keep Alive commands
     * are sent.
     */

    uint32_t keep_alive_timeout_ms;
    // ...
}

So to fully support the feature, either xNVMe needs to create a thread that polls spdk_nvme_ctrlr_process_admin_completions or give the user the ability to call that function on their own.

The text was updated successfully, but these errors were encountered:

karlowich · 2024-04-18T11:53:22Z

Hi @LiadOz, thanks for putting this on our radar. We will have to spend some time coming up with the optimal solution.
For now, I think you should be able to achieve the desired outcome by sending the Keep Alive command "manually" through xnvme_cmd_pass_admin(). Can you try this out?

LiadOz · 2024-04-18T12:49:39Z

This would technically work as a workaround. If keep_alive_timeout_ms is set, spdk will send the keep alive command on it's own when spdk_nvme_ctrlr_process_admin_completions is called. So with the solution you provided 2 keep alives will be sent, one by the user, and the other one by spdk. The target probably doesn't care that it will get multiple keep alives at the same time.
I will test this solution on my environment and see if it works.

LiadOz · 2024-04-24T08:21:08Z

@karlowich This solution seems to work for my use case. In my fork I added spdk_keep_alive_timeout_ms field to xnvme_opts. Can I open a PR for this change?

karlowich · 2024-04-24T09:05:07Z

@LiadOz as you point out in your previous comment spdk_keep_alive_timeout_ms is not needed when sending the keep alive manually. So I'm not sure I see why we would use spdk_keep_alive_timeout_ms as long as we don't support manually calling spdk_nvme_ctrlr_process_admin_completions.

But maybe I'm missing something?

LiadOz · 2024-04-24T09:31:59Z

spdk_keep_alive_timeout_ms is first passed to the target in the initialization phase. In addition when spdk_nvme_ctrlr_process_admin_completions is called it checks if the keep alive timeout is activated and if a sufficient amount of time has passed then it sends the keep alive command. So it has two purposes.

karlowich · 2024-04-25T14:34:50Z

@LiadOz Yes, but as far as I can tell, if you don't plan on calling spdk_nvme_ctrlr_process_admin_completions there is no need to set spdk_keep_alive_timeout_ms.

LiadOz · 2024-04-25T16:37:30Z

I'm saying that the parameter spdk_keep_alive_timeout_ms does two things. First it sets the keep alive value for the target. So when the target got that value, it knows that if spdk_keep_alive_timeout_ms passes without a keep alive then it will abort all of the other commands. This is the behavior that I need.

In addition, when this value is set in the spdk of the host side, when spdk_nvme_ctrlr_process_admin_completions is called, then if a sufficient amount of time has passed it will send a keep alive command on its own.

So with the solution you suggested, since I can't call spdk_nvme_ctrlr_process_admin_completions directly, I will send keep alive commands on my own. But I need to let the target know what is the acceptable keep alive timeout, that's why I need to set spdk_keep_alive_timeout_ms.
This parameter affects multiple behaviors, that's why I still need it.

karlowich · 2024-04-25T16:45:31Z

@LiadOz Sorry, you are absolutely right. Now I see the value of being able to set spdk_keep_alive_timeout_ms, you can go ahead and open a PR.

karlowich · 2024-05-13T08:21:51Z

@LiadOz Can I close this?

LiadOz · 2024-05-15T06:35:02Z

My PR only added the option to enable the keep alive feature. The way we are sending the keep alive commands is workaround/hack, I don't think this ticket should be closed until a proper solution for it is implemented. But if you think the current workaround is good enough you can close this ticket. I have no problem with continuing to work with the hack.

karlowich · 2024-05-15T08:48:28Z

@LiadOz Good point, let's keep this open to remind us in the future.

LiadOz mentioned this issue Apr 30, 2024

feat(spdk): add keep_alive_timeout to opts #467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support keep alive in spdk tcp backend #462

Support keep alive in spdk tcp backend #462

LiadOz commented Apr 18, 2024

karlowich commented Apr 18, 2024

LiadOz commented Apr 18, 2024

LiadOz commented Apr 24, 2024

karlowich commented Apr 24, 2024 •

edited

Loading

LiadOz commented Apr 24, 2024

karlowich commented Apr 25, 2024

LiadOz commented Apr 25, 2024

karlowich commented Apr 25, 2024

karlowich commented May 13, 2024

LiadOz commented May 15, 2024

karlowich commented May 15, 2024

Support keep alive in spdk tcp backend #462

Support keep alive in spdk tcp backend #462

Comments

LiadOz commented Apr 18, 2024

karlowich commented Apr 18, 2024

LiadOz commented Apr 18, 2024

LiadOz commented Apr 24, 2024

karlowich commented Apr 24, 2024 • edited Loading

LiadOz commented Apr 24, 2024

karlowich commented Apr 25, 2024

LiadOz commented Apr 25, 2024

karlowich commented Apr 25, 2024

karlowich commented May 13, 2024

LiadOz commented May 15, 2024

karlowich commented May 15, 2024

karlowich commented Apr 24, 2024 •

edited

Loading