New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device hot plug support #157
Conversation
Move ts_cmp from srp_daemon.h to util.h, to be used from another place in a downstream patch. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
ibverbs_init is called only in the first time that ibv_get_device_list is called by the user. This function does two main things: 1) Initialization actions like fork init, memory check, etc. 2) Scan the sysfs device list and build the cached ibv_device list. This patch refactors the ibverbs_init and moves out the built device list code into new function - ibverbs_get_device_list. Motivation: In downstream patch, we refresh the ibv_device list according to the updated snapshot of the sysfs devices, so we want that the initialization action will continue to be singleton, but rescan the sysfs in each call to ibv_get_device_list. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
libibverbs/device.c
Outdated
|
||
pthread_mutex_lock(&dev_list_lock); | ||
num_devices = ibverbs_get_device_list(&device_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we grab a mutex we don't need the pthread_once, just do the init within this mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done, will upload soon a series with this small change.
@@ -262,6 +267,8 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device) | |||
context->cmd_fd = cmd_fd; | |||
pthread_mutex_init(&context->mutex, NULL); | |||
|
|||
ibverbs_device_hold(device); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should fail early here if the device is already known to be deleted and not grab another ref on it. eg consider the case where the dev node has been reclaimed by another device. There is a race here of some kind with the kernel. We need to do design this so that the kernel cannot re-use the cdev numbers or sysfs file names, and do the timestamp test appropriately to ensure that we are not opening the wrong instance of a ibv_device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ref count is needed to keep the ibv_device memory regardless whether its kernel device was deleted. Cleaning the memory upon device deletion is wrong and may crash the application.
Application can get the 'IBV_EVENT_DEVICE_FATAL' event to know about its device deletion and upon ibv_close_device the memory will be freed.
The rare race that the kernel device was deleted and was re-claimed by another device just before that ibv_open_device is called is orthogonal to this series and might happen without this series as well.
This can be handled in some future support in kernel side or in the user area by using the timestamp creation just after calling cmd_fd = open(devpath, O_RDWR | O_CLOEXEC); as part of __ibv_open_device().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about clearing memory, I am saying ibv_open_device should not return a device we already know to be gone.
I disagree about the race, it is very relevant to this series and must be addressed here, like you say, by checking the timestamp in proper places. Previously we expressly did not support hotplug while verbs apps are running, now we do, so we can no longer ignore these fringe cases.
You need to think of something to make sysfs safe as well, maybe openat()...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to address that in this series, we can go with a simple solution in the user area as follows:
As part of __ibv_open_device just after that the cmd_fd was opened successfully we can make sure that this is still the sysfs device that the application asked to open by re-reading the timestamp and comparing to what was already set on the ibv_device, In case that TS was changed it points on some "other" device and the API will return an error. Otherwise we can safely continue to allocate the context by calling the kernel device. In case that by that time the device was somehow dropped the API will fail and an error will be returned as expected.
Does it make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I was asking for originally.
Plus you have to do some solution for everything that touches sysfs. Again, maybe openat() is the answer there.
Otherwise we can safely continue to allocate the context by calling the kernel device. In case that by that time the device was somehow dropped the API will fail and an error will be returned as expected.
I think you also need to check that the device hasn't been discovered to be removed before even attempting to open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made some extra checking re the option to use the openat system call and found that it can be done. However, it requires holding an open 'FD' per ibv_device. In some clouds environments which uses SRIOV and many NIC(s) it might be large numbers of 'FD's which is a limited resource per process. The alternative solution with TS doesn't require to hold any additional 'FD' and as such is preferred,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should do as I suggested, and hold the fd during the open_device and close it once the device is successfully opened - particularly for the downcall into the driver.
sysfs access outside the opendevice flow needs to be identified and handled via TS checking, I'm not sure how much of that there is..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open the FD as part of ibv_open_device is too late as it might open the new device and as such openat won't fail. If TS is used by that step, there is no extra need to use open/openat/close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to protect all sysfs access, hold the dirfd across sysfs accesses. I have some rework in a branch here, let me finish it and you can see how it would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already agreed that TS should be used to detect dropped devices as part of get device list flow.
This discussion is for further cases which are not directly related to this series and can happen also today, we prefer going ahead and take the series, any further handling can be done on top of this series.
Problem ======= Currently, libibverbs builds the cached ibv_device list only in the first time that ibv_get_device_list is called, as a result the list is never updated, no matter if there were hardware changes in the system. Solution ======== Modify the implementation of ibv_get_device_list() so that consecutive calls will re-scan the sysfs in the same manner as done today in order to create a fresh ibv_device list each time. For this purpose, the cached device list is change to be a real linked list and not a dynamic array. How we identify new devices =========================== Identifying same device is done according to timestamp creation of /sys/class/infiniband_verbs/uverbs%d/ibdev. We get the file status with stat syscall and use the st_mtime field for this purpose. When we rescan the sysfs devices, then we check for each sysfs device if it was already been in the last scan, if not then we allocate new ibv_device and add it to the cached device list. Next patch in this series handles the case that a device is no longer in use. Note: This patch changes the IBVERBS_PRIVATE symbol as required in the note above verbs_device struct. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Now, ibv_device list is refreshed each time that ibv_get_device_list is called, therefore we need to free devices from previous scan which aren't bound anymore, otherwise it could lead to memory leak of ibv_device structures. We can free the memory of ibv_device only if it isn't used anymore by the user. How we identify if device is still used ======================================== We add a reference count to the verbs device struct. This reference count is increased when: a. Set to one for having the device in the list until it should be deleted. b. User call to ibv_get_device_list. c. User call to ibv_open_device. The reference count is decreased when: a. User call to ibv_free_device_list. b. User call to ibv_close_device. c. Device is no longer exists in the sysfs. Device will be freed when the refcount is decreased to zero. For free the ibv_device struct, we add uninit_device callback function to verbs_device_ops. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
When libibvers calls to verbs_init_func of the provider, then the provider allocates the verbs_device. Add to all providers a function to free this memory once it's called. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
This series adds a device hot plug support based on the RFC and its follows discussion.