Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device hot plug support #157

Merged
merged 5 commits into from Jul 18, 2017
Merged

Device hot plug support #157

merged 5 commits into from Jul 18, 2017

Conversation

yishaih
Copy link
Member

@yishaih yishaih commented Jun 29, 2017

This series adds a device hot plug support based on the RFC and its follows discussion.

Maor Gottlieb added 2 commits June 29, 2017 15:17
Move ts_cmp from srp_daemon.h to util.h, to be used from
another place in a downstream patch.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
ibverbs_init is called only in the first time that
ibv_get_device_list is called by the user.

This function does two main things:
1) Initialization actions like fork init, memory check, etc.
2) Scan the sysfs device list and build the cached ibv_device list.

This patch refactors the ibverbs_init and moves out the built
device list code into new function - ibverbs_get_device_list.

Motivation: In downstream patch, we refresh the ibv_device list
according to the updated snapshot of the sysfs devices, so we want
that the initialization action will continue to be singleton, but
rescan the sysfs in each call to ibv_get_device_list.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>

pthread_mutex_lock(&dev_list_lock);
num_devices = ibverbs_get_device_list(&device_list);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we grab a mutex we don't need the pthread_once, just do the init within this mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done, will upload soon a series with this small change.

@@ -262,6 +267,8 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
context->cmd_fd = cmd_fd;
pthread_mutex_init(&context->mutex, NULL);

ibverbs_device_hold(device);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should fail early here if the device is already known to be deleted and not grab another ref on it. eg consider the case where the dev node has been reclaimed by another device. There is a race here of some kind with the kernel. We need to do design this so that the kernel cannot re-use the cdev numbers or sysfs file names, and do the timestamp test appropriately to ensure that we are not opening the wrong instance of a ibv_device.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ref count is needed to keep the ibv_device memory regardless whether its kernel device was deleted. Cleaning the memory upon device deletion is wrong and may crash the application.
Application can get the 'IBV_EVENT_DEVICE_FATAL' event to know about its device deletion and upon ibv_close_device the memory will be freed.

The rare race that the kernel device was deleted and was re-claimed by another device just before that ibv_open_device is called is orthogonal to this series and might happen without this series as well.

This can be handled in some future support in kernel side or in the user area by using the timestamp creation just after calling cmd_fd = open(devpath, O_RDWR | O_CLOEXEC); as part of __ibv_open_device().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not talking about clearing memory, I am saying ibv_open_device should not return a device we already know to be gone.

I disagree about the race, it is very relevant to this series and must be addressed here, like you say, by checking the timestamp in proper places. Previously we expressly did not support hotplug while verbs apps are running, now we do, so we can no longer ignore these fringe cases.

You need to think of something to make sysfs safe as well, maybe openat()...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to address that in this series, we can go with a simple solution in the user area as follows:
As part of __ibv_open_device just after that the cmd_fd was opened successfully we can make sure that this is still the sysfs device that the application asked to open by re-reading the timestamp and comparing to what was already set on the ibv_device, In case that TS was changed it points on some "other" device and the API will return an error. Otherwise we can safely continue to allocate the context by calling the kernel device. In case that by that time the device was somehow dropped the API will fail and an error will be returned as expected.

Does it make sense ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I was asking for originally.

Plus you have to do some solution for everything that touches sysfs. Again, maybe openat() is the answer there.

Otherwise we can safely continue to allocate the context by calling the kernel device. In case that by that time the device was somehow dropped the API will fail and an error will be returned as expected.

I think you also need to check that the device hasn't been discovered to be removed before even attempting to open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made some extra checking re the option to use the openat system call and found that it can be done. However, it requires holding an open 'FD' per ibv_device. In some clouds environments which uses SRIOV and many NIC(s) it might be large numbers of 'FD's which is a limited resource per process. The alternative solution with TS doesn't require to hold any additional 'FD' and as such is preferred,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should do as I suggested, and hold the fd during the open_device and close it once the device is successfully opened - particularly for the downcall into the driver.

sysfs access outside the opendevice flow needs to be identified and handled via TS checking, I'm not sure how much of that there is..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open the FD as part of ibv_open_device is too late as it might open the new device and as such openat won't fail. If TS is used by that step, there is no extra need to use open/openat/close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to protect all sysfs access, hold the dirfd across sysfs accesses. I have some rework in a branch here, let me finish it and you can see how it would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already agreed that TS should be used to detect dropped devices as part of get device list flow.

This discussion is for further cases which are not directly related to this series and can happen also today, we prefer going ahead and take the series, any further handling can be done on top of this series.

Maor Gottlieb added 3 commits July 10, 2017 17:20
Problem
=======
Currently, libibverbs builds the cached ibv_device list only
in the first time that ibv_get_device_list is called, as a result
the list is never updated, no matter if there were hardware changes
in the system.

Solution
========
Modify the implementation of ibv_get_device_list() so that
consecutive calls will re-scan the sysfs in the same manner as done
today in order to create a fresh ibv_device list each time.
For this purpose, the cached device list is change to be a
real linked list and not a dynamic array.

How we identify new devices
===========================
Identifying same device is done according to timestamp creation of
/sys/class/infiniband_verbs/uverbs%d/ibdev.
We get the file status with stat syscall and use the st_mtime field for
this purpose.

When we rescan the sysfs devices, then we check for each sysfs device
if it was already been in the last scan, if not then we allocate new
ibv_device and add it to the cached device list.
Next patch in this series handles the case that a device is no
longer in use.

Note:
This patch changes the IBVERBS_PRIVATE symbol as required in the note
above verbs_device struct.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Now, ibv_device list is refreshed each time that
ibv_get_device_list is called, therefore we need to free devices
from previous scan which aren't bound anymore, otherwise it could lead
to memory leak of ibv_device structures. We can free the memory of
ibv_device only if it isn't used anymore by the user.

How we identify if device is still used
========================================
We add a reference count to the verbs device struct.
This reference count is increased when:
  a. Set to one for having the device in the list until it should be
     deleted.
  b. User call to ibv_get_device_list.
  c. User call to ibv_open_device.

The reference count is decreased when:
  a. User call to ibv_free_device_list.
  b. User call to ibv_close_device.
  c. Device is no longer exists in the sysfs.

Device will be freed when the refcount is decreased to zero.

For free the ibv_device struct, we add uninit_device callback function
to verbs_device_ops.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
When libibvers calls to verbs_init_func of the provider,
then the provider allocates the verbs_device.

Add to all providers a function to free this memory once it's
called.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
@dledford dledford merged commit d05bd7b into linux-rdma:master Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants