Skip to content
Permalink
Browse files
vfio: Define device migration protocol v2
Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.

This is the mandatory portion of the new protocol, it defines the 5
mandatory states for basic stop and copy migration and the protocol to
move the migration data in/out of the kernel.

Compared to the clarification of the v1 protocol Alex proposed:

https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen

This has a few deliberate functional differences:

 - ERROR arcs allow the device function to remain unchanged.

 - The protocol is not required to return to the original state on
   transition failure. Instead we directly return the current state,
   whatever it may be. Userspace can execute an unwind back to the
   original state, reset, or do something else without needing kernel
   support. This simplifies the kernel design and should userspace choose
   a policy like always reset, avoids doing useless work in the kernel
   on error handling paths.

 - PRE_COPY is made optional, userspace must discover it before using it.
   This reflects the fact that the majority of drivers we are aware of
   right now will not implement PRE_COPY.

 - segmentation is not part of the data stream protocol, the receiver
   does not have to reproduce the framing boundaries.

The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.

A new IOCTL VFIO_DEVICE_MIG_SET_STATE is defined to replace writing to the
device_state field in the region. This allows returning more information
in the case of failure, and includes returning a brand new FD whenever the
requested arc opens a data transfer session.

The VFIO core code implements the new ioctl and provides a helper function
to the driver. Using the helper the driver only has to implement 6 of the
FSM arcs and the other combination transitions are elaborated consistently
from those arcs.

The ioctl VFIO_DEVICE_MIG_ARC_SUPPORTED is defined as a way to query the
kernel for support of FSM capabilities. This allows userspace to discover
optional FSM features, and provides a robust route for future
expansion. Combined with the ability of the kernel to execute combination
transitions there is alot of flexibility to define new arcs and states in
the future while still providing a backward compatible SET_STATE interface
to userspace. The existing VFIO_DEVICE_FEATURE ioctl can also be used as
part of any future migration feature negotiation.

Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.

As the current qemu design requires the available data size up front the
VFIO_DEVICE_MIG_FD_SEGMENT ioctl allows querying this so it can build the
data frame.

The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using read() on the
region.

Change-Id: Iaf7940cd9804becf7a1040e019e39af7e0b75fa7
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
  • Loading branch information
jgunthorpe committed Jan 23, 2022
1 parent c1fa42c commit 96734f6e53ef22b459d3ba12b8d22b4309a4cd15
Show file tree
Hide file tree
Showing 3 changed files with 397 additions and 8 deletions.
@@ -1557,15 +1557,204 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
return 0;
}

/*
* vfio_mig_get_next_state - Compute the next step in the FSM
* @cur_fsm - The current state the device is in
* @new_fsm - The target state to reach
*
* Return the next step in the state progression between cur_fsm and new_fsm.
* This breaks down requests for combination transitions into smaller steps and
* returns the next step to get to new_fsm. The function may need to be called
* multiple times before reaching new_fsm.
*
* VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
*/
u32 vfio_mig_get_next_state(struct vfio_device *device,
enum vfio_device_mig_state cur_fsm,
enum vfio_device_mig_state new_fsm)
{
enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
/*
* The coding in this table requires the driver to implement 6
* FSM arcs:
* RESUMING -> STOP
* RUNNING -> STOP
* STOP -> RESUMING
* STOP -> RUNNING
* STOP -> STOP_COPY
* STOP_COPY -> STOP
*
* The coding will step through multiple states for these combination
* transitions:
* RESUMING -> STOP -> RUNNING
* RESUMING -> STOP -> STOP_COPY
* RUNNING -> STOP -> RESUMING
* RUNNING -> STOP -> STOP_COPY
* STOP_COPY -> STOP -> RESUMING
* STOP_COPY -> STOP -> RUNNING
*/
static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
[VFIO_DEVICE_STATE_STOP] = {
[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
},
[VFIO_DEVICE_STATE_RUNNING] = {
[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
},
[VFIO_DEVICE_STATE_STOP_COPY] = {
[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
},
[VFIO_DEVICE_STATE_RESUMING] = {
[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
},
[VFIO_DEVICE_STATE_ERROR] = {
[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
},
};
if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
return VFIO_DEVICE_STATE_ERROR;

return vfio_from_fsm_table[cur_fsm][new_fsm];
}
EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);

/*
* Convert the drivers's struct file into a FD number and return it to userspace
*/
static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg,
struct vfio_device_mig_set_state *set_state)
{
int ret;
int fd;

fd = get_unused_fd_flags(O_CLOEXEC);
if (fd < 0) {
ret = fd;
goto out_fput;
}

set_state->data_fd = fd;
if (copy_to_user(arg, set_state, sizeof(*set_state))) {
ret = -EFAULT;
goto out_put_unused;
}
fd_install(fd, filp);
return 0;

out_put_unused:
put_unused_fd(fd);
out_fput:
fput(filp);
return ret;
}

static int vfio_ioctl_mig_set_state(struct vfio_device *device,
void __user *arg)
{
size_t minsz =
offsetofend(struct vfio_device_mig_set_state, flags);
enum vfio_device_mig_state final_state = VFIO_DEVICE_STATE_ERROR;
struct vfio_device_mig_set_state set_state;
struct file *filp;

if (!device->ops->migration_set_state)
return -EOPNOTSUPP;

if (copy_from_user(&set_state, arg, minsz))
return -EFAULT;

if (set_state.argsz < minsz || set_state.flags)
return -EOPNOTSUPP;

/*
* It is tempting to try to validate set_state.device_state here, but
* then we can't return final_state. The validation is done in
* vfio_mig_get_next_state().
*/
filp = device->ops->migration_set_state(device, set_state.device_state,
&final_state);
set_state.device_state = final_state;
if (IS_ERR(filp)) {
if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP ||
PTR_ERR(filp) == -ENOTTY ||
PTR_ERR(filp) == -EFAULT))
filp = ERR_PTR(-EINVAL);
goto out_copy;
}

if (!filp)
goto out_copy;
return vfio_ioct_mig_return_fd(filp, arg, &set_state);
out_copy:
set_state.data_fd = -1;
if (copy_to_user(arg, &set_state, sizeof(set_state)))
return -EFAULT;
if (IS_ERR(filp))
return PTR_ERR(filp);
return 0;
}

static int vfio_ioctl_mig_arc_supported(struct vfio_device *device,
void __user *arg)
{
size_t minsz =
offsetofend(struct vfio_device_mig_arc_supported, to_state);
struct vfio_device_mig_arc_supported supp;

if (!device->ops->migration_set_state)
return -EOPNOTSUPP;

if (copy_from_user(&supp, arg, minsz))
return -EFAULT;

if (supp.argsz < minsz)
return -EINVAL;

/*
* The coding tables always have error in the first hop if the
* ultimate destination is impossible.
*/
if (vfio_mig_get_next_state(device, supp.from_state, supp.to_state) ==
VFIO_DEVICE_STATE_ERROR)
return -ENOENT;
return 0;
}

static long vfio_device_fops_unl_ioctl(struct file *filep,
unsigned int cmd, unsigned long arg)
{
struct vfio_device *device = filep->private_data;

if (unlikely(!device->ops->ioctl))
return -EINVAL;

return device->ops->ioctl(device, cmd, arg);
switch (cmd) {
case VFIO_DEVICE_MIG_SET_STATE:
return vfio_ioctl_mig_set_state(device, (void __user *)arg);
case VFIO_DEVICE_MIG_ARC_SUPPORTED:
return vfio_ioctl_mig_arc_supported(device, (void __user *)arg);
default:
if (unlikely(!device->ops->ioctl))
return -EINVAL;
return device->ops->ioctl(device, cmd, arg);
}
}

static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
@@ -55,6 +55,8 @@ struct vfio_device {
* @match: Optional device name match callback (return: 0 for no-match, >0 for
* match, -errno for abort (ex. match with insufficient or incorrect
* additional args)
* @migration_set_state: Optional callback to change the migration
* state for devices that support migration.
*/
struct vfio_device_ops {
char *name;
@@ -69,6 +71,10 @@ struct vfio_device_ops {
int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
void (*request)(struct vfio_device *vdev, unsigned int count);
int (*match)(struct vfio_device *vdev, char *buf);
struct file *(*migration_set_state)(
struct vfio_device *device,
enum vfio_device_mig_state new_state,
enum vfio_device_mig_state *final_state);
};

void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
@@ -82,6 +88,10 @@ extern void vfio_device_put(struct vfio_device *device);

int vfio_assign_device_set(struct vfio_device *device, void *set_id);

u32 vfio_mig_get_next_state(struct vfio_device *device,
enum vfio_device_mig_state cur_fsm,
enum vfio_device_mig_state new_fsm);

/*
* External user API
*/

0 comments on commit 96734f6

Please sign in to comment.