|
| 1 | +.. _hv-device-passthrough: |
| 2 | + |
| 3 | +Device PassThrough |
| 4 | +################## |
| 5 | + |
| 6 | +A critical part of virtualization is virtualizing devices: exposing all |
| 7 | +aspects of a device including its I/O, interrupts, DMA, and configuration. |
| 8 | +There are three typical device |
| 9 | +virtualization methods: emulation, para-virtualization, and passthrough. |
| 10 | +Both emulation and passthrough are used in ACRN project. Device |
| 11 | +emulation is discussed in :ref:`hld-io-emulation` and |
| 12 | +device passthrough will be discussed here. |
| 13 | + |
| 14 | +In the ACRN project, device emulation means emulating all existing hardware |
| 15 | +resource through a software component device model running in the |
| 16 | +Service OS (SOS). Device |
| 17 | +emulation must maintain the same SW interface as a native device, |
| 18 | +providing transparency to the VM software stack. Passthrough implemented in |
| 19 | +hypervisor assigns a physical device to a VM so the VM can access |
| 20 | +the hardware device directly with minimal (if any) VMM involvement. |
| 21 | + |
| 22 | +The difference between device emulation and passthrough is shown in |
| 23 | +:numref:`emu-passthru-diff`. You can notice device emulation has |
| 24 | +a longer access path which causes worse performance compared with |
| 25 | +passthrough. Passthrough can deliver near-native performance, but |
| 26 | +can’t support device sharing. |
| 27 | + |
| 28 | +.. figure:: images/passthru-image30.png |
| 29 | + :align: center |
| 30 | + :name: emu-passthru-diff |
| 31 | + |
| 32 | + Difference between Emulation and passthrough |
| 33 | + |
| 34 | +Passthrough in the hypervisor provides the following functionalities to |
| 35 | +allow VM to access PCI devices directly: |
| 36 | + |
| 37 | +- DMA Remapping by VT-d for PCI device: hypervisor will setup DMA |
| 38 | + remapping during VM initialization phase. |
| 39 | +- MMIO Remapping between virtual and physical BAR |
| 40 | +- Device configuration Emulation |
| 41 | +- Remapping interrupts for PCI device |
| 42 | +- ACPI configuration Virtualization |
| 43 | +- GSI sharing violation check |
| 44 | + |
| 45 | +The following diagram details passthrough initialization control flow in ACRN: |
| 46 | + |
| 47 | +.. figure:: images/passthru-image22.png |
| 48 | + :align: center |
| 49 | + |
| 50 | + Passthrough devices initialization control flow |
| 51 | + |
| 52 | +Passthrough Device status |
| 53 | +************************* |
| 54 | + |
| 55 | +Most common devices on supported platforms are enabled for |
| 56 | +passthrough, as detailed here: |
| 57 | + |
| 58 | +.. figure:: images/passthru-image77.png |
| 59 | + :align: center |
| 60 | + |
| 61 | + Passthrough Device Status |
| 62 | + |
| 63 | +DMA Remapping |
| 64 | +************* |
| 65 | + |
| 66 | +To enable passthrough, for VM DMA access the VM can only |
| 67 | +support GPA, while physical DMA requires HPA. One work-around |
| 68 | +is building identity mapping so that GPA is equal to HPA, but this |
| 69 | +is not recommended as some VM don’t support relocation well. To |
| 70 | +address this issue, Intel introduces VT-d in chipset to add one |
| 71 | +remapping engine to translate GPA to HPA for DMA operations. |
| 72 | + |
| 73 | +Each VT-d engine (DMAR Unit), maintains a remapping structure |
| 74 | +similar to a page table with device BDF (Bus/Dev/Func) as input and final |
| 75 | +page table for GPA/HPA translation as output. The GPA/HPA translation |
| 76 | +page table is similar to a normal multi-level page table. |
| 77 | + |
| 78 | +VM DMA depends on Intel VT-d to do the translation from GPA to HPA, so we |
| 79 | +need to enable VT-d IOMMU engine in ACRN before we can passthrough any device. SOS |
| 80 | +in ACRN is a VM running in non-root mode which also depends |
| 81 | +on VT-d to access a device. In SOS DMA remapping |
| 82 | +engine settings, GPA is equal to HPA. |
| 83 | + |
| 84 | +ACRN hypervisor checks DMA-Remapping Hardware unit Definition (DRHD) in |
| 85 | +host DMAR ACPI table to get basic info, then sets up each DMAR unit. For |
| 86 | +simplicity, ACRN reuses EPT table as the translation table in DMAR |
| 87 | +unit for each passthrough device. The control flow is shown in the |
| 88 | +following figures: |
| 89 | + |
| 90 | +.. figure:: images/passthru-image72.png |
| 91 | + :align: center |
| 92 | + |
| 93 | + DMA Remapping control flow during HV init |
| 94 | + |
| 95 | +.. figure:: images/passthru-image86.png |
| 96 | + :align: center |
| 97 | + |
| 98 | + ptdev assignment control flow |
| 99 | + |
| 100 | +.. figure:: images/passthru-image42.png |
| 101 | + :align: center |
| 102 | + |
| 103 | + ptdev de-assignment control flow |
| 104 | + |
| 105 | + |
| 106 | +MMIO Remapping |
| 107 | +************** |
| 108 | + |
| 109 | +For PCI MMIO BAR, hypervisor builds EPT mapping between virtual BAR and |
| 110 | +physical BAR, then VM can access MMIO directly. |
| 111 | + |
| 112 | +Device configuration emulation |
| 113 | +****************************** |
| 114 | + |
| 115 | +PCI configuration is based on access of port 0xCF8/CFC. ACRN |
| 116 | +implements PCI configuration emulation to handle 0xCF8/CFC to control |
| 117 | +PCI device through two paths: implemented in hypervisor or in SOS device |
| 118 | +model. |
| 119 | + |
| 120 | +- When configuration emulation is in the hypervisor, the interception of |
| 121 | + 0xCF8/CFC port and emulatation of PCI configuration space access are |
| 122 | + tricky and unclean. Therefore the final solution is to reuse the |
| 123 | + PCI emulation infrastructure of SOS device model. The hypervisor |
| 124 | + routes the UOS 0xCF8/CFC access to device model, and keeps blind to the |
| 125 | + physical PCI devices. Upon receiving UOS PCI configuration space access |
| 126 | + request, device model needs to emulate some critical space, for instance, |
| 127 | + BAR, MSI capability, and INTLINE/INTPIN. |
| 128 | + |
| 129 | +- For other access, device model |
| 130 | + reads/writes physical configuration space on behalf of UOS. To do |
| 131 | + this, device model is linked with lib pci access to access physical PCI |
| 132 | + device. |
| 133 | + |
| 134 | +Interrupt Remapping |
| 135 | +******************* |
| 136 | + |
| 137 | +When the physical interrupt of a passthrough device happens, hypervisor has |
| 138 | +to distribute it to the relevant VM according to interrupt remapping |
| 139 | +relationships. The structure ``ptdev_remapping_info`` is used to define |
| 140 | +the subordination relation between physical interrupt and VM, the |
| 141 | +virtual destination, etc. See the following figure for details: |
| 142 | + |
| 143 | +.. figure:: images/passthru-image91.png |
| 144 | + :align: center |
| 145 | + |
| 146 | + Remapping of physical interrupts |
| 147 | + |
| 148 | +There are two different types of interrupt source: IOAPIC and MSI. |
| 149 | +The hypervisor will record different information for interrupt |
| 150 | +distribution: physical and virtual IOAPIC pin for IOAPIC source, |
| 151 | +physical and virtual BDF and other info for MSI source. |
| 152 | + |
| 153 | +SOS passthrough is also in the scope of interrupt remapping which is |
| 154 | +done on-demand rather than on hypervisor initialization. |
| 155 | + |
| 156 | +.. figure:: images/passthru-image102.png |
| 157 | + :align: center |
| 158 | + :name: init-remapping |
| 159 | + |
| 160 | + Initialization of remapping of virtual IOAPIC interrupts for SOS |
| 161 | + |
| 162 | +:numref:`init-remapping` above illustrates how remapping of (virtual) IOAPIC |
| 163 | +interrupts are remappied for SOS. VM exit occurs whenever SOS tries to |
| 164 | +unmask an interrupt in (virtual) IOAPIC by writing to the Redirection |
| 165 | +Table Entry (or RTE). The hypervisor then invokes the IOAPIC emulation |
| 166 | +handler (refer to :ref:`hld-io-emulation` for details on I/O emulation) which |
| 167 | +calls APIs to set up a remapping for the to-be-unmasked interrupt. |
| 168 | + |
| 169 | +Remapping of (virtual) PIC interrupts are set up in a similar sequence: |
| 170 | + |
| 171 | +.. figure:: images/passthru-image98.png |
| 172 | + :align: center |
| 173 | + |
| 174 | + Initialization of remapping of virtual MSI for SOS |
| 175 | + |
| 176 | +This figure illustrates how mappings of MSI or MSIX are set up for |
| 177 | +SOS. SOS is responsible for issuing an hypercall to notify the |
| 178 | +hypervisor before it configures the PCI configuration space to enable an |
| 179 | +MSI. The hypervisor takes this opportunity to set up a remapping for the |
| 180 | +given MSI or MSIX before it is actually enabled by SOS. |
| 181 | + |
| 182 | +When the UOS needs to access the physical device by passthrough, it uses |
| 183 | +the following steps: |
| 184 | + |
| 185 | +- UOS gets a virtual interrupt |
| 186 | +- VM exit happens and the trapped vCPU is the target where the interrup |
| 187 | + will be injected. |
| 188 | +- Hypervisor will handle the interrupt and translate the vector |
| 189 | + according to ptdev_remapping_info. |
| 190 | +- Hypervisor delivers the interrupt to UOS. |
| 191 | + |
| 192 | +When the SOS needs to use the physical device, the passthrough is also |
| 193 | +active because the SOS is the first VM. The detail steps are: |
| 194 | + |
| 195 | +- SOS get all physical interrupts. It assigns different interrupts for |
| 196 | + different VMs during initialization and reassign when a VM is created or |
| 197 | + deleted. |
| 198 | +- When physical interrupt is trapped, an exception will happen after VMCS |
| 199 | + has been set. |
| 200 | +- Hypervisor will handle the vm exit issue according to |
| 201 | + ptdev_remapping_info and translates the vector. |
| 202 | +- The interrupt will be injected the same as a virtual interrupt. |
| 203 | + |
| 204 | +ACPI Virtualization |
| 205 | +******************* |
| 206 | + |
| 207 | +ACPI virtualization is designed in ACRN with these assumptions: |
| 208 | + |
| 209 | +- HV has no knowledge of ACPI, |
| 210 | +- SOS owns all physical ACPI resources, |
| 211 | +- UOS sees virtual ACPI resources emulated by device model. |
| 212 | + |
| 213 | +Some passthrough devices require physical ACPI table entry for |
| 214 | +initialization. The device model will create such device entry based on |
| 215 | +the physical one according to vendor ID and device ID. Virtualization is |
| 216 | +implemented in SOS device model and not in scope of the hypervisor. |
| 217 | + |
| 218 | +GSI Sharing Violation Check |
| 219 | +*************************** |
| 220 | + |
| 221 | +All the PCI devices that are sharing the same GSI should be assigned to |
| 222 | +the same VM to avoid physical GSI sharing between multiple VMs. For |
| 223 | +devices that don't support MSI, ACRN DM |
| 224 | +shares the same GSI pin to a GSI |
| 225 | +sharing group. The devices in the same group should be assigned together to |
| 226 | +the current VM, otherwise, none of them should be assigned to the |
| 227 | +current VM. A device that violates the rule will be rejected to be |
| 228 | +passthrough. The checking logic is implemented in Device Mode and not |
| 229 | +in scope of hypervisor. |
| 230 | + |
| 231 | +Data structures and interfaces |
| 232 | +****************************** |
| 233 | + |
| 234 | +.. note:: replace with reference to API docs |
| 235 | + |
| 236 | +The following APIs are provided to initialize interrupt remapping for |
| 237 | +SOS: |
| 238 | + |
| 239 | +- int ptdev_intx_pin_remap(struct vm \*vm, uint8_t virt_pin, enum |
| 240 | + ptdev_vpin_source vpin_src); |
| 241 | + |
| 242 | + Set up the remapping of the given virtual pin for the given vm. |
| 243 | + |
| 244 | +- int ptdev_msix_remap(struct vm \*vm, uint16_t virt_bdf, uint16_t |
| 245 | + entry_nr, struct ptdev_msi_info \*info); |
| 246 | + |
| 247 | +The following APIs are provided to manipulate the interrupt remapping |
| 248 | +for UOS. |
| 249 | + |
| 250 | +- int ptdev_add_intx_remapping(struct vm \*vm, uint16_t virt_bdf, |
| 251 | + uint16_t phys_bdf, uint8_t virt_pin, uint8_t phys_pin, bool |
| 252 | + pic_pin); |
| 253 | + |
| 254 | + Add mapping between the given virtual and physical pin for the |
| 255 | + given vm. |
| 256 | + |
| 257 | +- void ptdev_remove_intx_remapping(struct vm \*vm, uint8_t |
| 258 | + virt_pin, bool pic_pin); |
| 259 | + |
| 260 | + Remove mapping of the given virtual pin for the given vm. |
| 261 | + |
| 262 | +- int ptdev_add_msix_remapping(struct vm \*vm, uint16_t virt_bdf, |
| 263 | + uint16_t phys_bdf, uint32_t vector_count); |
| 264 | + |
| 265 | + Add mapping of the given number of vectors between the given |
| 266 | + physical and virtual BDF for the given vm. |
| 267 | + |
| 268 | +- void ptdev_remove_msix_remapping(struct vm \*vm, uint16_t |
| 269 | + virt_bdf, uint32_t vector_count); |
| 270 | + |
| 271 | + Remove the mapping of given number of vectors of the given virtual |
| 272 | + BDF for the given vm. |
| 273 | + |
| 274 | +The following APIs are provided to acknowledge a virtual interrupt. |
| 275 | + |
0 commit comments