-
Notifications
You must be signed in to change notification settings - Fork 867
Permit MMIO exits to bypass the emulation. #164
Comments
I guess what you want is something like the macOS Hypervisor.framework API, which allows user space to access most VMCS registers, so the QEMU HVF accelerator (which runs in user space) can fetch the MMIO instruction, emulate it, and then resume the guest from the next instruction. The HAXM API provides a higher-level abstraction by design, which is why
With the HAXM API, you can't read/write the guest RIP alone, but have to sync a larger set of vCPU registers (again that's by design), so these round trips will be very costly. In fact, if the instruction only accesses one MMIO address (which is the most common case), I think the |
To be clear I'm trying to emulate VGA which requires an MMIO region 64K in size. HAX_EXIT_FAST_MMIO requires a round trip for every vram read and write and there'll be thousands while a frame is drawn then none for a long period. Try a vga program that uses unchained mode such as doom in qemu with haxm and you'll see the performance is horrid, far slower than emulation so I was trying to do it like dosemu which exit kvm at the first vram access and emulates until it appears the program is finshed. As you say HAX_EXIT_PAGEFAULT would likely be the best solution but it appears that it has a 2MB granularity which is far to large for the 64k vga vram window and will put the entire vm address space within 2MB when running in real mode. |
I see, so you don't need to sync vCPU state very often. Is "kvm" a typo? I wonder if dosemu uses KVM at all, since it predates KVM. And I'm curious whether KVM API (
True. The 2MB granularity stems from the fact that HAXM divides each RAM block (host buffer backing guest memory) into 2MB chunks: Lines 37 to 38 in 0d3922d
Theoretically, it's possible to switch to 4KB chunks, but last time we tried that, we ran into some stability issue. Maybe you could try to make it work? |
Dosemu2 supports KVM but the original dosemu presumably had the same problem but with v86 mode instead. I think dosemu2 uses v86 mode in KVM for compatibility with the original dosemu and then they use regular page faults to trap vga access. That makes it basically impossible to support any programs which need their own paging.
Well, I can take a stab at it I suppose. |
@cracyc I wanted to port/update dosemu support for NetBSD some time ago but I've faced rather very legacy code that needed modernization in regards of used interfaces (e.g. switch to mcontext from sigcontext). We have also dropped v86 support from the NetBSD kernel and dosemu was using it. Was this situation changed? Is dosemu(2) switching to HAXM now? KVM restricts users to Linux while HAXM works now on 4 major Operating Systems (including NetBSD). Unfortunately I had to resign from my porting efforts previously as it demanded too many generic improvements beyond adding compatibility code. @raphaelning I got a report from @polprog that OpenBSD in HAXM is terribly slow.. is this VGA related? |
Not that I know of, what I'm working on is a bit different. |
I have no idea. Other desktop OSes seem to run smoothly. Since VGA rendering is probably done with MMIO, you may be able to identify the bottleneck by observing exits to QEMU of type |
@raphaelning NetBSD with X Window is also unusable due to slowness... I will file a dedicated PR for it in future once all the booting issues will be solved. |
If it's using basic vga mode (4 plane/16 color) it's (at least from my experimentation) going to be very slow. It appears the svga cirrus emulation in linear framebuffer mode doesn't mmio exit for vram writes which would be far faster. |
From what I have noticed, the apparent VGA slowness is because of general emulation slowness in case of having extremely verbose logging enabled. On NetBSD, caused by setting the following debug level variable to 0 (That is not the default value!).
|
Are you sure that performance bottlenecks disappear if we set |
OK, @polprog explained off-list that if we add extra logging performance is reduced.. but reverting to HEAD it's back to the current state. Unfortunately this doesn't solve our primary issue with MMIO bottleneck. |
ad "off-list explanation": To add a little context - when I was debugging, I would change that value back and forth as I needed to see more or less info in dmesg(8) and the VM would run either slow or fast. My test box is not the fastest, and with loglevel set to zero (most verbose) I could literally see QEMU's BIOS print the messages line-by-line as if it was a 9600bps terminal! And that also caused longer kernel load times - just as if the emulated CPU clock was a couple times slower. - This is not a bug, since implementing those debug messages in a non-blocking way would be just plain over-engineering and they are suspressed in most cases anyway. This is a proof that it's not MMIO related, I just feel that it should be mentioned :) |
Do you mean that your report of slowness of OpenBSD was caused by debug level and not MMIO bottleneck? |
Yes, exactly. |
I'm facing the same problem, one MMIO exit for every write operation in the VGA area is incredibly slow, so in fact it's unusable. dosemu approach of writing a whole emulator just for VGA emulation seems to be a ridiculous amount of work and way too complicated. No, it's not loglevel related, the roundtrips ARE costly, there are thousands of them per second! Now what I have tried and at least improves the situation a bit is when I attach the VGA area to a page, check if it changed on every haxm exit (mark already read bytes) and write changed values to the emulated VGA. At least the performance is somewhere near usable, but there still is the problem of detecting changes (I tried GetWriteWatch() API, but it doesn't fire when memory area gets modified by HAXM Guest more than once, I can provide a simple test application as proof). KVM has a module called "coalesced IO", which collects all MMIO writes into a ringbuffer that can be read and flushed by userspace. Still, I'm not convinced by this concept, as it still has 1 exit per write which is still costly when there are thousands of them per second. Now I had the following idea, maybe someone can tell me if it would work and if it can be implemented in HAXM that way:
Would that work and is someone here knowledgeable enough to enahance HAXM functionality this way? |
I tried to implement coalesced MMIO writes to HAXM for evaluation. I got a preformance gain by up to 50% (of course depending on the tested aplpication), but as expected, the video performance is still unacceptable. Should anybody be interested in the experimental coalesced MMIO implementation, please tell me and I will commit it to my repository accordingly. Coalesced MMIO can be turned on via a flag in my patch, so it shouldn't break compatibility. The comparison approach works at least somewhat acceptable (even though you have a 1/255 chance that you miss a write), but it fails to detect reads,as Intel SDM says, that you can't have writeonly-Pages in EPT, d'oh :-( |
@AlexAltea maybe can help you with this. |
@leecher1337 Sorry for the late reply, I missed these GitHub notifications until @frymezim pinged me. I'd be happy to look over your patch. Coalesced MMIO sounds an awesome feature, and I'd be happy to add support for it in QEMU's HAXM backend. |
Just for comparison purposes, I also tried to run QEMU with HAXM (and installing MS-DOS into it), and it suffers from exactly the same performance problem. I will create a branch with the coalesced MMIO support for your review, shall I create a PULL request for it too or just drop a link here for review? |
@leecher1337 Feel free to drop a link for now! Whether you want to submit a pull request is up to you (and up to the Intel team, if that could be merged). |
Hi, leecher1337@5d6a603 /* We have to following assumption on coalesced writes:
Here is an example for handling from NTVDMx64 HAXM implementation:
I hope that explains its usage. |
It appears to be impossible for a program to handle a mmio exit in userspace rather than round tripping to kernel every time. If the registers are reloaded after HAX_EXIT_FAST_MMIO they will likely then be corrupted in em_emulate_insn. I also tried to use HAX_EPT_PERM_NONE but it seems to not support small memory regions.
The text was updated successfully, but these errors were encountered: