Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage #8

Open
moononournation opened this issue Sep 1, 2023 · 25 comments
Open

Memory usage #8

moononournation opened this issue Sep 1, 2023 · 25 comments

Comments

@moononournation
Copy link

I would like to try porting the XTulator to a MCU like ESP32.

I found the memory not enough for memory.h:

uint8_t *memory_mapRead[MEMORY_RANGE];
uint8_t *memory_mapWrite[MEMORY_RANGE];
uint8_t (*memory_mapReadCallback[MEMORY_RANGE])(void *udata, uint32_t addr);
void (*memory_mapWriteCallback[MEMORY_RANGE])(void *udata, uint32_t addr, uint8_t value);
void *memory_udata[MEMORY_RANGE];

ESP32 can have 8 MB PSRAM but the about structure requires 2x MB to emulate 1 MB RAM. Any hint to reduce the memory usage?

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

Not quite sure what you are attempting but MEMORY_RANGE is a constant value set at 1MB so those 5 data structures require 5MB memory. The emualtor only emulates 640K to 1MB system. This repo is very old and no longer updated. See my fork for all the latest updates and releases.
Faux86-remake

@moononournation
Copy link
Author

Faux86-remake is base on Fake86 or XTulator?

@moononournation
Copy link
Author

I know MEMORY_RANGE is 1 MB, but uint8_t *memory_mapRead[MEMORY_RANGE] mapping requires extra 4 MB. Then 4 more pointer arrays all requires 4 MB x 5 = 20 MB.

@mikechambers84
Copy link
Owner

mikechambers84 commented Sep 1, 2023

This repo is not abandoned, I just haven't made any updates for a couple of years. I intend to get back to it. I tend to project-hop a lot.

I've actually done an MCU port of this already and it required a bit of restructuring for the memory and ports stuff of course.

I'm on my phone at the moment, but I'll come back later and share how I did it.

As it is, it makes sense on a desktop PC because it's fast and memory isn't an issue, but yeah not good for an MCU.

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

Faux86-remake is base on Fake86 or XTulator?

Both. I merged and updated code from Fake86, Faux86 and XTulator ro create Faux86-remake.
I think you will struggle running this emualtor on an ESP32 without major code rework.

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

This repo is not abandoned, I just haven't made any updates for a couple of years. I intend to get back to it. I tend to project-hop a lot.

Sorry didn't say it was abandoned, only not recently updated. Yes, the XTulator is coded for desktop PC use but my port with Fake86/Faux86 is ported and tuned for RPi.

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

I know MEMORY_RANGE is 1 MB, but uint8_t *memory_mapRead[MEMORY_RANGE] mapping requires extra 4 MB. Then 4 more pointer arrays all requires 4 MB x 5 = 20 MB.

Yes, you are correct they are pointer arrays so 4 x MEMORY_RANGE. Maybe you could dynamically create them and use some kind of memory or file cached array. No idea really how you can reduce the size of a static pointer array. I don't think the full 1MB of memory is used anyway so at most only 640KB is required.
640KB x 5 = 3.2MB
3.2MB x 4 = 12.8MB
But also don't forget video ram which is 256KB.

@mikechambers84
Copy link
Owner

Yes, we can't usually use those huge static arrays on an embedded system. When I did an Arduino port, my memory.c file looked more like this:

https://pastebin.com/1Xu1KnL4

I made a struct with the required data to specify a memory area, made an array of 16 of them (which is overkill) and looked through the arrays on RAM accesses. Maybe there's some faster way to do this.

You'll have to modify this code depending on exactly how you're accessing your memory. I was using an Arduino with SPI RAM. You'll need to supply your own memory_ramRead and memory_ramWrite functions.

I did something similar with the struct arrays in the ports.c file.

Hopefully this gets you started.

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

You'd have to a memory mapped page file to get round this if the MCU does not have enough memory to map the full address ranges. That's if you do have access to file based storage on the MCU i.e flash or memory card.
Either way it's going to be slow and effect the core timing.

@mikechambers84
Copy link
Owner

mikechambers84 commented Sep 1, 2023

Right, or something like SPI RAM ideally. He says he has 8 MB PSRAM so that should be good.

If the MCU doesn't have enough native RAM, then yeah it'll be slow and there's no way around it really.

I actually used this same concept to build a 16-bit DOS version (lol) and it uses as much real RAM as the system can allocate, but then hits a swap file when it needs to access more.

I've never seen your fork before by the way, I'm looking at it now. Very good stuff! I'm at work now but will dig deeper this weekend.

@ArnoldUK
Copy link

ArnoldUK commented Sep 1, 2023

THANKS. I took alot of code from your XTulator and I've noticed I did not give any credit in the readme files. I will update it tonight to highlight your code was used for part of the remake. I spent a good few weeks tweaking code from all repo's that I could find.

@moononournation
Copy link
Author

@mikechambers84, is your MCU project runnable now? On which platform?

@moononournation
Copy link
Author

Since the MCU resource is very limited, is it become simpler if I only hard code to a specific hardware setting?

@moononournation
Copy link
Author

I am developing something like this:
https://twitter.com/moononournation/status/1698190873639661968?s=20

Someone already success on Fake86 before, may be I dig into it first.

@moononournation
Copy link
Author

@mikechambers84, I followed your code and have much imporvement:

esp32-XTulator.ino.elf section `.dram0.bss' will not fit in region `dram0_0_seg'
DRAM segment data does not fit.
DRAM segment data does not fit.
section .iram0.vectors VMA [0000000040080000,0000000040080402] overlaps section .dram0.bss VMA [000000003ffc1bb8,00000000404e326f]
region `dram0_0_seg' overflowed by 5271664 bytes

But still need to locate another memory eater...

@moononournation
Copy link
Author

I can run XTulator in ESP32-S3 now after use malloc for the vga_framebuffer. But I found it is noticeable slower than fake86/faux86 since loading bios and vga initiazation. Any hint I can tune?

@moononournation
Copy link
Author

Don't know why it simply too slow, here are the booting video:
https://x.com/moononournation/status/1709219600150544456

@mikechambers84
Copy link
Owner

mikechambers84 commented Oct 3, 2023

Hi, are you rendering the screen in the same thread that that's doing the CPU emulation? That would take up a lot of CPU cycles.

When I did an ESP32 port, I ran another thread on the second core that was dedicated to screen rendering.

@moononournation
Copy link
Author

Display should not the reason, because it still slow even I set force FPS to 1.

@mikechambers84
Copy link
Owner

I don't really have any other ideas without seeing the code of your port and how you're handling everything.

Either way, I'd highly recommend making the screen render on second core.

@moononournation
Copy link
Author

My working code is at: https://github.com/moononournation/arduino-XTulator
I have not touched too much, just revised the memory part as you mentioned. And the change display class to using 16-bit frame buffer and use malloc() for memory allocation.

@mikechambers84
Copy link
Owner

mikechambers84 commented Oct 4, 2023

It might be faster if you just get rid of all of the memory map code and do some if-else statements like

if (addr32< 0xA0000) {
//read from 640 KB memory array
}
else if (addr32 < 0xB0000) {
return vga_readmemory(NULL, addr32);
}
else if (addr32 >= 0xFE000) {
return BIOSROM[addr32 - 0xFE000];
}
And so on...

The way it's searching now for the memory map on each read/write might just be too heavy for an MCU to do quickly.

These memory and port maps help make it more easily modular with inserting hardware into the virtual system on startup based on the configuration desired. It's very fast on a PC because there's enough RAM where I can have those big arrays and just do a simple array lookup for the memory/port handler based on address. On this memory limited system, we're having to make some smaller structures and for-loop through them to find the right one, so I guess that's just too slow.

I was using a very fast ARM MCU running at 800+ MHz so it wasn't an issue there, but the ESP32 is more limited.

@moononournation
Copy link
Author

moononournation commented Oct 7, 2023

I still not yet locate why XTulator run slower than Faux86-remake.
But I found a place both Fake86/Faux86 and XTulator run not efficient. It is the way CPU exec, the opcode checking is a very long switch case condition. The worst case is 255 case checking for the opcode 0xFF. The above mentioned XTulator memory use a read/write callback function array to make it seek faster. Wild guess you also planned CPU exec would implement an opcode function array to make it seek much faster.
I am lack of knowledge on x86 opcode, so sorry for I cannot help it too much.

@dbalsom
Copy link

dbalsom commented Mar 11, 2024

The worst case is 255 case checking for the opcode 0xFF.

that's not how switch cases work. They are usually implemented via jump tables, and a case per opcode is an extremely common method in emulation.

@lgblgblgb
Copy link

The worst case is 255 case checking for the opcode 0xFF.

that's not how switch cases work. They are usually implemented via jump tables, and a case per opcode is an extremely common method in emulation.

Indeed, once I've encountered a Z80 emulation written in C using array of function pointers. I modified that (with the intent of optimization) to use a big switch statement handling all opcodes, and it run faster after that, in fact it was something like twice as fast in average (IIRC ...). Most (all?) modern C compilers optimizes switch statements into jump tables if the cases are "well behavioured" (like, there are many cases and all/most of them are continuous), which is always faster than referencing function pointers also with the cost of calling subroutine then. Of course random switch/case statements with non-continuous/"random" cases are different stories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants