Normal core FPU emulation is incompatible with "Fast" Pentium memcpy trick #119

joncampbell123 · 2016-01-01T12:42:51Z

Ref: http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/CUJ/1996/9612/durham1/durl1.htm

The normal core attempts to use the host floating point support to emulate Intel x87 instructions, but it does so by typecasting everything to "double", even 80-bit extended formats.

Unfortunately, this typecasting causes data corruption for any demo or game that uses FILD+FISTP (64-bit integer load and store) as a "faster" method of copying memory on Pentium-class (pre-MMX) hardware. On an actual processor, the 64-bit integer fits normally into the mantissa of the 80-bit FPU register without losing bits, while in DOSBox-X's normal core, the typecast to "double" loses bits and causes data corruption. In most demos, since the fast memcpy trick is used to copy from a system buffer into video memory, this corruption is visible as black vertical bars on the screen that sometimes shift or blend into the picture content depending on how the 64-bit int -> double conversion truncates.

joncampbell123 · 2016-01-01T12:45:39Z

Example: "Toontown" demo

Breaking into the demo at the right point shows that it is indeed drawing to the screen with a fast FILD + FSTP loop, 16 bytes at a time, Pentium optimized loop.

ghost · 2016-01-04T04:14:54Z

Reproduced your very interesting finding! PCem seems to correctly handle the above conversion in its "normal core"; I think below may be (abbreviated) hints.


x87.h:
/*Hack for FPU copy. If set then ST_i64 contains the 64-bit integer loaded by FILD*/
#define TAG_UINT64 (1 << 2)
x87.c:
uint16_t x87_gettag()
{
...
        for (c = 0; c < 8; c++)
                if (tag[c] & TAG_UINT64)
                        ret |= 2 << (c*2);
                else
                        ret |= (tag[c] << (c*2));
x87_ops.h:
static inline void x87_st_fsave(int reg)
{
        reg = (TOP + reg) & 7;
        if (tag[reg] & TAG_UINT64)
        {
            writememl(easeg, eaaddr, ST_i64[reg] & 0xffffffff);
            writememl(easeg, eaaddr + 4, ST_i64[reg] >> 32);
            writememw(easeg, eaaddr + 8, 0x5555);
        }
        else
                x87_st80(ST[reg]);
static inline void x87_ld_frstor(int reg)
{
        reg = (TOP + reg) & 7;
        temp = readmemw(easeg, eaaddr + 8);
        if (temp == 0x5555 && tag[reg] == 2)
        {
                tag[reg] = TAG_UINT64;
                ST_i64[reg] = readmeml(easeg, eaaddr);
                ST_i64[reg] |= ((uint64_t)readmeml(easeg, eaaddr + 4) << 32);
                ST[reg] = (double)ST_i64[reg];
        }
        else
                ST[reg] = x87_ld80();
x87_ops_loadstore.h:
static int opFILDiq_a32(uint32_t fetchdat)
{
...
        ST_i64[TOP] = temp64;
        tag[TOP] |= TAG_UINT64;
static int FISTPiq_a32(uint32_t fetchdat)
{
...
        if (tag[TOP] & TAG_UINT64)
                temp64 = ST_i64[TOP];
        else
                temp64 = x87_fround(ST(0));

joncampbell123 · 2016-01-05T01:22:39Z

:) That's actually an interesting hack. But, it's a hack.

I think a better approach would be to use the "long double" datatype where possible to operate with 80-bit extended precision. I know that's not possible with Microsoft C++, but Linux+GCC has it, and that's where I can start to improve FPU emulation. DOSBox-X is more concerned with emulation accuracy, rather than shortcuts at the expense of FPU precision.

fuel-pcbox · 2016-01-05T01:34:06Z

Just emulate floating-point in software.

On Mon, Jan 4, 2016 at 7:22 PM, Jonathan Campbell notifications@github.com
wrote:

:) That's actually an interesting hack. But, it's a hack.

I think a better approach would be to use the "long double" datatype where
possible to operate with 80-bit extended precision. I know that's not
possible with Microsoft C++, but Linux+GCC has it, and that's where I can
start to improve FPU emulation. DOSBox-X is more concerned with emulation
accuracy, rather than shortcuts at the expense of FPU precision.

—
Reply to this email directly or view it on GitHub
#119 (comment)
.

joncampbell123 · 2016-01-05T01:47:10Z

Of course! But for x86 targets, we could keep the accuracy and gain some performance if the C++ compiler can map the "long double" type to the right FPU instruction at compile time in the way it does now for "double".

ghost · 2016-01-05T06:27:37Z

Here are further details on this issue:
http://dreamlayers.blogspot.com/2015/01/the-dosbox-fpu-emulator-and-ole2dispdll.html

Compiled the source code from the above link and the dos program is available here:
http://filebin.ca/2SNnbwgyMIHa/testfild.zip

That should provide a simple test of the fild-fbstp operation.

ghost · 2016-01-06T00:06:20Z

The fild/fistp issue is also reported here:
http://sourceforge.net/p/dosbox/bugs/369/

And at their forum:
http://www.vogons.org/viewtopic.php?t=24499&sid=9ae1bdc243d05d347f08553e7ad73c44
-and-
http://www.vogons.org/viewtopic.php?t=17000&highlight=fpu+precision
["A nice test to see the flaws of the non-80bit precise store operations
is carmageddon, you'll get black stripes in the graphics (the missing 16 bits)"].

And a comment from a user in that thread:
"Possible workaround, with a smaller overhead than a full soft float implementation:
You could duplicate the FPU stack (full and double precision). Have fild push to both, and fist(p) pop from the full precision stack only if no other FPU operation occurred in the mean time."

That sounds somewhat similar to the pcem method of correcting the OP issue. However, wd favored the compiler approach for accuracy, presumably the same as the "long double" approach as discussed above (full 80-bit and not just recovery of 64 bits for the integer operation).

Perhaps the already proven pcem method is worthwhile as an intermediate solution until the more thorough "long double" method is implemented; given that the full solution is difficult to implement, to fully test, and any drawbacks are unknown.

Verified that PCem without this "fpu copy patch" does result in "black vertical bars on the screen" as in dosbox-x:
https://bitbucket.org/pcem_emulator/pcem/commits/9c8cd36ad61f396fba5b820f9ab2d7dcdf591a06

Edit: confirmed that the above patch is running a parallel higher precision (HP) stack "ST_i64", at least for the integer operation, along with consistent use of bitwise operations to encode whether a fpu command should use one stack or the other. The other part are the functions "x87_st_fsave" and "x87_ld_frstor" which are called to pack and unpack the HP value (retain the integer portion) so bits are not lost as in the static_cast to a double.

The FILD command will activate the HP stack by setting tag[TOP] while the FISTP command will detect that same tag[TOP] value to identify which stack to use (FILD vs. others). I think FLD and FXCH operate on both stacks. The statement "tag[TOP] &= ~TAG_UINT64" turns off the 3rd bit of tag[TOP] so the normal stack is active (many arithmetic and trigonometric commands). The statement "tag[TOP] |= TAG_UINT64" turns on the 3rd bit so the HP stack is active (FILD command). Any tag[#] value at 3rd bit of 1 indicates the HP stack.

There are two issues here, the expansion of the fpu from 64-bits to the full 80-bits and the problem of static_cast demoting a 64-bit integer to 53-bits (or thereabouts :) ). The "solution" to the second issue would fix the problem in the demo from OP and the other instances mentioned in the web links above, but the loss of 80- to 64-bits seems to have less incompatibilities with software. If there was generally a loss of function from 80- to 64-bits, then wouldn't a Quake software renderer engine compiled without x86 code also exhibit graphical problems? However, I understand that dosbox-x goal is accuracy and not a subset of accuracy for running software.

ghost · 2016-01-08T04:53:12Z

Logged the 'Bit64s' FILD values generated during key games and demos. Here is the log statement at the end of function "FPU_FLD_I64" in fpu_instructions.h:
LOG_MSG("FILD value: %g",fpu.regs[store_to].d);

From brief testing, 'Bit64s" FILD values are:
Quake1: < 10^10 (~2^32, double word)
Quake2 (q2dos): < 10^8
SimCity 2K: < 10^7
Warcraft2: reportedly 486SX required; likewise, there were no values logged
ToonTown demo: < 10^19
Starcraft: < 10^4

Codeholio calculated that the 64 bit double precision value can hold a mantissa length of nearly 52 bits (my paraphrase). For these approximations, we can assume that a 51 bit value is where the true FILD integer value will overflow in the fpu emulation since the conversion to double in emulation is not complete. Here is the decimal value for binary 2^51: 2251799813685248. That is a 51 bit value. For comparison to our above tests, 2^51 converts to (decimal) ~10^16 in size. So this is a rough approximate value where the FILD/FISTP command pair will show inaccuracies.

None of the above test games show values near to 10^16 except for the Toontown demo, which exceeds it by 10^3, so the last 3 or 4 digits of those integers are presumably lost in the non-x86 fpu emulation. This is verified by examples from x86 source code where the FILD may operate on a double word (32-bit value) or on a quad word (64-bit); the latter shown by Codeholio's example during use of the "Fast Pentium memcpy trick" versus Quake source code where FILD is typically operating on double word values (x86).

joncampbell123 added the bug label Jan 1, 2016

joncampbell123 self-assigned this Jan 1, 2016

aybe added the topic: CPU label Sep 14, 2017

joncampbell123 closed this as completed Dec 17, 2017

QmwJlHuSg9pa mentioned this issue Aug 26, 2021

[Android] Vertical stripes all over Carmageddon schellingb/dosbox-pure#156

Open

johnnovak mentioned this issue Nov 15, 2023

Allow extended precision floating point calculations to be done in software dosbox-staging/dosbox-staging#3125

Open

2 tasks

kklobe mentioned this issue Nov 29, 2023

Fix vertical bars due to FILD/FIST memcpy trick dosbox-staging/dosbox-staging#3151

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normal core FPU emulation is incompatible with "Fast" Pentium memcpy trick #119

Normal core FPU emulation is incompatible with "Fast" Pentium memcpy trick #119

joncampbell123 commented Jan 1, 2016

joncampbell123 commented Jan 1, 2016

ghost commented Jan 4, 2016

joncampbell123 commented Jan 5, 2016

fuel-pcbox commented Jan 5, 2016

joncampbell123 commented Jan 5, 2016

ghost commented Jan 5, 2016

ghost commented Jan 6, 2016

ghost commented Jan 8, 2016

Normal core FPU emulation is incompatible with "Fast" Pentium memcpy trick #119

Normal core FPU emulation is incompatible with "Fast" Pentium memcpy trick #119

Comments

joncampbell123 commented Jan 1, 2016

joncampbell123 commented Jan 1, 2016

ghost commented Jan 4, 2016

joncampbell123 commented Jan 5, 2016

fuel-pcbox commented Jan 5, 2016

joncampbell123 commented Jan 5, 2016

ghost commented Jan 5, 2016

ghost commented Jan 6, 2016

ghost commented Jan 8, 2016