Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving accuracy of vfpu_sin #16946

Open
2 tasks done
fp64 opened this issue Feb 9, 2023 · 239 comments
Open
2 tasks done

Improving accuracy of vfpu_sin #16946

fp64 opened this issue Feb 9, 2023 · 239 comments

Comments

@fp64
Copy link
Contributor

fp64 commented Feb 9, 2023

What should happen

This is a work-in-progress/research topic, I just thought I'd put it there for people to see.

This concerns implementation of y=vfpu_sin(x), which computes an approximation of $\sin(\frac{\pi}{2} x)$.

Technical: I'm sometimes using hex-floats notation for clarity.

Looking at the table the following things appear to be true:

  1. For $|x|>2^{32}$ the output seems garbage (small non-zero values), despite x always being even integer in this case. How does one even get this wrong? Then again, on x87 fsin instruction returns x for $|x|>2^{64}$. Hopefully, games do not rely on this.
  2. For inf/nan the output is specific nan (0x7F800001 or 0xFF800001, the same sign bit as x). Note, that this appears to be a signaling nan, though apparently some processors treat it backwards.
  3. The floating point output y always has two lowest bits as 0 (next one may be 0 or 1) .
  4. The usual symmetries are obeyed, which allows to safely reduce x to $[0;1]$. Note: result is -0 when x is in [...,-6, -2, -0, +2, +6,...].
  5. The output is uniquely defined by trunc(x*0x1p+23f) - i.e. input is effectively ??:23 fixed point.
  6. Output is always multiple of $2^{-28}$, i.e. y*0x1p+28f is integer.

The above two points mean that the problem is reduced to essentially integer: a -> b, where a=int(0x1p+23f*x), and b=int(0x1p+28f*y).
This gives us a more manageable table of $2^{23}+1$ entries. Attached below:
table.bin.zip
The contents (in binary) are raw values of b as 32bit LE integers (i.e. 4 bytes at offset 4*a are the value of b that corresponds to a). Where the value is unknown (absent in original table) 0xFFFFFFFF is used (not possible to mistake for a real output, since it is larger than $2^{28}$).
The table is dense for x>=0.25, but has only 60 values below:

Show
//        a,         b
{
{         0,         0},
{         1,        50},
{         2,       100},
{         3,       150},
{         4,       201},
{         5,       251},
{         6,       301},
{         7,       351},
{         8,       402},
{         9,       452},
{        10,       502},
{        11,       552},
{        12,       603},
{        13,       653},
{        14,       703},
{        15,       753},
{        16,       804},
{        32,      1608},
{        63,      3166},
{        64,      3216},
{       129,      6484},
{       255,     12817},
{       256,     12867},
{       258,     12968},
{       516,     25937},
{      1023,     51422},
{      1024,     51472},
{      1033,     51924},
{      2066,    103849},
{      4095,    205839},
{      4096,    205889},
{      4132,    207699},
{      4608,    231625},
{      8264,    415397},
{     16383,    823501},
{     16384,    823552},
{     16529,    830840},
{     22528,   1132380},
{     33059,   1661715},
{     65535,   3294065},
{     65536,   3294116},
{     66118,   3323368},
{    132237,   6646276},
{    149130,   7495120},
{    262143,  13171452},
{    262144,  13171504},
{    264474,  13288480},
{    298260,  14984388},
{    528948,  26544376},
{    596520,  29922048},
{   1048575,  52369104},
{   1048576,  52369152},
{   1057896,  52828544},
{   1168434,  58264416},
{   1193040,  59471152},
{   1310720,  65224496},
{   1435360,  71283488},
{   1572864,  77922688},
{   1714176,  84691872},
{   1917732,  94337248},
};

Note, that for $|x| < 10^{-3}$ we have $|\sin(\frac{\pi}{2} x)-(\frac{\pi}{2} x)| < 2^{-28}$.

Assuming we can satisfactorily patch the blank spots in the table, this may turn into a compression problem, i.e. even without knowing what HW actually does we just need a way to recreate the exact table. It is, in principle, possible to bite the bullet and just (optionally) ship the table with PPSSPP (not that I actually advocate that) - it is 4MB zipped, and 32MB in mem, which while not nice, might be tolerable.

Please do take this with a grain of salt, it is possible I missed (or messed up) something.

Who would this benefit

If this does indeed improve emulation accuracy, it may make more games correctly playable, to benefit of everyone concerned.

Platform (if relevant)

None

Games this would be useful in

Games that rely on specific values produced by vfpu, not sure which

Other emulators or software with a similar feature

No response

Checklist

@hrydgard
Copy link
Owner

hrydgard commented Feb 9, 2023

Cool. So we should have the PSP generate a new table from 0.23 fixpoint, from 0 to 1 would be enough (or even 0 to 0.5).

Then I guess theoretically we can try to compute first, second and third derivatives of that sequence. If those series becomes a series of straight lines or plateaus at some derivative, we'll know we're dealing with polynomial interpolation of a much smaller table, which I think might be likely. Though I don't know if precision is enough for those derivatives to make sense...

Storing the full table will hopefully not be necessary, though by taking the derivative before compressing, the result should be really compressible... (actually that might be nonsense given how sin and cos are each other's derivative, sometimes with flipped signs, but we have to be dealing with an approximation here...)

@fp64
Copy link
Contributor Author

fp64 commented Feb 9, 2023

from 0 to 1 would be enough (or even 0 to 0.5).

To be clear

The table is dense for x>=0.25

means, that all values for [0.25;1] are already known. Only [0;0.25] remains.

Storing the full table will hopefully not be necessary

For one thing 32MB might be larger than L3 cache.

though by taking the derivative before compressing, the result should be really compressible...

Tried it:

 33554436  table.bin
  3430864  table.bin.zip
  4645219  table.bin.zst
 33554436  delta1.bin
   171407  delta1.bin.zip
   238125  delta1.bin.zst
 33554436  delta2.bin
   175539  delta2.bin.zip
   220832  delta2.bin.zst

delta1 and delta2 are first and second finite differences. Both zip and zstd were invoked with -9.

@hrydgard
Copy link
Owner

hrydgard commented Feb 9, 2023

Oh yeah, didn't read carefully enough, so a smaller range to dump.

Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P

By the way, 4 way symmetry is good, but 8 way symmetry might be possible with some flipping around of signs and offsetting. Or hm, maybe not...

@fp64
Copy link
Contributor Author

fp64 commented Feb 10, 2023

Note: result is -0 when x is in [...,-6, -2, -0, +2, +6,...].

Grr, no, it's [...,-8,-4,-0,+2,+6,...]: both sources of signflip (signbit(x) and (x mod 4==2)) just XOR with each other.

@unknownbrackets
Copy link
Collaborator

Oh, hm. Maybe I overwrote that file I uploaded before with a smaller subset. It took a lot of time to generate, I thought I saved it... anyway, I had dumped some of it as binary. Let me just regenerate, at least for exponents between 0x60 - 0x7F....

-[Unknown]

@unknownbrackets
Copy link
Collaborator

unknownbrackets commented Feb 10, 2023

https://forums.ppsspp.org/uploads/sin_60-7F.zip
https://forums.ppsspp.org/uploads/cos_60-7F.zip

Here's all the exponents from 60-7F, each as a separate file. 60-67 aren't that interesting, just included to represent. The data is simply input, output repeated in 4 byte pairs.

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Feb 10, 2023

Here's all the exponents from 60-7F, each as a separate file.

Thanks!

Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P

Well, after the first delta we already get long repeats of the same value, e.g.
50,50,50,50,50,49,49,49,49,49.
Second delta gets us long repeats of zeroes, e.g.:
0,0,0,0,0,-1,0,0,0,0
which isn't much better for compression, I think (e.g. for LZ the first example is something like LITERAL(50),COPY(-1,4),LITERAL(49),COPY(-1,4), and the second is LITERAL(0),COPY(-1,4),LITERAL(-1),LITERAL(0),COPY(-1,3) - so one command more).

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

Converted the provided files. So, now we have a complete fixed23->fixed28 table for sin:
sin.bin.zip
Again, just uint32 outputs (input is index, i.e. offset in file divided by 4), LE, in binary.

Haven't yet verified that cos uses the same table (reversed).

Again, thanks for actually dumping the data!

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

The draft implementation of the reduction:

static uint32_t vfpu_sin_table[(1<<23)+1]={
#include "sin.bin.h" // Or #embed, or whatever.
};

static inline uint32_t vfpu_sin_fixed(uint32_t arg)
{
    // Table-based for now.
    return vfpu_sin_table[arg];
}

// For |x|<2^32 should produce exactly the
// same value (bitwise) as PSP. For finite
// x>=2^32 produces 0, whereas PSP sometimes
// produces junk (which looks like small
// non-zero values). Hopefully games do not rely
// on this. For inf/nan produces qNaN, but
// with a different payload than PSP (PSP produces
// sNaN).
static inline float vfpu_sin(float x)
{
    // Handle large/non-finite values.
    if(!(fabsf(x)<0x1p25f)) return copysignf(x-x,x); // Beware -ffast-math.
    // Reduce to [-2;+2].
    int32_t n=int32_t(0.5f*x); // Rounds to 0.
    x=x-2.0f*float(n); // Exact, no roundoff. Also, -0 remains -0.
    // Flip sign according to half-period.
    if(n&1) x=-x;
    // Convert to fixed 1.23.
    uint32_t arg=uint32_t(int32_t(fabsf(x)*0x1p23f));
    // Reduce to [0;1] (i.e. [0;2^23]).
    if(arg>=1u<<23) arg=(1u<<24)-arg;
    // Convert from fixed 1.28, and apply sign.
    return float(int32_t(vfpu_sin_fixed(arg)))*0x1p-28f*copysignf(1.0f,x);
}

// WARNING: not tested, this is just a guess.
static inline float vfpu_cos(float x)
{
    x=fabsf(x);
    // Handle large/non-finite values.
    if(!(fabsf(x)<0x1p25f)) return 1.0f+(x-x);
    // Reduce to [-2;+2].
    int32_t n=int32_t(0.5f*x); // Rounds to 0.
    x=x-2.0f*float(n); // Exact, no roundoff.
    // Convert to fixed 1.23.
    uint32_t arg=uint32_t(int32_t(x*0x1p23f));
    // Reduce to [0;1] (i.e. [0;2^23]).
    if(arg>=1u<<23) {arg=(1u<<24)-arg; n^=1;}
    // Convert from fixed 1.28, and apply sign.
    return float(int32_t(vfpu_sin_fixed((1u<<23)-arg)))*0x1p-28f*(n&1?-1.0f:+1.0f);
}

Edit: minor fixes.

Not ready for deployment, but may be useful as-is to test if it unbreaks some picky games.

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

I'm using fixed 0.28 since it can represent all output values, but that might not be what is going on internally.
As I already mentioned, the 2 lower bits of the bitwise representation of floating point output are always 0, meaning output has at most 22 (binary) significant digits. In other words, the stepping of the output in fixed 0.28 is

uint32_t quantum(uint32_t x)
{
    return x<1u<<22?
        1u:
        1u<<(32-22-count_leading_zeroes(x));
}

One idea I had is that we need an approximation (treated as real-valued here) f(i) that passes each region table[i]<=f(i)<table[i]+quantum(table[i]) (this may be whole input range, or piecewise), and then simply truncate it (to quantum). This is similar to what RLIBM does. This does require a bit fancy solver though.

Maybe something simpler suffices.

@unknownbrackets, you mentioned that you don't think this is CORDIC. Can you elaborate?

@hrydgard
Copy link
Owner

hrydgard commented Feb 11, 2023

I don't believe it's CORDIC either because CORDIC is iterative, so it takes a variable-amount of time depending on the angle, and I don't think we've observed that. I would guess it's an interpolated table lookup from a smaller table, but I could be wrong of course.

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

Hm, got runtime size down to 2MB: table of quadratic coefficients + table of exceptions.
It's ugly, but may be workable. Currently searches exceptions via binary search, but it can be improved.
In case you want it:
vfpu_sin_fixed.zip

This implements extern uint32_t vfpu_sin_fixed(uint32_t arg), which you can plug into the reduction code above to give you actual vfpu_sin and vfpu_cos.

It is verified to produce the exactly the same result as vfpu_sin_table. The only advantage is that now memory for lookup table is down to 2MB from 32MB.

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

The code itself (so you don't need to download to take a look):

vfpu_sin_fixed.cpp
#include "stdint.h"

#include "quadratic_coefs.h"
#include "exceptions.h"

static inline uint32_t quantum(uint32_t x)
{
    /*
    return x<1u<<22?
        1u:
        1u<<(32-22-__builtin_clz(x));
    */
    int i=0;
    while((x>>i)>=0x00400000) ++i;
    return uint32_t(1)<<i;
}

static inline u32 truncate_bits(u32 x)
{
    return x&-quantum(x);
}

extern uint32_t vfpu_sin_fixed(uint32_t arg)
{
    // Handle endpoints.
    if(arg==0u) return 0u;
    if(arg==0x00800000) return 0x10000000;
    // Get quadratic coefficients.
    int32_t i=int32_t(arg>>7),j=int32_t(arg&0x7F);
    const int32_t *coef=quadratic_coefs[i];
    int64_t A=coef[0],B=coef[1],C=coef[2];
    // Compute approximation. Is off by at most 1 quantum.
    const int PA=30,PB=25,PC=3;
    uint32_t v=uint32_t(((((A*j>>(PA-PB))+B)*j>>(PB-PC))+C)>>PC);
    v=truncate_bits(v);
    // Look up exceptions. Binary search for now, but this
    // can be done better.
    unsigned lo=0,hi=sizeof(exceptions)/sizeof(exceptions[0]);
    while(lo<hi)
    {
        uint32_t m=(lo+hi)/2;
        uint32_t b=exceptions[m];
        uint32_t e=(b>>31?~b:b);
        if(e==arg)
        {
            v+=quantum(v)*(b>>31?-1u:+1u);
            break;
        }
        else if(e<arg) lo=m+1;
        else           hi=m;
    }
    return v;
}

@hrydgard
Copy link
Owner

That's cool, nicely done. Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware. Getting to the point where it could be worth it if it unbreaks stuff in games (that Ridge Racer replay, or whatever it was, maybe?).

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware.

Yeah, e.g. this mentions sizes around 10KB.

@fp64
Copy link
Contributor Author

fp64 commented Feb 11, 2023

Haven't yet verified that cos uses the same table (reversed).

and done.

@fp64
Copy link
Contributor Author

fp64 commented Feb 13, 2023

Made it uglier, but tables fit in 720 KB. Not sure if it's worth it.

Code:

vfpu_sin_fixed.cpp
#include "stdint.h"

#include "vfpu_sin_lut.h"

static inline uint32_t quantum(uint32_t x)
{
    //return x<1u<<22?
    //    1u:
    //    1u<<(32-22-__builtin_clz(x));
    int i=0;
    while((x>>i)>=0x00400000) ++i;
    return uint32_t(1)<<i;
}

static inline u32 truncate_bits(u32 x)
{
    return x&-quantum(x);
}

static inline int64_t ilerp(int64_t a,int64_t b,int64_t n,int64_t d)
{
    return a+((b-a)*n)/d;
}

// Cubic interpolation, with control points P0, P1, and
// derivatives T0, T1.
static inline int64_t icubic(int64_t P0,int64_t T0,int64_t P1,int64_t T1,int64_t n,int64_t d)
{
    // This can be converted to Bezier, which can
    // be calculated via lerps.
    int64_t C0=P0+T0/3;
    int64_t C1=P1-T1/3;
    int64_t Q0=ilerp(P0,C0,n,d);
    int64_t Q1=ilerp(C0,C1,n,d);
    int64_t Q2=ilerp(C1,P1,n,d);
    int64_t L0=ilerp(Q0,Q1,n,d);
    int64_t L1=ilerp(Q1,Q2,n,d);
    return ilerp(L0,L1,n,d);
}

// Catmull-Rom interpolation.
static inline int64_t icerp(int64_t P0,int64_t P1,int64_t P2,int64_t P3,int64_t n,int64_t d)
{
    return icubic(P1,(P2-P0)/2,P2,(P3-P1)/2,n,d);
}

// Fixed point 4.28 multiply.
static inline uint32_t mul28(uint32_t x,uint32_t y)
{
    return (uint32_t)((uint64_t)x*(uint64_t)y>>28);
}

// Crude approximation, may be off by around 5 quantums.
static inline uint32_t vfpu_sin_approx(uint32_t x)
{
    if((int32_t)x<0) return -vfpu_sin_approx(-x);
    x&=0x00FFFFFFu;
    if(x>=0x00800000u) x=0x01000000u-x;
    x*=(1u<<(28-23))/2; // Coefficients are for sinpi, so we divide by 2.
    uint32_t x2=mul28(x,x);
    return mul28(x, 843314880u
         -mul28(x2,1387197952u
         -mul28(x2, 684549312u
         -mul28(x2, 160694304u
         -mul28(x2,  21002648u)))));
}

uint32_t vfpu_sin_fixed(uint32_t arg)
{
    // Handle endpoints.
    if(arg==0u) return 0u;
    if(arg==0x00800000) return 0x10000000;
    // Get cubic coefficients.
    // Note: coef stores deltas from crude approximation.
    const signed char *coef=vfpu_sin_lut_coefs[arg>>8];
    int64_t P[4];
    for(unsigned i=0;i<4;++i)
    {
        P[i]=int32_t(vfpu_sin_approx((arg&-256u)+256u*(i-1)));
        int32_t q=(int32_t)quantum(uint32_t(P[i]<0?-P[i]:P[i]));
        P[i]=8192LL*P[i]+512LL*coef[i]*q;
    }
    // Compute approximation. Is off by at most 1 quantum.
    uint32_t v=uint32_t(icerp(P[0],P[1],P[2],P[3],arg&255,256)/8192);
    v=truncate_bits(v);
    // Look up exceptions via binary search.
    // Note: vfpu_sin_lut_intervals stores
    // deltas from interval estimation.
    int32_t lo=int32_t(459376LL*( arg      &-128u)/N)+16384
        +vfpu_sin_lut_intervals[ arg      >>7];
    int32_t hi=int32_t(459376LL*((arg+128u)&-128u)/N)+16384
        +vfpu_sin_lut_intervals[(arg+128u)>>7];
    while(lo<hi)
    {
        int32_t m=(lo+hi)/2;
        // Note: vfpu_sin_lut_exceptions stores
        // index&127 (for each initial interval the
        // upper bits of index are the same, namely
        // arg&-128), inverted if the correction is
        // negative (correction is always either +1
        // or -1).
        int32_t b=vfpu_sin_lut_exceptions[m];
        uint32_t e=(arg&-128u)+uint32_t(b<0?~b:b);
        if(e==arg)
        {
            v+=quantum(v)*(b<0?-1u:+1u);
            break;
        }
        else if(e<arg) lo=m+1;
        else           hi=m;
    }
    return v;
}

vfpu_sin_fixed.zip

There should be a better way, but well...

Not sure if it's good enough to consider making pull request.

@fp64
Copy link
Contributor Author

fp64 commented Feb 13, 2023

It's also probably somewhat slower than before.

Also, amusingly, the table is not monotonic:

TABLE[5242176]==223176256
TABLE[5242177]==223176192

@fp64
Copy link
Contributor Author

fp64 commented Feb 13, 2023

In case anyone is curious, the junk for $|x|&gt;2^{32}$ looks like this:

Table
Input (as float) Output (as float)
7F800000 inf 7F800001 nan
FF800000 -inf FF800001 -nan
7E800000 8.50705917e+37 00000000 0
7F000000 1.70141183e+38 00000000 0
7EC00000 1.27605888e+38 00000000 0
7F400000 2.55211775e+38 00000000 0
4FFFFFFF 8.58993408e+09 00000000 0
4F812345 4.33314458e+09 00000000 0
CF812345 -4.33314458e+09 80000000 -0
4FFFFFFF 8.58993408e+09 00000000 0
50012345 8.66628915e+09 BCE4BBA0 -0.0279214978
D0012345 -8.66628915e+09 3CE4BBA0 0.0279214978
50FFFFFF 3.43597363e+10 B5490000 -7.4878335e-07
50812345 1.73325783e+10 3D64A4C4 0.0558211952
D0812345 -1.73325783e+10 BD64A4C4 -0.0558211952
50FFFFFF 3.43597363e+10 B5490000 -7.4878335e-07
51012345 3.46651566e+10 3DE44980 0.111468315
D1012345 -3.46651566e+10 BDE44980 -0.111468315
51FFFFFF 1.37438945e+11 B6490000 -2.9951334e-06
51812345 6.93303132e+10 3E62DD4C 0.221547306
D1812345 -6.93303132e+10 BE62DD4C -0.221547306
51FFFFFF 1.37438945e+11 B6490000 -2.9951334e-06
52012345 1.38660626e+11 3EDD3A0C 0.432083488
D2012345 -1.38660626e+11 BEDD3A0C -0.432083488
52FFFFFF 5.49755781e+11 B7490000 -1.19805336e-05
52812345 2.77321253e+11 3F47827C 0.779334784
D2812345 -2.77321253e+11 BF47827C -0.779334784
52FFFFFF 5.49755781e+11 B7490000 -1.19805336e-05
53012345 5.54642506e+11 3F7A0754 0.976674318
D3012345 -5.54642506e+11 BF7A0754 -0.976674318
53FFFFFF 2.19902312e+12 B8490C00 -4.79333103e-05
53812345 1.10928501e+12 BED6C01C -0.419434428
D3812345 -1.10928501e+12 3ED6C01C 0.419434428
53FFFFFF 2.19902312e+12 B8490C00 -4.79333103e-05
54012345 2.21857002e+12 3F42F284 0.761512995
D4012345 -2.21857002e+12 BF42F284 -0.761512995
54FFFFFF 8.7960925e+12 B9491000 -0.000191748142
54812345 4.43714005e+12 3F7CB5C4 0.987148523
D4812345 -4.43714005e+12 BF7CB5C4 -0.987148523
54FFFFFF 8.7960925e+12 B9491000 -0.000191748142
55012345 8.87428009e+12 BEA18974 -0.315501809
D5012345 -8.87428009e+12 3EA18974 0.315501809
55FFFFFF 3.518437e+13 BA491040 -0.000766996294
55812345 1.77485602e+13 3F194954 0.598775148
D5812345 -1.77485602e+13 BF194954 -0.598775148
55FFFFFF 3.518437e+13 BA491040 -0.000766996294
56012345 3.54971204e+13 3F758A18 0.959138393
D6012345 -3.54971204e+13 BF758A18 -0.959138393
56FFFFFF 1.4073748e+14 BB491000 -0.00306797028
56812345 7.09942407e+13 3F0AF1B4 0.542750597
D6812345 -7.09942407e+13 BF0AF1B4 -0.542750597
56FFFFFF 1.4073748e+14 BB491000 -0.00306797028
57012345 1.41988481e+14 BF696590 -0.911705971
D7012345 -1.41988481e+14 3F696590 0.911705971
57FFFFFF 5.6294992e+14 BC490E90 -0.0122715384
57812345 2.83976963e+14 BF3FC764 -0.749136209
D7812345 -2.83976963e+14 3F3FC764 0.749136209
57FFFFFF 5.6294992e+14 BC490E90 -0.0122715384
58012345 5.67953926e+14 3F7E1324 0.992479563
D8012345 -5.67953926e+14 BF7E1324 -0.992479563
58FFFFFF 2.25179968e+15 BD48FB30 -0.0490676761
58812345 1.13590785e+15 BE78CFCC -0.242980182
D8812345 -1.13590785e+15 3E78CFCC 0.242980182
58FFFFFF 2.25179968e+15 BD48FB30 -0.0490676761
59012345 2.2718157e+15 3EF15AE8 0.471396685
D9012345 -2.2718157e+15 BEF15AE8 -0.471396685
59FFFFFF 9.00719872e+15 BE47C5C0 -0.195090294
59812345 4.54363141e+15 3F54DB30 0.831469536
D9812345 -4.54363141e+15 BF54DB30 -0.831469536
59FFFFFF 9.00719872e+15 BE47C5C0 -0.195090294
5A012345 9.08726281e+15 3F6C835C 0.923879385
DA012345 -9.08726281e+15 BF6C835C -0.923879385
5AFFFFFF 3.60287949e+16 BF3504F0 -0.70710659
5A812345 1.81745256e+16 BF3504F0 -0.70710659
DA812345 -1.81745256e+16 3F3504F0 0.70710659
5AFFFFFF 3.60287949e+16 BF3504F0 -0.70710659
5B012345 3.63490513e+16 3F800000 1
DB012345 -3.63490513e+16 BF800000 -1
5BFFFFFF 1.44115179e+17 80000000 -0
5B812345 7.26981025e+16 80000000 -0
DB812345 -7.26981025e+16 00000000 0
5BFFFFFF 1.44115179e+17 80000000 -0
5C012345 1.45396205e+17 00000000 0
DC012345 -1.45396205e+17 80000000 -0
5CFFFFFF 5.76460718e+17 00000000 0
5C812345 2.9079241e+17 00000000 0
DC812345 -2.9079241e+17 80000000 -0
5CFFFFFF 5.76460718e+17 00000000 0
5D012345 5.8158482e+17 00000000 0
DD012345 -5.8158482e+17 80000000 -0
5DFFFFFF 2.30584287e+18 00000000 0
5D812345 1.16316964e+18 00000000 0
DD812345 -1.16316964e+18 80000000 -0
5DFFFFFF 2.30584287e+18 00000000 0
5E012345 2.32633928e+18 00000000 0
DE012345 -2.32633928e+18 80000000 -0
5EFFFFFF 9.22337149e+18 00000000 0
5E812345 4.65267856e+18 00000000 0
DE812345 -4.65267856e+18 80000000 -0
5EFFFFFF 9.22337149e+18 00000000 0
5F012345 9.30535712e+18 00000000 0
DF012345 -9.30535712e+18 80000000 -0
5FFFFFFF 3.68934859e+19 00000000 0
5F812345 1.86107142e+19 00000000 0
DF812345 -1.86107142e+19 80000000 -0
5FFFFFFF 3.68934859e+19 00000000 0
60012345 3.72214285e+19 BCE4BBA0 -0.0279214978
E0012345 -3.72214285e+19 3CE4BBA0 0.0279214978
60FFFFFF 1.47573944e+20 B5490000 -7.4878335e-07
60812345 7.4442857e+19 3D64A4C4 0.0558211952
E0812345 -7.4442857e+19 BD64A4C4 -0.0558211952
60FFFFFF 1.47573944e+20 B5490000 -7.4878335e-07
61012345 1.48885714e+20 3DE44980 0.111468315
E1012345 -1.48885714e+20 BDE44980 -0.111468315
61FFFFFF 5.90295775e+20 B6490000 -2.9951334e-06
61812345 2.97771428e+20 3E62DD4C 0.221547306
E1812345 -2.97771428e+20 BE62DD4C -0.221547306
61FFFFFF 5.90295775e+20 B6490000 -2.9951334e-06
62012345 5.95542856e+20 3EDD3A0C 0.432083488
E2012345 -5.95542856e+20 BEDD3A0C -0.432083488
62FFFFFF 2.3611831e+21 B7490000 -1.19805336e-05
62812345 1.19108571e+21 3F47827C 0.779334784
E2812345 -1.19108571e+21 BF47827C -0.779334784
62FFFFFF 2.3611831e+21 B7490000 -1.19805336e-05
63012345 2.38217142e+21 3F7A0754 0.976674318
E3012345 -2.38217142e+21 BF7A0754 -0.976674318
63FFFFFF 9.4447324e+21 B8490C00 -4.79333103e-05
63812345 4.76434285e+21 BED6C01C -0.419434428
E3812345 -4.76434285e+21 3ED6C01C 0.419434428
63FFFFFF 9.4447324e+21 B8490C00 -4.79333103e-05
64012345 9.52868569e+21 3F42F284 0.761512995
E4012345 -9.52868569e+21 BF42F284 -0.761512995
64FFFFFF 3.77789296e+22 B9491000 -0.000191748142
64812345 1.90573714e+22 3F7CB5C4 0.987148523
E4812345 -1.90573714e+22 BF7CB5C4 -0.987148523
64FFFFFF 3.77789296e+22 B9491000 -0.000191748142
65012345 3.81147428e+22 BEA18974 -0.315501809
E5012345 -3.81147428e+22 3EA18974 0.315501809
65FFFFFF 1.51115718e+23 BA491040 -0.000766996294
65812345 7.62294855e+22 3F194954 0.598775148
E5812345 -7.62294855e+22 BF194954 -0.598775148
65FFFFFF 1.51115718e+23 BA491040 -0.000766996294
66012345 1.52458971e+23 3F758A18 0.959138393
E6012345 -1.52458971e+23 BF758A18 -0.959138393
66FFFFFF 6.04462874e+23 BB491000 -0.00306797028
66812345 3.04917942e+23 3F0AF1B4 0.542750597
E6812345 -3.04917942e+23 BF0AF1B4 -0.542750597
66FFFFFF 6.04462874e+23 BB491000 -0.00306797028
67012345 6.09835884e+23 BF696590 -0.911705971
E7012345 -6.09835884e+23 3F696590 0.911705971
67FFFFFF 2.4178515e+24 BC490E90 -0.0122715384
67812345 1.21967177e+24 BF3FC764 -0.749136209
E7812345 -1.21967177e+24 3F3FC764 0.749136209
67FFFFFF 2.4178515e+24 BC490E90 -0.0122715384
68012345 2.43934354e+24 3F7E1324 0.992479563
E8012345 -2.43934354e+24 BF7E1324 -0.992479563
68FFFFFF 9.67140598e+24 BD48FB30 -0.0490676761
68812345 4.87868707e+24 BE78CFCC -0.242980182
E8812345 -4.87868707e+24 3E78CFCC 0.242980182
68FFFFFF 9.67140598e+24 BD48FB30 -0.0490676761
69012345 9.75737415e+24 3EF15AE8 0.471396685
E9012345 -9.75737415e+24 BEF15AE8 -0.471396685
69FFFFFF 3.86856239e+25 BE47C5C0 -0.195090294
69812345 1.95147483e+25 3F54DB30 0.831469536
E9812345 -1.95147483e+25 BF54DB30 -0.831469536
69FFFFFF 3.86856239e+25 BE47C5C0 -0.195090294
6A012345 3.90294966e+25 3F6C835C 0.923879385
EA012345 -3.90294966e+25 BF6C835C -0.923879385
6AFFFFFF 1.54742496e+26 BF3504F0 -0.70710659
6A812345 7.80589932e+25 BF3504F0 -0.70710659
EA812345 -7.80589932e+25 3F3504F0 0.70710659
6AFFFFFF 1.54742496e+26 BF3504F0 -0.70710659
6B012345 1.56117986e+26 3F800000 1
EB012345 -1.56117986e+26 BF800000 -1
6BFFFFFF 6.18969983e+26 80000000 -0
6B812345 3.12235973e+26 80000000 -0
EB812345 -3.12235973e+26 00000000 0
6BFFFFFF 6.18969983e+26 80000000 -0
6C012345 6.24471946e+26 00000000 0
EC012345 -6.24471946e+26 80000000 -0
6CFFFFFF 2.47587993e+27 00000000 0
6C812345 1.24894389e+27 00000000 0
EC812345 -1.24894389e+27 80000000 -0
6CFFFFFF 2.47587993e+27 00000000 0
6D012345 2.49788778e+27 00000000 0
ED012345 -2.49788778e+27 80000000 -0
6DFFFFFF 9.90351972e+27 00000000 0
6D812345 4.99577556e+27 00000000 0
ED812345 -4.99577556e+27 80000000 -0
6DFFFFFF 9.90351972e+27 00000000 0
6E012345 9.99155113e+27 00000000 0
EE012345 -9.99155113e+27 80000000 -0
6EFFFFFF 3.96140789e+28 00000000 0
6E812345 1.99831023e+28 00000000 0
EE812345 -1.99831023e+28 80000000 -0
6EFFFFFF 3.96140789e+28 00000000 0
6F012345 3.99662045e+28 00000000 0
EF012345 -3.99662045e+28 80000000 -0
6FFFFFFF 1.58456316e+29 00000000 0
6F812345 7.9932409e+28 00000000 0
EF812345 -7.9932409e+28 80000000 -0
6FFFFFFF 1.58456316e+29 00000000 0
70012345 1.59864818e+29 BCE4BBA0 -0.0279214978
F0012345 -1.59864818e+29 3CE4BBA0 0.0279214978
70FFFFFF 6.33825262e+29 B5490000 -7.4878335e-07
70812345 3.19729636e+29 3D64A4C4 0.0558211952
F0812345 -3.19729636e+29 BD64A4C4 -0.0558211952
70FFFFFF 6.33825262e+29 B5490000 -7.4878335e-07
71012345 6.39459272e+29 3DE44980 0.111468315
F1012345 -6.39459272e+29 BDE44980 -0.111468315
71FFFFFF 2.53530105e+30 B6490000 -2.9951334e-06
71812345 1.27891854e+30 3E62DD4C 0.221547306
F1812345 -1.27891854e+30 BE62DD4C -0.221547306
71FFFFFF 2.53530105e+30 B6490000 -2.9951334e-06
72012345 2.55783709e+30 3EDD3A0C 0.432083488
F2012345 -2.55783709e+30 BEDD3A0C -0.432083488
72FFFFFF 1.01412042e+31 B7490000 -1.19805336e-05
72812345 5.11567418e+30 3F47827C 0.779334784
F2812345 -5.11567418e+30 BF47827C -0.779334784
72FFFFFF 1.01412042e+31 B7490000 -1.19805336e-05
73012345 1.02313484e+31 3F7A0754 0.976674318
F3012345 -1.02313484e+31 BF7A0754 -0.976674318
73FFFFFF 4.05648168e+31 B8490C00 -4.79333103e-05
73812345 2.04626967e+31 BED6C01C -0.419434428
F3812345 -2.04626967e+31 3ED6C01C 0.419434428
73FFFFFF 4.05648168e+31 B8490C00 -4.79333103e-05
74012345 4.09253934e+31 3F42F284 0.761512995
F4012345 -4.09253934e+31 BF42F284 -0.761512995
74FFFFFF 1.62259267e+32 B9491000 -0.000191748142
74812345 8.18507868e+31 3F7CB5C4 0.987148523
F4812345 -8.18507868e+31 BF7CB5C4 -0.987148523
74FFFFFF 1.62259267e+32 B9491000 -0.000191748142
75012345 1.63701574e+32 BEA18974 -0.315501809
F5012345 -1.63701574e+32 3EA18974 0.315501809
75FFFFFF 6.49037069e+32 BA491040 -0.000766996294
75812345 3.27403147e+32 3F194954 0.598775148
F5812345 -3.27403147e+32 BF194954 -0.598775148
75FFFFFF 6.49037069e+32 BA491040 -0.000766996294
76012345 6.54806295e+32 3F758A18 0.959138393
F6012345 -6.54806295e+32 BF758A18 -0.959138393
76FFFFFF 2.59614827e+33 BB491000 -0.00306797028
76812345 1.30961259e+33 3F0AF1B4 0.542750597
F6812345 -1.30961259e+33 BF0AF1B4 -0.542750597
76FFFFFF 2.59614827e+33 BB491000 -0.00306797028
77012345 2.61922518e+33 BF696590 -0.911705971
F7012345 -2.61922518e+33 3F696590 0.911705971
77FFFFFF 1.03845931e+34 BC490E90 -0.0122715384
77812345 5.23845036e+33 BF3FC764 -0.749136209
F7812345 -5.23845036e+33 3F3FC764 0.749136209
77FFFFFF 1.03845931e+34 BC490E90 -0.0122715384
78012345 1.04769007e+34 3F7E1324 0.992479563
F8012345 -1.04769007e+34 BF7E1324 -0.992479563
78FFFFFF 4.15383724e+34 BD48FB30 -0.0490676761
78812345 2.09538014e+34 BE78CFCC -0.242980182
F8812345 -2.09538014e+34 3E78CFCC 0.242980182
78FFFFFF 4.15383724e+34 BD48FB30 -0.0490676761
79012345 4.19076029e+34 3EF15AE8 0.471396685
F9012345 -4.19076029e+34 BEF15AE8 -0.471396685
79FFFFFF 1.6615349e+35 BE47C5C0 -0.195090294
79812345 8.38152057e+34 3F54DB30 0.831469536
F9812345 -8.38152057e+34 BF54DB30 -0.831469536
79FFFFFF 1.6615349e+35 BE47C5C0 -0.195090294
7A012345 1.67630411e+35 3F6C835C 0.923879385
FA012345 -1.67630411e+35 BF6C835C -0.923879385
7AFFFFFF 6.64613958e+35 BF3504F0 -0.70710659
7A812345 3.35260823e+35 BF3504F0 -0.70710659
FA812345 -3.35260823e+35 3F3504F0 0.70710659
7AFFFFFF 6.64613958e+35 BF3504F0 -0.70710659
7B012345 6.70521646e+35 3F800000 1
FB012345 -6.70521646e+35 BF800000 -1
7BFFFFFF 2.65845583e+36 80000000 -0
7B812345 1.34104329e+36 80000000 -0
FB812345 -1.34104329e+36 00000000 0
7BFFFFFF 2.65845583e+36 80000000 -0
7C012345 2.68208658e+36 00000000 0
FC012345 -2.68208658e+36 80000000 -0
7CFFFFFF 1.06338233e+37 00000000 0
7C812345 5.36417317e+36 00000000 0
FC812345 -5.36417317e+36 80000000 -0
7CFFFFFF 1.06338233e+37 00000000 0
7D012345 1.07283463e+37 00000000 0
FD012345 -1.07283463e+37 80000000 -0
7DFFFFFF 4.25352933e+37 00000000 0
7D812345 2.14566927e+37 00000000 0
FD812345 -2.14566927e+37 80000000 -0
7DFFFFFF 4.25352933e+37 00000000 0
7E012345 4.29133853e+37 00000000 0
FE012345 -4.29133853e+37 80000000 -0
7EFFFFFF 1.70141173e+38 00000000 0
7E812345 8.58267707e+37 00000000 0
FE812345 -8.58267707e+37 80000000 -0
7EFFFFFF 1.70141173e+38 00000000 0
7F012345 1.71653541e+38 00000000 0
FF012345 -1.71653541e+38 80000000 -0

@hrydgard
Copy link
Owner

Wow, this is some wacky stuff. Non-monotonicity within the wrapped range is really surprising! Might possibly be some kind of hint to what's really going on, or it's just another indication that we won't be able to find it, because it's so weird :P

@unknownbrackets
Copy link
Collaborator

I think I had theorized that only some of the exponent bits are respected, causing this pattern. It seemed to mostly work. See:

	if (k > 0x80) {
		const uint8_t over = k & 0x1F;
		mantissa = (mantissa << over) & 0x00FFFFFF;
		k = 0x80;
	}

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Feb 14, 2023

const uint8_t over = k & 0x1F;

When I tried it before, it didn't seem to work.
But I must have botched something, because now it does get mostly correct results:

https://godbolt.org/z/YbzhbshPd

Thanks!

@hrydgard
Copy link
Owner

A common way to "linearize" floats between 0 and 1 before doing lookups into a table is to simply add 1.0, after that, the mantissa represents the linear position between 1 and 2 and you can just drop the exponent and use the mantissa from there on as your table lookup index. But this is from a software perspective. That the exponent seems to wrap like that makes me think that there's a table lookup involving that too, or some specialized hardware that effectively does the same as adding 1.0 but cheaper and thus limited in the range of exponents it can process...

@fp64
Copy link
Contributor Author

fp64 commented Feb 14, 2023

It just looks like a shift nobody bothered to mask (same as x86 shr,shl instructions).

It might be even cyclic shift (so no need to branch on sign of exponent-0x7F) with some additional masking.

@unknownbrackets
Copy link
Collaborator

I think a table that's down to 720 KB is interesting. Not that terrible RAM usage if it fixes things.

Games I know that are or could be affected by sin/cos accuracy:

  • FF3
  • Hitman Reborn
  • Ridge Racer
  • Dissidia
  • GTA

I wonder how much slower this variant is than what we do now, though... we'd probably end up not wanting to add it everywhere.

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented Feb 15, 2023

sin and cos are generally not the most frequently used mathematical operations, since they're normally used for things like setting up factors that are then multiplied with a lot of other things, typical example being a rotation matrix so a not-gigantic slowdown in these might not be so bad...

Though, given they are pretty fast on the hardware, maybe some games over-use them for other stuff..

And I agree that 720kb is getting palatable..

@fp64
Copy link
Contributor Author

fp64 commented Feb 15, 2023

Ok, I think this matches all available data:

static float vfpu_sin(float x)
{
    uint32_t bits;
    memcpy(&bits,&x,sizeof(x));
    uint32_t sign=bits&0x80000000u;
    uint32_t exponent=(bits>>23)&0xFFu;
    uint32_t significand=(bits&0x007FFFFFu)|0x00800000u;
    if(exponent==0xFFu)
    {
        // NOTE: this bitpattern is a signaling
        // NaN on x86, so maybe just return
        // a normal qNaN?
        float y;
        bits=sign^0x7F800001u;
        memcpy(&y,&bits,sizeof(y));
        return y;
    }
    if(exponent<0x7Fu)
    {
        if(exponent<0x7Fu-23u) significand=0u;
        else significand>>=(0x7F-exponent);
    }
    else if(exponent>0x7Fu)
    {
        // There is weirdness for large exponents.
        if(exponent-0x7Fu>=25u&&exponent-0x7Fu<32u) significand=0u;
        else if((exponent&0x9Fu)==0x9Fu) significand=0u;
        else significand<<=(exponent-0x7Fu);
    }
    sign^=((significand<<7)&0x80000000u);
    significand&=0x00FFFFFFu;
    if(significand>0x00800000u) significand=0x01000000u-significand;
    uint32_t ret=vfpu_sin_fixed(significand);
    return (sign?-1.0f:+1.0f)*float(int32_t(ret))*3.7252903e-09f; // 0x1p-28f
}

Notes:

  • using x87 FPU (-mfpmath=387 on 32-bit) sNaN tends to get turned into qNaN.
  • memcpy is not endian-safe, this needs a proper float_as_bits (can be concocted with frexpf).

@hrydgard
Copy link
Owner

Cool! So the remaining piece is reducing vfpu_sin_fix further if possible.

memcpy is perfectly endian-safe for something like this, as I haven't heard of an architecture that has different endian-ness for its floating point. In addition, we only support little-endian architectures in practice anyway.

@fp64
Copy link
Contributor Author

fp64 commented Feb 15, 2023

Actually, I still do not have the data to test reduction for cos (for larger values). It's easy to make a guess, but confirmation would be nice (e.g. does the sign of NaN match the input, or fabsf(input)?).

Should I make a PR? Not to be merged as-is, but just to have an artifact to test things with? If so:

  • Do I need to stick PPSSPP GPL text on vfpu_sin_lut.h (to be #include'd in MIPSVFPUUtils.cpp only)?
  • Do I need to mention it in CMakeLists.txt somewhere?

Speed, as measured on my machine:

Method Time
sinf(x*float(M_PI_2)) 1.0x
(float)sin(x*M_PI_2) 1.6x
32MB table, lin. access 1.83x
32MB table, rnd. access 3.6x
720KB table,vfpu_sin_fixed 12.0x

Aside: pedantry compels me to use 'significand' instead of 'manitissa', but that's just me.

@fp64
Copy link
Contributor Author

fp64 commented Feb 15, 2023

Does not seem to fix #2990 (the behavior is still the same as the 1st video). This much was already suspected.

Incidentally, when I got the sign of cos wrong (didn't do the signflip for [1;2)), the camera started behaving weirdly, but the path of the car itself seemed the same as before. Maybe sin/cos aren't used, or maybe angles are always in [-1;+1].

@fp64
Copy link
Contributor Author

fp64 commented May 21, 2023

I guess it's not fair to evaluate a random number generator on bad seeds?

Yes and no? Making sure there aren't any bad seeds is an attractive (and achievable) design goal for an RNG. Especially is you control the seeding procedure (and vrnds does) - you can have bad states, but make sure no seeds ever lead to them (immediately or otherwise). Maybe something as simple as force-setting D-part to 1 at seeding might have completely fixed these problems, or maybe not.

And it matters, in practice, which seeds. Like "what happens with seed=0?" is about the first thing one tests about an RNG. And in this case the answer is "underwhelming RNG". If there is a doc somewhere that explicitly spells out "don't use seed 0, like, ever; or 0x40000000, really" - I haven't seen it.

Assuming it's not, it indeed doesn't seem bad then.

It does look pretty ok for normal seeds. But then again, at 256-bit state (or 160-bit, or 129-bit, since E-part is effectively 1-bit most of the time) it better be. To be fair, there are some well-known RNGs that fare worse while having larger state (original Mersenne Twister is an obvious example).

Very cool to have correct (or close to it) implementation finally, nicely done! Eagerly awaiting the PR :)

Besides more testing (on whole dataset, and I might formulate couple more requests - e.g. more seeds with top bit set to help verify E-part), there is some more plumbing to be done (actually interacting with thread context stuff), which I'm not super clear about either.

@fp64
Copy link
Contributor Author

fp64 commented May 21, 2023

And yes, nice experience overall. Especially, since, unlike floating point stuff, we actually do get to uncover what HW does.

@unknownbrackets
Copy link
Collaborator

Well, given how the GE works in some ways too, and even some missing checks for interlock in the VFPU (matrix overlap for src/dst), I wouldn't be surprised if the docs just said not to use certain seed values at all.

Here's the range of context values, the "_s" one is just a single vrndi.s per iteration. Note that this is the result of attempting to write context values with one bit - they read back always with 0x3F80 at the top, so that's what the file has. For completeness, I included writing 1 for those upper bits, although I assume the registers simply aren't that wide.

vrndi_hamming_ctx.dat.zst.zip
vrndi_hamming_ctx_s.dat.zst.zip
vrndi_rcx3_s.dat.zst.zip

Looking pretty cool so far.

-[Unknown]

@hrydgard
Copy link
Owner

Besides more testing (on whole dataset, and I might formulate couple more requests - e.g. more seeds with top bit set to help verify E-part), there is some more plumbing to be done (actually interacting with thread context stuff), which I'm not super clear about either.

Hm, what thread context stuff? We already handle VFPU context switching between threads that have the VFPU bit enabled, and these values are just part of it, so it should be alright already.

@fp64
Copy link
Contributor Author

fp64 commented May 22, 2023

So it's not fully accurate.

Seed 7F000000 fail.
Iteration 0.
Expected:
9DB9EC94 0BD9DB78 D96A341D 5FBCFBDD 3F802A58 3F80C21A 3F80000E 3F800022 3F80C213 3F8064A6 3F801600 3F807700 
Actual:
9DB9ED3D 0BD9DBBE D96A343A 5FBCFBE9 3F802A58 3F80C21A 3F800054 3F8000CB 3F80C213 3F8064A6 3F801600 3F807700 
Seed 7F014CD1...
Seed 7F014CD1 fail.
Iteration 0.
Expected:
0158A18E 919A5739 F7332AD6 FBBA6B99 3F802749 3F800FCA 3F80D918 3F806A7B 3F80CE76 3F80B2BD 3F8019C9 3F808025 
Actual:
0158A237 919A577F F7332AF3 FBBA6BA5 3F802749 3F800FCA 3F80D95E 3F806B24 3F80CE76 3F80B2BD 3F8019C9 3F808025 

@fp64
Copy link
Contributor Author

fp64 commented May 22, 2023

All RII1 in vrndi_7F.dat matches though.

I'll also ask again: is there any difference between dumps for 7F/various vs noseed (generated by different code or something)? Because noseed generates 4 outputs per iteration, but the others generate 8 but show 4 (skip every other line in dump).

@unknownbrackets
Copy link
Collaborator

No, the main loop is the same between both as long as the header type is the same (i.e. RIII1.) The only difference is that for noseed and the ctx ones, vrnds.s is never called before the loop. A set of 8 vmfvc and sv.q are used to read the context values (and lv.q / vmtvc to write in the ctx case.)

The code has gotten a little messy but looks like this:

	for (int i = 0; i < iterations; ++i) {
		ScePspVector4 *result = RunVrndOp(type, vsz);
		for (int j = 0; j < (vsz == VRNDS_ONE || vsz == VRNDS_ONE_S ? 1 : 32); ++j) {
			datwrite4(result[j].i);
		}
		if (vsz == VRNDS_ONE || vsz == VRNDS_ONE_S) {
			ctx = GetVrndContext();
			datwrite4(ctx);
			datwrite4(ctx + 4);
		}
	}

	if (vsz != VRNDS_ONE && vsz != VRNDS_ONE_S) {
		ctx = GetVrndContext();
		int ctxHeader[4] = { 0x3D3D3D3D, 0x00585443, 0x3D3D3D3D, typemarker | szmarker };
		datwrite4(ctxHeader);
		datwrite4(ctx);
		datwrite4(ctx + 4);
	}

RunVrndOp() just runs asm(...); for vrndi.qs and sv.qs, the same for any of the generated files. When using RIIQ are you also seeing 8 generated per iteration?

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented May 22, 2023

Just in case: GCC thinks asm is sideeffect-free if you do not tell it otherwise (volatile/constraints/clobbers). It can happily rearrange/duplicate/unroll it.
Could it have something to do with this?

When using RIIQ are you also seeing 8 generated per iteration?

Nope, RIIQ seems to have all output:

00000000|3D3D3D3D 44454553 3D3D3D3D 00000000|====SEED====....
00000010|3F800000 3F800000 3F800000 3F800000|...?...?...?...?
00000020|3F800000 3F800000 3F800000 3F800000|...?...?...?...?
00000030|3D3D3D3D 31494952 3D3D3D3D 00000064|====RII1====d...
00000040|63132A58 E3CC94B3 E723057A 2E130A5D|X*.c....z.#.]...
00000050|3F802A58 3F800000 3F800000 3F800000|X*.?...?...?...?
00000060|3F806313 3F800000 3F800000 3F800000|.c.?...?...?...?
00000070|2EDA2FB0 22DC176B 3F269B12 44EB1E55|./..k.."..&?U..D

0000db70|3D3D3D3D 44454553 3D3D3D3D 00000000|====SEED====....
0000db80|3F800000 3F800000 3F800000 3F800000|...?...?...?...?
0000db90|3F800000 3F800000 3F800000 3F800000|...?...?...?...?
0000dba0|3D3D3D3D 51494952 3D3D3D3D 00000064|====RIIQ====d...
0000dbb0|80648F81 1C5983F7 00010DCE 00000001|..d...Y.........
0000dbc0|63132A58 E3CC94B3 E723057A 2E130A5D|X*.c....z.#.]...
0000dbd0|A38863A4 2F8F472F E1D765E6 79D76079|.c../G./.e..y`.y
0000dbe0|2EDA2FB0 22DC176B 3F269B12 44EB1E55|./..k.."..&?U..D
0000dbf0|9E3C9A7C 478B4167 B76DD0FE C8D41FF1||.<.gA.G..m.....

So for RII1 we have:
63132A58 E3CC94B3 E723057A 2E130A5D
2EDA2FB0 22DC176B 3F269B12 44EB1E55
but for RIIQ:
80648F81 1C5983F7 00010DCE 00000001
63132A58 E3CC94B3 E723057A 2E130A5D
A38863A4 2F8F472F E1D765E6 79D76079
2EDA2FB0 22DC176B 3F269B12 44EB1E55
9E3C9A7C 478B4167 B76DD0FE C8D41FF1

@unknownbrackets
Copy link
Collaborator

Weird, that definitely is happening. Maybe I fixed a bug without realizing? I've run into a couple optimizer bugs, having used -O1 here... hope it wasn't that. I generated a new 7F and it's got correct values now, for whatever reason. Sorry, this probably confused things a good bit.

Do you want me to upload the corrected one?

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented May 22, 2023

Do you want me to upload the corrected one?

Not particularly pressing (I've tested all previous RII1's for 7F, and it's very unlikely that "inbetween" iterations are off somehow), but it wouldn't hurt.
Corrected vrndi_various.dat is a bit more useful, but not much (all output is there in RIIQ's and state can be recovered from it).

@fp64
Copy link
Contributor Author

fp64 commented May 24, 2023

Hm, so

    E=addition_overflows(C+E,C)&&
      addition_overflows(C+E,D)&&
      addition_overflows(C+E,C+D+E)&&
      addition_overflows(C+E,C+D);

matches RII1's in both vrndi_7F.dat and vrndi_various.dat.

@fp64
Copy link
Contributor Author

fp64 commented May 24, 2023

Actually, I think I'd like corrected dumps after all.
It appears that there is something stragne with RIIQ as well.
The vrndi_various.dat for seed=00000000 has:

80648F81 1C5983F7 00010DCE 00000001
63132A58 E3CC94B3 E723057A 2E130A5D
A38863A4 2F8F472F E1D765E6 79D76079
2EDA2FB0 22DC176B 3F269B12 44EB1E55
9E3C9A7C 478B4167 B76DD0FE C8D41FF1

But the vrnd_generate4 produces:

C35937CC 1C5983F7 00010DCE 00000001
63132A58 E3CC94B3 E723057A 2E130A5D
A38863A4 2F8F472F E1D765E6 79D76079
2EDA2FB0 22DC176B 3F269B12 44EB1E55
9E3C9A7C 478B4167 B76DD0FE C8D41FF1

which only differs in 0th output (3rd actually, since output register is filled backwards), but is on track after that. vrndi_rcx3_s.dat also has C35937CC at corresponding position.

@fp64
Copy link
Contributor Author

fp64 commented May 25, 2023

Randomly caught this during compilation (too small to create an issue) while preparing implementation:

[ 54%] Building CXX object CMakeFiles/Common.dir/Common/StringUtils.cpp.o
In file included from ppsspp/Common/MemoryUtil.cpp:24:
ppsspp/Common/MemoryUtil.cpp: In function ‘void* AllocateAlignedMemory(size_t, size_t)’:
ppsspp/Common/MemoryUtil.cpp:260:31: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=]
  260 |  _assert_msg_(ptr != nullptr, "Failed to allocate aligned memory of size %llu", size);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~
      |                                                                                 |
      |                                                                                 size_t {aka unsigned int}
ppsspp/Common/Log.h:160:65: note: in definition of macro ‘_assert_msg_’
  160 |  (!HandleAssert(__FUNCTION__, __FILENAME__, __LINE__, #_a_, __VA_ARGS__)) Crash(); \
      |                                                             ^~~~~~~~~~~

ppsspp/Common/MemoryUtil.cpp:260:77: note: format string is defined here
  260 | tr != nullptr, "Failed to allocate aligned memory of size %llu", size);
      |                                                           ~~~^
      |                                                              |
      |                                                              long long unsigned int
      |                                                           %u

Can be fixed by either "%zu" (which PPSSPP seems to hardly ever use: 2 non-external matches via git grep) or (unsigned long long)size.

@fp64
Copy link
Contributor Author

fp64 commented May 25, 2023

Hm, what thread context stuff? We already handle VFPU context switching between threads that have the VFPU bit enabled, and these values are just part of it, so it should be alright already.

Oh, yeah currentMIPS is available, so it's probably fine.
Does removing MIPSState::rng break anything (savestate compatibility? do we care?)? It can, of course, remain as a dummy instead.

@hrydgard
Copy link
Owner

The savestate handling just needs a version check. Maybe we should even run the old rng on old savestates, not sure.

@fp64
Copy link
Contributor Author

fp64 commented May 25, 2023

Hm, previous change to

auto s = p.Section("MIPSState", 1, 3);

seems to have been 55500d4 (a while ago), so maybe less abrupt switch is useful.

fp64 added a commit to fp64/ppsspp that referenced this issue May 25, 2023
See investigation starting
hrydgard#16946 (comment)
for more details.
Still needs more testing.
@fp64
Copy link
Contributor Author

fp64 commented May 26, 2023

It appears that expressions

E = addition_overflows(C + E, C) &&
    addition_overflows(C + E, D) &&
    addition_overflows(C + E, C + D + E);

and

E = addition_overflows(C + E, C) &&
    addition_overflows(C + E, D) &&
    addition_overflows(C + E, C + D + E) &&
    addition_overflows(C + E, C + D);

both match current dataset.

They are not equivalent, e.g. the results are different for C=FFFFFF01, D=000000FF, E=00000080.
As such I'd like to see the RIIs dump for

rcx0=3F800000
rcx1=3F880000
rcx2=3F80FF01
rcx3=3F8000FF
rcx4=3F800000
rcx5=3F800000
rcx6=3F80FFFF
rcx7=3F800000

This corresponds to the above parameters. This particular state should not be possible to reach via either vrnds or noseed, but direct writing to RCX might work?

On that topic:

Note that this is the result of attempting to write context values with one bit - they read back always with 0x3F80 at the top, so that's what the file has.

doesn't seems to match earlier

These can even be written, although only 30 bits:

What actually happens?

@fp64
Copy link
Contributor Author

fp64 commented May 26, 2023

Oh wait, now that I tested on vrndi_hamming.dat both of the above actually fail.

@fp64
Copy link
Contributor Author

fp64 commented May 27, 2023

Oh, interesting. It seems like seed=F0000000 bumps D by 2, rather than 1 on the next iteration after the initial one.
That's contrary to previously conjectured logic of E. Does this mean that E=2 after initial iteration? I lack RIIs dump of seed=F0000000 to check.

So presumably the sequence is:

C        D        E(presumed)
F0000000 F0000000 F0000000 <- after seeding
F0000000 C0000000 00000002 <- after initial iteration
C0000000 70000002 00000001
70000002 A0000005 00000000

Same seems to be true for seed=E0000000, but not seed=C0000000.

I'd like to request RIIs dumps for sees xx000000 (i.e. 256 seeds in total), if that's not too much trouble.

@fp64
Copy link
Contributor Author

fp64 commented May 27, 2023

From vrndi_hamming.dat we get, e.g. this:

seeds
Seed     C        D        E (presumed)

00400000 00400000 00400000 00400000
         00400000 01000000 00000000
         01000000 02400000 00000000
         02400000 05800000 00000000
         05800000 0D400000 00000000
         0D400000 20000000 00000000
         20000000 4D400000 00000000
         4D400000 BA800000 00000000
         BA800000 C2400000 00000000
         C2400000 3F000000 00000001
         3F000000 40400001 ........
00800000 00800000 00800000 00800000
         00800000 02000000 00000000
         02000000 04800000 00000000
         04800000 0B000000 00000000
         0B000000 1A800000 00000000
         1A800000 40000000 00000000
         40000000 9A800000 00000000
         9A800000 75000000 00000000
         75000000 84800000 00000000
         84800000 7E000000 00000000
         7E000000 80800000 ........
01000000 01000000 01000000 01000000
         01000000 04000000 00000000
         04000000 09000000 00000000
         09000000 16000000 00000000
         16000000 35000000 00000000
         35000000 80000000 00000000
         80000000 35000000 00000000
         35000000 EA000000 00000000
         EA000000 09000000 00000000
         09000000 FC000000 00000000
         FC000000 01000000 ........
02000000 02000000 02000000 02000000
         02000000 08000000 00000000
         08000000 12000000 00000000
         12000000 2C000000 00000000
         2C000000 6A000000 00000000
         6A000000 00000000 00000000
         00000000 6A000000 00000000
         6A000000 D4000000 00000000
         D4000000 12000000 00000000
         12000000 F8000000 00000000
         F8000000 02000000 ........
04000000 04000000 04000000 04000000
         04000000 10000000 00000000
         10000000 24000000 00000000
         24000000 58000000 00000000
         58000000 D4000000 00000000
         D4000000 00000000 00000000
         00000000 D4000000 00000000
         D4000000 A8000000 00000000
         A8000000 24000000 00000001
         24000000 F0000001 00000000
         F0000001 04000002 ........
08000000 08000000 08000000 08000000
         08000000 20000000 00000000
         20000000 48000000 00000000
         48000000 B0000000 00000000
         B0000000 A8000000 00000000
         A8000000 00000000 00000001
         00000000 A8000001 00000000
         A8000001 50000002 00000000
         50000002 48000005 00000000
         48000005 E000000C 00000000
         E000000C 0800001D ........
10000000 10000000 10000000 10000000
         10000000 40000000 00000000
         40000000 90000000 00000000
         90000000 60000000 00000000
         60000000 50000000 00000000
         50000000 00000000 00000000
         00000000 50000000 00000000
         50000000 A0000000 00000000
         A0000000 90000000 00000000
         90000000 C0000000 00000000
         C0000000 10000000 ........
20000000 20000000 20000000 20000000
         20000000 80000000 00000000
         80000000 20000000 00000000
         20000000 C0000000 00000000
         C0000000 A0000000 00000000
         A0000000 00000000 00000001
         00000000 A0000001 00000000
         A0000001 40000002 00000000
         40000002 20000005 00000000
         20000005 8000000C 00000000
         8000000C 2000001D ........
40000000 40000000 40000000 40000000
         40000000 00000000 00000000
         00000000 40000000 00000000
         40000000 80000000 00000000
         80000000 40000000 00000000
         40000000 00000000 00000000
         00000000 40000000 00000000
         40000000 80000000 00000000
         80000000 40000000 00000000
         40000000 00000000 00000000
         00000000 40000000 ........
80000000 80000000 80000000 80000000
         80000000 00000000 00000001
         00000000 80000001 00000000
         80000001 00000002 00000000
         00000002 80000005 00000000
         80000005 0000000C 00000000
         0000000C 8000001D 00000000
         8000001D 00000046 00000000
         00000046 800000A9 00000000
         800000A9 00000198 00000000
         00000198 800003D9 ........
01800000 01800000 01800000 01800000
         01800000 06000000 00000000
         06000000 0D800000 00000000
         0D800000 21000000 00000000
         21000000 4F800000 00000000
         4F800000 C0000000 00000000
         C0000000 CF800000 00000000
         CF800000 5F000000 00000001
         5F000000 8D800001 00000000
         8D800001 7A000002 00000000
         7A000002 81800005 ........
03000000 03000000 03000000 03000000
         03000000 0C000000 00000000
         0C000000 1B000000 00000000
         1B000000 42000000 00000000
         42000000 9F000000 00000000
         9F000000 80000000 00000000
         80000000 9F000000 00000000
         9F000000 BE000000 00000000
         BE000000 1B000000 00000000
         1B000000 F4000000 00000000
         F4000000 03000000 ........
06000000 06000000 06000000 06000000
         06000000 18000000 00000000
         18000000 36000000 00000000
         36000000 84000000 00000000
         84000000 3E000000 00000000
         3E000000 00000000 00000000
         00000000 3E000000 00000000
         3E000000 7C000000 00000000
         7C000000 36000000 00000000
         36000000 E8000000 00000000
         E8000000 06000000 ........
0C000000 0C000000 0C000000 0C000000
         0C000000 30000000 00000000
         30000000 6C000000 00000000
         6C000000 08000000 00000000
         08000000 7C000000 00000000
         7C000000 00000000 00000000
         00000000 7C000000 00000000
         7C000000 F8000000 00000000
         F8000000 6C000000 00000000
         6C000000 D0000000 00000001
         D0000000 0C000001 ........
18000000 18000000 18000000 18000000
         18000000 60000000 00000000
         60000000 D8000000 00000000
         D8000000 10000000 00000000
         10000000 F8000000 00000000
         F8000000 00000000 00000000
         00000000 F8000000 00000000
         F8000000 F0000000 00000000
         F0000000 D8000000 00000001
         D8000000 A0000001 00000001
         A0000001 18000003 ........
30000000 30000000 30000000 30000000
         30000000 C0000000 00000000
         C0000000 B0000000 00000000
         B0000000 20000000 00000001
         20000000 F0000001 00000000
         F0000001 00000002 00000000
         00000002 F0000005 00000000
         F0000005 E000000C 00000000
         E000000C B000001D 00000001
         B000001D 40000047 00000001
         40000047 300000AC ........
60000000 60000000 60000000 60000000
         60000000 80000000 00000000
         80000000 60000000 00000000
         60000000 40000000 00000000
         40000000 E0000000 00000000
         E0000000 00000000 00000000
         00000000 E0000000 00000000
         E0000000 C0000000 00000000
         C0000000 60000000 00000001
         60000000 80000001 00000000
         80000001 60000002 ........
C0000000 C0000000 C0000000 C0000000
         C0000000 00000000 00000001
         00000000 C0000001 00000000
         C0000001 80000002 00000000
         80000002 C0000005 00000001
         C0000005 0000000D 00000000
         0000000D C000001F 00000000
         C000001F 8000004B 00000000
         8000004B C00000B5 00000001
         C00000B5 000001B6 00000000
         000001B6 C0000421 ........
0E000000 0E000000 0E000000 0E000000
         0E000000 38000000 00000000
         38000000 7E000000 00000000
         7E000000 34000000 00000000
         34000000 E6000000 00000000
         E6000000 00000000 00000000
         00000000 E6000000 00000000
         E6000000 CC000000 00000000
         CC000000 7E000000 00000001
         7E000000 C8000001 00000001
         C8000001 0E000003 ........
1C000000 1C000000 1C000000 1C000000
         1C000000 70000000 00000000
         70000000 FC000000 00000000
         FC000000 68000000 00000000
         68000000 CC000000 00000001
         CC000000 00000001 00000000
         00000001 CC000002 00000000
         CC000002 98000005 00000000
         98000005 FC00000C 00000001
         FC00000C 9000001E 00000001
         9000001E 1C000049 ........
38000000 38000000 38000000 38000000
         38000000 E0000000 00000000
         E0000000 F8000000 00000000
         F8000000 D0000000 00000001
         D0000000 98000001 00000001
         98000001 00000003 00000001
         00000003 98000008 00000000
         98000008 30000013 00000000
         30000013 F800002E 00000000
         F800002E 2000006F 00000000
         2000006F 3800010C ........
70000000 70000000 70000000 70000000
         70000000 C0000000 00000001
         C0000000 F0000001 00000000
         F0000001 A0000002 00000001
         A0000002 30000006 00000001
         30000006 0000000F 00000000
         0000000F 30000024 00000000
         30000024 60000057 00000000
         60000057 F00000D2 00000000
         F00000D2 400001FB 00000000
         400001FB 700004C8 ........
E0000000 E0000000 E0000000 E0000000
         E0000000 80000000 00000002
         80000000 E0000002 00000001
         E0000002 40000005 00000000
         40000005 6000000C 00000001
         6000000C 0000001E 00000000
         0000001E 60000048 00000000
         60000048 C00000AE 00000000
         C00000AE E00001A4 00000000
         E00001A4 800003F6 00000001
         800003F6 E0000991 ........
0000000F 0000000F 0000000F 0000000F
         0000000F 0000003C 00000000
         0000003C 00000087 00000000
         00000087 0000014A 00000000
         0000014A 0000031B 00000000
         0000031B 00000780 00000000
         00000780 0000121B 00000000
         0000121B 00002BB6 00000000
         00002BB6 00006987 00000000
         00006987 0000FEC4 00000000
         0000FEC4 0002670F ........
003C0000 003C0000 003C0000 003C0000
         003C0000 00F00000 00000000
         00F00000 021C0000 00000000
         021C0000 05280000 00000000
         05280000 0C6C0000 00000000
         0C6C0000 1E000000 00000000
         1E000000 486C0000 00000000
         486C0000 AED80000 00000000
         AED80000 A61C0000 00000000
         A61C0000 FB100000 00000001
         FB100000 9C3C0001 ........
00780000 00780000 00780000 00780000
         00780000 01E00000 00000000
         01E00000 04380000 00000000
         04380000 0A500000 00000000
         0A500000 18D80000 00000000
         18D80000 3C000000 00000000
         3C000000 90D80000 00000000
         90D80000 5DB00000 00000000
         5DB00000 4C380000 00000000
         4C380000 F6200000 00000000
         F6200000 38780000 ........
00F00000 00F00000 00F00000 00F00000
         00F00000 03C00000 00000000
         03C00000 08700000 00000000
         08700000 14A00000 00000000
         14A00000 31B00000 00000000
         31B00000 78000000 00000000
         78000000 21B00000 00000000
         21B00000 BB600000 00000000
         BB600000 98700000 00000000
         98700000 EC400000 00000001
         EC400000 70F00001 ........
01E00000 01E00000 01E00000 01E00000
         01E00000 07800000 00000000
         07800000 10E00000 00000000
         10E00000 29400000 00000000
         29400000 63600000 00000000
         63600000 F0000000 00000000
         F0000000 43600000 00000000
         43600000 76C00000 00000001
         76C00000 30E00001 00000000
         30E00001 D8800002 00000000
         D8800002 E1E00005 ........
03C00000 03C00000 03C00000 03C00000
         03C00000 0F000000 00000000
         0F000000 21C00000 00000000
         21C00000 52800000 00000000
         52800000 C6C00000 00000000
         C6C00000 E0000000 00000000
         E0000000 86C00000 00000001
         86C00000 ED800001 00000001
         ED800001 61C00003 00000000
         61C00003 B1000007 00000001
         B1000007 C3C00012 ........
07800000 07800000 07800000 07800000
         07800000 1E000000 00000000
         1E000000 43800000 00000000
         43800000 A5000000 00000000
         A5000000 8D800000 00000000
         8D800000 C0000000 00000000
         C0000000 0D800000 00000000
         0D800000 DB000000 00000000
         DB000000 C3800000 00000000
         C3800000 62000000 00000001
         62000000 87800001 ........
0F000000 0F000000 0F000000 0F000000
         0F000000 3C000000 00000000
         3C000000 87000000 00000000
         87000000 4A000000 00000000
         4A000000 1B000000 00000000
         1B000000 80000000 00000000
         80000000 1B000000 00000000
         1B000000 B6000000 00000000
         B6000000 87000000 00000000
         87000000 C4000000 00000000
         C4000000 0F000000 ........
1E000000 1E000000 1E000000 1E000000
         1E000000 78000000 00000000
         78000000 0E000000 00000000
         0E000000 94000000 00000000
         94000000 36000000 00000000
         36000000 00000000 00000000
         00000000 36000000 00000000
         36000000 6C000000 00000000
         6C000000 0E000000 00000000
         0E000000 88000000 00000000
         88000000 1E000000 ........
3C000000 3C000000 3C000000 3C000000
         3C000000 F0000000 00000000
         F0000000 1C000000 00000000
         1C000000 28000000 00000000
         28000000 6C000000 00000000
         6C000000 00000000 00000000
         00000000 6C000000 00000000
         6C000000 D8000000 00000000
         D8000000 1C000000 00000000
         1C000000 10000000 00000000
         10000000 3C000000 ........
78000000 78000000 78000000 78000000
         78000000 E0000000 00000001
         E0000000 38000001 00000000
         38000001 50000002 00000000
         50000002 D8000005 00000000
         D8000005 0000000C 00000000
         0000000C D800001D 00000000
         D800001D B0000046 00000000
         B0000046 380000A9 00000001
         380000A9 20000199 00000000
         20000199 780003DB ........
F0000000 F0000000 F0000000 F0000000
         F0000000 C0000000 00000002
         C0000000 70000002 00000001
         70000002 A0000005 00000000
         A0000005 B000000C 00000000
         B000000C 0000001D 00000000
         0000001D B0000046 00000000
         B0000046 600000A9 00000000
         600000A9 70000198 00000000
         70000198 400003D9 00000000
         400003D9 F000094A ........

@fp64
Copy link
Contributor Author

fp64 commented May 27, 2023

And C>=0x80000000 does not seem to be necessary to set E (which also contradicts addition_overflows(C + E, C) clause from before), though subset of data from vrndi_hamming.dat I'm using does have C>=0x70000000 and D>=0x40000000 in this case.

@fp64
Copy link
Contributor Author

fp64 commented May 28, 2023

I did file an issue at jpcsp/jpcsp#507.
While the most recent commit there is about a year old, if that gets an extra pair of eyes on vrnd, the help would be appreciated.

The PR might have been a little hasty in retrospect; but amending it is expected to be a rather small change.

@fp64
Copy link
Contributor Author

fp64 commented May 29, 2023

The following:

E=uint32_t((uint64_t(C)+uint64_t(D>>1)+uint64_t(E))>>32);

matches all of the data thrown at it (assuming tests are ok):

vrndi_7F.dat
vrndi_hamming_ctx_s.dat
vrndi_hamming.dat
vrndi_noseed1.dat
vrndi_noseedq.dat
vrndi_noseeds.dat
vrndi_rcx3_s.dat
vrndi_various.dat

This leaves me with several questions. Namely: "Why?", "Wherefore?", "Inasmuch as which?".

Might postpone pushing, until I see the RIIs dumps for seeds xx000000 (sorry to ask for them again), to make sure this is really correct (do I create a new PR, or can I amend already-closed one?).

@hrydgard
Copy link
Owner

Heh, weird. Maybe it was cheap in gates somehow.

You just create a new PR to fix it.

@fp64
Copy link
Contributor Author

fp64 commented May 31, 2023

Just noticed, that expressions

E=uint32_t((uint64_t(C)+uint64_t(D>>1)+uint64_t(E))>>32);

and

E = addition_overflows(C + E, C) &&
    addition_overflows(C + E, D) &&
    addition_overflows(C + E, C + D + E) &&
    addition_overflows(C + E, C + D);

are, apparently, equivalent for E=0 (and have only tiny chance to differ for E=1). Not hard to see in retrospect:

E = addition_overflows(C, C) && // C>=2^31 <-- actually redundant.
    addition_overflows(C, D) && // C+D>=2^32 => uint32_t(C+D)=C+D-2^32
    addition_overflows(C, C + D) && // C+uint32_t(C+D)>=2^32 => C+C+D>=2^33 => C+D/2>=2^32
    addition_overflows(C, C + D); // Same.

No wonder only large seed tests provided counter-examples (E can large upon seeding/force-writing, but, presumably, not later; even E=2 only happens right after large E).

@fp64
Copy link
Contributor Author

fp64 commented Jun 2, 2023

Next would be vdiv.
As I mentioned, randomly doing x*rcp(y) doesn't seem to fix Ridge Racer, but some similar possibilities exist (e.g. zero some lower bits before/after).

@fp64
Copy link
Contributor Author

fp64 commented Sep 13, 2023

Out of curiousity, @unknownbrackets, do you know if producing 4 lanes of output is slower than producing 1 (which would suggest that vrnd is, in fact, serial under the hood).

Are there any suspect instructions left? You mentioned that vdiv likely matches HW, but if you want me to take a look at it, I'm down (would need data).

@fp64
Copy link
Contributor Author

fp64 commented Sep 14, 2023

likely matches HW

Oh, wait that was dot, not vdiv, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants