New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving accuracy of vfpu_sin #16946
Comments
Cool. So we should have the PSP generate a new table from 0.23 fixpoint, from 0 to 1 would be enough (or even 0 to 0.5). Then I guess theoretically we can try to compute first, second and third derivatives of that sequence. If those series becomes a series of straight lines or plateaus at some derivative, we'll know we're dealing with polynomial interpolation of a much smaller table, which I think might be likely. Though I don't know if precision is enough for those derivatives to make sense... Storing the full table will hopefully not be necessary, though by taking the derivative before compressing, the result should be really compressible... (actually that might be nonsense given how sin and cos are each other's derivative, sometimes with flipped signs, but we have to be dealing with an approximation here...) |
To be clear
means, that all values for [0.25;1] are already known. Only [0;0.25] remains.
For one thing 32MB might be larger than L3 cache.
Tried it:
|
Oh yeah, didn't read carefully enough, so a smaller range to dump. Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P By the way, 4 way symmetry is good, but 8 way symmetry might be possible with some flipping around of signs and offsetting. Or hm, maybe not... |
Grr, no, it's [...,-8,-4,-0,+2,+6,...]: both sources of signflip ( |
Oh, hm. Maybe I overwrote that file I uploaded before with a smaller subset. It took a lot of time to generate, I thought I saved it... anyway, I had dumped some of it as binary. Let me just regenerate, at least for exponents between 0x60 - 0x7F.... -[Unknown] |
https://forums.ppsspp.org/uploads/sin_60-7F.zip Here's all the exponents from 60-7F, each as a separate file. 60-67 aren't that interesting, just included to represent. The data is simply input, output repeated in 4 byte pairs. -[Unknown] |
Thanks!
Well, after the first delta we already get long repeats of the same value, e.g. |
Converted the provided files. So, now we have a complete Haven't yet verified that Again, thanks for actually dumping the data! |
The draft implementation of the reduction: static uint32_t vfpu_sin_table[(1<<23)+1]={
#include "sin.bin.h" // Or #embed, or whatever.
};
static inline uint32_t vfpu_sin_fixed(uint32_t arg)
{
// Table-based for now.
return vfpu_sin_table[arg];
}
// For |x|<2^32 should produce exactly the
// same value (bitwise) as PSP. For finite
// x>=2^32 produces 0, whereas PSP sometimes
// produces junk (which looks like small
// non-zero values). Hopefully games do not rely
// on this. For inf/nan produces qNaN, but
// with a different payload than PSP (PSP produces
// sNaN).
static inline float vfpu_sin(float x)
{
// Handle large/non-finite values.
if(!(fabsf(x)<0x1p25f)) return copysignf(x-x,x); // Beware -ffast-math.
// Reduce to [-2;+2].
int32_t n=int32_t(0.5f*x); // Rounds to 0.
x=x-2.0f*float(n); // Exact, no roundoff. Also, -0 remains -0.
// Flip sign according to half-period.
if(n&1) x=-x;
// Convert to fixed 1.23.
uint32_t arg=uint32_t(int32_t(fabsf(x)*0x1p23f));
// Reduce to [0;1] (i.e. [0;2^23]).
if(arg>=1u<<23) arg=(1u<<24)-arg;
// Convert from fixed 1.28, and apply sign.
return float(int32_t(vfpu_sin_fixed(arg)))*0x1p-28f*copysignf(1.0f,x);
}
// WARNING: not tested, this is just a guess.
static inline float vfpu_cos(float x)
{
x=fabsf(x);
// Handle large/non-finite values.
if(!(fabsf(x)<0x1p25f)) return 1.0f+(x-x);
// Reduce to [-2;+2].
int32_t n=int32_t(0.5f*x); // Rounds to 0.
x=x-2.0f*float(n); // Exact, no roundoff.
// Convert to fixed 1.23.
uint32_t arg=uint32_t(int32_t(x*0x1p23f));
// Reduce to [0;1] (i.e. [0;2^23]).
if(arg>=1u<<23) {arg=(1u<<24)-arg; n^=1;}
// Convert from fixed 1.28, and apply sign.
return float(int32_t(vfpu_sin_fixed((1u<<23)-arg)))*0x1p-28f*(n&1?-1.0f:+1.0f);
} Edit: minor fixes. Not ready for deployment, but may be useful as-is to test if it unbreaks some picky games. |
I'm using fixed 0.28 since it can represent all output values, but that might not be what is going on internally. uint32_t quantum(uint32_t x)
{
return x<1u<<22?
1u:
1u<<(32-22-count_leading_zeroes(x));
} One idea I had is that we need an approximation (treated as real-valued here) Maybe something simpler suffices. @unknownbrackets, you mentioned that you don't think this is CORDIC. Can you elaborate? |
I don't believe it's CORDIC either because CORDIC is iterative, so it takes a variable-amount of time depending on the angle, and I don't think we've observed that. I would guess it's an interpolated table lookup from a smaller table, but I could be wrong of course. |
Hm, got runtime size down to 2MB: table of quadratic coefficients + table of exceptions. This implements It is verified to produce the exactly the same result as |
The code itself (so you don't need to download to take a look): vfpu_sin_fixed.cpp#include "stdint.h"
#include "quadratic_coefs.h"
#include "exceptions.h"
static inline uint32_t quantum(uint32_t x)
{
/*
return x<1u<<22?
1u:
1u<<(32-22-__builtin_clz(x));
*/
int i=0;
while((x>>i)>=0x00400000) ++i;
return uint32_t(1)<<i;
}
static inline u32 truncate_bits(u32 x)
{
return x&-quantum(x);
}
extern uint32_t vfpu_sin_fixed(uint32_t arg)
{
// Handle endpoints.
if(arg==0u) return 0u;
if(arg==0x00800000) return 0x10000000;
// Get quadratic coefficients.
int32_t i=int32_t(arg>>7),j=int32_t(arg&0x7F);
const int32_t *coef=quadratic_coefs[i];
int64_t A=coef[0],B=coef[1],C=coef[2];
// Compute approximation. Is off by at most 1 quantum.
const int PA=30,PB=25,PC=3;
uint32_t v=uint32_t(((((A*j>>(PA-PB))+B)*j>>(PB-PC))+C)>>PC);
v=truncate_bits(v);
// Look up exceptions. Binary search for now, but this
// can be done better.
unsigned lo=0,hi=sizeof(exceptions)/sizeof(exceptions[0]);
while(lo<hi)
{
uint32_t m=(lo+hi)/2;
uint32_t b=exceptions[m];
uint32_t e=(b>>31?~b:b);
if(e==arg)
{
v+=quantum(v)*(b>>31?-1u:+1u);
break;
}
else if(e<arg) lo=m+1;
else hi=m;
}
return v;
} |
That's cool, nicely done. Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware. Getting to the point where it could be worth it if it unbreaks stuff in games (that Ridge Racer replay, or whatever it was, maybe?). |
Yeah, e.g. this mentions sizes around 10KB. |
and done. |
Made it uglier, but tables fit in 720 KB. Not sure if it's worth it. Code: vfpu_sin_fixed.cpp#include "stdint.h"
#include "vfpu_sin_lut.h"
static inline uint32_t quantum(uint32_t x)
{
//return x<1u<<22?
// 1u:
// 1u<<(32-22-__builtin_clz(x));
int i=0;
while((x>>i)>=0x00400000) ++i;
return uint32_t(1)<<i;
}
static inline u32 truncate_bits(u32 x)
{
return x&-quantum(x);
}
static inline int64_t ilerp(int64_t a,int64_t b,int64_t n,int64_t d)
{
return a+((b-a)*n)/d;
}
// Cubic interpolation, with control points P0, P1, and
// derivatives T0, T1.
static inline int64_t icubic(int64_t P0,int64_t T0,int64_t P1,int64_t T1,int64_t n,int64_t d)
{
// This can be converted to Bezier, which can
// be calculated via lerps.
int64_t C0=P0+T0/3;
int64_t C1=P1-T1/3;
int64_t Q0=ilerp(P0,C0,n,d);
int64_t Q1=ilerp(C0,C1,n,d);
int64_t Q2=ilerp(C1,P1,n,d);
int64_t L0=ilerp(Q0,Q1,n,d);
int64_t L1=ilerp(Q1,Q2,n,d);
return ilerp(L0,L1,n,d);
}
// Catmull-Rom interpolation.
static inline int64_t icerp(int64_t P0,int64_t P1,int64_t P2,int64_t P3,int64_t n,int64_t d)
{
return icubic(P1,(P2-P0)/2,P2,(P3-P1)/2,n,d);
}
// Fixed point 4.28 multiply.
static inline uint32_t mul28(uint32_t x,uint32_t y)
{
return (uint32_t)((uint64_t)x*(uint64_t)y>>28);
}
// Crude approximation, may be off by around 5 quantums.
static inline uint32_t vfpu_sin_approx(uint32_t x)
{
if((int32_t)x<0) return -vfpu_sin_approx(-x);
x&=0x00FFFFFFu;
if(x>=0x00800000u) x=0x01000000u-x;
x*=(1u<<(28-23))/2; // Coefficients are for sinpi, so we divide by 2.
uint32_t x2=mul28(x,x);
return mul28(x, 843314880u
-mul28(x2,1387197952u
-mul28(x2, 684549312u
-mul28(x2, 160694304u
-mul28(x2, 21002648u)))));
}
uint32_t vfpu_sin_fixed(uint32_t arg)
{
// Handle endpoints.
if(arg==0u) return 0u;
if(arg==0x00800000) return 0x10000000;
// Get cubic coefficients.
// Note: coef stores deltas from crude approximation.
const signed char *coef=vfpu_sin_lut_coefs[arg>>8];
int64_t P[4];
for(unsigned i=0;i<4;++i)
{
P[i]=int32_t(vfpu_sin_approx((arg&-256u)+256u*(i-1)));
int32_t q=(int32_t)quantum(uint32_t(P[i]<0?-P[i]:P[i]));
P[i]=8192LL*P[i]+512LL*coef[i]*q;
}
// Compute approximation. Is off by at most 1 quantum.
uint32_t v=uint32_t(icerp(P[0],P[1],P[2],P[3],arg&255,256)/8192);
v=truncate_bits(v);
// Look up exceptions via binary search.
// Note: vfpu_sin_lut_intervals stores
// deltas from interval estimation.
int32_t lo=int32_t(459376LL*( arg &-128u)/N)+16384
+vfpu_sin_lut_intervals[ arg >>7];
int32_t hi=int32_t(459376LL*((arg+128u)&-128u)/N)+16384
+vfpu_sin_lut_intervals[(arg+128u)>>7];
while(lo<hi)
{
int32_t m=(lo+hi)/2;
// Note: vfpu_sin_lut_exceptions stores
// index&127 (for each initial interval the
// upper bits of index are the same, namely
// arg&-128), inverted if the correction is
// negative (correction is always either +1
// or -1).
int32_t b=vfpu_sin_lut_exceptions[m];
uint32_t e=(arg&-128u)+uint32_t(b<0?~b:b);
if(e==arg)
{
v+=quantum(v)*(b<0?-1u:+1u);
break;
}
else if(e<arg) lo=m+1;
else hi=m;
}
return v;
} There should be a better way, but well... Not sure if it's good enough to consider making pull request. |
It's also probably somewhat slower than before. Also, amusingly, the table is not monotonic:
|
In case anyone is curious, the junk for Table
|
Wow, this is some wacky stuff. Non-monotonicity within the wrapped range is really surprising! Might possibly be some kind of hint to what's really going on, or it's just another indication that we won't be able to find it, because it's so weird :P |
I think I had theorized that only some of the exponent bits are respected, causing this pattern. It seemed to mostly work. See: if (k > 0x80) {
const uint8_t over = k & 0x1F;
mantissa = (mantissa << over) & 0x00FFFFFF;
k = 0x80;
} -[Unknown] |
When I tried it before, it didn't seem to work. https://godbolt.org/z/YbzhbshPd Thanks! |
A common way to "linearize" floats between 0 and 1 before doing lookups into a table is to simply add 1.0, after that, the mantissa represents the linear position between 1 and 2 and you can just drop the exponent and use the mantissa from there on as your table lookup index. But this is from a software perspective. That the exponent seems to wrap like that makes me think that there's a table lookup involving that too, or some specialized hardware that effectively does the same as adding 1.0 but cheaper and thus limited in the range of exponents it can process... |
It just looks like a shift nobody bothered to mask (same as x86 It might be even cyclic shift (so no need to branch on sign of |
I think a table that's down to 720 KB is interesting. Not that terrible RAM usage if it fixes things. Games I know that are or could be affected by sin/cos accuracy:
I wonder how much slower this variant is than what we do now, though... we'd probably end up not wanting to add it everywhere. -[Unknown] |
sin and cos are generally not the most frequently used mathematical operations, since they're normally used for things like setting up factors that are then multiplied with a lot of other things, typical example being a rotation matrix so a not-gigantic slowdown in these might not be so bad... Though, given they are pretty fast on the hardware, maybe some games over-use them for other stuff.. And I agree that 720kb is getting palatable.. |
Ok, I think this matches all available data: static float vfpu_sin(float x)
{
uint32_t bits;
memcpy(&bits,&x,sizeof(x));
uint32_t sign=bits&0x80000000u;
uint32_t exponent=(bits>>23)&0xFFu;
uint32_t significand=(bits&0x007FFFFFu)|0x00800000u;
if(exponent==0xFFu)
{
// NOTE: this bitpattern is a signaling
// NaN on x86, so maybe just return
// a normal qNaN?
float y;
bits=sign^0x7F800001u;
memcpy(&y,&bits,sizeof(y));
return y;
}
if(exponent<0x7Fu)
{
if(exponent<0x7Fu-23u) significand=0u;
else significand>>=(0x7F-exponent);
}
else if(exponent>0x7Fu)
{
// There is weirdness for large exponents.
if(exponent-0x7Fu>=25u&&exponent-0x7Fu<32u) significand=0u;
else if((exponent&0x9Fu)==0x9Fu) significand=0u;
else significand<<=(exponent-0x7Fu);
}
sign^=((significand<<7)&0x80000000u);
significand&=0x00FFFFFFu;
if(significand>0x00800000u) significand=0x01000000u-significand;
uint32_t ret=vfpu_sin_fixed(significand);
return (sign?-1.0f:+1.0f)*float(int32_t(ret))*3.7252903e-09f; // 0x1p-28f
} Notes:
|
Cool! So the remaining piece is reducing vfpu_sin_fix further if possible. memcpy is perfectly endian-safe for something like this, as I haven't heard of an architecture that has different endian-ness for its floating point. In addition, we only support little-endian architectures in practice anyway. |
Actually, I still do not have the data to test reduction for Should I make a PR? Not to be merged as-is, but just to have an artifact to test things with? If so:
Speed, as measured on my machine:
Aside: pedantry compels me to use 'significand' instead of 'manitissa', but that's just me. |
Does not seem to fix #2990 (the behavior is still the same as the 1st video). This much was already suspected. Incidentally, when I got the sign of cos wrong (didn't do the signflip for [1;2)), the camera started behaving weirdly, but the path of the car itself seemed the same as before. Maybe sin/cos aren't used, or maybe angles are always in [-1;+1]. |
Yes and no? Making sure there aren't any bad seeds is an attractive (and achievable) design goal for an RNG. Especially is you control the seeding procedure (and And it matters, in practice, which seeds. Like "what happens with seed=0?" is about the first thing one tests about an RNG. And in this case the answer is "underwhelming RNG". If there is a doc somewhere that explicitly spells out "don't use seed 0, like, ever; or 0x40000000, really" - I haven't seen it.
It does look pretty ok for normal seeds. But then again, at 256-bit state (or 160-bit, or 129-bit, since E-part is effectively 1-bit most of the time) it better be. To be fair, there are some well-known RNGs that fare worse while having larger state (original Mersenne Twister is an obvious example).
Besides more testing (on whole dataset, and I might formulate couple more requests - e.g. more seeds with top bit set to help verify E-part), there is some more plumbing to be done (actually interacting with thread context stuff), which I'm not super clear about either. |
And yes, nice experience overall. Especially, since, unlike floating point stuff, we actually do get to uncover what HW does. |
Well, given how the GE works in some ways too, and even some missing checks for interlock in the VFPU (matrix overlap for src/dst), I wouldn't be surprised if the docs just said not to use certain seed values at all. Here's the range of context values, the "_s" one is just a single vrndi_hamming_ctx.dat.zst.zip Looking pretty cool so far. -[Unknown] |
Hm, what thread context stuff? We already handle VFPU context switching between threads that have the VFPU bit enabled, and these values are just part of it, so it should be alright already. |
So it's not fully accurate.
|
All I'll also ask again: is there any difference between dumps for 7F/various vs noseed (generated by different code or something)? Because noseed generates 4 outputs per iteration, but the others generate 8 but show 4 (skip every other line in dump). |
No, the main loop is the same between both as long as the header type is the same (i.e. RIII1.) The only difference is that for noseed and the ctx ones, The code has gotten a little messy but looks like this: for (int i = 0; i < iterations; ++i) {
ScePspVector4 *result = RunVrndOp(type, vsz);
for (int j = 0; j < (vsz == VRNDS_ONE || vsz == VRNDS_ONE_S ? 1 : 32); ++j) {
datwrite4(result[j].i);
}
if (vsz == VRNDS_ONE || vsz == VRNDS_ONE_S) {
ctx = GetVrndContext();
datwrite4(ctx);
datwrite4(ctx + 4);
}
}
if (vsz != VRNDS_ONE && vsz != VRNDS_ONE_S) {
ctx = GetVrndContext();
int ctxHeader[4] = { 0x3D3D3D3D, 0x00585443, 0x3D3D3D3D, typemarker | szmarker };
datwrite4(ctxHeader);
datwrite4(ctx);
datwrite4(ctx + 4);
} RunVrndOp() just runs -[Unknown] |
Just in case: GCC thinks asm is sideeffect-free if you do not tell it otherwise (volatile/constraints/clobbers). It can happily rearrange/duplicate/unroll it.
Nope, RIIQ seems to have all output:
So for RII1 we have: |
Weird, that definitely is happening. Maybe I fixed a bug without realizing? I've run into a couple optimizer bugs, having used Do you want me to upload the corrected one? -[Unknown] |
Not particularly pressing (I've tested all previous RII1's for 7F, and it's very unlikely that "inbetween" iterations are off somehow), but it wouldn't hurt. |
Hm, so E=addition_overflows(C+E,C)&&
addition_overflows(C+E,D)&&
addition_overflows(C+E,C+D+E)&&
addition_overflows(C+E,C+D); matches RII1's in both |
Actually, I think I'd like corrected dumps after all.
But the
which only differs in 0th output (3rd actually, since output register is filled backwards), but is on track after that. |
Randomly caught this during compilation (too small to create an issue) while preparing implementation:
Can be fixed by either |
Oh, yeah |
The savestate handling just needs a version check. Maybe we should even run the old rng on old savestates, not sure. |
See investigation starting hrydgard#16946 (comment) for more details. Still needs more testing.
It appears that expressions E = addition_overflows(C + E, C) &&
addition_overflows(C + E, D) &&
addition_overflows(C + E, C + D + E); and E = addition_overflows(C + E, C) &&
addition_overflows(C + E, D) &&
addition_overflows(C + E, C + D + E) &&
addition_overflows(C + E, C + D); both match current dataset. They are not equivalent, e.g. the results are different for
This corresponds to the above parameters. This particular state should not be possible to reach via either On that topic:
doesn't seems to match earlier
What actually happens? |
Oh wait, now that I tested on |
Oh, interesting. It seems like seed=F0000000 bumps D by 2, rather than 1 on the next iteration after the initial one. So presumably the sequence is:
Same seems to be true for I'd like to request |
From seeds
|
And |
I did file an issue at jpcsp/jpcsp#507. The PR might have been a little hasty in retrospect; but amending it is expected to be a rather small change. |
The following: E=uint32_t((uint64_t(C)+uint64_t(D>>1)+uint64_t(E))>>32); matches all of the data thrown at it (assuming tests are ok):
This leaves me with several questions. Namely: "Why?", "Wherefore?", "Inasmuch as which?". Might postpone pushing, until I see the |
Heh, weird. Maybe it was cheap in gates somehow. You just create a new PR to fix it. |
Just noticed, that expressions E=uint32_t((uint64_t(C)+uint64_t(D>>1)+uint64_t(E))>>32); and E = addition_overflows(C + E, C) &&
addition_overflows(C + E, D) &&
addition_overflows(C + E, C + D + E) &&
addition_overflows(C + E, C + D); are, apparently, equivalent for E=0 (and have only tiny chance to differ for E=1). Not hard to see in retrospect: E = addition_overflows(C, C) && // C>=2^31 <-- actually redundant.
addition_overflows(C, D) && // C+D>=2^32 => uint32_t(C+D)=C+D-2^32
addition_overflows(C, C + D) && // C+uint32_t(C+D)>=2^32 => C+C+D>=2^33 => C+D/2>=2^32
addition_overflows(C, C + D); // Same. No wonder only large seed tests provided counter-examples (E can large upon seeding/force-writing, but, presumably, not later; even E=2 only happens right after large E). |
Next would be |
Out of curiousity, @unknownbrackets, do you know if producing 4 lanes of output is slower than producing 1 (which would suggest that Are there any suspect instructions left? You mentioned that |
Oh, wait that was |
What should happen
This is a work-in-progress/research topic, I just thought I'd put it there for people to see.
This concerns implementation of$\sin(\frac{\pi}{2} x)$ .
y=vfpu_sin(x)
, which computes an approximation ofTechnical: I'm sometimes using hex-floats notation for clarity.
Looking at the table the following things appear to be true:
x
always being even integer in this case. How does one even get this wrong? Then again, on x87fsin
instruction returnsx
forinf
/nan
the output is specificnan
(0x7F800001
or0xFF800001
, the same sign bit asx
). Note, that this appears to be a signaling nan, though apparently some processors treat it backwards.y
always has two lowest bits as 0 (next one may be 0 or 1) .x
to-0
whenx
is in [...,-6, -2, -0, +2, +6,...].trunc(x*0x1p+23f)
- i.e. input is effectively??:23
fixed point.y*0x1p+28f
is integer.The above two points mean that the problem is reduced to essentially integer:$2^{23}+1$ entries. Attached below:$2^{28}$ ).
a -> b
, wherea=int(0x1p+23f*x)
, andb=int(0x1p+28f*y)
.This gives us a more manageable table of
table.bin.zip
The contents (in binary) are raw values of
b
as 32bit LE integers (i.e. 4 bytes at offset4*a
are the value ofb
that corresponds toa
). Where the value is unknown (absent in original table)0xFFFFFFFF
is used (not possible to mistake for a real output, since it is larger thanThe table is dense for
x>=0.25
, but has only 60 values below:Show
Note, that for$|x| < 10^{-3}$ we have $|\sin(\frac{\pi}{2} x)-(\frac{\pi}{2} x)| < 2^{-28}$ .
Assuming we can satisfactorily patch the blank spots in the table, this may turn into a compression problem, i.e. even without knowing what HW actually does we just need a way to recreate the exact table. It is, in principle, possible to bite the bullet and just (optionally) ship the table with PPSSPP (not that I actually advocate that) - it is 4MB zipped, and 32MB in mem, which while not nice, might be tolerable.
Please do take this with a grain of salt, it is possible I missed (or messed up) something.
Who would this benefit
If this does indeed improve emulation accuracy, it may make more games correctly playable, to benefit of everyone concerned.
Platform (if relevant)
None
Games this would be useful in
Games that rely on specific values produced by vfpu, not sure which
Other emulators or software with a similar feature
No response
Checklist
The text was updated successfully, but these errors were encountered: