The vmulComplex functions in XDSP.h could be further optimized like this:
// (r1, i1) * (r2, i2) = (r1r2 - i1i2, r1i2 + r2i1)
XMVECTOR vr1r2 = XMVectorMultiply(r1, r2);
XMVECTOR vr1i2 = XMVectorMultiply(r1, i2);
rResult = XMVectorNegativeMultiplySubtract(i1, i2, vr1r2); // real: (r1r2 - i1i2)
iResult = XMVectorMultiplyAdd(r2, i1, vr1i2); // imaginary: (r1i2 + r2i1)
On SSE2 it makes no difference, but when compiling for ARM it does.
The vmulComplex functions in XDSP.h could be further optimized like this:
// (r1, i1) * (r2, i2) = (r1r2 - i1i2, r1i2 + r2i1)
XMVECTOR vr1r2 = XMVectorMultiply(r1, r2);
XMVECTOR vr1i2 = XMVectorMultiply(r1, i2);
rResult = XMVectorNegativeMultiplySubtract(i1, i2, vr1r2); // real: (r1r2 - i1i2)
iResult = XMVectorMultiplyAdd(r2, i1, vr1i2); // imaginary: (r1i2 + r2i1)
On SSE2 it makes no difference, but when compiling for ARM it does.