softgpu: Use SIMD more for dot products #17571

unknownbrackets · 2023-06-13T03:00:49Z

Had this in a stash from before v1.15.x were released. It's a small gain but it helps when there's a lot of vertex processing.

-[Unknown]

hrydgard · 2023-06-13T07:47:21Z

Small gains add up!

fp64 · 2023-06-13T22:16:29Z

Is there any particular need for explicit zeroing in Dot33SSE4, compared to rearranging it as

__m128 Dot33SSE4(__m128 a, __m128 b) {
    __m128 multiplied = _mm_mul_ps(a, b);
    __m128 lanes3311 = _mm_movehdup_ps(multiplied);
    __m128 partial = _mm_add_ps(multiplied, lanes3311);
    return _mm_add_ss(partial, _mm_movehl_ps(partial, multiplied));
}

There might well be, the x86 SIMD perf is quirky (and it does seem to introduce an extra move), but I do not see it offhand. Seems slight improvement, when measured on godbolt.

The pure SSE2 version

__m128 Dot33SSE2(__m128 a,__m128 b)
{
    __m128 v = _mm_mul_ps(a, b);
    __m128 shuf = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 2, 0, 1));
    __m128 sums = _mm_add_ps(v, shuf);
    shuf = _mm_movehl_ps(shuf, shuf);
    return _mm_add_ss(sums, shuf);
}

might be a bit slower.

Note: code is inspired by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 .

hrydgard · 2023-06-13T22:56:34Z

Given that we probably don't match the hardware dot products to the bit level here, we might even be able to use dpps (_mm_dp_ps) when SSE4_1 is available.

fp64 · 2023-06-14T00:08:23Z

Hm, _mm_dp_ps(a, b, 0x71) seems neither faster nor slower, but shorter and more straightforward.

unknownbrackets · 2023-06-14T02:18:04Z

Mostly I avoid dpps because I've found it to be slow in the past. Just checked on godbolt and I'm seeing it much slower (just casually replacing Dot33SSE4 with it and trying clang 16 as well.) Maybe depends on what runner it hits, because one time I did see it about the same speed. This latest runner seems to prefer the SSE2 code:

dpps 2.555 2.587 2.506 2.563 2.465
SSE4, v2 1.594 1.538 1.577 1.536 1.610
SSE2 1.325 1.336 1.352 1.372 1.335

Was mostly trying to avoid the awful codegen in Dot(). Feel free to PR one of the other versions.

The template stuff was more to convince MSVC to inline and use SSE regs more consistently. There aren't processors without SSE4.1 that will be able to run softgpu at decent performance for most games AFAIK, so I mostly just want to keep it running there and am not worrying about perf on SSE2. But if we can avoid the annoying hoops, all the better.

-[Unknown]

fp64 · 2023-06-15T08:04:27Z

Yes, I also seen dpps being mentioned as slow on some archs.

I also don't understand the reason for && !PPSSPP_ARCH(X86) (e.g. Dot33SSE4 has it, but LightCeil doesn't).

There aren't processors without SSE4.1 that will be able to run softgpu at decent performance for most games AFAIK

Well, Valkyria Chronicles II runs at about 8 FPS on my SSE4-less machine in SW mode, so yeah, sounds about right (somewhat usable for debug purposes, but not for normal play). Dithering has its charm though.

hrydgard · 2023-06-15T08:13:07Z

Oh right, forgot that that's why we haven't used dpps so far.

fp64 · 2023-06-15T09:12:22Z

My understanding is that dpps basically isn't ever faster than doing it by hand, so the only wins are codesize, convenience, and, maybe, register pressure.

Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.

unknownbrackets · 2023-06-16T01:00:21Z

&& !PPSSPP_ARCH(X86) was probably a copy/paste mistake, sorry. We have that in a bunch of places in the software renderer code now to actually avoid using SIMD in func arguments/etc. because they end up unaligned (since the stack is not aligned on x86_32) and cause crashes iirc. But also, there aren't any x86_32 only processors that motivate performance focus either.

-[Unknown]

softgpu: Use SIMD more for dot products.

a7fa37d

unknownbrackets added the Software Rasterizer label Jun 13, 2023

unknownbrackets added this to the v1.16.0 milestone Jun 13, 2023

hrydgard merged commit 10ae6f0 into hrydgard:master Jun 13, 2023
19 checks passed

unknownbrackets deleted the softgpu-dot branch June 14, 2023 01:45

fp64 added a commit to fp64/ppsspp that referenced this pull request Jun 15, 2023

Convert Dot33 to SSE2

f0d844a

Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.

fp64 mentioned this pull request Jun 15, 2023

Convert Dot33 to SSE2 #17584

Merged

fp64 mentioned this pull request Aug 25, 2023

More x86 IR JIT #17975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

softgpu: Use SIMD more for dot products #17571

softgpu: Use SIMD more for dot products #17571

unknownbrackets commented Jun 13, 2023

hrydgard commented Jun 13, 2023

fp64 commented Jun 13, 2023

hrydgard commented Jun 13, 2023 •

edited

fp64 commented Jun 14, 2023

unknownbrackets commented Jun 14, 2023

fp64 commented Jun 15, 2023

hrydgard commented Jun 15, 2023

fp64 commented Jun 15, 2023

unknownbrackets commented Jun 16, 2023

softgpu: Use SIMD more for dot products #17571

softgpu: Use SIMD more for dot products #17571

Conversation

unknownbrackets commented Jun 13, 2023

hrydgard commented Jun 13, 2023

fp64 commented Jun 13, 2023

hrydgard commented Jun 13, 2023 • edited

fp64 commented Jun 14, 2023

unknownbrackets commented Jun 14, 2023

fp64 commented Jun 15, 2023

hrydgard commented Jun 15, 2023

fp64 commented Jun 15, 2023

unknownbrackets commented Jun 16, 2023

hrydgard commented Jun 13, 2023 •

edited