Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

softgpu: Use SIMD more for dot products #17571

Merged
merged 1 commit into from Jun 13, 2023

Conversation

unknownbrackets
Copy link
Collaborator

Had this in a stash from before v1.15.x were released. It's a small gain but it helps when there's a lot of vertex processing.

-[Unknown]

@unknownbrackets unknownbrackets added this to the v1.16.0 milestone Jun 13, 2023
@hrydgard hrydgard merged commit 10ae6f0 into hrydgard:master Jun 13, 2023
19 checks passed
@hrydgard
Copy link
Owner

Small gains add up!

@fp64
Copy link
Contributor

fp64 commented Jun 13, 2023

Is there any particular need for explicit zeroing in Dot33SSE4, compared to rearranging it as

__m128 Dot33SSE4(__m128 a, __m128 b) {
    __m128 multiplied = _mm_mul_ps(a, b);
    __m128 lanes3311 = _mm_movehdup_ps(multiplied);
    __m128 partial = _mm_add_ps(multiplied, lanes3311);
    return _mm_add_ss(partial, _mm_movehl_ps(partial, multiplied));
}

There might well be, the x86 SIMD perf is quirky (and it does seem to introduce an extra move), but I do not see it offhand. Seems slight improvement, when measured on godbolt.

The pure SSE2 version

__m128 Dot33SSE2(__m128 a,__m128 b)
{
    __m128 v = _mm_mul_ps(a, b);
    __m128 shuf = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 2, 0, 1));
    __m128 sums = _mm_add_ps(v, shuf);
    shuf = _mm_movehl_ps(shuf, shuf);
    return _mm_add_ss(sums, shuf);
}

might be a bit slower.

Note: code is inspired by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 .

@hrydgard
Copy link
Owner

hrydgard commented Jun 13, 2023

Given that we probably don't match the hardware dot products to the bit level here, we might even be able to use dpps (_mm_dp_ps) when SSE4_1 is available.

@fp64
Copy link
Contributor

fp64 commented Jun 14, 2023

Hm, _mm_dp_ps(a, b, 0x71) seems neither faster nor slower, but shorter and more straightforward.

@unknownbrackets unknownbrackets deleted the softgpu-dot branch June 14, 2023 01:45
@unknownbrackets
Copy link
Collaborator Author

Mostly I avoid dpps because I've found it to be slow in the past. Just checked on godbolt and I'm seeing it much slower (just casually replacing Dot33SSE4 with it and trying clang 16 as well.) Maybe depends on what runner it hits, because one time I did see it about the same speed. This latest runner seems to prefer the SSE2 code:

dpps 2.555 2.587 2.506 2.563 2.465
SSE4, v2 1.594 1.538 1.577 1.536 1.610
SSE2 1.325 1.336 1.352 1.372 1.335

Was mostly trying to avoid the awful codegen in Dot(). Feel free to PR one of the other versions.

The template stuff was more to convince MSVC to inline and use SSE regs more consistently. There aren't processors without SSE4.1 that will be able to run softgpu at decent performance for most games AFAIK, so I mostly just want to keep it running there and am not worrying about perf on SSE2. But if we can avoid the annoying hoops, all the better.

-[Unknown]

@fp64
Copy link
Contributor

fp64 commented Jun 15, 2023

Yes, I also seen dpps being mentioned as slow on some archs.

I also don't understand the reason for && !PPSSPP_ARCH(X86) (e.g. Dot33SSE4 has it, but LightCeil doesn't).

There aren't processors without SSE4.1 that will be able to run softgpu at decent performance for most games AFAIK

Well, Valkyria Chronicles II runs at about 8 FPS on my SSE4-less machine in SW mode, so yeah, sounds about right (somewhat usable for debug purposes, but not for normal play). Dithering has its charm though.

@hrydgard
Copy link
Owner

Oh right, forgot that that's why we haven't used dpps so far.

@fp64
Copy link
Contributor

fp64 commented Jun 15, 2023

My understanding is that dpps basically isn't ever faster than doing it by hand, so the only wins are codesize, convenience, and, maybe, register pressure.

fp64 added a commit to fp64/ppsspp that referenced this pull request Jun 15, 2023
Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.
@fp64 fp64 mentioned this pull request Jun 15, 2023
@unknownbrackets
Copy link
Collaborator Author

&& !PPSSPP_ARCH(X86) was probably a copy/paste mistake, sorry. We have that in a bunch of places in the software renderer code now to actually avoid using SIMD in func arguments/etc. because they end up unaligned (since the stack is not aligned on x86_32) and cause crashes iirc. But also, there aren't any x86_32 only processors that motivate performance focus either.

-[Unknown]

@fp64 fp64 mentioned this pull request Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants