-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a jit for drawing pixels in the software renderer #15163
Conversation
1fead9f
to
6f2ddb7
Compare
Code looks great already! Yeah it would indeed be a lot of data to pass in and out for 2x2, but you'd also get up to four times the work done in one invocation. But yeah, hard to say if the benefit would actually be that large. |
6f2ddb7
to
df4317b
Compare
d34813f
to
3c2066d
Compare
Not really tested, just filling out parts.
Locking also in helpers, so need to nest locks.
But let's still special case the 512 path, since it's so common.
At this point, it's used in some areas in some games. Alpha blending is the main unimplemented path, then logic/masking.
Also use it for samplerjit.
Will use this for masking too.
Messed with SSE4 then realized there's no point, just use SHR.
3c2066d
to
6644c42
Compare
Alpha not yet implemented.
It's easier to use it in these places, but seems it stalls longer on the dest reg.
Reuse it expanded where we can, in case of dither+fog+blend, etc.
This is slightly abusing PixelFuncID, but the intent is to provide some memory that's easily accessible from the jit func, but still associated with that calculation (i.e. not global.)
Okay, now all paths are implemented. Some stats:
Was honestly hoping for more, but this really improves things and makes drawPixel no longer a bottleneck in simpler cases. For areas where it runs slower, even excluding threading overhead, ApplyTexturing() is typically 20-40% (including samplerjit which can be 10-30%.) -[Unknown] |
I did some initial experiments with four pixels at a time, I think I found a way around the problem I saw before that hurt performance. Seems like it'll be a win, but I plan to build upon this and don't want to add more commits to this pull... Also looks like the sampler path and UV/ST handling could be using more SIMD and go through jit, might worry about that first. -[Unknown] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for slowness reviewing! Looks good, let's merge.
So I implemented a SIMD version: In the best cases, it gives an improvement of i.e. 254->278. In most cases, it's pretty flat. And in some (heavy DrawSprite users, which does 1 pixel at a time still), it's slower (i.e. 600->500.) Not sure if I did something obviously silly, though I think the main bottleneck is indeed the sampler. -[Unknown] |
Hm, that's interesting (and cool work!). Though when I thought SIMD would help a lot, I thought that a lot more of execution time was taken by the pixel pipeline. As for the sampler, maybe texture filtering four pixels at a time using SIMD might be better than one, but then again, maybe not.. |
This is not yet battle tested or anything, but implements a jit on x64 only for the software renderer. I created a very simple reg cache for it, which I tried to design with arm64 in mind too.
I didn't test Linux at all, though I did attempt to support it.
Currently, seeing 50-80% improvement in some games, most notably Cave Story.
Alpha blending is not yet implemented (I want to check its accuracy more), which is the main thing preventing it from running in most games.I had previously messed with drawing 4 pixels simultaneously with a mask. I didn't go nearly as far with it, but there were some complexities and it was initially seeming like it wouldn't be faster based on changing C++ first. I'm still not sure. To do that would mean z and fog would be vec4s, and color would be packed 16x8 (4x4x8.) Mask would be passed additionally. That could still be interesting, but I wanted to try this for now.
There's probably still some areas to improve, I've only looked a little at the produced assembly to weed out silly patterns.
Also: not currently well tested. Pretty much just wrote out all the code until "softjit: Initial color write" without any good way to test, but it worked with only a few small fixes, actually.
-[Unknown]