-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sound/discrete.cpp: Use appropriate memory barriers for task synchronisation #12034
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reviewing to allow in-place explanatory comments.
std::unique_ptr<double []> node_buf; | ||
const double *source; | ||
volatile double *ptr; | ||
const double * source; | ||
std::atomic<double *> ptr; | ||
int node_num; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using volatile
here doesn’t do what it’s supposed to, as it doesn’t enforce ordering of writes to the buffer relative to writes to the pointer itself.
Making ptr
atomic guarantees consistent loads on systems where pointer-sized writes are not inherently atomic (typically systems where the memory bus is narrower than a pointer, e.g. 32-bit CPU with 16-bit memory, or 64-bit CPU with 32-bit memory). It also makes it easier to ensure the compiler doesn’t turn what appears to be a single access int multiple loads.
volatile const double *ptr; /* pointer into linked_outbuf.nodebuf */ | ||
output_buffer * linked_outbuf; /* what output are we connected to ? */ | ||
double buffer; /* input[] will point here */ | ||
const double * ptr; // pointer into linked_outbuf.nodebuf | ||
output_buffer * linked_outbuf; // what output are we connected to? | ||
double buffer; // input[] will point here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no point having volatile
on ptr
here, either. You just need to ensure that there’s an acquire barrier between determining the number of samples available and reading the samples.
friend class discrete_device; | ||
public: | ||
virtual ~discrete_task() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using public
/protected
/private
randomly and relying on friends is a bad design. (There isn’t much point using protected
at all here when there are no derived classes.)
There’s no point giving this a vtable as it has no other virtual member functions, and isn’t used as a base class for anything.
discrete_task(discrete_device &pdev) : m_device(pdev), m_threadid(-1) | ||
bool lock_threadid(int32_t threadid) | ||
{ | ||
// FIXME: the code expects to be able to take pointers to members of elements of this vector before it's filled | ||
source_list.reserve(16); | ||
int expected = -1; | ||
return m_threadid.compare_exchange_strong(expected, threadid, std::memory_order_acquire, std::memory_order_relaxed); | ||
} | ||
|
||
protected: | ||
static void *task_callback(void *param, int threadid); | ||
inline bool process(); | ||
void unlock() | ||
{ | ||
m_threadid.store(-1, std::memory_order_release); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locking requires acquire semantics to ensure no task work can be reordered before it happens. Unlocking requires release semantics to ensure no task work can be reordered after it happens.
/* buffer the outputs */ | ||
// buffer the outputs | ||
for (output_buffer &outbuf : m_buffers) | ||
*outbuf.ptr.load(std::memory_order_relaxed) = *outbuf.source; | ||
std::atomic_thread_fence(std::memory_order_release); | ||
for (output_buffer &outbuf : m_buffers) | ||
*outbuf.ptr++ = *outbuf.source; | ||
outbuf.ptr.store(outbuf.ptr.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was pretty dodgy – there’s no guarantee of when the postincrement happens relative to the store.
We can use a single release barrier here between storing the samples and updating the pointers to allow some loop unrolling, etc.
/* check dependencies */ | ||
// check dependencies | ||
for (input_buffer &sn : source_list) | ||
{ | ||
int avail = sn.linked_outbuf->ptr - sn.ptr; | ||
const int avail = sn.linked_outbuf->ptr.load(std::memory_order_relaxed) - sn.ptr; | ||
if (avail < 0) | ||
throw emu_fatalerror("discrete_task::process: available samples are negative"); | ||
if (avail < samples) | ||
samples = avail; | ||
} | ||
std::atomic_thread_fence(std::memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need a single acquire barrier here between determining the minimum number of samples available across all inputs and consuming the samples.
Since output_buffer::ptr
is atomic, we get a guarantee that there’s a single load here and we won’t get an inconsistent value.
void discrete_task::prepare_for_queue(int samples) | ||
{ | ||
m_threadid.store(-1, std::memory_order_relaxed); // unlock the thread | ||
m_samples = samples; | ||
/* set up task buffers */ | ||
|
||
// set up task buffers | ||
for (output_buffer &ob : m_buffers) | ||
ob.ptr = ob.node_buf.get(); | ||
ob.ptr.store(ob.node_buf.get(), std::memory_order_relaxed); | ||
|
||
/* initialize sources */ | ||
// initialize sources | ||
for (input_buffer &sn : source_list) | ||
{ | ||
sn.ptr = sn.linked_outbuf->node_buf.get(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use relaxed memory order here because there’s a barrier after preparing all the tasks.
buf.ptr = buf.node_buf.get(); | ||
buf.ptr.store(buf.node_buf.get(), std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens on start, and there’s a barrier after preparing tasks.
/* Setup tasks */ | ||
// Set up tasks | ||
for (const auto &task : task_list) | ||
{ | ||
/* unlock the thread */ | ||
task->unlock(); | ||
|
||
task->prepare_for_queue(samples); | ||
} | ||
std::atomic_thread_fence(std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unlock operation has been moved inside discrete_task::prepare_for_queue
. The release barrier here ensures preparation work can’t be reordered after starting the tasks.
I have tested this on an ARM64 build and it seems to work fine. Games with discrete sound now do not hung when numprocessors is not one. Thank you. |
Just chiming in here, on Xbox retroarch mame, the current core which is based on current mame, discrete games still crash back to dash and threads have to be turned off. Is there anything else here that could be causing the crash? Was hoping since this fixed issues on the pi5 it would also benefit us as well. |
We do not support hacked up RetroArch cores. |
This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox. Best way to fix the problem would be to create a standalone version of Mame for Xbox instead of using RA. Or hope for a miracle. |
Thanks for the reply @StLouisCPhT! Yea, the series x has really good speed on mame retroarch and we do have the option to turn off threading (as the discrete games and some chd games that use threading run just fine on xbox once you disable it). I think even if the series x and s had a standalone port, the discrete/chd games would still crash unless threading was disabled, so i figured id ask here if there was anything we could try in case we ever do go that route. Anyhow, If not though, oh well, most of the discrete games run just fine it's just when you start getting into certain chd games that use threading like san francisco rush, time crsis, etc that you start to get slowdown. That's where it'd be nice to get the speed benefit like pc does. Anyways, figured there was no harm in asking. Thank you again for you response! :) |
After the fiasco with people violating MS's retail restrictions and the lack of activity by Sir Mangler in almost a year, I gave up on any real emulation support on the Xbox consoles and switched to a 2019 Shield TV Pro; everything works as well as or even better than RA on my Series X, plus i have more standalone options. Dolphin still needs work on the speed side, but most GC and Wii games now hit 30fps. @cuavas Does Mame still have any of the old UWP code still in it or was it completely stripped out? |
The fix should be correct for all processor architectures, there is nothing ARM-specific about it. However, Retroarch doesn't use MAME's threading system, and with their intense focus on obsolete hardware they're a little angry we use threading at all. |
The old UWP code was removed as it had rotted pretty badly and Microsoft moved to allowing a subset of Win32 for Store apps rather than requiring UWP. |
So the crashing for Xbox is more on the retroarch side then? Sorry, just trying to understand as the only way to get those games to not crash is to turn off the threading in the mame code. |
As far as we're aware the fix is correct and real MAME builds work well with it on all processors. So yes, the problem is RetroArch's. |
The synchronisation barriers used in the discrete sound device aren’t really thought out properly. There are two variable where concurrent accesses matter:
output_buffer::ptr
, used to determine the number of samples available to the consumer.discrete_task::m_threadid
, used to ensure only a single thread runs a given task at any given time.Using the
volatile
qualifier foroutput_buffer::ptr
doesn’t really help, as it doesn’t enforce ordering between writes to the buffer and writes to the pointer itself. The pointer itself needs to be updated with release semantics, while it needs to be read with acquire semantics when determining how many samples are available to the consumer.discrete_task::m_threadid
needs to be updated with acquire semantics when locking and release semantics when unlocking to ensure no work gets reordered outside the critical section. It’s a bad idea to usecompare_exchange_weak
when locking as it isn’t in a loop attempting to complete a lock-free operation. On x86 you’ll never notice the difference, but on most other architectures it can fail on contention for a cache line, which isn’t what we want to happen here.discrete_task
seems to usepublic
/protected
/private
randomly, and doesn’t need a vtable as it’s never used as a base class.This may or may not fix #11373, but I don’t like having dodgy synchronisation code around that works by chance at best.