Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sound/discrete.cpp: Use appropriate memory barriers for task synchronisation #12034

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

cuavas
Copy link
Member

@cuavas cuavas commented Feb 14, 2024

The synchronisation barriers used in the discrete sound device aren’t really thought out properly. There are two variable where concurrent accesses matter:

  • output_buffer::ptr, used to determine the number of samples available to the consumer.
  • discrete_task::m_threadid, used to ensure only a single thread runs a given task at any given time.

Using the volatile qualifier for output_buffer::ptr doesn’t really help, as it doesn’t enforce ordering between writes to the buffer and writes to the pointer itself. The pointer itself needs to be updated with release semantics, while it needs to be read with acquire semantics when determining how many samples are available to the consumer.

discrete_task::m_threadid needs to be updated with acquire semantics when locking and release semantics when unlocking to ensure no work gets reordered outside the critical section. It’s a bad idea to use compare_exchange_weak when locking as it isn’t in a loop attempting to complete a lock-free operation. On x86 you’ll never notice the difference, but on most other architectures it can fail on contention for a cache line, which isn’t what we want to happen here.

discrete_task seems to use public/protected/private randomly, and doesn’t need a vtable as it’s never used as a base class.

This may or may not fix #11373, but I don’t like having dodgy synchronisation code around that works by chance at best.

Copy link
Member Author

@cuavas cuavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reviewing to allow in-place explanatory comments.

Comment on lines 103 to 106
std::unique_ptr<double []> node_buf;
const double *source;
volatile double *ptr;
const double * source;
std::atomic<double *> ptr;
int node_num;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using volatile here doesn’t do what it’s supposed to, as it doesn’t enforce ordering of writes to the buffer relative to writes to the pointer itself.

Making ptr atomic guarantees consistent loads on systems where pointer-sized writes are not inherently atomic (typically systems where the memory bus is narrower than a pointer, e.g. 32-bit CPU with 16-bit memory, or 64-bit CPU with 32-bit memory). It also makes it easier to ensure the compiler doesn’t turn what appears to be a single access int multiple loads.

Comment on lines -102 to +113
volatile const double *ptr; /* pointer into linked_outbuf.nodebuf */
output_buffer * linked_outbuf; /* what output are we connected to ? */
double buffer; /* input[] will point here */
const double * ptr; // pointer into linked_outbuf.nodebuf
output_buffer * linked_outbuf; // what output are we connected to?
double buffer; // input[] will point here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no point having volatile on ptr here, either. You just need to ensure that there’s an acquire barrier between determining the number of samples available and reading the samples.

Comment on lines -109 to -111
friend class discrete_device;
public:
virtual ~discrete_task() { }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using public/protected/private randomly and relying on friends is a bad design. (There isn’t much point using protected at all here when there are no derived classes.)

There’s no point giving this a vtable as it has no other virtual member functions, and isn’t used as a base class for anything.

Comment on lines -130 to +148
discrete_task(discrete_device &pdev) : m_device(pdev), m_threadid(-1)
bool lock_threadid(int32_t threadid)
{
// FIXME: the code expects to be able to take pointers to members of elements of this vector before it's filled
source_list.reserve(16);
int expected = -1;
return m_threadid.compare_exchange_strong(expected, threadid, std::memory_order_acquire, std::memory_order_relaxed);
}

protected:
static void *task_callback(void *param, int threadid);
inline bool process();
void unlock()
{
m_threadid.store(-1, std::memory_order_release);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking requires acquire semantics to ensure no task work can be reordered before it happens. Unlocking requires release semantics to ensure no task work can be reordered after it happens.

Comment on lines -207 to +217
/* buffer the outputs */
// buffer the outputs
for (output_buffer &outbuf : m_buffers)
*outbuf.ptr.load(std::memory_order_relaxed) = *outbuf.source;
std::atomic_thread_fence(std::memory_order_release);
for (output_buffer &outbuf : m_buffers)
*outbuf.ptr++ = *outbuf.source;
outbuf.ptr.store(outbuf.ptr.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty dodgy – there’s no guarantee of when the postincrement happens relative to the store.

We can use a single release barrier here between storing the samples and updating the pointers to allow some loop unrolling, etc.

Comment on lines -236 to +253
/* check dependencies */
// check dependencies
for (input_buffer &sn : source_list)
{
int avail = sn.linked_outbuf->ptr - sn.ptr;
const int avail = sn.linked_outbuf->ptr.load(std::memory_order_relaxed) - sn.ptr;
if (avail < 0)
throw emu_fatalerror("discrete_task::process: available samples are negative");
if (avail < samples)
samples = avail;
}
std::atomic_thread_fence(std::memory_order_acquire);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need a single acquire barrier here between determining the minimum number of samples available across all inputs and consuming the samples.

Since output_buffer::ptr is atomic, we get a guarantee that there’s a single load here and we won’t get an inconsistent value.

Comment on lines 272 to 284
void discrete_task::prepare_for_queue(int samples)
{
m_threadid.store(-1, std::memory_order_relaxed); // unlock the thread
m_samples = samples;
/* set up task buffers */

// set up task buffers
for (output_buffer &ob : m_buffers)
ob.ptr = ob.node_buf.get();
ob.ptr.store(ob.node_buf.get(), std::memory_order_relaxed);

/* initialize sources */
// initialize sources
for (input_buffer &sn : source_list)
{
sn.ptr = sn.linked_outbuf->node_buf.get();
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use relaxed memory order here because there’s a barrier after preparing all the tasks.

Comment on lines -320 to +329
buf.ptr = buf.node_buf.get();
buf.ptr.store(buf.node_buf.get(), std::memory_order_relaxed);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens on start, and there’s a barrier after preparing tasks.

Comment on lines -1020 to +1031
/* Setup tasks */
// Set up tasks
for (const auto &task : task_list)
{
/* unlock the thread */
task->unlock();

task->prepare_for_queue(samples);
}
std::atomic_thread_fence(std::memory_order_release);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unlock operation has been moved inside discrete_task::prepare_for_queue. The release barrier here ensures preparation work can’t be reordered after starting the tasks.

@cuavas cuavas merged commit 257d81f into mamedev:master Feb 15, 2024
5 checks passed
@cuavas cuavas deleted the discsync branch February 15, 2024 00:43
@seleuco
Copy link

seleuco commented Feb 15, 2024

I have tested this on an ARM64 build and it seems to work fine. Games with discrete sound now do not hung when numprocessors is not one. Thank you.

@joepogo
Copy link

joepogo commented Mar 1, 2024

Just chiming in here, on Xbox retroarch mame, the current core which is based on current mame, discrete games still crash back to dash and threads have to be turned off. Is there anything else here that could be causing the crash? Was hoping since this fixed issues on the pi5 it would also benefit us as well.

@cuavas
Copy link
Member Author

cuavas commented Mar 1, 2024

We do not support hacked up RetroArch cores.

@StLouisCPhT
Copy link

StLouisCPhT commented Mar 4, 2024

Just chiming in here, on Xbox retroarch mame, the current core which is based on current mame, discrete games still crash back to dash and threads have to be turned off. Is there anything else here that could be causing the crash? Was hoping since this fixed issues on the pi5 it would also benefit us as well.

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

Best way to fix the problem would be to create a standalone version of Mame for Xbox instead of using RA.

Or hope for a miracle.

@joepogo
Copy link

joepogo commented Mar 4, 2024

Thanks for the reply @StLouisCPhT! Yea, the series x has really good speed on mame retroarch and we do have the option to turn off threading (as the discrete games and some chd games that use threading run just fine on xbox once you disable it).

I think even if the series x and s had a standalone port, the discrete/chd games would still crash unless threading was disabled, so i figured id ask here if there was anything we could try in case we ever do go that route.

Anyhow, If not though, oh well, most of the discrete games run just fine it's just when you start getting into certain chd games that use threading like san francisco rush, time crsis, etc that you start to get slowdown. That's where it'd be nice to get the speed benefit like pc does.

Anyways, figured there was no harm in asking. Thank you again for you response! :)

@StLouisCPhT
Copy link

StLouisCPhT commented Mar 4, 2024

Thanks for the reply @StLouisCPhT! Yea, the series x has really good speed on mame retroarch and we do have the option to turn off threading (as the discrete games and some chd games that use threading run just fine on xbox once you disable it).

I think even if the series x and s had a standalone port, the discrete/chd games would still crash unless threading was disabled, so i figured id ask here if there was anything we could try in case we ever do go that route.

Anyhow, If not though, oh well, most of the discrete games run just fine it's just when you start getting into certain chd games that use threading like san francisco rush, time crisis, etc that you start to get slowdown. That's where it'd be nice to get the speed benefit like pc does.

Anyways, figured there was no harm in asking. Thank you again for you response! :)

After the fiasco with people violating MS's retail restrictions and the lack of activity by Sir Mangler in almost a year, I gave up on any real emulation support on the Xbox consoles and switched to a 2019 Shield TV Pro; everything works as well as or even better than RA on my Series X, plus i have more standalone options. Dolphin still needs work on the speed side, but most GC and Wii games now hit 30fps.

@cuavas Does Mame still have any of the old UWP code still in it or was it completely stripped out?

@rb6502
Copy link
Contributor

rb6502 commented Mar 5, 2024

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

The fix should be correct for all processor architectures, there is nothing ARM-specific about it. However, Retroarch doesn't use MAME's threading system, and with their intense focus on obsolete hardware they're a little angry we use threading at all.

@cuavas
Copy link
Member Author

cuavas commented Mar 5, 2024

@cuavas Does Mame still have any of the old UWP code still in it or was it completely stripped out?

The old UWP code was removed as it had rotted pretty badly and Microsoft moved to allowing a subset of Win32 for Store apps rather than requiring UWP.

@joepogo
Copy link

joepogo commented Mar 5, 2024

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

The fix should be correct for all processor architectures, there is nothing ARM-specific about it. However, Retroarch doesn't use MAME's threading system, and with their intense focus on obsolete hardware they're a little angry we use threading at all.

So the crashing for Xbox is more on the retroarch side then? Sorry, just trying to understand as the only way to get those games to not crash is to turn off the threading in the mame code.

@rb6502
Copy link
Contributor

rb6502 commented Mar 5, 2024

So the crashing for Xbox is more on the retroarch side then? Sorry, just trying to understand as the only way to get those games to not crash is to turn off the threading in the mame code.

As far as we're aware the fix is correct and real MAME builds work well with it on all processors. So yes, the problem is RetroArch's.

stonedDiscord pushed a commit to stonedDiscord/mame that referenced this pull request Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

src/devices/sound/discrete.cpp bad address and threading issues
5 participants