New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

sound/discrete.cpp: Use appropriate memory barriers for task synchronisation #12034

Merged

cuavas merged 1 commit into mamedev:master from cuavas:discsync

Feb 15, 2024

Member

cuavas commented Feb 14, 2024

The synchronisation barriers used in the discrete sound device aren’t really thought out properly. There are two variable where concurrent accesses matter:

output_buffer::ptr, used to determine the number of samples available to the consumer.
discrete_task::m_threadid, used to ensure only a single thread runs a given task at any given time.

Using the volatile qualifier for output_buffer::ptr doesn’t really help, as it doesn’t enforce ordering between writes to the buffer and writes to the pointer itself. The pointer itself needs to be updated with release semantics, while it needs to be read with acquire semantics when determining how many samples are available to the consumer.

discrete_task::m_threadid needs to be updated with acquire semantics when locking and release semantics when unlocking to ensure no work gets reordered outside the critical section. It’s a bad idea to use compare_exchange_weak when locking as it isn’t in a loop attempting to complete a lock-free operation. On x86 you’ll never notice the difference, but on most other architectures it can fail on contention for a cache line, which isn’t what we want to happen here.

discrete_task seems to use public/protected/private randomly, and doesn’t need a vtable as it’s never used as a base class.

This may or may not fix #11373, but I don’t like having dodgy synchronisation code around that works by chance at best.


          sound/discrete.cpp: Use appropriate memory barriers for task synchron…

4be9a67

…isation.

cuavas commented

View reviewed changes

Member Author

cuavas left a comment

Self-reviewing to allow in-place explanatory comments.

src/devices/sound/discrete.cpp

Comment on lines 103 to 106

    
              	std::unique_ptr<double []>  node_buf;

              	const double                *source;

              	volatile double             *ptr;

              	const double *              source;

              	std::atomic<double *>       ptr;

              	int                         node_num;

Member Author

cuavas Feb 14, 2024

Using volatile here doesn’t do what it’s supposed to, as it doesn’t enforce ordering of writes to the buffer relative to writes to the pointer itself.

Making ptr atomic guarantees consistent loads on systems where pointer-sized writes are not inherently atomic (typically systems where the memory bus is narrower than a pointer, e.g. 32-bit CPU with 16-bit memory, or 64-bit CPU with 32-bit memory). It also makes it easier to ensure the compiler doesn’t turn what appears to be a single access int multiple loads.

src/devices/sound/discrete.cpp

Comment on lines -102 to +113

    
              	volatile const double       *ptr;               /* pointer into linked_outbuf.nodebuf */

              	output_buffer *             linked_outbuf;      /* what output are we connected to ? */

              	double                      buffer;             /* input[] will point here */

              	const double *              ptr;                // pointer into linked_outbuf.nodebuf

              	output_buffer *             linked_outbuf;      // what output are we connected to?

              	double                      buffer;             // input[] will point here

Member Author

cuavas Feb 14, 2024

There’s no point having volatile on ptr here, either. You just need to ensure that there’s an acquire barrier between determining the number of samples available and reading the samples.

src/devices/sound/discrete.cpp

Comment on lines -109 to -111

-              	friend class discrete_device;
               public:
-              	virtual ~discrete_task() { }

Member Author

cuavas Feb 14, 2024

Using public/protected/private randomly and relying on friends is a bad design. (There isn’t much point using protected at all here when there are no derived classes.)

There’s no point giving this a vtable as it has no other virtual member functions, and isn’t used as a base class for anything.

src/devices/sound/discrete.cpp

Comment on lines -130 to +148

    
              	discrete_task(discrete_device &pdev) : m_device(pdev), m_threadid(-1)

              	bool lock_threadid(int32_t threadid)

              	{

              		// FIXME: the code expects to be able to take pointers to members of elements of this vector before it's filled

              		source_list.reserve(16);

              		int expected = -1;

              		return m_threadid.compare_exchange_strong(expected, threadid, std::memory_order_acquire, std::memory_order_relaxed);

              	}

              protected:

              	static void *task_callback(void *param, int threadid);

              	inline bool process();

              	void unlock()

              	{

              		m_threadid.store(-1, std::memory_order_release);

              	}

Member Author

cuavas Feb 14, 2024

Locking requires acquire semantics to ensure no task work can be reordered before it happens. Unlocking requires release semantics to ensure no task work can be reordered after it happens.

src/devices/sound/discrete.cpp

Comment on lines -207 to +217

    
              	/* buffer the outputs */

              	// buffer the outputs

              	for (output_buffer &outbuf : m_buffers)

              		*outbuf.ptr.load(std::memory_order_relaxed) = *outbuf.source;

              	std::atomic_thread_fence(std::memory_order_release);

              	for (output_buffer &outbuf : m_buffers)

              		*outbuf.ptr++ = *outbuf.source;

              		outbuf.ptr.store(outbuf.ptr.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed);

Member Author

cuavas Feb 14, 2024

This was pretty dodgy – there’s no guarantee of when the postincrement happens relative to the store.

We can use a single release barrier here between storing the samples and updating the pointers to allow some loop unrolling, etc.

src/devices/sound/discrete.cpp

Comment on lines -236 to +253

    
              	/* check dependencies */

              	// check dependencies

              	for (input_buffer &sn : source_list)

              	{

              		int avail = sn.linked_outbuf->ptr - sn.ptr;

              		const int avail = sn.linked_outbuf->ptr.load(std::memory_order_relaxed) - sn.ptr;

              		if (avail < 0)

              			throw emu_fatalerror("discrete_task::process: available samples are negative");

              		if (avail < samples)

              			samples = avail;

              	}

              	std::atomic_thread_fence(std::memory_order_acquire);

Member Author

cuavas Feb 14, 2024

We only need a single acquire barrier here between determining the minimum number of samples available across all inputs and consuming the samples.

Since output_buffer::ptr is atomic, we get a guarantee that there’s a single load here and we won’t get an inconsistent value.

src/devices/sound/discrete.cpp

Comment on lines 272 to 284

    
              void discrete_task::prepare_for_queue(int samples)

              {

              	m_threadid.store(-1, std::memory_order_relaxed); // unlock the thread

              	m_samples = samples;

              	/* set up task buffers */

              	// set up task buffers

              	for (output_buffer &ob : m_buffers)

              		ob.ptr = ob.node_buf.get();

              		ob.ptr.store(ob.node_buf.get(), std::memory_order_relaxed);

              	/* initialize sources */

              	// initialize sources

              	for (input_buffer &sn : source_list)

              	{

              		sn.ptr = sn.linked_outbuf->node_buf.get();

              	}

              }

Member Author

cuavas Feb 14, 2024

We can use relaxed memory order here because there’s a barrier after preparing all the tasks.

src/devices/sound/discrete.cpp

Comment on lines -320 to +329

    
              							buf.ptr = buf.node_buf.get();

              							buf.ptr.store(buf.node_buf.get(), std::memory_order_relaxed);

Member Author

cuavas Feb 14, 2024

This only happens on start, and there’s a barrier after preparing tasks.

src/devices/sound/discrete.cpp

Comment on lines -1020 to +1031

    
              	/* Setup tasks */

              	// Set up tasks

              	for (const auto &task : task_list)

              	{

              		/* unlock the thread */

              		task->unlock();

              		task->prepare_for_queue(samples);

              	}

              	std::atomic_thread_fence(std::memory_order_release);

Member Author

cuavas Feb 14, 2024

The unlock operation has been moved inside discrete_task::prepare_for_queue. The release barrier here ensures preparation work can’t be reordered after starting the tasks.

cuavas merged commit 257d81f into mamedev:master

5 checks passed

cuavas deleted the discsync branch

February 15, 2024 00:43

seleuco commented Feb 15, 2024

I have tested this on an ARM64 build and it seems to work fine. Games with discrete sound now do not hung when numprocessors is not one. Thank you.

cuavas mentioned this pull request

src/devices/sound/discrete.cpp bad address and threading issues #11373

Closed

grant2258 mentioned this pull request

strider 2 black screen libretro/mame#470

Closed

Mokona pushed a commit to Mokona/mame that referenced this pull request


          sound/discrete.cpp: Use appropriate memory barriers for task synchron…

c6ecc37

…isation. (mamedev#12034)

joepogo commented Mar 1, 2024

Just chiming in here, on Xbox retroarch mame, the current core which is based on current mame, discrete games still crash back to dash and threads have to be turned off. Is there anything else here that could be causing the crash? Was hoping since this fixed issues on the pi5 it would also benefit us as well.

Member Author

cuavas commented Mar 1, 2024

We do not support hacked up RetroArch cores.

StLouisCPhT commented Mar 4, 2024 •

edited

Loading

Just chiming in here, on Xbox retroarch mame, the current core which is based on current mame, discrete games still crash back to dash and threads have to be turned off. Is there anything else here that could be causing the crash? Was hoping since this fixed issues on the pi5 it would also benefit us as well.

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

Best way to fix the problem would be to create a standalone version of Mame for Xbox instead of using RA.

Or hope for a miracle.

joepogo commented Mar 4, 2024

Thanks for the reply @StLouisCPhT! Yea, the series x has really good speed on mame retroarch and we do have the option to turn off threading (as the discrete games and some chd games that use threading run just fine on xbox once you disable it).

I think even if the series x and s had a standalone port, the discrete/chd games would still crash unless threading was disabled, so i figured id ask here if there was anything we could try in case we ever do go that route.

Anyhow, If not though, oh well, most of the discrete games run just fine it's just when you start getting into certain chd games that use threading like san francisco rush, time crsis, etc that you start to get slowdown. That's where it'd be nice to get the speed benefit like pc does.

Anyways, figured there was no harm in asking. Thank you again for you response! :)

StLouisCPhT commented Mar 4, 2024 •

edited

Loading

Thanks for the reply @StLouisCPhT! Yea, the series x has really good speed on mame retroarch and we do have the option to turn off threading (as the discrete games and some chd games that use threading run just fine on xbox once you disable it).

I think even if the series x and s had a standalone port, the discrete/chd games would still crash unless threading was disabled, so i figured id ask here if there was anything we could try in case we ever do go that route.

Anyhow, If not though, oh well, most of the discrete games run just fine it's just when you start getting into certain chd games that use threading like san francisco rush, time crisis, etc that you start to get slowdown. That's where it'd be nice to get the speed benefit like pc does.

Anyways, figured there was no harm in asking. Thank you again for you response! :)

After the fiasco with people violating MS's retail restrictions and the lack of activity by Sir Mangler in almost a year, I gave up on any real emulation support on the Xbox consoles and switched to a 2019 Shield TV Pro; everything works as well as or even better than RA on my Series X, plus i have more standalone options. Dolphin still needs work on the speed side, but most GC and Wii games now hit 30fps.

@cuavas Does Mame still have any of the old UWP code still in it or was it completely stripped out?

Contributor

rb6502 commented Mar 5, 2024

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

The fix should be correct for all processor architectures, there is nothing ARM-specific about it. However, Retroarch doesn't use MAME's threading system, and with their intense focus on obsolete hardware they're a little angry we use threading at all.

Member Author

cuavas commented Mar 5, 2024

@cuavas Does Mame still have any of the old UWP code still in it or was it completely stripped out?

The old UWP code was removed as it had rotted pretty badly and Microsoft moved to allowing a subset of Win32 for Store apps rather than requiring UWP.

joepogo commented Mar 5, 2024

This was for fixing the issue on ARM based systems iirc. Not Intel/AMD x86-64 based systems like Xbox.

The fix should be correct for all processor architectures, there is nothing ARM-specific about it. However, Retroarch doesn't use MAME's threading system, and with their intense focus on obsolete hardware they're a little angry we use threading at all.

So the crashing for Xbox is more on the retroarch side then? Sorry, just trying to understand as the only way to get those games to not crash is to turn off the threading in the mame code.

Contributor

rb6502 commented Mar 5, 2024

So the crashing for Xbox is more on the retroarch side then? Sorry, just trying to understand as the only way to get those games to not crash is to turn off the threading in the mame code.

As far as we're aware the fix is correct and real MAME builds work well with it on all processors. So yes, the problem is RetroArch's.

stonedDiscord pushed a commit to stonedDiscord/mame that referenced this pull request


          sound/discrete.cpp: Use appropriate memory barriers for task synchron…

511f4bb

…isation. (mamedev#12034)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment