Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ringbuffers are not thread safe #388

Closed
nicholashaydensmith opened this issue Oct 28, 2018 · 15 comments
Closed

Ringbuffers are not thread safe #388

nicholashaydensmith opened this issue Oct 28, 2018 · 15 comments

Comments

@nicholashaydensmith
Copy link

The ringbuffer code is not thread safe because variables write_ptr and read_ptr are read from and written to by different threads. Not only are the size_t variables themselves subject to data hazard, the compiler and processor are free to reorder any of the operations inside of jack_ringbuffer_read and jack_ringbuffer_write because there is zero fencing. Consider the following scenario:

Producer thread: calls jack_ringbuffer_write and updates write_ptr before we actually memcpy any data and gets context switched out.
Consumer thread: calls jack_ringbuffer_read and thinks there is data to be consumed when in fact the producer thread hasn't actually written anything to memory.

The volatile keyword does not make anything threadsafe and shouldn't be used in this context. volatile is really only useful for memory mapped IO and is never appropriate to use as a means for inter-thread communication.

Instead of using volatile, the write_ptr and read_ptr variables should be made atomic and use proper memory order semantics to ensure thread safety, https://en.cppreference.com/w/cpp/atomic/memory_order.

99.9% of the time I bet the code, as it stands, works perfectly fine so I understand this is probably not a super high priority issue. Just thought I'd raise this issue for correctness purposes. I would submit a pull request, but it would violate the terms and agreement with my current employer.

@falkTX
Copy link
Member

falkTX commented Oct 28, 2018

I am aware of this "issue", but tbh it seems all just theory.
Before trying to fix this, we need to have a test case that proves the bad behavior.

@kmatheussen
Copy link
Contributor

Hopefully it's only in theory that it would misbehave, at least on intel CPUs. But it's undoubtedly a real problem is if you use tsan, so this must be fixed.

Here's a disussion about it from a couple of years ago: http://linux-audio.4202.n7.nabble.com/Realtime-inter-thread-communication-td99157.html

I did supply a patch for the ringbuffer in that thread, but I've lost access to the account where it was uploaded. Anyway, it's an easy fix.

@kmatheussen
Copy link
Contributor

The volatile keyword does not make anything threadsafe and shouldn't be used in this context. volatile is really only useful for memory mapped IO and is never appropriate to use as a means for inter-thread communication.

Actually, the microsoft c compiler guarantee instructions on volatile variables not to be reordered: https://preshing.com/20141024/my-multicore-talk-at-cppcon-2014/
https://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Seems like Microsoft did the right thing there.

@jcelerier
Copy link

jcelerier commented Nov 2, 2018

Before trying to fix this, we need to have a test case that proves the bad behavior.

I had similar code in my codebase that worked fine and then failed when compiling at high optimisation levels with LTO, since the compiler was able to "assume" that because there is no actual protection, then writing from two threads would be undefined behaviour, and since undefined behaviour cannot ever happen from the point of view of the compiler, the only logical alternative was to assume that everything would happen on a single thread.

Actual writes were hence eschewed and everything would stay in CPU caches for who knows how long (in practice write synchronization didn't happen at all from what I remember). Using atomic<T>* instead of volatile T* solved it.

@falkTX
Copy link
Member

falkTX commented Nov 2, 2018

Can you create a similar case in some small code?

In any case, multiple writers to a single ringbuffer does not sound like a good idea to me, even with memory fences and other techniques.

@jcelerier
Copy link

jcelerier commented Nov 2, 2018

the simplest example code that reproduces this kind of problems would be this :

#include <thread>
#include <iostream>
int main()
{
    int i = 0;
    volatile int* p = &i;
    
    std::thread t([p] { 
        (*p)++;
    });
    
    while(i == 0) ;
    
    std::cerr << i << std::endl;   
    
    t.join();
}

for me with gcc 8, it works fine at -O0 / -O1 but stays in an infinite loop at -O2 / -O3

@falkTX
Copy link
Member

falkTX commented Nov 2, 2018

I know about that case, the optimization makes the final binary skip the while loop.
The "issue" there is the local variable.

A ringbuffer is quite different, the class variables in there are not local scope variables like in the example.

I am yet to see working (proof of mis-write/read) test code against it.

@jpcima
Copy link

jpcima commented Jan 9, 2019

I've tried porting the robustness check from my own ring-buffer library to test.
I send a sequence of integers and I check on the receiving side.

Apparently, I am able to reproduce a data race problem.
Without a forced delay following read or write, I have a failure and a report by thread sanitizer.

test_case.cc.gz
test_case.txt.gz

Hope this helps.

jackdmp version 1.9.12 tmpdir /dev/shm protocol 8

@7890
Copy link
Contributor

7890 commented Jan 10, 2019

Hi, I think you are using the ringbuffer in a wrong way.

When you use jack_ringbuffer_write() and it was successful, there's no need to call jack_ringbuffer_write_advance(). The latter is only useful when the pointers were retrieved via jack_ringbuffer_get_write_vector().

Now if you use jack_ringbuffer_write() and count wasn't equal, you'll already have written a portion of data to your buffer and have messed up the event boundaries from there on.

I've slightly modified your testcase:

test_case.cc.txt

It runs without errors with delay=0.
Greetings

@jpcima
Copy link

jpcima commented Jan 10, 2019

Tried and you're right @7890, it was a misuse and the test case seems to work robustly now.
While thread sanitizer still emits its complaint, it ran without error on my side.

@7890
Copy link
Contributor

7890 commented Jan 10, 2019

Excellent :)

@7890
Copy link
Contributor

7890 commented Jan 14, 2019

Please re-open if this poses an "real" issue with a reproducible case.

@Krasjet
Copy link
Contributor

Krasjet commented Jan 23, 2021

Hey, I want to give some updates about this issue. I am able to consistently reproduce this issue on a raspberry pi 4 using the test cases from #388 (comment). It appears that the relaxed memory ordering on ARM platforms does pose a problem to the current implementation.

I wonder if anyone is able to reproduce this issue on Apple Silicon?

@Krasjet
Copy link
Contributor

Krasjet commented Jan 23, 2021

Here is a sample output:

...
---------------- 418756/1000000 ----------------
---------------- 418757/1000000 ----------------
message (0) != expected (82)

I have deleted all the print statements between each iteration to speed up the testing.

@Krasjet
Copy link
Contributor

Krasjet commented Jan 23, 2021

Here is the test program I'm using, with some minor cleanups

//          Copyright Jean Pierre Cimalando 2018.
// Distributed under the Boost Software License, Version 1.0.
//    (See accompanying file LICENSE or copy at
//          http://www.boost.org/LICENSE_1_0.txt)

// g++ -O2 -o test_case test_case.cc -ljack -lpthread

#include <jack/ringbuffer.h>
#include <memory>
#include <thread>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>

jack_ringbuffer_t *ring_buffer;
size_t capacity = 1024;
size_t message_count = 100;
size_t message_size = 100;

void thread1()
{
  jack_ringbuffer_t *rb = ::ring_buffer;
  const size_t message_size = ::message_size;
  const size_t message_count = ::message_count;
  std::unique_ptr<uint8_t[]> message_buffer(new uint8_t[message_size]);

  for (size_t i = 0; i < message_count;) {
    *(size_t *)message_buffer.get() = i;

    size_t can_write=jack_ringbuffer_write_space(rb);
    size_t count=0;
    if (can_write>=message_size) {
      count = jack_ringbuffer_write(rb, (const char *)message_buffer.get(), message_size);
    }
    if (count == message_size)
      ++i;
    else
      sched_yield();
  }
}

void thread2()
{
  jack_ringbuffer_t *rb = ::ring_buffer;
  const size_t message_size = ::message_size;
  const size_t message_count = ::message_count;
  std::unique_ptr<uint8_t[]> message_buffer(new uint8_t[message_size]);

  for (size_t i = 0; i < message_count;) {
    size_t can_read=jack_ringbuffer_read_space(rb);
    size_t count=0;
    if (can_read>=message_size) {
      count = jack_ringbuffer_read(rb, (char *)message_buffer.get(), message_size);
    }

    if (count == message_size) {
      size_t msg = *(size_t *)message_buffer.get();
      if (msg != i) {
        printf("message (%zu) != expected (%zu)\n", msg, i);
        exit(1);
      }
      ++i;
    }
  }
}

int main()
{
  size_t ntimes = 1000000;
  for (size_t i = 0; i < ntimes; ++i)
  {
    printf("---------------- %4zu/%4zu ----------------\n", i + 1, ntimes);
    jack_ringbuffer_t *rb = ::ring_buffer = jack_ringbuffer_create(::capacity);
    std::thread t1(thread1);
    std::thread t2(thread2);
    t1.join();
    t2.join();
    ::ring_buffer = nullptr;
    delete rb;
  }
  printf("-------------------------------------------\n");
  printf("success!\n");

  return 0;
}

Krasjet added a commit to Krasjet/jack2 that referenced this issue Jul 21, 2022
This patch addresses the thread safety problem of `jack_ringbuffer_t`
mentioned in jackaudio#715 and jackaudio#388. The overbound read bug caused by this problem
is impossible to reproduce on x86 due to its strong memory ordering, but
it is a problem on ARM and other weakly ordered architectures.

Basically, the main problem is that, on a weakly ordered architecture,
it is possible that the pointer increment after `memcpy` becomes visible
to the other thread before `memcpy` finishes:

	memcpy (&(rb->buf[rb->write_ptr]), src, n1);
	// vvv can be visible to reading thread before memcpy finishes
	rb->write_ptr = (rb->write_ptr + n1) & rb->size_mask;

If this happens, the other thread can read the remaining garbage values
in `rb->buf` due to be overwritten by the unfinished `memcpy`.

To fix this, an explicit pair of release/acquire memory fences [1] is
used to ensure the copy on the other thread *happens after* the `memcpy`
finishes so no garbage values can be read.

[1]: https://preshing.com/20130922/acquire-and-release-fences/
falkTX pushed a commit to moddevices/jack2 that referenced this issue Nov 14, 2022
This patch addresses the thread safety problem of `jack_ringbuffer_t`
mentioned in jackaudio#715 and jackaudio#388. The overbound read bug caused by this problem
is impossible to reproduce on x86 due to its strong memory ordering, but
it is a problem on ARM and other weakly ordered architectures.

Basically, the main problem is that, on a weakly ordered architecture,
it is possible that the pointer increment after `memcpy` becomes visible
to the other thread before `memcpy` finishes:

	memcpy (&(rb->buf[rb->write_ptr]), src, n1);
	// vvv can be visible to reading thread before memcpy finishes
	rb->write_ptr = (rb->write_ptr + n1) & rb->size_mask;

If this happens, the other thread can read the remaining garbage values
in `rb->buf` due to be overwritten by the unfinished `memcpy`.

To fix this, an explicit pair of release/acquire memory fences [1] is
used to ensure the copy on the other thread *happens after* the `memcpy`
finishes so no garbage values can be read.

[1]: https://preshing.com/20130922/acquire-and-release-fences/
falkTX pushed a commit that referenced this issue Jan 28, 2023
* fix ringbuffer thread safety on ARM. fix #715 #388

This patch addresses the thread safety problem of `jack_ringbuffer_t`
mentioned in #715 and #388. The overbound read bug caused by this problem
is impossible to reproduce on x86 due to its strong memory ordering, but
it is a problem on ARM and other weakly ordered architectures.

Basically, the main problem is that, on a weakly ordered architecture,
it is possible that the pointer increment after `memcpy` becomes visible
to the other thread before `memcpy` finishes:

	memcpy (&(rb->buf[rb->write_ptr]), src, n1);
	// vvv can be visible to reading thread before memcpy finishes
	rb->write_ptr = (rb->write_ptr + n1) & rb->size_mask;

If this happens, the other thread can read the remaining garbage values
in `rb->buf` due to be overwritten by the unfinished `memcpy`.

To fix this, an explicit pair of release/acquire memory fences [1] is
used to ensure the copy on the other thread *happens after* the `memcpy`
finishes so no garbage values can be read.

[1]: https://preshing.com/20130922/acquire-and-release-fences/

* remove volatile qualifier on ringbuf r/w pointers

The volatile constraints are excess when compiler barriers are present.
It generates unnecessary `mov` instructions when pointers aren't going
to be updated.

* simplify read/write space calculations

This optimization is possible because the buffer size is always a power
of 2. See [1] for details.

[1]: drobilla/zix#1 (comment)

* move acq fences to separate lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants