Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to lockless unbounded mpsc #21

Merged
merged 4 commits into from Nov 24, 2019
Merged

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Nov 23, 2019

Implemented lockless unbounded MPSC channels.

Original issue

The initial implementation was based on snmalloc paper (itself inspired by Pony).
The idea was to use it for both:

  • the future memory pool that needs a way to transfer onership of memory block back to the thread that allocated them in the first place
  • steal request

However they have the following issues:

  • they hold on the last item of a queue (breaking for steal requests)
  • they require memory management of the dummy node (snmalloc deletes it and its memory doesn't seem to be reclaimed)
  • they never touch the "back" pointer of the queue when dequeuing, meaning if an item was last, dequeuing will still points to it. Pony has an emptiness check via tagged pointer and snmalloc does something unknown.

Overhead / Performance

On fib(40) with i9-9980XE (36 cores OC @ 4.1 GHz)

DeepinScreenshot_select-area_20191123183003

DeepinScreenshot_select-area_20191123183051

We gain 10-20 ms (5-10%) and also avoids a thread blocking a steal request channel


Investigation details

Full details from #19 (comment)

It seems like there is an issue with snmalloc channels (adapted from Pony's) or I adapted them improperly.

snmalloc pseudo code from paper

DeepinScreenshot_select-area_20191123105253

DeepinScreenshot_select-area_20191123105346

image

Code from: https://github.com/microsoft/snmalloc/blob/7faefbbb0ed69554d0e19bfe901ec5b28e046a82/src/ds/mpscq.h#L29-L83

    void init(T* stub)
    {
      stub->next.store(nullptr, std::memory_order_relaxed);
      front = stub;
      back.store(stub, std::memory_order_relaxed);
      invariant();
    }

    T* destroy()
    {
      T* fnt = front;
      back.store(nullptr, std::memory_order_relaxed);
      front = nullptr;
      return fnt;
    }

    inline bool is_empty()
    {
      T* bk = back.load(std::memory_order_relaxed);

      return bk == front;
    }

    void enqueue(T* first, T* last)
    {
      // Pushes a list of messages to the queue. Each message from first to
      // last should be linked together through their next pointers.
      invariant();
      last->next.store(nullptr, std::memory_order_relaxed);
      std::atomic_thread_fence(std::memory_order_release);
      T* prev = back.exchange(last, std::memory_order_relaxed);
      prev->next.store(first, std::memory_order_relaxed);
    }

    std::pair<T*, bool> dequeue()
    {
      // Returns the front message, or null if not possible to return a message.
      invariant();
      T* first = front;
      T* next = first->next.load(std::memory_order_relaxed);

      if (next != nullptr)
      {
        front = next;
        AAL::prefetch(&(next->next));
        assert(front);
        std::atomic_thread_fence(std::memory_order_acquire);
        invariant();
        return std::pair(first, true);
      }

      return std::pair(nullptr, false);
    }
  };
} // namespace snmalloc

Pony: https://qconlondon.com/london-2016/system/files/presentation-slides/sylvanclebsch.pdf

DeepinScreenshot_select-area_20191123110607

DeepinScreenshot_select-area_20191123110621

Code from https://github.com/ponylang/ponyc/blob/7145c2a84b68ae5b307f0756eee67c222aa02fda/src/libponyrt/actor/messageq.c

Note that Pony enqueue at the head and dequeue at the tail

static bool messageq_push(messageq_t* q, pony_msg_t* first, pony_msg_t* last)
{
  atomic_store_explicit(&last->next, NULL, memory_order_relaxed);

  // Without that fence, the store to last->next above could be reordered after
  // the exchange on the head and after the store to prev->next done by the
  // next push, which would result in the pop incorrectly seeing the queue as
  // empty.
  // Also synchronise with the pop on prev->next.
  atomic_thread_fence(memory_order_release);

  pony_msg_t* prev = atomic_exchange_explicit(&q->head, last,
    memory_order_relaxed);

  bool was_empty = ((uintptr_t)prev & 1) != 0;
  prev = (pony_msg_t*)((uintptr_t)prev & ~(uintptr_t)1);

#ifdef USE_VALGRIND
  // Double fence with Valgrind since we need to have prev in scope for the
  // synchronisation annotation.
  ANNOTATE_HAPPENS_BEFORE(&prev->next);
  atomic_thread_fence(memory_order_release);
#endif
  atomic_store_explicit(&prev->next, first, memory_order_relaxed);

  return was_empty;
}

void ponyint_messageq_init(messageq_t* q)
{
  pony_msg_t* stub = POOL_ALLOC(pony_msg_t);
  stub->index = POOL_INDEX(sizeof(pony_msg_t));
  atomic_store_explicit(&stub->next, NULL, memory_order_relaxed);

  atomic_store_explicit(&q->head, (pony_msg_t*)((uintptr_t)stub | 1),
    memory_order_relaxed);
  q->tail = stub;

#ifndef PONY_NDEBUG
  messageq_size_debug(q);
#endif
}

pony_msg_t* ponyint_thread_messageq_pop(messageq_t* q
#ifdef USE_DYNAMIC_TRACE
  , uintptr_t thr
#endif
  )
{
  pony_msg_t* tail = q->tail;
  pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_relaxed);

  if(next != NULL)
  {
    DTRACE2(THREAD_MSG_POP, (uint32_t) next->id, (uintptr_t) thr);
    q->tail = next;
    atomic_thread_fence(memory_order_acquire);
#ifdef USE_VALGRIND
    ANNOTATE_HAPPENS_AFTER(&tail->next);
    ANNOTATE_HAPPENS_BEFORE_FORGET_ALL(tail);
#endif
    ponyint_pool_free(tail->index, tail);
  }

  return next;
}

bool ponyint_messageq_markempty(messageq_t* q)
{
  pony_msg_t* tail = q->tail;
  pony_msg_t* head = atomic_load_explicit(&q->head, memory_order_relaxed);

  if(((uintptr_t)head & 1) != 0)
    return true;

  if(head != tail)
    return false;

  head = (pony_msg_t*)((uintptr_t)head | 1);

#ifdef USE_VALGRIND
  ANNOTATE_HAPPENS_BEFORE(&q->head);
#endif
  return atomic_compare_exchange_strong_explicit(&q->head, &tail, head,
    memory_order_release, memory_order_relaxed);
}

However, I probably have missed something here but it seems to me like:

  • The back of the queue (head in Pony) is never modified in the enqueue/pop routines.
    This is problematic as it would still points to the item that was just dequeued
  • In snmalloc the queue front is overwritten by next, meaning the initial dummy node would be overwritten on first dequeue. But its memory is never freed/recycled.
  • As they both erase the front in the dequeue/pop, but the dequeuing only advances when there is more than one item in the queue, the consumer will get blocked on the last message.
  • Snmalloc keeps an atomic count of enqueued objects next to the queue.
    In our case, both for the memory pool and for steal requests (for the adaptative steal strategy) we also need to keep an approximate count of enqueued items, we might as well integrate it in the queue.

- they hold on the last item of a queue (breaking for steal requests)
- they require memory management of the dummy node (snmalloc deletes it and its memory doesn't seem to be reclaimed)
- they never touch the "back" pointer of the queue when dequeuing, meaning if an item was last, dequeuing will still points to it. Pony has an emptiness check via tagged pointer and snmalloc does ???
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant