Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

efficient synchronization of external threads using futex #306

Merged
merged 13 commits into from Mar 1, 2021

Conversation

shintaro-iwasaki
Copy link
Collaborator

Pull Request Description

Problem

Most Argobots functions now support external threads; Pthreads can call Argobots synchronization functions such as ABT_mutex_lock(), ABT_barrier_wait(), and ABT_thread_join() together with ULTs. However, the current implementation of Argobots uses spin-wait (maybe plus pause or sched_yield()), which burns cores if synchronization conditions are not satisfied soon. This can cause a catastrophic performance degradation in cases where Pthreads would call such synchronization functions, for example, when users do not know who will call Argobots functions. This issue can happen particularly when Argobots is used as a building block of another runtime system that is expected to be called by other high-level programs.

Solution

This change affects only cases where Pthreads (neither ULTs nor tasklets!) directly call Argobots synchronization functions.

Argobots provides a futex-based suspension configuration that affects all the synchronization operations. Currently, unless the user passes --enable-wait-policy=active, Argobots will make the underlying external thread block on a futex (like most POSIX synchronization operations internally do). The underlying external thread will not use CPU resource in a busy loop, so it alleviates thread oversubscription. This does not change the ULT performance, so if you call such synchronization objects on ULTs (this is a typical use case), this change does not affect the behavior (i.e., it does not increase the overheads of these operations).

Developer Notes

This synchronization mechanism is a little bit complicated. Just some notes:

  1. futex is basically Linux only. As a fallback, Argobots has a POSIX-based implementation (pthread_mutex_t and pthread_cond_t) for other UNIX systems.

  2. Since we should not allocate large memory space just for this additional mechanism that not many programs use, ABTD_futex_xxx_t should be small and simple. Currently, its size is 4 or 8 bytes. This structure can be initialized by zero clear no matter whether POSIX or futex is used (note: pthread_mutex_t or pthread_cond_t can be 32 bytes or more (implementation dependent)). This is also necessary to coexist this feature with static initializers of ABT_mutex and ABT_cond.

  3. The synchronization semantics of some synchronization operations is different. Specifically, ABT_cond_wait() does not allow spurious wakeup while pthread_cond_wait() allows it. This causes a semantics mismatch and adds some overhead.

  4. External signal can disrupt system calls (i.e., futex or pthread_cond_wait). Seven tests are newly added to check if external threads work while receiving a lot of signals (ext_thread_join, ext_thread_mutex, ...).

Performance

The following shows the performance of synchronization objects of the following:

  1. corresponding POSIX implementation (pthread_mutex_lock() etc)
  2. this PR with futex (I tried this on Linux so futex was available on all the machines)
  3. this PR without futex (for reference)
  4. this PR with the active wait policy (=the current Argobots behavior)

on those machines:

  1. Intel Skylake (Intel Xeon 8180M, 28 cores / 56 hardware threads * 2 sockets) / openSUSE 42.2 / GCC 7.5
  2. Summit-like POWER 9 (IBM S822LC 10 cores / 80 hardware threads * 2 sockets) / RedHat 7.6 / GCC 9.3
  3. 64-bit ARM (JLSE's 64-bit ARM for HPC, in total 112 hardware threads)/ RedHat 7.6 / GCC 9.3

All use Pthreads for underlying threads.

(The benchmark code is collapsed)

/* gcc -O3 test.c -lpthread */
#include <abt.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int g_cond_val = 0;
pthread_barrier_t pth_barrier;
pthread_mutex_t pth_mutex;
pthread_cond_t pth_cond;

ABT_barrier abt_barrier;
ABT_mutex abt_mutex;
ABT_cond abt_cond;

int g_num_pthreads = 0;
int g_num_repeats = 0;
int g_type = 0;

typedef struct {
    int tid;
    pthread_t thread;
} pthread_arg_t;

void *pthread_func(void *arg)
{
    pthread_arg_t *p_arg = (pthread_arg_t *)arg;
    int tid = p_arg->tid;
    int num_pthreads = g_num_pthreads;
    int num_repeats = g_num_repeats;

    pthread_barrier_wait(&pth_barrier);
    g_cond_val = 0;
    double start_time = ABT_get_wtime();
    pthread_barrier_wait(&pth_barrier);

    if (g_type == 0) {
        /* Pthreads mutex */
        for (int i = 0; i < num_repeats; i++) {
            pthread_mutex_lock(&pth_mutex);
            /* Do nothing. */
            pthread_mutex_unlock(&pth_mutex);
        }
    } else if (g_type == 1) {
        /* Argobots mutex */
        for (int i = 0; i < num_repeats; i++) {
            ABT_mutex_lock(abt_mutex);
            /* Do nothing. */
            ABT_mutex_unlock(abt_mutex);
        }
    } else if (g_type == 2) {
        /* Pthreads cond broadcast */
        assert(g_num_pthreads >= 2);
        while (g_cond_val < num_repeats) {
            pthread_mutex_lock(&pth_mutex);
            if (g_cond_val % num_pthreads == tid) {
                g_cond_val += 1;
                pthread_cond_broadcast(&pth_cond);
            } else if (g_cond_val < num_repeats) {
                pthread_cond_wait(&pth_cond, &pth_mutex);
            }
            pthread_mutex_unlock(&pth_mutex);
        }
    } else if (g_type == 3) {
        /* Argobots cond broadcast */
        assert(g_num_pthreads >= 2);
        while (g_cond_val < num_repeats) {
            ABT_mutex_lock(abt_mutex);
            if (g_cond_val % num_pthreads == tid) {
                g_cond_val += 1;
                ABT_cond_broadcast(abt_cond);
            } else if (g_cond_val < num_repeats) {
                ABT_cond_wait(abt_cond, abt_mutex);
            }
            ABT_mutex_unlock(abt_mutex);
        }
    } else if (g_type == 4) {
        /* Pthreads barrier */
        for (int i = 0; i < num_repeats; i++)
            pthread_barrier_wait(&pth_barrier);
    } else if (g_type == 5) {
        /* Argobots barrier */
        for (int i = 0; i < num_repeats; i++)
            ABT_barrier_wait(abt_barrier);
    }
    pthread_barrier_wait(&pth_barrier);

    if (tid == 0) {
        double elapsed_time = ABT_get_wtime() - start_time;
        printf("Unit execution time: %.6f [us]\n",
               elapsed_time / num_repeats * 1.0e6);
    }
    return NULL;
}

int main(int argc, const char **argv)
{
    if (argc != 4) {
        printf("Usage: ./a.out NUM_THREADS NUM_REPEATS BENCH_TYPE\n");
        printf("BENCH_TYPE = 0: POSIX mutex\n");
        printf("           = 1: Argobots mutex\n");
        printf("           = 2: POSIX cond\n");
        printf("           = 3: Argobots cond\n");
        printf("           = 4: POSIX barrier\n");
        printf("           = 5: Argobots barrier\n");
        return -1;
    }
    g_num_pthreads = atoi(argv[1]);
    g_num_repeats = atoi(argv[2]);
    g_type = atoi(argv[3]);

    /* Performance check. */
    ABT_init(0, NULL);

    /* Allocate all synchronization objects. */
    pthread_barrier_init(&pth_barrier, NULL, g_num_pthreads);
    pthread_mutex_init(&pth_mutex, NULL);
    pthread_cond_init(&pth_cond, NULL);
    ABT_barrier_create(g_num_pthreads, &abt_barrier);
    ABT_mutex_create(&abt_mutex);
    ABT_cond_create(&abt_cond);

    pthread_arg_t *pthread_args =
        (pthread_arg_t *)malloc(sizeof(pthread_arg_t) * g_num_pthreads);

    /* Check the mutex performance.  Note that we do not consider a fairness
     * issue */
    for (int i = 0; i < g_num_pthreads; i++) {
        pthread_args[i].tid = i;
        pthread_create(&pthread_args[i].thread, NULL, pthread_func,
                       &pthread_args[i]);
    }
    for (int i = 0; i < g_num_pthreads; i++) {
        pthread_join(pthread_args[i].thread, NULL);
    }

    pthread_barrier_destroy(&pth_barrier);
    pthread_mutex_destroy(&pth_mutex);
    pthread_cond_destroy(&pth_cond);
    ABT_barrier_free(&abt_barrier);
    ABT_mutex_free(&abt_mutex);
    ABT_cond_free(&abt_cond);
    free(pthread_args);

    ABT_finalize();
    return 0;
}

(The experimental setting is collapsed)

Little contention:
Mutex: 1 thread (Skylake/ARM64/POWER9)
Cond: 2 threads (Skylake/ARM64/POWER9)
Barrier: 2 threads (Skylake/ARM64/POWER9)

High contention (not oversubscribed):
Mutex/Cond/Barrier: 28 threads (Skylake/ARM64), 40 threads (POWER9)

High contention (oversubscribed):
Mutex/Cond/Barrier: 256 threads (Skylake/ARM64), 320 threads (POWER9)

Results are the average of 6 time executions. Each execution repeats these operations many times (around 100000 or more). Argobots is compiled with --enable-perf-opt.

image

Please note that this result is to "understand" the performance. This PR's change is qualitative, so bad performance of microbenchmarks does not lessen the value of this PR much. This microbenchmark does not well represent a use case of Pthreads+Argobots: the purpose of this new mechanism is to sleep external threads while waiting for others' work that is moderately coarse-grained and if it is the case this PR works very well no matter POSIX or futex is used. I also note that semantics (spurious wakeup or not), provided features (Pthreads only or ULT + Pthreads), and scheduling fairness (Argobots synchronization objects might not be as fair as those of Pthreads) are different, so this comparison is not fair.

We can observe the following:

  1. In any case, this mechanism works on several architectures without hang.
  2. If not contended, the performance of Argobots looks okay.
  3. If contended, the performance of Argobots is not good.
  4. If the wait policy is passive, overall the futex version is faster than the Pthreads version.
  5. When Pthreads oversubscription happens, the performance of ABT_barrier and ABT_cond gets significantly bad when the wait policy of Argobots is active (because Argobots spins cores).

There is much room for improvement, so if further performance optimization is necessary, please let us know.

Known issues

  1. Some operations are obviously unoptimized (specifically ABT_cond_signal()).
  2. A few operations still use busy-wait (as far as I checked, when a thread waits for completion of a tasklet).
  3. Some users might want to set this configuration per synchronization objects. Perhaps users wants to change this behavior per function (e.g., ABT_mutex_lock_active() and ABT_mutex_lock_passive()). This is future work (if there is such a demand).

Checklist

  • Reference appropriate issues (with "Fixes" or "See" as appropriate)
  • Commits are self-contained and do not do two things at once
  • Commit message is of the form: module: short description and follows good practice
  • Passes whitespace checkers

ABT_CONFIG_USE_LINUX_FUTEX is declared if futex is available on that
system.
--enable-wait-policy=<active|passive|default|auto> changes the wait
policy of synchronization objects.  This information can be retrieved by
ABT_info functions.
The current spinlock implementation relies only on ABTD_atomic features
and therefore can be classified as ABTD.  To introduce ABTD_futex in the
next commit, which uses this spinlock, this patch moves the spinlock
implementation from ABTI to ABTD.
ABTD_futex_multiple is an internal synchronization object for external
threads (Pthreads).  Unlike spinlock, external threads that block on
ABTD_futex_multiple will suspend, so waiting external threads do not
burn cores.  Since futex is a Linux-specific feature,
ABTD_futex_multiple has a POSIX version that implements the same
feature with pthread_mutex_t and pthread_cond_t.  To cover both futex
and Pthreads, ABTD_futex_multiple does not support all the futex
operations. ABTD_futex_multiple supports only wait, timedwait, and
broadcast.
blocking is not used.  This patch removes it to simplify introduction of
ABTD_futex_multiple.
This patch refactors ABTI_waitlist_wait_timedout_and_unlock() to
minimize the change in the next commit.  This patch does not change the
logic of the existing algorithm.
Argobots synchronization objects such as ABT_cond and ABT_barreir
internally use ABTI_waitlist.  This patch introduces ABTD_futex_multiple
for ABTI_waitlist operations so that external threads and tasklets can
wait for those objects without spinning cores.
@shintaro-iwasaki
Copy link
Collaborator Author

test:argobots/all
test:argobots/osx
test:argobots/freebsd
test:argobots/solaris

@shintaro-iwasaki shintaro-iwasaki linked an issue Mar 1, 2021 that may be closed by this pull request
To check synchronization objects on external threads, this patch adds
six tests: ext_thread_barrier, ext_thread_cond, ext_thread_eventual,
ext_thread_future, ext_thread_mutex, and ext_thread_rwlock.  Those tests
are tested under signals, which cause spurious wake up and can break
futex-based implementation.
@shintaro-iwasaki
Copy link
Collaborator Author

test:argobots/osx

ABTD_futex_single is an internal synchronization object for external
threads (Pthreads).  Unlike spinlock, external threads that block on
ABTD_futex_single will suspend, so waiting external threads do not
burn cores.  ABTD_futex_single has a POSIX version as a fallback.
ABTD_futex_single supports only suspend and resume.
This patch refactors ABTI_ythread_context_switch_to_child_internal() and
ythread_terminate() to prepare for the next commit.  This commit does
not change the existing join request handling.
When an external thread or a tasklet joins yieldable threads, it will
wait on ABTD_futex_single.  This improves CPU utilization.
A new test, ext_thread_join, checks ULT join and execution stream join
on external threads, which internally uses ABTD_futex_single.
New ext_thread tests create Pthreads on the primary execution stream.
On Linux, a newly created Pthread will inherit its Parent's affinity
setting, so when affinity is enabled and no specific affinity is set
(i.e., when no ABT_SET_AFFINIY is set), the primary execution stream and
external Pthreads are pinned to a single core, which significantly
degrades the performance because the primary execution stream does not
sleep (while Pthreads internally sleep on futex or pthread_cond_t).
This patch relaxes the affinity setting of the primary execution stream
to avoid this oversubscription.
@shintaro-iwasaki
Copy link
Collaborator Author

test:argobots/osx

@shintaro-iwasaki
Copy link
Collaborator Author

test:argobots/freebsd
test:argobots/solaris
test:argobots/all

@shintaro-iwasaki
Copy link
Collaborator Author

We understand that the current synchronization mechanism has room for performance improvement. Let us know any use case so that we can optimize it.

@shintaro-iwasaki shintaro-iwasaki merged commit fdd5cf3 into pmodels:main Mar 1, 2021
@shintaro-iwasaki shintaro-iwasaki deleted the pr/futex branch March 2, 2021 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid spin-wait in Argobots blocking routines
1 participant