[Parallel Programming]Peterson Lock 研究报告

作者信息

Charles, Liu charlesliu.cn.bj@gmail.com 安远国际
Derek, Zhang anonymous@anonymous.com 百度

前言

前段时间研究Peterson Lock的时候遇到了一些问题，就将问题抛给了前Mentor（Derek, Zhang）。经过两天的讨论，将所有的细节罗列在这份报告里面，对于所有的外部资料的引用，只会贴出链接，作者并不做任何形式的翻译或者转述，这样可以保证所有信息的准确性。

Peterson Locking Algorithm实现

关于Peterson Locking Algorithm，请参考Wiki：Peterson's algorithm

#ifndef _PETERSON_LOCK_H
#define _PETERSON_LOCK_H
/*
 * Description: implementing peterson's locking algorithm
 * File: peterson_lock.h
 * Author: Charles, Liu.
 * Mailto: charlesliu.cn.bj@gmail.com
 */
#include <pthread.h>

typedef struct {
    volatile bool flag[2];
    volatile int victim;
} peterson_lock_t;

void peterson_lock_init(peterson_lock_t &lock) {
    lock.flag[0] = lock.flag[1] = false;
    lock.victim = 0;
}

void peterson_lock(peterson_lock_t &lock, int id) {
    lock.flag[id] = true; 
    lock.victim = id; 
    while (lock.flag[1 - id] && lock.victim == id);
}

void peterson_unlock(peterson_lock_t &lock, int id) {
    lock.flag[id] = false;
    lock.victim = id;
}

#endif

以下是测试程序：

#include <stdio.h>
#include "peterson_lock.h"

peterson_lock_t lock;
int count = 0;

void *routine0(void *arg) {
    int *cnt = (int *)arg;
    for (int i = 0; i < *cnt; ++i) {
        peterson_lock(lock, 0);
        ++count;
        printf("count: %d", count);
        peterson_unlock(lock, 0);
    }

    return NULL;
}

void *routine1(void *arg) {
    int *cnt = (int *)arg;
    for (int i = 0; i < *cnt; ++i) {
        peterson_lock(lock, 1);
        ++count;
        peterson_unlock(lock, 1);
    }
}

int main(int argc, char **argv) {
    peterson_lock_init(lock);
    pthread_t thread0, thread1;
    int count0 = 10000;
    int count1 = 20000;
    pthread_create(&thread0, NULL, routine0, (void *)&count0);
    pthread_create(&thread1, NULL, routine1, (void *)&count1);

    pthread_join(thread0, NULL);
    pthread_join(thread1, NULL);

    printf("Expected: %d\n", (count0 + count1));
    printf("Reality : %d\n", count);

    return 0;
}

如果你编译运行以上的测试，会发现Peterson Lock的实现并没有达到目的，原因就是编译时期和运行时期都会有指令重排的问题。

Memory Ordering

这里我不打算讲Memory Ordering的具体细节，所有信息都在下面的链接里：

这里是Memory Ordering的Wiki -> Memory ordering
这里的一篇文章讲到了Memory Ordering和Peterson Lock的问题 -> Who ordered memory fences on an x86?
这里有更详细的关于Intel IA64 Memory Ordering的资料 -> Intel® 64 Architecture Memory Ordering White Paper
C++ 里Memory Order相关的资料(由于atomic本身就用到了原子操作，所以我并不打算使用) -> std::memory_order

请你务必仔细看完以上的三份资料，不然无法继续进行接下来的内容。
以上Peterson Lock实现失败的原因就显而易见了，博主的机器是Intel x86-64 smp的机器，拥有strict memory order，也就是只有在store&load的情况下才会有指令重拍的问题，而以上Peterson Lock的实现正好就是store&load的情况，为此我们必须避免CPU的指令重排。

关于Memory Fence

上节提到避免指令重排需要加Memory Fence，指令重排分为两种:

一种是编译时期的指令重排，可以通过这个来防止：asm volatile ("" : : : "memory")
一种是运行时期的cpu指令重排，同时也包含了防止编译时期的指令重排：asm volatile ("mfence" : : : "memory") or asm volatile ("lfence" : : : "memory") or asm volatile ("sfence" : : : "memory")

而Memory Fence分为三种：

mfence -> Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction). Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store ordering between routines that produce weakly-ordered results and routines that consume that data. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). The PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the MFENCE instruction is not ordered with respect to PREFETCHh instructions or any other speculative fetching mechanism (that is, data could be speculatively loaded into the cache just before, during, or after the execution of an MFENCE instruction).
lfence -> Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible. The LFENCE instruction is ordered with respect to load instructions, other LFENCE instructions, any MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to store instructions or the SFENCE instruction. Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue and speculative reads. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The LFENCE instruction provides a performance-efficient way of insuring load ordering between routines that produce weakly-ordered results and routines that consume that data. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). The PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the LFENCE instruction is not ordered with respect to PREFETCHh instructions or any other speculative fetching mechanism (that is, data could be speculative loaded into the cache just before, during, or after the execution of an LFENCE instruction).
sfence -> Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction. This serializing operation guarantees that every store instruction that precedes in program order the SFENCE instruction is globally visible before any store instruction that follows the SFENCE instruction is globally visible. The SFENCE instruction is ordered with respect store instructions, other SFENCE instructions, any MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to load instructions or the LFENCE instruction. Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The SFENCE instruction provides a performance-efficient way of insuring store ordering between routines that produce weakly-ordered results and routines that consume this data.

参考Intel x86-64的Memory Ordering的设计，其实lfence和sfence在保证Memory Ordering这点上是没有意义的，因为Intel x86-64本来就是strict memory order，但是内存可见性这个点上，lfence和sfence任然有其价值，由于Peterson Lock是为了避免store&load的指令重排，所以我们使用mfence，即Full Memory Fence。加上MFENCE之后的代码如下：

#ifndef _PETERSON_LOCK_H
#define _PETERSON_LOCK_H
/*
 * Description: implementing peterson's locking algorithm
 * File: peterson_lock.h
 * Author: Charles, Liu.
 * Mailto: charlesliu.cn.bj@gmail.com
 */
#include <pthread.h>

typedef struct {
    volatile bool flag[2];
    volatile int victim;
} peterson_lock_t;

void peterson_lock_init(peterson_lock_t &lock) {
    lock.flag[0] = lock.flag[1] = false;
    lock.victim = 0;
}

void peterson_lock(peterson_lock_t &lock, int id) {
    lock.flag[id] = true; // 子句A
    lock.victim = id; // 子句B，子句A和B的顺序很重要，如果调换的话会出现问题
    asm volatile ("mfence" : : : "memory"); // MFENCE加在store和load之间
    while (lock.flag[1 - id] && lock.victim == id);
}

void peterson_unlock(peterson_lock_t &lock, int id) {
    lock.flag[id] = false;
    lock.victim = id;
}

#endif

首先看一下MFENCE添加的位置是在store和load之间，其次思考这么一个问题，如果调换子句A和子句B的位置会出现什么情况呢？

peterson_lock_0:                       peterson_lock_1:
-------------------------------------------------------
lock.victim = 0;
                                      lock.victim = 1;
                                      lock.flag[1] = true;
                                      asm volatile ("mfence" : : : "memory");
                                      while (lock.flag[0] && lock.victim == 1);
                                      // lock.flag[0] is false so
                                      // the process enters critical
                                      // section
lock.flag[0] = true;
asm volatile ("mfence" : : : "memory");
while (lock.flag[1] && lock.victim == 0);
// lock.victim is 1 so
// the process enters critical
// section

Thread0和Thread1会同时进入Critical Section，再思考如下的情况，

Thread0:                                Thread1:
-------------------------------------------------------
lock.flag[0] = true;
lock.victim = 0;						
                                        lock.flag[1] = true;
                                        lock.victim = 1;
                                        mfence;
                                        如果这个时候Thread0的lock.victim = 0对于Thread1可见，
                                        那么会进入Critical Section
mfence;
如果这个时候Thread1的lock.victim = 1对于
Thread0可见，那么会进入Critical Section

以上情况并没有发生，为什么呢？请看mfence那段加粗的文字，这正是mfence的另一个作用，内存可见性。

关于volatile和GCC优化

如果去掉上面的两个volatile，不加优化，运行正常，void peterson_lock(peterson_lock_t &lock, int id)的汇编结果如下:

0000000000400638 <_Z13peterson_lockR15peterson_lock_ti>:
  400638:   55                      push   rbp
  400639:   48 89 e5                mov    rbp,rsp
  40063c:   48 89 7d f8             mov    QWORD PTR [rbp-0x8],rdi
  400640:   89 75 f4                mov    DWORD PTR [rbp-0xc],esi
  400643:   48 8b 55 f8             mov    rdx,QWORD PTR [rbp-0x8]
  400647:   8b 45 f4                mov    eax,DWORD PTR [rbp-0xc]
  40064a:   48 98                   cdqe
  40064c:   c6 04 02 01             mov    BYTE PTR [rdx+rax*1],0x1 ; Clause 1
  400650:   48 8b 45 f8             mov    rax,QWORD PTR [rbp-0x8]  ; Clause 2
  400654:   8b 55 f4                mov    edx,DWORD PTR [rbp-0xc]  ; Clause 3
  400657:   89 50 04                mov    DWORD PTR [rax+0x4],edx  ; Clause 4
  40065a:   0f ae f0                mfence
  40065d:   90                      nop
  40065e:   b8 01 00 00 00          mov    eax,0x1
  400663:   2b 45 f4                sub    eax,DWORD PTR [rbp-0xc]
  400666:   48 8b 55 f8             mov    rdx,QWORD PTR [rbp-0x8]
  40066a:   48 98                   cdqe
  40066c:   0f b6 04 02             movzx  eax,BYTE PTR [rdx+rax*1]
  400670:   84 c0                   test   al,al
  400672:   74 0c                   je     400680 <_Z13peterson_lockR15peterson_lock_ti+0x48>
  400674:   48 8b 45 f8             mov    rax,QWORD PTR [rbp-0x8]
  400678:   8b 40 04                mov    eax,DWORD PTR [rax+0x4]
  40067b:   3b 45 f4                cmp    eax,DWORD PTR [rbp-0xc]
  40067e:   74 de                   je     40065e <_Z13peterson_lockR15peterson_lock_ti+0x26>
  400680:   5d                      pop    rbp
  400681:   c3                      ret

请看上述汇编中标注出来的Clause 1-4，正好对应下面的两句话：

lock.flag[id] = true;
lock.victim = id;

但是值并没有存在寄存器里面，也并没有直接从寄存器读取，而是直接读取内存。

如果我们加上-O2再编译一次，得到的void peterson_lock(peterson_lock_t &lock, int id)的汇编结果如下:

00000000004007a0 <_Z13peterson_lockR15peterson_lock_ti>:
  4007a0:   48 63 c6                movsxd rax,esi
  4007a3:   c6 04 07 01             mov    BYTE PTR [rdi+rax*1],0x1 ; Clause 1
  4007a7:   89 77 04                mov    DWORD PTR [rdi+0x4],esi  ; Clause 2
  4007aa:   0f ae f0                mfence
  4007ad:   b8 01 00 00 00          mov    eax,0x1
  4007b2:   29 f0                   sub    eax,esi
  4007b4:   48 98                   cdqe
  4007b6:   80 3c 07 00             cmp    BYTE PTR [rdi+rax*1],0x0
  4007ba:   75 04                   jne    4007c0 <_Z13peterson_lockR15peterson_lock_ti+0x20>
  4007bc:   f3 c3                   repz ret
  4007be:   66 90                   xchg   ax,ax
  4007c0:   39 77 04                cmp    DWORD PTR [rdi+0x4],esi
  4007c3:   75 f7                   jne    4007bc <_Z13peterson_lockR15peterson_lock_ti+0x1c>
  4007c5:   39 77 04                cmp    DWORD PTR [rdi+0x4],esi
  4007c8:   74 f6                   je     4007c0 <_Z13peterson_lockR15peterson_lock_ti+0x20>
  4007ca:   eb f0                   jmp    4007bc <_Z13peterson_lockR15peterson_lock_ti+0x1c>
  4007cc:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

请注意上面的Clause 1-2，发现没，lock.flag[id]变为了直接赋值，而lock.victim的值变为从esi寄存器读，由于现在CPU架构大多都是NUMA架构，我的也是，各个CPU独享自己的寄存器，然后gdb -tui [our program]，break peterson_lock然后运行发现，直接跳转如下：
并没有peterson_lock函数，也就是-O2直接将它inline掉了。如果运行程序会发现程序直接死锁了，下面我会给出解释的，先让我们看看-O3编译出的版本：

00000000004007a0 <_Z13peterson_lockR15peterson_lock_ti>:
  4007a0:   48 63 c6                movsxd rax,esi
  4007a3:   c6 04 07 01             mov    BYTE PTR [rdi+rax*1],0x1
  4007a7:   89 77 04                mov    DWORD PTR [rdi+0x4],esi
  4007aa:   0f ae f0                mfence
  4007ad:   b8 01 00 00 00          mov    eax,0x1
  4007b2:   29 f0                   sub    eax,esi
  4007b4:   48 98                   cdqe
  4007b6:   80 3c 07 00             cmp    BYTE PTR [rdi+rax*1],0x0
  4007ba:   74 05                   je     4007c1 <_Z13peterson_lockR15peterson_lock_ti+0x21>
  4007bc:   3b 77 04                cmp    esi,DWORD PTR [rdi+0x4]
  4007bf:   74 07                   je     4007c8 <_Z13peterson_lockR15peterson_lock_ti+0x28>
  4007c1:   f3 c3                   repz ret
  4007c3:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
  4007c8:   eb fe                   jmp    4007c8 <_Z13peterson_lockR15peterson_lock_ti+0x28>
  4007ca:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]

和-O2的版本并没有多大差别，如果运行的话还是会死锁。为什么会死锁呢？解释如下：

由于lock.victim的值直接从寄存器读取，那么lock.victim == id 的条件恒成立

Thread0                                         Thread1
------------------------------------------------------------------
lock.flag[0] = true;                            lock.flag[1] = true;
lock.victim = 0;                                lock.victim = 1;
mfence;                                         mfence;
while (lock.flag[1] && lock.victim == 0);       while (lock.flag[0] && lock.victim == 1);
同时进入死锁

其实由于Thread0写lock.flag[0]，读lock.flag[1]；Thread1写lock.flag[1]，读lock.flag[0]，即各自读写不同的内存，所以bool lock.flag[2]之前的volatile是可以去掉的，但是volatile int lock.victim前面的volatile是必须得保留的。
总结一下，volatile修饰的作用就是避免在Hardware层面上，变量的值会被写进寄存器，这样也就不会从寄存器读取了。

本报告的简略PDF

Peterson Locking Algorithm Report

小结

满满的干货，Good Luck, Have Fun!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parallel Programming]Peterson Lock 研究报告

[Parallel Programming]Peterson Lock 研究报告

作者信息

前言

Peterson Locking Algorithm实现

Memory Ordering

关于Memory Fence

关于volatile和GCC优化

本报告的简略PDF

小结

Clone this wiki locally