Gunrock races reported in PLDI 2018 paper #344

jowens · 2018-06-26T16:54:58Z

"CURD: a dynamic CUDA race detector", https://dl.acm.org/citation.cfm?id=3192368

Yuanfeng Peng (@stilltracy) reported the following two races (note line numbers are commented in the snippets below):

on ../../gunrock/util/soa_tuple.cuh,line 136 :

    template <typename TupleSlice>
    __host__ __device__ __forceinline__ void Set(
        const TupleSlice &tuple,
        const int row,
        const int col)
    {   
        t0[row][col] = tuple.t0; // line 136
        t1[row][col] = tuple.t1;
    }

and ../../gunrock/util/scan/warp_scan.cuh,line 159:

    template <typename T, typename WarpscanT, typename ReductionOp>
    static __device__ __forceinline__ T Invoke(
        T current_partial,                          // Input partial
        T &total_reduction,                         // Total reduction (out param)
        WarpscanT warpscan[][NUM_ELEMENTS],     // Smem for warpscanning.  Contains at least two segments of size NUM_ELEMENTS (the first being initialized to identity)
        ReductionOp scan_op,                        // Scan operator
        int warpscan_tid = threadIdx.x)             // Thread's local index into a segment of NUM_ELEMENTS items
    {   
        const int WIDTH = 1 << STEPS;
        T inclusive_partial = Iterate<1, WIDTH>::Invoke( // line 159
            current_partial,
            warpscan,
            scan_op,
            warpscan_tid);

@sgpyc can you take a look? @yzhwang also helpful if you look ...

The text was updated successfully, but these errors were encountered:

jowens · 2018-06-27T16:57:02Z

New info from @stilltracy!

Hi John,

I re-ran CURD on bfs and re-collected the race reports, and it seems that the line numbers I sent you previously were outdated and doesn't correspond to the right source locations (as CURD instruments the source files), sorry!

Below is the update-to-date reports from CURD:

intra-block(SMem) (W,R) race by 0 and 1 on < 260, 4>, between (../../gunrock/util/soa_tuple.cuh,line 136) and (../../gunrock/util/soa_tuple.cuh,line 154)

    template <typename TupleSlice>                                                                                                                                                                                                            
    __host__ __device__ __forceinline__ void Set(                                                                                                                                                                                             
        const TupleSlice &tuple,                                                                                                                                                                                                              
        const int row,                                                                                                                                                                                                                        
        const int col)                                                                                                                                                                                                                        
    {                                                                                                                                                                                                                                         
        t0[row][col] = tuple.t0;     // 136                                                                                                                                                                                                   
        t1[row][col] = tuple.t1;                                                                                                                                                                                                              
    }
 
    template <typename TupleSlice>                                                                                                                                                                                                            
    __host__ __device__ __forceinline__ void Get(                                                                                                                                                                                             
        TupleSlice &retval,                                                                                                                                                                                                                   
        const int row,                                                                                                                                                                                                                        
        const int col) const                                                                                                                                                                                                                  
    {                                                                                                                                                                                                                                         
        retval = TupleSlice(t0[row][col], t1[row][col]);     // 154                                                                                                                                                                                 
    }

The race seems to be on accesses to t0[row][col], by thread n and n+1.

Another one is inside ../../gunrock/util/scan/soa/warp_soa_scan.cuh:

    template <int OFFSET_LEFT, int WIDTH>                                                                              
    struct Iterate                                                                                                     
    {   
        // Scan                                                                                                        
        template <
            typename Tuple,
            typename WarpscanSoa,
            typename ReductionOp>
        static __device__ __forceinline__ Tuple Scan(                                                                  
            Tuple partial,                                                                                             
            WarpscanSoa warpscan_partials,                                                                             
            ReductionOp scan_op,
            int warpscan_tid)
        {       
            // Store exclusive partial
            warpscan_partials.Set(partial, 1, warpscan_tid);                                                           
                                                                                                                       
            if (!WarpscanSoa::VOLATILE) __threadfence_block();                                                         
            
            if (ReductionOp::IDENTITY_STRIDES || (warpscan_tid >= OFFSET_LEFT)) {                                      
        
                // Load current partial                                                                                
                Tuple current_partial;                                                                                 
                warpscan_partials.Get(current_partial, 1, warpscan_tid - OFFSET_LEFT);     // 73,  Read                           
    
                // Compute inclusive partial from exclusive and current partials
                partial = scan_op(current_partial, partial);                                      // 76, Write                     
            }

This one is tricky, and I'm not 100% sure about it. I'll give my two cents here: line 73 reads from current_partial and the scan_op() on line 76 writes to it.

sgpyc · 2018-06-28T07:05:29Z

The code pointed out are for warp/block level scan and iteration; they are quite deep, and used by the TWC_advance and CULL_filter operators. The data races are very difficult to detect. If @stilltracy have the condition (dataset + command line parameters) or any other info that can narrow down to when the race could happen, it will save us a lot of time.

By the way, are the data races from CUB occur in similar operations?

jowens · 2018-06-28T13:12:47Z

@stilltracy these are theoretical races, right? you didn't find them in practice, this is through program analysis?

neoblizz · 2018-06-28T15:16:43Z

Would this also be useful on the new dev-refactor branch with refactored code?

stilltracy · 2018-06-29T07:33:18Z

@jowens these are actual races that were caught by CURD on a real execution.
@sgpyc I ran test/bfs rgg and observed the reported races. Unfortunately, CURD currently can't give more detailed information about the call sites, such as call stacks, etc..
Yeah, some of the data races on CUB are on warp/block level scans etc.

jowens · 2018-06-29T14:46:06Z

@sgpyc I don't know this low-level code, but is it possible to look at the code itself and figure out where the race might be, just from first principles?

jowens assigned yzhwang and sgpyc Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunrock races reported in PLDI 2018 paper #344

Gunrock races reported in PLDI 2018 paper #344

jowens commented Jun 26, 2018

jowens commented Jun 27, 2018

sgpyc commented Jun 28, 2018

jowens commented Jun 28, 2018

neoblizz commented Jun 28, 2018

stilltracy commented Jun 29, 2018

jowens commented Jun 29, 2018

Gunrock races reported in PLDI 2018 paper #344

Gunrock races reported in PLDI 2018 paper #344

Comments

jowens commented Jun 26, 2018

jowens commented Jun 27, 2018

sgpyc commented Jun 28, 2018

jowens commented Jun 28, 2018

neoblizz commented Jun 28, 2018

stilltracy commented Jun 29, 2018

jowens commented Jun 29, 2018