Skip to content

Commit

Permalink
Fix for possible RNG offset calculation bug in cuda vectorized dropou…
Browse files Browse the repository at this point in the history
…t with VEC=2 (pytorch#50110)

Summary:
The [offset calculation](https://github.com/pytorch/pytorch/blob/e3c56ddde67ca1a49159ffa886d889b6e65c7033/aten/src/ATen/native/cuda/Dropout.cu#L328) (which gives an estimated ceiling on the most 32-bit values in the philox sequence any thread in the launch will use) uses the hardcoded UNROLL value of 4, and assumes the hungriest threads can use every value (.x, .y, .z, and .w) their curand_uniform4 calls provide.  However, the way fused_dropout_kernel_vec is currently written, that assumption isn't true in the VEC=2 case:  Each iteration of the `grid x VEC` stride loop, each thread calls curand_uniform4 once, uses rand.x and rand.y, and discards rand.z and rand.w.  This means (I _think_) curand_uniform4 may be called twice as many times per thread in the VEC=2 case as for the VEC=4 case or the fully unrolled code path, which means the offset calculation (which is a good estimate for the latter two cases) is probably wrong for the `fused_dropout_kernel_vec<..., /*VEC=*/2>` code path.

The present PR inserts some value-reuse in fused_dropout_kernel_vec to align the number of times curand_uniform4 is called for launches with the same totalElements in the VEC=2 and VEC=4 cases.  The diff should
- make the offset calculation valid for all code paths
- provide a very small perf boost by reducing the number of curand_uniform4 calls in the VEC=2 path
- ~~make results bitwise accurate for all code paths~~ nvm, tensor elements are assigned to threads differently in the unrolled, VEC 2 and VEC 4 cases, so we're screwed here no matter what.

ngimel what do you think?

Pull Request resolved: pytorch#50110

Reviewed By: smessmer

Differential Revision: D25790121

Pulled By: ngimel

fbshipit-source-id: f8f533ad997268c6f323cf4d225de547144247a8
  • Loading branch information
mcarilli authored and hwangdeyu committed Jan 14, 2021
1 parent d9a70c2 commit 8d6c093
Showing 1 changed file with 18 additions and 3 deletions.
21 changes: 18 additions & 3 deletions aten/src/ATen/native/cuda/Dropout.cu
Expand Up @@ -57,6 +57,12 @@ fused_dropout_kernel_vec(at::cuda::detail::TensorInfo<scalar_t, IndexType> a,

accscalar_t pinv = accscalar_t(1)/p;

// Helps align the total number of times curand_uniform4 is called by each thread for the same totalElements
// in the vec=2 and vec=4 cases.
bool gridxvec_loop_state = 0;

float4 rand;

// Note: Vectorized loads means we'll stride each thread by an additional VEC factor, as we'll load VEC elements at a time
for (IndexType linearIndex = idx * VEC;
linearIndex < totalElements;
Expand All @@ -69,12 +75,21 @@ fused_dropout_kernel_vec(at::cuda::detail::TensorInfo<scalar_t, IndexType> a,
//curand_uniform_double was pure evil anyway, not doing what it promises, and there's nothing for halfs, so generate float for everything
// Note: need a new set of random values per 4 elements -- we'll handle VEC elements in this thread, so need ceil(VEC / 4)
// sets of rand.
float4 rand = curand_uniform4(&state);
if ((VEC == 4) || (gridxvec_loop_state == 0)) {
rand = curand_uniform4(&state);
} else {
// sets up the last two values we generated last iteration to be used this iteration.
rand.x = rand.z;
rand.y = rand.w;
gridxvec_loop_state ^= 1;
}

rand.x = rand.x < p;
rand.y = rand.y < p;
rand.z = rand.z < p;
rand.w = rand.w < p;
if (VEC == 4) {
rand.z = rand.z < p;
rand.w = rand.w < p;
}

// Note: We explicitly check for is_contiguous() before launching the vectorized kernel
// and replace IndexToOffset call with linearIndex to allow vectorization of NHWC (or other)
Expand Down

0 comments on commit 8d6c093

Please sign in to comment.