OpenMP barrier slowness in neural network examples with high core count. #485

mratsim · 2021-01-03T13:48:46Z

The following bench, reduced to only call linear which is just a thin wrapper around BLAS, takes 1.6s without -d:openmp and 15s with -d:openmp

import ../src/arraymancer

# Learning XOR function with a neural network.
proc main() =
  # Autograd context / neuralnet graph
  let ctx = newContext Tensor[float32]
  let bsz = 32 # batch size

  let x_train_bool = randomTensor([bsz * 100, 2], 1).astype(bool)
  let y_bool = x_train_bool[_,0] xor x_train_bool[_,1]
  let x_train = ctx.variable(x_train_bool.astype(float32))
  let y = y_bool.astype(float32)

  # We will build the following network:
  # Input --> Linear(out_features = 3) --> relu --> Linear(out_features = 1) --> Sigmoid --> Cross-Entropy Loss

  let layer_3neurons = ctx.variable(
                        randomTensor(3, 2, 2.0f) -. 1.0f,
                        requires_grad = true
                      )

  let classifier_layer = ctx.variable(
                          randomTensor(1, 3, 2.0f) -. 1.0f,
                          requires_grad = true
                        )

  # Stochastic Gradient Descent
  let optim = newSGD[float32](
      layer_3neurons, classifier_layer, 0.01f
    )

  # Learning loop
  for epoch in 0..10000:
    for batch_id in 0..<100:

      # minibatch offset in the Tensor
      let offset = batch_id * 32
      let x = x_train[offset ..< offset + 32, _]
      let target = y[offset ..< offset + 32, _]

      # Building the network
      let n1 = linear(x, layer_3neurons) # <-- problematic line (linear without bias).

It seems like the machine stalls on OpenMP barriers.
Also it seems like the more core you have the more problematic it is.

The text was updated successfully, but these errors were encountered:

…gmoid_cross_entropy

mratsim · 2021-01-03T14:41:49Z

So at first I thought it was setZero used in parallel region and so that couldn't be parallelized or we would have too many threads.

But then for this script

import ../src/arraymancer

# Learning XOR function with a neural network.
proc main() =
  # Autograd context / neuralnet graph
  let ctx = newContext Tensor[float32]
  let bsz = 32 # batch size

  let x_train_bool = randomTensor([bsz * 100, 2], 1).astype(bool)
  let y_bool = x_train_bool[_,0] xor x_train_bool[_,1]
  let x_train = ctx.variable(x_train_bool.astype(float32))
  let y = y_bool.astype(float32)

  # We will build the following network:
  # Input --> Linear(out_features = 3) --> relu --> Linear(out_features = 1) --> Sigmoid --> Cross-Entropy Loss

  let layer_3neurons = ctx.variable(
                        randomTensor(3, 2, 2.0f) -. 1.0f,
                        requires_grad = true
                      )

  let classifier_layer = ctx.variable(
                          randomTensor(1, 3, 2.0f) -. 1.0f,
                          requires_grad = true
                        )

  # Stochastic Gradient Descent
  let optim = newSGD[float32](
      layer_3neurons, classifier_layer, 0.01f
    )

  # Learning loop
  for epoch in 0..10000:
    for batch_id in 0..<100:

      # minibatch offset in the Tensor
      let offset = batch_id * 32
      let x = x_train[offset ..< offset + 32, _]
      let target = y[offset ..< offset + 32, _]

      # Building the network
      let n1 = relu linear(x, layer_3neurons)
      let n2 = linear(n1, classifier_layer)
      let loss = n2.sigmoid_cross_entropy(target)

We have:

which I suspected might be due to nested parallelism here:

Arraymancer/src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim

Lines 49 to 51 in bdcdfe1

    
           result = sum: 
        
             map2_inline(input, target): 
        
               -y * x +  max(x,0) + ln1p(exp(-abs(x))) # This leverage the logsumexp trick to improve numerical stability

but even rewritten with map-reduce fusion and ultimately serial we're still stuck in those barriers for seemingly no reason.

Arraymancer/src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim

Lines 44 to 57 in 979f5d5

    
           result = 0.T 
        
           for xi, ti in zip(input, target): 
        
             result += (-ti * xi +  max(xi,0) + ln1p(exp(-abs(xi))) ) / T(input.shape[1]) 
        
           # TODO - Parallel fused map-reduce, openmp issue - https://github.com/mratsim/Arraymancer/issues/485 
        
           # forEachStaged ii in input, ti in target: 
        
           #   before_loop: 
        
           #     var local_sum{.exportc.} = 0.T 
        
           #   in_loop: 
        
           #     # This leverage the logsumexp trick to improve numerical stability 
        
           #     local_sum += -ti * ii +  max(ii,0) + ln1p(exp(-abs(ii))) 
        
           #   after_loop: 
        
           #     {.emit: "#pragma omp atomic".} 
        
           #     {.emit: "`result` += `local_sum`;".}

Then seemed like copyFromRaw was the biggest culprit left.

Conclusion

I'm unsure if the issue is in misuse of OpenMP parallel sections or tied to OpenMP design but we are clearly reaching its limits. I suspenct Facebook came to the same conclusion in PyTorch and introduced their C10 threadpool or Halide with their custom threadpool.

AKA, we likely need to introduce Weave sooner than later as OpenMP doesn't cut it and is also hard to debug/profile (and a pain to install on Mac).

mratsim added optimization bug OpenMP labels Jan 3, 2021

mratsim changed the title ~~Very bad interaction between BLAS and OpenMP~~ OpenMP barrier issue Jan 3, 2021

mratsim added a commit that referenced this issue Jan 3, 2021

#485 first part, don't parallelize setZero

69efb08

mratsim added a commit that referenced this issue Jan 3, 2021

Trying to solve #485 to no avail, at least drop parallelization of si…

979f5d5

…gmoid_cross_entropy

mratsim changed the title ~~OpenMP barrier issue~~ OpenMP barrier slowness in neural network examples with high core count. Jan 3, 2021

mratsim mentioned this issue Dec 31, 2023

2023-12-31 - Longstanding missing features #616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP barrier slowness in neural network examples with high core count. #485

OpenMP barrier slowness in neural network examples with high core count. #485

mratsim commented Jan 3, 2021

mratsim commented Jan 3, 2021

OpenMP barrier slowness in neural network examples with high core count. #485

OpenMP barrier slowness in neural network examples with high core count. #485

Comments

mratsim commented Jan 3, 2021

mratsim commented Jan 3, 2021

Conclusion