-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP barrier slowness in neural network examples with high core count. #485
Comments
So at first I thought it was But then for this script import ../src/arraymancer
# Learning XOR function with a neural network.
proc main() =
# Autograd context / neuralnet graph
let ctx = newContext Tensor[float32]
let bsz = 32 # batch size
let x_train_bool = randomTensor([bsz * 100, 2], 1).astype(bool)
let y_bool = x_train_bool[_,0] xor x_train_bool[_,1]
let x_train = ctx.variable(x_train_bool.astype(float32))
let y = y_bool.astype(float32)
# We will build the following network:
# Input --> Linear(out_features = 3) --> relu --> Linear(out_features = 1) --> Sigmoid --> Cross-Entropy Loss
let layer_3neurons = ctx.variable(
randomTensor(3, 2, 2.0f) -. 1.0f,
requires_grad = true
)
let classifier_layer = ctx.variable(
randomTensor(1, 3, 2.0f) -. 1.0f,
requires_grad = true
)
# Stochastic Gradient Descent
let optim = newSGD[float32](
layer_3neurons, classifier_layer, 0.01f
)
# Learning loop
for epoch in 0..10000:
for batch_id in 0..<100:
# minibatch offset in the Tensor
let offset = batch_id * 32
let x = x_train[offset ..< offset + 32, _]
let target = y[offset ..< offset + 32, _]
# Building the network
let n1 = relu linear(x, layer_3neurons)
let n2 = linear(n1, classifier_layer)
let loss = n2.sigmoid_cross_entropy(target) We have: which I suspected might be due to nested parallelism here:
but even rewritten with map-reduce fusion and ultimately serial we're still stuck in those barriers for seemingly no reason.
Then seemed like ConclusionI'm unsure if the issue is in misuse of OpenMP parallel sections or tied to OpenMP design but we are clearly reaching its limits. I suspenct Facebook came to the same conclusion in PyTorch and introduced their C10 threadpool or Halide with their custom threadpool. AKA, we likely need to introduce Weave sooner than later as OpenMP doesn't cut it and is also hard to debug/profile (and a pain to install on Mac). |
The following bench, reduced to only call
linear
which is just a thin wrapper around BLAS, takes 1.6s without-d:openmp
and 15s with-d:openmp
It seems like the machine stalls on OpenMP barriers.
Also it seems like the more core you have the more problematic it is.
The text was updated successfully, but these errors were encountered: