Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor scaling performance in CPU multi-node #2739

Closed
yma11 opened this issue Mar 23, 2021 · 11 comments
Closed

poor scaling performance in CPU multi-node #2739

yma11 opened this issue Mar 23, 2021 · 11 comments
Labels

Comments

@yma11
Copy link

yma11 commented Mar 23, 2021

Environment:

  1. Framework: TensorFlow
  2. Framework version: 2.4.1
  3. Horovod version: v0.21.3
  4. MPI version: 4.0.0
  5. CUDA version: N/A
  6. NCCL version:
  7. Python version:
  8. OS and version:
  9. GCC version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Your question:
I tried the benchmark examples/tensorflow2/tensorflow2_synthetic_benchmark.py to identify the scaling performance, but it's becoming slower when scaling up and what would be the possible root causes?

horovodrun -np 1 -H localhost:1 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 11.5 img/sec per GPU
[1,0]:Iter #1: 11.5 img/sec per GPU
[1,0]:Iter #2: 11.6 img/sec per GPU
[1,0]:Iter #3: 11.8 img/sec per GPU
[1,0]:Iter #4: 11.7 img/sec per GPU
[1,0]:Iter #5: 11.8 img/sec per GPU
[1,0]:Iter #6: 11.6 img/sec per GPU
[1,0]:Iter #7: 12.0 img/sec per GPU
[1,0]:Iter #8: 12.1 img/sec per GPU
[1,0]:Iter #9: 12.1 img/sec per GPU
[1,0]:Img/sec per GPU: 11.8 +-0.4

scale to 4 np, horovodrun -np 4 -H localhost:4 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 6.9 img/sec per GPU
[1,0]:Iter #1: 6.9 img/sec per GPU

scale up to 8 np with 2 nodes, horovodrun -np 8 -H localhost:4,sr225:4 python tensorflow2_synthetic_benchmark.py:
[1,0]:Iter #0: 6.7 img/sec per GPU
[1,0]:Iter #1: 6.6 img/sec per GPU
[1,0]:Iter #2: 6.8 img/sec per GPU
[1,0]:Iter #3: 6.6 img/sec per GPU

Thanks!

@yma11 yma11 added the question label Mar 23, 2021
@yma11
Copy link
Author

yma11 commented Mar 23, 2021

From the timeline, I can see the wall duration of allgather and allreduce is more than 100ms, is that normal? network is only about 1.5Gb/s while my network is 100Gb/s.

@chongxiaoc
Copy link
Collaborator

From the timeline, I can see the wall duration of allgather and allreduce is more than 100ms, is that normal? network is only about 1.5Gb/s while my network is 100Gb/s.

It seems that you have high-speed fabric other than ethernet? In that case, you have to use mpirun explicitly to select fabric rather than using tcp. See: https://horovod.readthedocs.io/en/stable/mpi_include.html

@yma11
Copy link
Author

yma11 commented Mar 24, 2021

@chongxiaoc oh, sorry my network should be up to 25Gb/s. I am not using RDMA.

@chongxiaoc
Copy link
Collaborator

How many CPU cores per node?
You are scaling with 4 ranks on a single node, which could cause thread oversubscription, since you are doing CPU training.

@yma11
Copy link
Author

yma11 commented Mar 24, 2021

96 vCores per node which should be quite enough.

@chongxiaoc
Copy link
Collaborator

Did you try with fixing torch threads per rank? I mean this example:
https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html

I think we have to fix number of threads per rank, then see the scalability.

@yma11
Copy link
Author

yma11 commented Mar 24, 2021

Is there similar interface in tensorflow to fix the threads number? I am using tensorflow 2.4.1.

@chongxiaoc
Copy link
Collaborator

@yma11
Copy link
Author

yma11 commented Mar 24, 2021

@chongxiaoc Thanks very much for your infor. By the way, besides this example, what I would like really to run is W&D training using tensorflow as following key scripts:

    # shard dataset for each process
    dataset = tf.data.TextLineDataset(filename)
    dataset = dataset.shard(hvd.size(), hvd.rank())
    dataset = dataset.shuffle(buffer_size=20000)
    dataset = dataset.repeat(num_epochs)
    dataset = dataset.prefetch(batch_size)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(parse_csv, num_parallel_calls=28)
    dataset = dataset.prefetch(1)
    # add distributed optimizer
    opt = tf.keras.optimizers.Adagrad(0.01 * hvd.size())
    # Add Horovod Distributed Optimizer
    opt = hvd.DistributedOptimizer(opt)
    opt_lr = tf.keras.optimizers.Ftrl(0.01 * hvd.size())
    opt_lr = hvd.DistributedOptimizer(opt_lr)
    m = tf.estimator.DNNLinearCombinedClassifier(
        config=runconfig,
        model_dir=model_dir,
        linear_optimizer=opt_lr,
        dnn_optimizer=opt,
        linear_feature_columns=wide_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[1024, 512, 256])
    # train
    m = build_estimator(model_type, checkpoint_dir, train_file, test_file)
    bcast_hook = hvd.BroadcastGlobalVariablesHook(0)
    m.train(input_fn=lambda: generate_input_fn(
        train_file, batch_size, int(no_of_epochs)),
        hooks=[bcast_hook], steps=int(train_steps) // hvd.size())

Besides the scaling problem, there is another strange phenomena that when set np=1, the single process can use up to 50 cores, but when scaling up to 2 nodes with np=8, each process only use ~4 cores. Why this should happen? maybe because each process process less data when np set larger?

@chongxiaoc
Copy link
Collaborator

Besides the scaling problem, there is another strange phenomena that when set np=1, the single process can use up to 50 cores, but when scaling up to 2 nodes with np=8, each process only use ~4 cores. Why this should happen? maybe because each process process less data when np set larger?

I'm not sure I understand this question.

  • the single process can use up to 50 cores?
    If you mentioned before the machine has 96 cores per node, single rank scaling up to 50 cores probably means that 96 threads and 48 physical cores.

  • 2 nodes with np=8, each process only use ~4 cores?
    In this case, you have overhead of allreducing gradients across intra and inter nodes. The communication cost is not free.
    But TBH, I'm not sure what the you mean using 4 cores here.

@yma11
Copy link
Author

yma11 commented Mar 24, 2021

for np=1,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
644697 sparkus+ 20 0 25.0g 1.8g 154704 S 4790 0.4 34:30.36 python

for np=8 on 2 nodes,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1425530 sparkus+ 20 0 24.0g 1.8g 156768 S 462.0 0.4 19:09.42 python
1425533 sparkus+ 20 0 23.9g 1.8g 153400 S 457.4 0.4 14:26.05 python
1425531 sparkus+ 20 0 23.9g 1.8g 157532 S 405.9 0.4 13:10.54 python
1425532 sparkus+ 20 0 23.9g 1.8g 152800 S 371.6 0.4 12:48.52 python

@horovod horovod locked and limited conversation to collaborators Mar 27, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Development

No branches or pull requests

3 participants