poor scaling performance in CPU multi-node #2739

yma11 · 2021-03-23T06:10:10Z

Environment:

Framework: TensorFlow
Framework version: 2.4.1
Horovod version: v0.21.3
MPI version: 4.0.0
CUDA version: N/A
NCCL version:
Python version:
OS and version:
GCC version:

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Your question:
I tried the benchmark examples/tensorflow2/tensorflow2_synthetic_benchmark.py to identify the scaling performance, but it's becoming slower when scaling up and what would be the possible root causes?

horovodrun -np 1 -H localhost:1 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 11.5 img/sec per GPU
[1,0]:Iter #1: 11.5 img/sec per GPU
[1,0]:Iter #2: 11.6 img/sec per GPU
[1,0]:Iter #3: 11.8 img/sec per GPU
[1,0]:Iter #4: 11.7 img/sec per GPU
[1,0]:Iter #5: 11.8 img/sec per GPU
[1,0]:Iter #6: 11.6 img/sec per GPU
[1,0]:Iter #7: 12.0 img/sec per GPU
[1,0]:Iter #8: 12.1 img/sec per GPU
[1,0]:Iter #9: 12.1 img/sec per GPU
[1,0]:Img/sec per GPU: 11.8 +-0.4

scale to 4 np, horovodrun -np 4 -H localhost:4 python tensorflow2_synthetic_benchmark.py
[1,0]:Iter #0: 6.9 img/sec per GPU
[1,0]:Iter #1: 6.9 img/sec per GPU

scale up to 8 np with 2 nodes, horovodrun -np 8 -H localhost:4,sr225:4 python tensorflow2_synthetic_benchmark.py:
[1,0]:Iter #0: 6.7 img/sec per GPU
[1,0]:Iter #1: 6.6 img/sec per GPU
[1,0]:Iter #2: 6.8 img/sec per GPU
[1,0]:Iter #3: 6.6 img/sec per GPU

Thanks!

yma11 · 2021-03-23T06:44:06Z

From the timeline, I can see the wall duration of allgather and allreduce is more than 100ms, is that normal? network is only about 1.5Gb/s while my network is 100Gb/s.

chongxiaoc · 2021-03-23T15:43:49Z

From the timeline, I can see the wall duration of allgather and allreduce is more than 100ms, is that normal? network is only about 1.5Gb/s while my network is 100Gb/s.

It seems that you have high-speed fabric other than ethernet? In that case, you have to use mpirun explicitly to select fabric rather than using tcp. See: https://horovod.readthedocs.io/en/stable/mpi_include.html

yma11 · 2021-03-24T01:12:04Z

@chongxiaoc oh, sorry my network should be up to 25Gb/s. I am not using RDMA.

chongxiaoc · 2021-03-24T01:37:30Z

How many CPU cores per node?
You are scaling with 4 ranks on a single node, which could cause thread oversubscription, since you are doing CPU training.

yma11 · 2021-03-24T01:39:27Z

96 vCores per node which should be quite enough.

chongxiaoc · 2021-03-24T01:48:41Z

Did you try with fixing torch threads per rank? I mean this example:
https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html

I think we have to fix number of threads per rank, then see the scalability.

yma11 · 2021-03-24T01:51:23Z

Is there similar interface in tensorflow to fix the threads number? I am using tensorflow 2.4.1.

chongxiaoc · 2021-03-24T02:04:15Z

https://www.tensorflow.org/api_docs/python/tf/config/threading

yma11 · 2021-03-24T02:16:10Z

@chongxiaoc Thanks very much for your infor. By the way, besides this example, what I would like really to run is W&D training using tensorflow as following key scripts:

    # shard dataset for each process
    dataset = tf.data.TextLineDataset(filename)
    dataset = dataset.shard(hvd.size(), hvd.rank())
    dataset = dataset.shuffle(buffer_size=20000)
    dataset = dataset.repeat(num_epochs)
    dataset = dataset.prefetch(batch_size)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(parse_csv, num_parallel_calls=28)
    dataset = dataset.prefetch(1)
    # add distributed optimizer
    opt = tf.keras.optimizers.Adagrad(0.01 * hvd.size())
    # Add Horovod Distributed Optimizer
    opt = hvd.DistributedOptimizer(opt)
    opt_lr = tf.keras.optimizers.Ftrl(0.01 * hvd.size())
    opt_lr = hvd.DistributedOptimizer(opt_lr)
    m = tf.estimator.DNNLinearCombinedClassifier(
        config=runconfig,
        model_dir=model_dir,
        linear_optimizer=opt_lr,
        dnn_optimizer=opt,
        linear_feature_columns=wide_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[1024, 512, 256])
    # train
    m = build_estimator(model_type, checkpoint_dir, train_file, test_file)
    bcast_hook = hvd.BroadcastGlobalVariablesHook(0)
    m.train(input_fn=lambda: generate_input_fn(
        train_file, batch_size, int(no_of_epochs)),
        hooks=[bcast_hook], steps=int(train_steps) // hvd.size())

Besides the scaling problem, there is another strange phenomena that when set np=1, the single process can use up to 50 cores, but when scaling up to 2 nodes with np=8, each process only use ~4 cores. Why this should happen? maybe because each process process less data when np set larger?

chongxiaoc · 2021-03-24T02:58:55Z

Besides the scaling problem, there is another strange phenomena that when set np=1, the single process can use up to 50 cores, but when scaling up to 2 nodes with np=8, each process only use ~4 cores. Why this should happen? maybe because each process process less data when np set larger?

I'm not sure I understand this question.

the single process can use up to 50 cores?
If you mentioned before the machine has 96 cores per node, single rank scaling up to 50 cores probably means that 96 threads and 48 physical cores.
2 nodes with np=8, each process only use ~4 cores?
In this case, you have overhead of allreducing gradients across intra and inter nodes. The communication cost is not free.
But TBH, I'm not sure what the you mean using 4 cores here.

yma11 · 2021-03-24T03:04:49Z

for np=1,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
644697 sparkus+ 20 0 25.0g 1.8g 154704 S 4790 0.4 34:30.36 python

for np=8 on 2 nodes,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1425530 sparkus+ 20 0 24.0g 1.8g 156768 S 462.0 0.4 19:09.42 python
1425533 sparkus+ 20 0 23.9g 1.8g 153400 S 457.4 0.4 14:26.05 python
1425531 sparkus+ 20 0 23.9g 1.8g 157532 S 405.9 0.4 13:10.54 python
1425532 sparkus+ 20 0 23.9g 1.8g 152800 S 371.6 0.4 12:48.52 python

yma11 added the question label Mar 23, 2021

tgaddair closed this as completed Mar 27, 2021

horovod locked and limited conversation to collaborators Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

poor scaling performance in CPU multi-node #2739

poor scaling performance in CPU multi-node #2739

yma11 commented Mar 23, 2021

yma11 commented Mar 23, 2021 •

edited

chongxiaoc commented Mar 23, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021 •

edited

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

poor scaling performance in CPU multi-node #2739

poor scaling performance in CPU multi-node #2739

Comments

yma11 commented Mar 23, 2021

yma11 commented Mar 23, 2021 • edited

chongxiaoc commented Mar 23, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021 • edited

chongxiaoc commented Mar 24, 2021

yma11 commented Mar 24, 2021

This issue was moved to a discussion.

yma11 commented Mar 23, 2021 •

edited

yma11 commented Mar 24, 2021 •

edited