How can I get a training throughput of over 180 TFLOPS ? #22

crazyofapple · 2023-07-07T03:50:14Z

I run the code, but only got 90+ tflops.

INFO train.py:317 in record_current_batch_training_metrics -- tflops=93.48098385143103,step=9,loss=7.502509117126465,tgs (tokens/gpu/second)=2104.89,lr=2.2e-06,loss_scale=65536.0,grad_norm=20.60409540743281,micro_num=4,num_consumed_tokens=2621440,inf_nan_skip_batches=0,num_samples_in_batch=13,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=6.15

sunpengsdu · 2023-07-10T01:55:46Z

Hi @crazyofapple , can you provide more details about your platform? In our platform, we use up to 128 GPU nodes connected by 4*100Gbps RoCE, and each node has 8 GPUs connected by NVLINK.

crazyofapple · 2023-07-10T06:09:41Z

Inter: 2 HDR100 IB 200G, Intra: 8 gpus w/ PCIE

sunpengsdu · 2023-07-10T06:39:04Z

The main performance bottleneck is the intra-node communication via PCIE. We did two experiments:

On a single GPU node with NVLINK. The training log is following:

2023-07-10 14:26:28,977 INFO train.py:317 in record_current_batch_training_metrics -- tflops=188.02533140299252,step=36,loss=5.459033012390137,tgs (tokens/gpu/second)=4233.73,lr=7.6e-06,loss_scale=65536.0,grad_norm=12.540833573326264,micro_num=4,num_consumed_tokens=4849664,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.72

On a single GPU node without MVLINK. The training log is following:

2023-07-10 14:34:49,024 INFO train.py:317 in record_current_batch_training_metrics -- tflops=99.1021732624673,step=18,loss=6.766777038574219,tgs (tokens/gpu/second)=2231.46,lr=4.000000000000001e-06,loss_scale=65536.0,grad_norm=12.957902089555239,micro_num=4,num_consumed_tokens=2490368,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=5.76

Since the optimizer needs a lot of allreduce/broadcast communication, it is quite important to ensure high communication bandwidth between GPUs in a node.

crazyofapple · 2023-07-10T06:44:07Z

thx

hellock assigned sunpengsdu Jul 7, 2023

crazyofapple closed this as completed Jul 10, 2023

crazyofapple reopened this Jul 10, 2023

crazyofapple closed this as completed Jul 10, 2023

li126com added a commit to li126com/InternLM that referenced this issue Mar 26, 2024

Fix:(initialize) remove fix_seed for initialization (InternLM#22)

e6eb75f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get a training throughput of over 180 TFLOPS ? #22

How can I get a training throughput of over 180 TFLOPS ? #22

crazyofapple commented Jul 7, 2023

sunpengsdu commented Jul 10, 2023

crazyofapple commented Jul 10, 2023

sunpengsdu commented Jul 10, 2023

crazyofapple commented Jul 10, 2023

How can I get a training throughput of over 180 TFLOPS ? #22

How can I get a training throughput of over 180 TFLOPS ? #22

Comments

crazyofapple commented Jul 7, 2023

sunpengsdu commented Jul 10, 2023

crazyofapple commented Jul 10, 2023

sunpengsdu commented Jul 10, 2023

crazyofapple commented Jul 10, 2023