Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get a training throughput of over 180 TFLOPS ? #22

Closed
crazyofapple opened this issue Jul 7, 2023 · 4 comments
Closed

How can I get a training throughput of over 180 TFLOPS ? #22

crazyofapple opened this issue Jul 7, 2023 · 4 comments
Assignees

Comments

@crazyofapple
Copy link

I run the code, but only got 90+ tflops.

INFO train.py:317 in record_current_batch_training_metrics -- tflops=93.48098385143103,step=9,loss=7.502509117126465,tgs (tokens/gpu/second)=2104.89,lr=2.2e-06,loss_scale=65536.0,grad_norm=20.60409540743281,micro_num=4,num_consumed_tokens=2621440,inf_nan_skip_batches=0,num_samples_in_batch=13,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=6.15

@sunpengsdu
Copy link
Collaborator

Hi @crazyofapple , can you provide more details about your platform? In our platform, we use up to 128 GPU nodes connected by 4*100Gbps RoCE, and each node has 8 GPUs connected by NVLINK.

@crazyofapple
Copy link
Author

Inter: 2 HDR100 IB 200G, Intra: 8 gpus w/ PCIE

@sunpengsdu
Copy link
Collaborator

The main performance bottleneck is the intra-node communication via PCIE. We did two experiments:

  1. On a single GPU node with NVLINK. The training log is following:

2023-07-10 14:26:28,977 INFO train.py:317 in record_current_batch_training_metrics -- tflops=188.02533140299252,step=36,loss=5.459033012390137,tgs (tokens/gpu/second)=4233.73,lr=7.6e-06,loss_scale=65536.0,grad_norm=12.540833573326264,micro_num=4,num_consumed_tokens=4849664,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.72

  1. On a single GPU node without MVLINK. The training log is following:

2023-07-10 14:34:49,024 INFO train.py:317 in record_current_batch_training_metrics -- tflops=99.1021732624673,step=18,loss=6.766777038574219,tgs (tokens/gpu/second)=2231.46,lr=4.000000000000001e-06,loss_scale=65536.0,grad_norm=12.957902089555239,micro_num=4,num_consumed_tokens=2490368,inf_nan_skip_batches=0,num_samples_in_batch=15,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=5.76

Since the optimizer needs a lot of allreduce/broadcast communication, it is quite important to ensure high communication bandwidth between GPUs in a node.

@crazyofapple
Copy link
Author

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants