Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance of model parallelism (MP) is not good #124

Closed
feifeibear opened this issue Jan 6, 2022 · 7 comments
Closed

The performance of model parallelism (MP) is not good #124

feifeibear opened this issue Jan 6, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@feifeibear
Copy link
Contributor

Hello developers.

I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115
BTW: I strongly recommend you add Tflops as an indicator of performance.

Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed
Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.

Model Scale global BS 1d-4mp 1d-8mp 2d-4mp 2d-8mp 3d-4mp 2.5d-4mp pstar deeps deeps-mp4 deeps-mp8
4B 8 7.61 7.62 9.89 8.47 failed 10.31 8.78 1.15 1.26 1.26
4B 16. OOM OOM OOM OOM OOM OOM 16.67 2.26 2.42 2.36
4B 128 OOM OOM OOM OOM OOM OOM 28.39 12.51 10.80 OOM
10B 2 OOM 3.62 OOM failed OOM OOM - - 0.15 0.15
10B 4 OOM 4.66 OOM OOM OOM OOM - - 0.30 0.30
10B 128 OOM OOM OOM OOM OOM OOM 13.43 OOM 6.31 5.73
  1. As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
  2. The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
  3. Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.
@feifeibear feifeibear changed the title [FEATURE] The performance of model parallelism (MP) is not good The performance of model parallelism (MP) is not good Jan 6, 2022
@kurisusnowdeng
Copy link
Member

kurisusnowdeng commented Jan 6, 2022

Hi @feifeibear . Thank you so much for your effort. We would appreciate it if you could also share the configurations used to test the same models with Deepspeed and PatrickStar? We would like to evaluate and improve the performance on a similar node scale as well as larger scale.
BTW, did you try 3d-8mp? 3d requires a cube number of mp.

@feifeibear
Copy link
Contributor Author

The DeepSpeed benchmark script
https://github.com/feifeibear/DeepSpeedZeRO3Benchmark
The PatrickStar
https://github.com/Tencent/PatrickStar/blob/master/examples/run_transformers.sh
The benchmarking is very easy.

export SUFFIX="colossal_compare"
env GPU_NUM=8 MODEL_TYPE="GPT" MODEL_NAME=GPT3_10B BS=2 CPU_EBD=0 AMM=1 MSC=1 CACHE=1 SP=0 CS=288 HYB=1 TILING=0 ACT_OFFLOAD=0 SUFFIX=${SUFFIX} bash run_transformers.sh

@feifeibear
Copy link
Contributor Author

I have uploaded the logs of DeepSpeed and PatirckStar to Baidu WangPan...
Note that for DeepSpeed, the SamplesPerSec is not equal to 'Throughput'. You have to calculate it by batch/elapse.

link: https://pan.baidu.com/s/1vEHl0hPuxDb7HjOlpuW-YA?pwd=1mfd
code: 1mfd

@kurisusnowdeng
Copy link
Member

@feifeibear Thank you!

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Jan 22, 2022
@binmakeswell binmakeswell added enhancement New feature or request and removed stale labels Apr 13, 2022
@binmakeswell
Copy link
Member

Thanks for your report, detailed tests with stable code will come soon.

@binmakeswell
Copy link
Member

We have updated a lot. This issue was closed due to inactivity. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants