Skip to content

Question about Detail Communication Cost and Memory Cost of Multi-Dimensional Parallelism #185

Discussion options

You must be logged in to vote

Hi, @ConnollyLeon . Thank you for your interest. I give my anwsers according to my understanding to your questions one by one.

  • A1: If using NCCL as the communication backend, the RING algorithm is supposed to be used, and the bandwidth cost you gave is correct.
  • A2: Indeed during both communication and multiplication the memory cost is O(1/p^2). However, such memory can be reused by other layers. Compared to the memory that is kept for backward (usually tens of transformer layers), it seems not so significant in terms of peak memory usage. That is why we also partition the activations by p^3.
  • A3: If using a tree topology for parallel communication, the latency can be O(log(p)). You can re…

Replies: 3 comments 8 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
3 replies
@ConnollyLeon
Comment options

@kurisusnowdeng
Comment options

@ConnollyLeon
Comment options

Answer selected by ConnollyLeon
Comment options

You must be logged in to vote
5 replies
@FrankLeeeee
Comment options

@ConnollyLeon
Comment options

@FrankLeeeee
Comment options

@ConnollyLeon
Comment options

@FrankLeeeee
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants