Revise pipeline schedule to consider communication ops#4524
Conversation
| int GetForwardRecvWaitedEvent(const int batch_id, const int stage_id) const; | ||
| int GetForwardRecvRecordedEvent(const int batch_id, const int stage_id) const; | ||
| int GetBackwardRecvWaitedEvent(const int batch_id, const int stage_id) const; | ||
| int GetBackwardRecvRecordedEvent(const int batch_id, const int stage_id) const; |
There was a problem hiding this comment.
would it better to use one function to return a struct at given batch_id and stage_id? We might make it a template function and use struct to differentiate between NCCL and MPI. Or maybe have MPIPipelineScheduler/NCCLPipelineScheduler that inherit from the basic PipelineScheduler. #Resolved
There was a problem hiding this comment.
Having MpiPipelineScheduler and NcclPipelineScheduler is doable (MPI scheduler is a special case of NCCL scheduler) but I'd like to first have NCCL P2P working and then come back to refactorize these classes again. Would this be ok to you?
In reply to: 455384433 [](ancestors = 455384433)
There was a problem hiding this comment.
| Slot::Slot() { | ||
| type = Empty; | ||
| batch_id = 0; | ||
| bool Action::IsForward() const { |
There was a problem hiding this comment.
For small function, maybe we can try use inline? #Resolved
There was a problem hiding this comment.
fa60ad0 to
3b837df
Compare
a3b543e to
a4ec543
Compare
|
|
||
| std::vector<int> forward_baseline_events_stage2{ | ||
| -1, -1, 0, 1, // FW00 @ stage 2 | ||
| 2, 3, 4, 5, // FW01 |
There was a problem hiding this comment.
2, 3 [](start = 6, length = 4)
Does the event distinguish fw pass or bw pass? I guess I am not clear if this '2' comes from the forward pass in stage 1 (FW01) or the backward pass in stage 2 (BW00)
There was a problem hiding this comment.
the variable name seems to indicate it's FW. event id itself is just an int and does not carry FW/BW semantics I think.
In reply to: 456110213 [](ancestors = 456110213)
There was a problem hiding this comment.
Right. This Wait '2' is in the FW pass and it's waiting a Record '2' in other place to sent it the signal. But the Record '2' can be from the FW pass in stage 1 (FW01) or the BW pass in stage 2(BW00). So not sure which signal it is listening to.
In reply to: 456111879 [](ancestors = 456111879,456110213)
There was a problem hiding this comment.
Yes, event ID doesn't define forward or backward. It's just a format of mine; I choose to split events based on forward/backward after events have been generated.
Wait 2 is on stage 2, which will be triggered by Record 2 on stage 2 (in the first row of backward_baseline_events_stage2).
First of all, different stages are run by different processes, so their events are independent event they have the same ID. That is, Event 0 on stage 0 has nothing to do with Event 0 on stage 1. The format of these 4 ints is
// Format per line below:
// waited event before Recv, waited event after Recv, recorded event before Send, recorded event after Send.
It means 2 is the event waited before Recv in FW01 on stage 2. Similarly, 3, 4, and 5 sequentially maps to waited event after Recv, recorded event before Send, and recorded event after Send.
In reply to: 456124585 [](ancestors = 456124585,456111879,456110213)
| std::vector<int> recorded_events; | ||
| Pass pass; | ||
|
|
||
| // The last upstream action' time in compute-only schedule. |
There was a problem hiding this comment.
' [](start = 29, length = 1)
Nit: remove?
Same as below.
| // For MPI PipeDream schedule, it's used to support Wait -> Recv -> Wait -> Compute -> Record -> Send -> Record. | ||
| // Since Send, Recv, and Compute are stored in the same slot, each slot contains two waited events and two recorded events. | ||
| // | ||
| // For NCCL PipDream schedule, it's used to support Wait -> Recv -> Record -> Wait -> Compute -> Record -> Wait -> Send -> Record. |
There was a problem hiding this comment.
Record [](start = 70, length = 6)
Out of curiosity, why NCCl needs an extra Record after each Recv and Send? #Resolved
There was a problem hiding this comment.
For NCCL, we need more control on the timing of communication components. It means we need to know the start and end of each of Send, Recv, Forward, Backward.
In reply to: 456117125 [](ancestors = 456117125)
| << recorded_event_id_before_send << "," << recorded_event_id | ||
| << ")" | ||
| << " "; | ||
| case 2: |
There was a problem hiding this comment.
case 2: [](start = 4, length = 7)
To make it a bit more robust, maybe add a default and assert it will never have more than 2 actions?
There was a problem hiding this comment.
Thank you. I will do it when refactorizing this class.
In reply to: 456118801 [](ancestors = 456118801)
NCCL is not as flexible as MPI. It requires user to schedule the communication to happen at the right time, so we extend our pipeline-parallel schedule to cover
SendandRecv. The following two tables are visualizations for schedules w/woSend/Recv.Symbols in the tables:
The class
PipelineSchedulenow includes two timeline tables,compute_table_andcompute_commute_table_.compute_table_stores existing schedule for backward combability (we haven't replaced MPI with NCCL P2P yet) andcompute_compute_table_will be used later when NCCL P2P is introduced. Each schedule is a 2-D table. The slot at timetand stageslocates attable[t][s], which is aPipelineSlot. APipelineSlotcontains parallelly-computed operations (PipelineTasks) such asForwardandSend.