Skip to content

Revise pipeline schedule to consider communication ops#4524

Merged
wschin merged 9 commits intomasterfrom
wechi/revise-pipeline-schedule
Jul 17, 2020
Merged

Revise pipeline schedule to consider communication ops#4524
wschin merged 9 commits intomasterfrom
wechi/revise-pipeline-schedule

Conversation

@wschin
Copy link
Contributor

@wschin wschin commented Jul 15, 2020

NCCL is not as flexible as MPI. It requires user to schedule the communication to happen at the right time, so we extend our pipeline-parallel schedule to cover Send and Recv. The following two tables are visualizations for schedules w/wo Send/Recv.

image

Symbols in the tables:

  • FW00: forward compute of batch 00
  • FS00: forward send of batch 00
  • FR00: forward recv of batch 00
  • BW00: backward compute of batch 00
  • BS00: backward send of batch 00
  • BR00: backward recv of batch 00.

The class PipelineSchedule now includes two timeline tables, compute_table_ and compute_commute_table_. compute_table_ stores existing schedule for backward combability (we haven't replaced MPI with NCCL P2P yet) and compute_compute_table_ will be used later when NCCL P2P is introduced. Each schedule is a 2-D table. The slot at time t and stage s locates at table[t][s], which is a PipelineSlot. A PipelineSlot contains parallelly-computed operations (PipelineTasks) such as Forward and Send.

@wschin wschin added the training issues related to ONNX Runtime training; typically submitted using template label Jul 15, 2020
@wschin wschin requested a review from a team as a code owner July 15, 2020 20:53
int GetForwardRecvWaitedEvent(const int batch_id, const int stage_id) const;
int GetForwardRecvRecordedEvent(const int batch_id, const int stage_id) const;
int GetBackwardRecvWaitedEvent(const int batch_id, const int stage_id) const;
int GetBackwardRecvRecordedEvent(const int batch_id, const int stage_id) const;
Copy link
Contributor

@ke1337 ke1337 Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it better to use one function to return a struct at given batch_id and stage_id? We might make it a template function and use struct to differentiate between NCCL and MPI. Or maybe have MPIPipelineScheduler/NCCLPipelineScheduler that inherit from the basic PipelineScheduler. #Resolved

Copy link
Contributor Author

@wschin wschin Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having MpiPipelineScheduler and NcclPipelineScheduler is doable (MPI scheduler is a special case of NCCL scheduler) but I'd like to first have NCCL P2P working and then come back to refactorize these classes again. Would this be ok to you?


In reply to: 455384433 [](ancestors = 455384433)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 455584382 [](ancestors = 455584382,455384433)

Slot::Slot() {
type = Empty;
batch_id = 0;
bool Action::IsForward() const {
Copy link
Contributor

@xzhu1900 xzhu1900 Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For small function, maybe we can try use inline? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.


In reply to: 455407917 [](ancestors = 455407917)

@wschin wschin force-pushed the wechi/revise-pipeline-schedule branch from fa60ad0 to 3b837df Compare July 16, 2020 00:42
@wschin wschin force-pushed the wechi/revise-pipeline-schedule branch from a3b543e to a4ec543 Compare July 16, 2020 07:04

std::vector<int> forward_baseline_events_stage2{
-1, -1, 0, 1, // FW00 @ stage 2
2, 3, 4, 5, // FW01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2, 3 [](start = 6, length = 4)

Does the event distinguish fw pass or bw pass? I guess I am not clear if this '2' comes from the forward pass in stage 1 (FW01) or the backward pass in stage 2 (BW00)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name seems to indicate it's FW. event id itself is just an int and does not carry FW/BW semantics I think.


In reply to: 456110213 [](ancestors = 456110213)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This Wait '2' is in the FW pass and it's waiting a Record '2' in other place to sent it the signal. But the Record '2' can be from the FW pass in stage 1 (FW01) or the BW pass in stage 2(BW00). So not sure which signal it is listening to.


In reply to: 456111879 [](ancestors = 456111879,456110213)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, event ID doesn't define forward or backward. It's just a format of mine; I choose to split events based on forward/backward after events have been generated.

Wait 2 is on stage 2, which will be triggered by Record 2 on stage 2 (in the first row of backward_baseline_events_stage2).

First of all, different stages are run by different processes, so their events are independent event they have the same ID. That is, Event 0 on stage 0 has nothing to do with Event 0 on stage 1. The format of these 4 ints is

  // Format per line below:
  //   waited event before Recv, waited event after Recv, recorded event before Send, recorded event after Send.

It means 2 is the event waited before Recv in FW01 on stage 2. Similarly, 3, 4, and 5 sequentially maps to waited event after Recv, recorded event before Send, and recorded event after Send.


In reply to: 456124585 [](ancestors = 456124585,456111879,456110213)

Copy link
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

std::vector<int> recorded_events;
Pass pass;

// The last upstream action' time in compute-only schedule.
Copy link
Contributor

@xzhu1900 xzhu1900 Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

' [](start = 29, length = 1)

Nit: remove?

Same as below.

// For MPI PipeDream schedule, it's used to support Wait -> Recv -> Wait -> Compute -> Record -> Send -> Record.
// Since Send, Recv, and Compute are stored in the same slot, each slot contains two waited events and two recorded events.
//
// For NCCL PipDream schedule, it's used to support Wait -> Recv -> Record -> Wait -> Compute -> Record -> Wait -> Send -> Record.
Copy link
Contributor

@xzhu1900 xzhu1900 Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record [](start = 70, length = 6)

Out of curiosity, why NCCl needs an extra Record after each Recv and Send? #Resolved

Copy link
Contributor Author

@wschin wschin Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NCCL, we need more control on the timing of communication components. It means we need to know the start and end of each of Send, Recv, Forward, Backward.


In reply to: 456117125 [](ancestors = 456117125)

<< recorded_event_id_before_send << "," << recorded_event_id
<< ")"
<< " ";
case 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 2: [](start = 4, length = 7)

To make it a bit more robust, maybe add a default and assert it will never have more than 2 actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will do it when refactorizing this class.


In reply to: 456118801 [](ancestors = 456118801)

@wschin wschin merged commit 21d2728 into master Jul 17, 2020
@wschin wschin deleted the wechi/revise-pipeline-schedule branch July 17, 2020 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants