You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two bugs that affect scheduling for fragment/stage with more than one scan nodes. Only colocated join can create such fragment/stage at this time.
The combined outcome is that there is no back pressure for split scheduling for all but the first scan node (in source scheduling order) in a stage.
bug 1
When a fragment/stage contains multiple scan nodes. None of the scan nodes will receive a TaskSource where TaskSource.noMoreSplits = true until scheduling for splits for all the scan nodes in the stage finishes.
In SqlStageExecution, there are 2 places where completeSources variable is updated. One in schedulingComplete, which happens after the entire stage finishes scheduling. The other in addExchangeLocations, which isn't relevant for scan nodes.
In SqlStageExecution, there are 3 places where task.noMoreSplits is invoked. It's invoked in scheduleTask, where task.noMoreSplits is invoked for each element in completeSources. The other two are schedulingComplete and addExchangeLocation.
This bug seems unlikely because the bug should have caused frequent query deadlock. That leads me to bug 2.
bug 2
In PipelineContext.getPipelineStatus, queuedDrivers is computed by looking at DriverContexts. However, note that in SqlTaskExecution.schedulePartitionedSource, there is this concept of pendingSplitsByPlanNode. pendingSplitsByPlanNode buffers splits for scan nodes that aren't yet eligible to schedule because another scan node who is ahead in term of source scheduling order hasn't finished scheduling.
Specifically, that "another scan node" is the first scan node (in source scheduling order). Due to bug 1, it will not finish scheduling until all splits for the stage are delivered to workers.
queuedDrivers should include splits in pendingSplitsByPlanNode although a DriverSplitRunner is yet to be created for those splits. The fact that the worker chose to defer the creation of those drivers is an implementation detail. Conceptually, those splits have been delivered to the worker, and the worker has created the drivers for them although those drivers are "blocked" (not runnable).
Now look at NodeScheduler.selectDistributionNodes. It depends on NodeAssignmentStats.getTotalSplitCount. It is effectively queuedDrivers + recent assignment. While recent assignment would increment, it is reset to zero whenever the split is delivered to workers. queuedDrivers would be a small number (and eventually hit zero) due to this bug.
This bug also leads to misleading/unituitive client stats.
mutual effect
If bug 2 gets fixed alone, it would lead to scheduling deadlocks.
If bug 1 gets fixed alone, it will restore back pressure and somewhat mitigate bug 2.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.
There are two bugs that affect scheduling for fragment/stage with more than one scan nodes. Only colocated join can create such fragment/stage at this time.
The combined outcome is that there is no back pressure for split scheduling for all but the first scan node (in source scheduling order) in a stage.
bug 1
When a fragment/stage contains multiple scan nodes. None of the scan nodes will receive a
TaskSource
whereTaskSource.noMoreSplits = true
until scheduling for splits for all the scan nodes in the stage finishes.In
SqlStageExecution
, there are 2 places wherecompleteSources
variable is updated. One inschedulingComplete
, which happens after the entire stage finishes scheduling. The other inaddExchangeLocations
, which isn't relevant for scan nodes.In
SqlStageExecution
, there are 3 places wheretask.noMoreSplits
is invoked. It's invoked inscheduleTask
, wheretask.noMoreSplits
is invoked for each element incompleteSources
. The other two areschedulingComplete
andaddExchangeLocation
.This bug seems unlikely because the bug should have caused frequent query deadlock. That leads me to bug 2.
bug 2
In
PipelineContext.getPipelineStatus
,queuedDrivers
is computed by looking atDriverContext
s. However, note that inSqlTaskExecution.schedulePartitionedSource
, there is this concept ofpendingSplitsByPlanNode
.pendingSplitsByPlanNode
buffers splits for scan nodes that aren't yet eligible to schedule because another scan node who is ahead in term of source scheduling order hasn't finished scheduling.Specifically, that "another scan node" is the first scan node (in source scheduling order). Due to bug 1, it will not finish scheduling until all splits for the stage are delivered to workers.
queuedDrivers
should include splits inpendingSplitsByPlanNode
although aDriverSplitRunner
is yet to be created for those splits. The fact that the worker chose to defer the creation of those drivers is an implementation detail. Conceptually, those splits have been delivered to the worker, and the worker has created the drivers for them although those drivers are "blocked" (not runnable).Now look at
NodeScheduler.selectDistributionNodes
. It depends onNodeAssignmentStats.getTotalSplitCount
. It is effectivelyqueuedDrivers
+ recent assignment. While recent assignment would increment, it is reset to zero whenever the split is delivered to workers.queuedDrivers
would be a small number (and eventually hit zero) due to this bug.This bug also leads to misleading/unituitive client stats.
mutual effect
If bug 2 gets fixed alone, it would lead to scheduling deadlocks.
If bug 1 gets fixed alone, it will restore back pressure and somewhat mitigate bug 2.
The text was updated successfully, but these errors were encountered: