Update our decommissioning logic to the current upstream. (apache#673)

[SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Cleanup the merge [SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Fix up the merge CR feedback, move adjustExecutors to a common utility function Exclude some non-public APIs Remove junk More CR feedback Fix adjustExecutors backport This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test Cleanup and drop watcher changes from the backport
holdenk · Sep 2, 2020 · b7fb9df · b7fb9df
1 parent 27227d1
commit b7fb9df
Show file tree

Hide file tree

Showing 49 changed files with 1,371 additions and 425 deletions.
diff --git a/core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala b/core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark
 
+import org.apache.spark.scheduler.ExecutorDecommissionInfo
 /**
  * A client that communicates with the cluster manager to request or kill executors.
  * This is currently supported only in YARN mode.
@@ -81,27 +82,35 @@ private[spark] trait ExecutorAllocationClient {
    * Default implementation delegates to kill, scheduler must override
    * if it supports graceful decommissioning.
    *
-   * @param executorIds identifiers of executors to decommission
+   * @param executorsAndDecominfo identifiers of executors & decom info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   def decommissionExecutors(
-    executorIds: Seq[String],
-    adjustTargetNumExecutors: Boolean): Seq[String]
+    executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
+    adjustTargetNumExecutors: Boolean): Seq[String] = {
+    killExecutors(executorsAndDecomInfo.map(_._1),
+      adjustTargetNumExecutors,
+      countFailures = false)
+  }
 
 
   /**
-   * Request that the cluster manager decommission the specified executors.
-   * Default implementation delegates to kill, scheduler must override
-   * if it supports graceful decommissioning.
+   * Request that the cluster manager decommission the specified executor.
+   * Delegates to decommissionExecutors.
    *
-   * @param executorIds identifiers of executors to decommission
+   * @param executorId identifiers of executor to decommission
+   * @param decommissionInfo information about the decommission (reason, host loss)
+   * @param adjustTargetNumExecutors if we should adjust the target number of executors.
    * @return whether the request is acknowledged by the cluster manager.
    */
-  def decommissionExecutor(executorId: String): Boolean = {
-    val decommissionedExecutors = decommissionExecutors(Seq(executorId),
-      adjustTargetNumExecutors = true)
+  final def decommissionExecutor(executorId: String,
+      decommissionInfo: ExecutorDecommissionInfo,
+      adjustTargetNumExecutors: Boolean): Boolean = {
+    val decommissionedExecutors = decommissionExecutors(
+      Array((executorId, decommissionInfo)),
+      adjustTargetNumExecutors = adjustTargetNumExecutors)
     decommissionedExecutors.nonEmpty && decommissionedExecutors(0).equals(executorId)
   }
 

diff --git a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
@@ -127,6 +127,8 @@ private[spark] class ExecutorAllocationManager(
   private val executorAllocationRatio =
     conf.get(DYN_ALLOCATION_EXECUTOR_ALLOCATION_RATIO)
 
+  private val decommissionEnabled = conf.get(WORKER_DECOMMISSION_ENABLED)
+
   validateSettings()
 
   // Number of executors to add in the next round
@@ -201,7 +203,8 @@ private[spark] class ExecutorAllocationManager(
       // storage shuffle decommissioning is enabled we have *experimental* support for
       // decommissioning without a shuffle service.
       if (conf.get(config.DYN_ALLOCATION_SHUFFLE_TRACKING_ENABLED) ||
-        (conf.get(WORKER_DECOMMISSION_ENABLED) && conf.get(STORAGE_SHUFFLE_DECOMMISSION_ENABLED))) {
+          (decommissionEnabled &&
+            conf.get(config.STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED))) {
         logWarning("Dynamic allocation without a shuffle service is an experimental feature.")
       } else if (!testing) {
         throw new SparkException("Dynamic allocation of executors requires the external " +
@@ -431,7 +434,7 @@ private[spark] class ExecutorAllocationManager(
 
     logDebug(s"Request to remove executorIds: ${executors.mkString(", ")}")
     val numExistingExecutors = (executorMonitor.executorCount
-      - executorMonitor.pendingRemovalCount - executorMonitor.pendingDecommissioningCount)
+      - executorMonitor.pendingRemovalCount - executorMonitor.decommissioningCount)
 
     var newExecutorTotal = numExistingExecutors
     executors.foreach { executorIdToBeRemoved =>
@@ -457,10 +460,12 @@ private[spark] class ExecutorAllocationManager(
     } else {
       // We don't want to change our target number of executors, because we already did that
       // when the task backlog decreased.
-      if (conf.get(WORKER_DECOMMISSION_ENABLED)) {
-        client.decommissionExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false)
+      if (decommissionEnabled) {
+        val executorIdsWithoutHostLoss = executorIdsToBeRemoved.toSeq.map(
+          id => (id, ExecutorDecommissionInfo("spark scale down", false))).toArray
+        client.decommissionExecutors(executorIdsWithoutHostLoss, adjustTargetNumExecutors = false)
       } else {
-        client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false,
+        client.killExecutors(executorIdsToBeRemoved.toSeq, adjustTargetNumExecutors = false,
           countFailures = false, force = false)
       }
     }
@@ -472,7 +477,7 @@ private[spark] class ExecutorAllocationManager(
     newExecutorTotal = numExistingExecutors
     if (testing || executorsRemoved.nonEmpty) {
       newExecutorTotal -= executorsRemoved.size
-      if (conf.get(WORKER_DECOMMISSION_ENABLED)) {
+      if (decommissionEnabled) {
         executorMonitor.executorsDecommissioned(executorsRemoved)
       } else {
         executorMonitor.executorsKilled(executorsRemoved)

diff --git a/core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala b/core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala
@@ -31,6 +31,7 @@ import org.apache.spark.deploy.DeployMessages._
 import org.apache.spark.deploy.master.Master
 import org.apache.spark.internal.Logging
 import org.apache.spark.rpc._
+import org.apache.spark.scheduler.ExecutorDecommissionInfo
 import org.apache.spark.util.{RpcUtils, ThreadUtils}
 
 /**
@@ -181,7 +182,8 @@ private[spark] class StandaloneAppClient(
         if (ExecutorState.isFinished(state)) {
           listener.executorRemoved(fullId, message.getOrElse(""), exitStatus, workerLost)
         } else if (state == ExecutorState.DECOMMISSIONED) {
-          listener.executorDecommissioned(fullId, message.getOrElse(""))
+          listener.executorDecommissioned(fullId,
+            ExecutorDecommissionInfo(message.getOrElse(""), isHostDecommissioned = workerLost))
         }
 
       case WorkerRemoved(id, host, message) =>

diff --git a/core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClientListener.scala b/core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClientListener.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.deploy.client
 
+import org.apache.spark.scheduler.ExecutorDecommissionInfo
+
 /**
  * Callbacks invoked by deploy client when various events happen. There are currently five events:
  * connecting to the cluster, disconnecting, being given an executor, having an executor removed
@@ -39,7 +41,7 @@ private[spark] trait StandaloneAppClientListener {
   def executorRemoved(
       fullId: String, message: String, exitStatus: Option[Int], workerLost: Boolean): Unit
 
-  def executorDecommissioned(fullId: String, message: String): Unit
+  def executorDecommissioned(fullId: String, decommissionInfo: ExecutorDecommissionInfo): Unit
 
   def workerRemoved(workerId: String, host: String, message: String): Unit
 }
diff --git a/core/src/main/scala/org/apache/spark/deploy/master/Master.scala b/core/src/main/scala/org/apache/spark/deploy/master/Master.scala
@@ -871,7 +871,10 @@ private[deploy] class Master(
         logInfo("Telling app of decommission executors")
         exec.application.driver.send(ExecutorUpdated(
           exec.id, ExecutorState.DECOMMISSIONED,
-          Some("worker decommissioned"), None, workerLost = false))
+          Some("worker decommissioned"), None,
+          // workerLost is being set to true here to let the driver know that the host (aka. worker)
+          // is also being decommissioned.
+          workerLost = true))
         exec.state = ExecutorState.DECOMMISSIONED
         exec.application.removeExecutor(exec)
       }

diff --git a/core/src/main/scala/org/apache/spark/ess/ExternalShuffleStorage.scala b/core/src/main/scala/org/apache/spark/ess/ExternalShuffleStorage.scala
@@ -56,7 +56,7 @@ object ExternalShuffleStorage extends Logging {
     conf.get(DYN_ALLOCATION_ENABLED) &&
       conf.get(WORKER_DECOMMISSION_ENABLED) &&
       conf.get(STORAGE_DECOMMISSION_ENABLED) &&
-      conf.get(STORAGE_SHUFFLE_DECOMMISSION_ENABLED) &&
+      conf.get(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED) &&
       conf.get(SPARK_SHUFFLE_EXTERNAL_STORAGE_ENABLED)
   }
 
@@ -156,7 +156,7 @@ object ExternalShuffleStorage extends Logging {
         s"${DYN_ALLOCATION_ENABLED.key} and " +
         s"${WORKER_DECOMMISSION_ENABLED.key} and " +
         s"${STORAGE_DECOMMISSION_ENABLED.key} and " +
-        s"${STORAGE_SHUFFLE_DECOMMISSION_ENABLED.key} and " +
+        s"${STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED.key} and " +
         s"${SPARK_SHUFFLE_EXTERNAL_STORAGE_ENABLED.key} are true.")
     }
     if (!conf.get(SHUFFLE_MANAGER).equals(SHUFFLE_MANAGER.defaultValueString)) {
@@ -175,7 +175,7 @@ object ExternalShuffleStorage extends Logging {
       .set(DYN_ALLOCATION_ENABLED, true)
       .set(WORKER_DECOMMISSION_ENABLED, true)
       .set(STORAGE_DECOMMISSION_ENABLED, true)
-      .set(STORAGE_SHUFFLE_DECOMMISSION_ENABLED, true)
+      .set(STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, true)
       .set(SPARK_SHUFFLE_EXTERNAL_STORAGE_ENABLED, true)
   }
 

diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -40,7 +40,7 @@ import org.apache.spark.resource.ResourceProfile
 import org.apache.spark.resource.ResourceProfile._
 import org.apache.spark.resource.ResourceUtils._
 import org.apache.spark.rpc._
-import org.apache.spark.scheduler.{ExecutorLossReason, TaskDescription}
+import org.apache.spark.scheduler.{ExecutorDecommissionInfo, ExecutorLossReason, TaskDescription}
 import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages._
 import org.apache.spark.serializer.SerializerInstance
 import org.apache.spark.util.{ChildFirstURLClassLoader, MutableURLClassLoader, SignalUtils, ThreadUtils, Utils}
@@ -79,7 +79,6 @@ private[spark] class CoarseGrainedExecutorBackend(
    */
   private[executor] val taskResources = new mutable.HashMap[Long, Map[String, ResourceInformation]]
 
-  // Track our decommissioning status internally.
   @volatile private var decommissioned = false
 
   override def onStart(): Unit = {
@@ -168,11 +167,15 @@ private[spark] class CoarseGrainedExecutorBackend(
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
         if (decommissioned) {
-          logError("Asked to launch a task while decommissioned.")
+          val msg = "Asked to launch a task while decommissioned."
+          logError(msg)
           driver match {
             case Some(endpoint) =>
               logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId))
+              endpoint.send(
+                DecommissionExecutor(
+                  executorId,
+                  ExecutorDecommissionInfo(msg, isHostDecommissioned = false)))
             case _ =>
               logError("No registered driver to send Decommission to.")
           }
@@ -265,12 +268,14 @@ private[spark] class CoarseGrainedExecutorBackend(
   }
 
   private def decommissionSelf(): Boolean = {
-    logInfo("Decommissioning self")
+    val msg = "Decommissioning self w/sync"
+    logInfo(msg)
     try {
       decommissioned = true
       // Tell master we are are decommissioned so it stops trying to schedule us
       if (driver.nonEmpty) {
-        driver.get.askSync[Boolean](DecommissionExecutor(executorId))
+        driver.get.askSync[Boolean](DecommissionExecutor(
+            executorId, ExecutorDecommissionInfo(msg, false)))
       } else {
         logError("No driver to message decommissioning.")
       }
@@ -285,22 +290,21 @@ private[spark] class CoarseGrainedExecutorBackend(
       // is viewed as acceptable to minimize introduction of any new locking structures in critical
       // code paths.
 
-      val shutdownExec = ThreadUtils.newDaemonSingleThreadExecutor("wait for decommissioning")
-      val shutdownRunnable = new Runnable() {
+      val shutdownThread = new Thread("wait-for-blocks-to-migrate") {
         override def run(): Unit = {
           var lastTaskRunningTime = System.nanoTime()
           val sleep_time = 1000 // 1s
 
           while (true) {
             logInfo("Checking to see if we can shutdown.")
+            Thread.sleep(sleep_time)
             if (executor == null || executor.numRunningTasks == 0) {
               if (env.conf.get(STORAGE_DECOMMISSION_ENABLED)) {
                 logInfo("No running tasks, checking migrations")
-                val allBlocksMigrated = env.blockManager.lastMigrationInfo()
+                val (migrationTime, allBlocksMigrated) = env.blockManager.lastMigrationInfo()
                 // We can only trust allBlocksMigrated boolean value if there were no tasks running
                 // since the start of computing it.
-                if (allBlocksMigrated._2 &&
-                  (allBlocksMigrated._1 > lastTaskRunningTime)) {
+                if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) {
                   logInfo("No running tasks, all blocks migrated, stopping.")
                   exitExecutor(0, "Finished decommissioning", notifyDriver = true)
                 } else {
@@ -310,18 +314,21 @@ private[spark] class CoarseGrainedExecutorBackend(
                 logInfo("No running tasks, no block migration configured, stopping.")
                 exitExecutor(0, "Finished decommissioning", notifyDriver = true)
               }
-              Thread.sleep(sleep_time)
             } else {
-              logInfo("Blocked from shutdown by running task")
+              logInfo("Blocked from shutdown by running ${executor.numRunningtasks} tasks")
               // If there is a running task it could store blocks, so make sure we wait for a
               // migration loop to complete after the last task is done.
-              Thread.sleep(sleep_time)
+              // Note: this is only advanced if there is a running task, if there
+              // is no running task but the blocks are not done migrating this does not
+              // move forward.
               lastTaskRunningTime = System.nanoTime()
             }
           }
         }
       }
-      shutdownExec.submit(shutdownRunnable)
+      shutdownThread.setDaemon(true)
+      shutdownThread.start()
+
       logInfo("Will exit when finished decommissioning")
       // Return true since we are handling a signal
       true

diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -419,24 +419,23 @@ package object config {
       .booleanConf
       .createWithDefault(false)
 
-  private[spark] val STORAGE_SHUFFLE_DECOMMISSION_ENABLED =
+  private[spark] val STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED =
     ConfigBuilder("spark.storage.decommission.shuffleBlocks.enabled")
       .doc("Whether to transfer shuffle blocks during block manager decommissioning. Requires " +
         "a migratable shuffle resolver (like sort based shuffe)")
       .version("3.1.0")
       .booleanConf
       .createWithDefault(false)
 
-  private[spark] val STORAGE_SHUFFLE_DECOMMISSION_MAX_THREADS =
+  private[spark] val STORAGE_DECOMMISSION_SHUFFLE_MAX_THREADS =
     ConfigBuilder("spark.storage.decommission.shuffleBlocks.maxThreads")
       .doc("Maximum number of threads to use in migrating shuffle files.")
       .version("3.1.0")
       .intConf
       .checkValue(_ > 0, "The maximum number of threads should be positive")
-      .createWithDefault(10)
-
+      .createWithDefault(8)
 
-  private[spark] val STORAGE_RDD_DECOMMISSION_ENABLED =
+  private[spark] val STORAGE_DECOMMISSION_RDD_BLOCKS_ENABLED =
     ConfigBuilder("spark.storage.decommission.rddBlocks.enabled")
       .doc("Whether to transfer RDD blocks during block manager decommissioning.")
       .version("3.1.0")

diff --git a/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala b/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
@@ -168,8 +168,10 @@ private[spark] class NettyBlockTransferService(
     // Everything else is encoded using our binary protocol.
     val metadata = JavaUtils.bufferToArray(serializer.newInstance().serialize((level, classTag)))
 
+    // We always transfer shuffle blocks as a stream for simplicity with the receiving code since
+    // they are always written to disk. Otherwise we check the block size.
     val asStream = (blockData.size() > conf.get(config.MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM) ||
-      blockId.isInternalShuffle || blockId.isShuffle)
+      blockId.isShuffle)
     val callback = new RpcResponseCallback {
       override def onSuccess(response: ByteBuffer): Unit = {
         logTrace(s"Successfully uploaded block $blockId${if (asStream) " as stream" else ""}")

diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -1697,10 +1697,19 @@ private[spark] class DAGScheduler(
 
           // TODO: mark the executor as failed only if there were lots of fetch failures on it
           if (bmAddress != null) {
-            val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled &&
-              unRegisterOutputOnHostOnFetchFailure) {
-              // We had a fetch failure with the external shuffle service, so we
-              // assume all shuffle data on the node is bad.
+            val externalShuffleServiceEnabled = env.blockManager.externalShuffleServiceEnabled
+            val isHostDecommissioned = taskScheduler
+              .getExecutorDecommissionInfo(bmAddress.executorId)
+              .exists(_.isHostDecommissioned)
+
+            // Shuffle output of all executors on host `bmAddress.host` may be lost if:
+            // - External shuffle service is enabled, so we assume that all shuffle data on node is
+            //   bad.
+            // - Host is decommissioned, thus all executors on that host will die.
+            val shuffleOutputOfEntireHostLost = externalShuffleServiceEnabled ||
+              isHostDecommissioned
+            val hostToUnregisterOutputs = if (shuffleOutputOfEntireHostLost
+              && unRegisterOutputOnHostOnFetchFailure) {
               Some(bmAddress.host)
             } else {
               // Unregister shuffle data just for one executor (we don't have any
@@ -2215,7 +2224,7 @@ private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler
 
     case ExecutorLost(execId, reason) =>
       val workerLost = reason match {
-        case SlaveLost(_, true) => true
+        case SlaveLost(_, true, _) => true
         case _ => false
       }
       dagScheduler.handleExecutorLost(execId, workerLost)

diff --git a/core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala b/core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail when an executor is being decommissioned.
+ * @param message Human readable reason for why the decommissioning is happening.
+ * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in other places) is
+ *                             being decommissioned too. Used to infer if the shuffle data might
+ *                             be lost even if the external shuffle service is enabled.
+ */
+private[spark]
+case class ExecutorDecommissionInfo(message: String, isHostDecommissioned: Boolean)