Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - Requires stages/failedTasks Rest API) #288

skakker · 2017-09-19T13:55:09Z

Tasks (and stages and jobs) can fail if an error occurs while the task is executing. Recommendations are based on what kind of errors are encountered by tasks. Error information is found on the task level, and looking at the error message, recommendations are given.

shkhrgpt · 2017-10-05T20:20:26Z

I think it would good if you can provide a brief description of this PR. It helps in the review.
Thanks.

akshayrai · 2017-10-10T18:19:33Z

@skakker , can you please update the description of all the PRs as @shkhrgpt has been pointing out. It helps the reviewers a lot.

akshayrai

Can you please create a task for yourself and update the wiki page in github with all the Spark Heuristics?

https://github.com/linkedin/dr-elephant/wiki/Metrics-and-Heuristics#spark

akshayrai · 2017-11-19T14:27:42Z

app/com/linkedin/drelephant/spark/fetchers/SparkRestClient.scala

@@ -92,6 +93,8 @@ class SparkRestClient(sparkConf: SparkConf) {
        await(futureJobDatas),
        await(futureStageDatas),
        await(futureExecutorSummaries),
+        await(futureFailedTasksDatas),
+        //Seq.empty,


Remove this.

akshayrai · 2017-11-19T16:50:23Z

app/com/linkedin/drelephant/spark/fetchers/SparkRestClient.scala

@@ -213,6 +216,18 @@ class SparkRestClient(sparkConf: SparkConf) {
      }
    }
  }
+
+  private def getStagesWithFailedTasks(attemptTarget: WebTarget): Seq[StageDataImpl] = {
+    val target = attemptTarget.path("stages/failedTasks")


Can't you get this information from the getStageData call? Why do you need to make a separate call for retrieving failed tasks?

I agree with @akshayrai. Why not get this information from getStageData?

We have a separate API developed for "failed tasks". If we fetch data from the StageDatas, we will have to iterate over all the stages and filter out the stages having failed tasks, which is a costly operation, hence using this API to fetch failed tasks.

If it's an API which is not available with the official version of Spark, then you need to add a comment about it. I am not sure if this PR should be merged unless the API is included in the official version of Spark.

I agree that iterating over all the stages could be a costly operation. But doing the same costly operation on Spark history server may be a lot more expensive because the history server is not only limited to Dr Elephant.

Hi Shekhar
As it turns out, the stages API does not return the task information. The only way to have failed task information is this or to call task API separately for each task. Hence, we will have to use this call only.

Hi Shekhar
I have made the calling of FailedTasksAPI configurable, the default value being false, now you don't have to worry about it as for the official version of spark, it won't call the API.

Thanks, @skakker for making it configurable. However, I am still not in favor of merging this change because it's going to be useless for almost all the users unless they patch their SHS which is not a trivial exercise.
@akshayrai @shankar37 What do you guys think about this issue?

@shkhrgpt, we will not merge these PRs (which depend on the custom Spark HS) with the master as long as the changes are not reflected in the public spark release.

We are planning to merge these to a separate branch for now.

akshayrai · 2017-11-19T16:51:38Z

app/com/linkedin/drelephant/spark/fetchers/statusapiv1/StageStatus.java

@@ -0,0 +1,18 @@
+package com.linkedin.drelephant.spark.fetchers.statusapiv1;


Please add comments on the motivation behind adding this class.

akshayrai · 2017-11-19T16:52:50Z

app/com/linkedin/drelephant/spark/fetchers/statusapiv1/StageStatus.java.orig

@@ -0,0 +1,18 @@
+package com.linkedin.drelephant.spark.fetchers.statusapiv1;


Is this class required? If not please remove it.

akshayrai · 2017-11-19T17:05:59Z

test/com/linkedin/drelephant/spark/fetchers/SparkRestClientTest.scala

+    class StagesWithFailedTasksResource {
+      @GET
+      def getStagesWithFailedTasks(@PathParam("appId") appId: String, @PathParam("attemptId") attemptId: String): Seq[StageDataImpl] =
+        if (attemptId == "2") Seq.empty else throw new Exception()


I am not clear what this code does here? Can you elaborate?

because we added a new REST call "failedTasksData", its a test for that.

akshayrai · 2017-11-19T17:55:30Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+      new HeuristicResultDetails("Stages with Overhead memory errors", evaluator.stagesWithOverheadError.toString)
+    )
+    if(evaluator.severityOverheadStages.getValue >= Severity.MODERATE.getValue)
+      resultDetails = resultDetails :+ new HeuristicResultDetails("Overhead memory errors", "Many tasks have failed due to overhead memory error. please try increasing it by 500MB in spark.yarn.executor.memoryOverhead")


try increasing it ... to try increasing spark.yarn.executor.memoryOverhead by 500MB

akshayrai · 2017-11-19T18:00:17Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+      resultDetails = resultDetails :+ new HeuristicResultDetails("Overhead memory errors", "Many tasks have failed due to overhead memory error. please try increasing it by 500MB in spark.yarn.executor.memoryOverhead")
+    //TODO: refine recommendations
+    if(evaluator.severityOOMStages.getValue >= Severity.MODERATE.getValue)
+      resultDetails = resultDetails :+ new HeuristicResultDetails("OOM errors", "Many tasks have failed due to OOM error. Kindly check by increasing executor memory, decreasing spark.memory.fraction or decreasing number of cores.")


Rephrase Kindly check by ... to try increasing spark.executor.memory or decreasing spark.memory.fraction (take a look at unified memory heuristic) or decreasing number of cores.

akshayrai · 2017-11-19T18:02:10Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+
+
+/**
+  * A heuristic based on errors encountered by failed tasks


Please add more details about this heuristic here.

akshayrai · 2017-11-19T18:08:24Z

app/views/help/spark/helpStagesWithFailedTasks.scala.html

+*@
+<p>Tasks (and stages and jobs) can fail if an error occurs while the task is executing.</p>
+
+<p>Tasks may fail due to Overhead memory issues or OOM errors. These errors are checked and warning is given accordingly.</p>


Please elaborate.

akshayrai · 2017-11-19T18:19:01Z

test/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristicTest.scala

+    attemptId = 0,
+    numActiveTasks = numCompleteTasks,
+    numCompleteTasks,
+    numFailedTasks = 0,


Semantically numFailedTasks should be set to 3.

shkhrgpt · 2017-11-28T23:31:59Z

app-conf/FetcherConf.xml

@@ -79,7 +79,7 @@
  -->
  <fetcher>
    <applicationtype>spark</applicationtype>
-    <classname>com.linkedin.drelephant.spark.fetchers.FSFetcher</classname>
+    <classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>


Maybe this should be done in a separate change?

shkhrgpt · 2017-11-28T23:46:44Z

app/com/linkedin/drelephant/spark/fetchers/SparkRestClient.scala

@@ -213,6 +215,18 @@ class SparkRestClient(sparkConf: SparkConf) {
      }
    }
  }
+
+  private def getStagesWithFailedTasks(attemptTarget: WebTarget): Seq[StageDataImpl] = {
+    val target = attemptTarget.path("stages/failedTasks")


I went through the Spark documentation but I couldn't find rest endpoint, stages/failedTasks, in Spark history server. Not sure how it's gonna work.

shkhrgpt · 2017-11-28T23:48:24Z

app/com/linkedin/drelephant/spark/fetchers/SparkRestClient.scala

@@ -213,6 +216,18 @@ class SparkRestClient(sparkConf: SparkConf) {
      }
    }
  }
+
+  private def getStagesWithFailedTasks(attemptTarget: WebTarget): Seq[StageDataImpl] = {
+    val target = attemptTarget.path("stages/failedTasks")


I agree with @akshayrai. Why not get this information from getStageData?

edwinalu · 2017-12-14T18:23:20Z

app/com/linkedin/drelephant/spark/fetchers/statusapiv1/StageStatus.java

+import org.apache.spark.util.EnumUtil;
+
+// added this class to accomodate the status "PENDING" for stages.
+public enum StageStatus {


Some of the leveldb SHS changes have been merged into the master branch for Spark, including "SKIPPED". It is not in 2.1 or 2.2 branches however. Is it possible to use the version from master? If not, could you please add a TODO to replace with the Spark version when possible?

edwinalu · 2017-12-14T18:28:03Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+      new HeuristicResultDetails("Stages with Overhead memory errors", evaluator.stagesWithOverheadError.toString)
+    )
+    if (evaluator.severityOverheadStages.getValue >= Severity.MODERATE.getValue)
+      resultDetails = resultDetails :+ new HeuristicResultDetails("Overhead memory errors", "Many tasks have failed due to overhead memory error. Please try increasing spark.yarn.executor.memoryOverhead by 500MB in spark.yarn.executor.memoryOverhead")


I think we're alerting if any tasks have an OOM error, or are killed by YARN. Can "Many" be changed to "some"?

edwinalu · 2017-12-14T19:44:17Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+    private def getStageSeverity(numFailedTasks: Int, stageStatus: StageStatus, severityStage: Severity, numCompleteTasks: Int): Severity = {
+      var severityTemp: Severity = Severity.NONE
+      if (numFailedTasks != 0 && stageStatus != StageStatus.FAILED) {
+        if (numFailedTasks.toDouble / numCompleteTasks.toDouble < 2.toDouble / 100.toDouble) {


The threshold (2) is hard coded right now -- please add a constant for this.

edwinalu · 2017-12-14T21:19:57Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+    lazy val stagesWithFailedTasks: Seq[StageData] = data.stagesWithFailedTasks
+
+    /**
+      * returns the OOM and Overhead memory errors severity


Combine the "@return" with the comment.

shankar37 · 2017-12-15T01:14:14Z

As long as the dr. elephant that runs with the default configuration doesn’t require the new SHS, I am ok with the change. This is like supporting a new version of spark which is in beta or something. The other option is to keep this in a separate branch till the SHS changes get merged to an official release of spark but that will be a lot of maintenance overhead of maintaining multiple branches to be kept in sync. As long as all the new features/heuristics are configured to be off by default and clearly document the new SHS dependency, we should go ahead. Thanks Shankar From: Shekhar Gupta <notifications@github.com> Reply-To: linkedin/dr-elephant <reply@reply.github.com> Date: Friday, December 15, 2017 at 4:37 AM To: linkedin/dr-elephant <dr-elephant@noreply.github.com> Cc: shankar37 <shankar37@gmail.com>, Mention <mention@noreply.github.com> Subject: Re: [linkedin/dr-elephant] Stages with failed tasks heuristic (#288) @shkhrgpt commented on this pull request.

________________________________ In app/com/linkedin/drelephant/spark/fetchers/SparkRestClient.scala<#288 (comment)>:

@@ -213,6 +216,18 @@ class SparkRestClient(sparkConf: SparkConf) {

} } } + + private def getStagesWithFailedTasks(attemptTarget: WebTarget): Seq[StageDataImpl] = { + val target = attemptTarget.path("stages/failedTasks") Thanks, @skakker<https://github.com/skakker> for making it configurable. However, I am still not in favor of merging this change because it's going to be useless for almost all the users unless they patch their SHS which is not a trivial exercise. @akshayrai<https://github.com/akshayrai> @shankar37<https://github.com/shankar37> What do you guys think about this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#288 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AMlRI2UyRnjTH7Do6nMev-mKnB_hMogSks5tAaoigaJpZM4PcbrU>.

shkhrgpt · 2017-12-15T03:52:21Z

I don't think depending on Spark features that are in beta is a good idea because those features may or may not exist in the final release. I am worried that users will try to use these new heuristics, and then they won't see results which might result in unnecessary issues. Just like the issues we facing with Spark 2.1.

Anyways, if we want to keep this change, then we need to have much more clear documentation. And in the PR, the author should also post the link to the JIRA/GitHub link of the new SHS patches on which this change depends. We need to make sure that at least those changes are merged in Spark before merging these changes. I would strongly recommend NOT to merge these changes until all the relevant SHS changes are merged in the main Spark. For example, this ticket about the better scalability of SHS, https://issues.apache.org/jira/browse/SPARK-18085, it still unresolved.

@skakker Please add the relevant documentation to the other PRs too which depend on the custom SHS.

skakker · 2017-12-19T08:54:08Z

@shkhrgpt
Yeah providing links to the relevant JIRAs makes sense. Will update the JIRAs with relevant JIRA ids.

akshayrai · 2018-01-08T14:23:38Z

app-conf/HeuristicConf.xml

@@ -193,6 +193,12 @@
    <classname>com.linkedin.drelephant.spark.heuristics.StagesHeuristic</classname>
    <viewname>views.html.help.spark.helpStagesHeuristic</viewname>
  </heuristic>
+  <heuristic>
+    <applicationtype>spark</applicationtype>
+    <heuristicname>Spark Stages with failed tasks</heuristicname>


Stages with Failed Tasks

akshayrai · 2018-01-08T14:33:48Z

app/com/linkedin/drelephant/spark/fetchers/SparkFetcher.scala

    val appId = analyticJob.getAppId
-    val restDerivedData = await(sparkRestClient.fetchData(appId, eventLogSource == EventLogSource.Rest))
+    val restDerivedData = await(sparkRestClient.fetchData(appId, eventLogSource == EventLogSource.Rest, fetchFailedTasks))


I noticed you were passing fetchFailedTasks in every method prior to this? Can't you compute it directly over here?

akshayrai · 2018-01-08T14:39:30Z

app/com/linkedin/drelephant/spark/fetchers/SparkFetcher.scala

@@ -41,6 +41,7 @@ class SparkFetcher(fetcherConfigurationData: FetcherConfigurationData)
  import ExecutionContext.Implicits.global

  private val logger: Logger = Logger.getLogger(classOf[SparkFetcher])
+  val fetchFailedTasks : Boolean = Option(fetcherConfigurationData.getParamMap.get(FETCH_FAILED_TASKS)).getOrElse("false").toBoolean


fetchFailedTasks to doFetchFailedTasks

akshayrai · 2018-01-08T14:42:29Z

app/views/help/spark/helpStagesWithFailedTasks.scala.html

+*@
+<p>Tasks (and stages and jobs) can fail if an error occurs while the task is executing.</p>
+
+<p>Tasks may fail due to Overhead memory issues or OOM errors. Due to errors in tasks, that stage might also fail. It is analysed as to why the tasks failed, if many tasks of the same stage failed due to the same error, etc. Suggestions are given to prevent these errors.</p>


Need more clarity

akshayrai · 2018-01-08T14:43:31Z

app/com/linkedin/drelephant/spark/legacydata/LegacyDataConverters.scala

@@ -34,14 +34,44 @@ import com.linkedin.drelephant.spark.fetchers.statusapiv1.StageStatus
 object LegacyDataConverters {
  import JavaConverters._

+  //Currently returns a default object (as this JSON is retrieved from Spark History Server), if spark history server is not used to fetch data, changes are required


Need more clarity

akshayrai · 2018-01-08T14:45:07Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+      if (numFailedTasks != 0 && stageStatus != StageStatus.FAILED) {
+        if (numFailedTasks.toDouble / numCompleteTasks.toDouble < ratioThreshold / 100.toDouble) {
+          severityTemp = Severity.MODERATE
+        }


formatting. Here and below

akshayrai · 2018-01-08T14:47:03Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+    }
+
+    /**
+      * returns the max (severity of this stage, present severity)


define what is stage severity here in layman terms first.

akshayrai · 2018-01-08T14:49:04Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+          severityTemp = Severity.SEVERE
+        }
+      }
+      else if (numFailedTasks != 0 && stageStatus == StageStatus.FAILED && numFailedTasks / numCompleteTasks > 0) {


What if numCompleteTasks is 0?

akshayrai · 2018-01-08T14:52:48Z

app/com/linkedin/drelephant/spark/heuristics/StagesWithFailedTasksHeuristic.scala

+
+  val OOM_ERROR = "java.lang.OutOfMemoryError"
+  val OVERHEAD_MEMORY_ERROR = "killed by YARN for exceeding memory limits"
+  val ratioThreshold : Double = 2


Make this configurable in HeuristicConf

shkhrgpt · 2018-01-09T16:32:55Z

@akshayrai I think that is a good idea to not to merge these changes until the relevant Spark code is checked in.
@skakker Can you please post the GitHub/JIRA links of those Spark changes so their progress can be monitored. Thank you.

akshayrai · 2018-01-10T05:39:45Z

@shkhrgpt , we are planning to merge this into a separate branch rather than keeping the PR open. We will merge it with the master once the Spark work is available in the public release.

@skakker will be sharing the JIRA details shortly.

…equires stages/failedTasks Rest API) (#288)

…equires stages/failedTasks Rest API) (linkedin#288)

…equires stages/failedTasks Rest API) (#288)

…equires stages/failedTasks Rest API) (linkedin#288)

…equires stages/failedTasks Rest API) (#288)

skakker force-pushed the FailedTasksHeuristic branch from d4cd6df to 95a3e9c Compare September 19, 2017 13:57

skakker force-pushed the FailedTasksHeuristic branch 2 times, most recently from 5964954 to a6fc106 Compare October 10, 2017 06:59

akshayrai suggested changes Nov 20, 2017

View reviewed changes

shkhrgpt reviewed Nov 29, 2017

View reviewed changes

akshayrai force-pushed the master branch from 7c2fd7f to 8b46933 Compare December 12, 2017 05:09

skakker force-pushed the FailedTasksHeuristic branch from 14cd883 to ca4243a Compare December 14, 2017 13:12

edwinalu reviewed Dec 14, 2017

View reviewed changes

swasti added 8 commits January 8, 2018 14:18

Stages with failed tasks heuristic

bb9a672

Changed default compression codec from .snappy to .lz4

dd14aed

created local instance for StageStatus

91a625c

added help page for stages with failed tasks heuristic

6532e80

acknowledged the changes

e89d50b

made calling of FailedTasksAPI configurable.

33a1841

fixed review comments

2414bda

chnaged test because of rebasing

aa5c395

skakker force-pushed the FailedTasksHeuristic branch from ffe5238 to aa5c395 Compare January 8, 2018 08:59

akshayrai suggested changes Jan 8, 2018

View reviewed changes

review comments acknowledged

eb159eb

akshayrai approved these changes Jan 9, 2018

View reviewed changes

akshayrai changed the title ~~Stages with failed tasks heuristic~~ Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - Requires stages/failedTasks Rest API) Jan 10, 2018

skakker changed the base branch from master to customSHSWork January 10, 2018 06:03

akshayrai merged commit 06b87a1 into linkedin:customSHSWork Jan 10, 2018

akshayrai pushed a commit that referenced this pull request Feb 21, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

d05321f

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request Feb 27, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

a001e83

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request Mar 6, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

8386e72

…equires stages/failedTasks Rest API) (#288)

arpang pushed a commit to arpang/dr-elephant that referenced this pull request Mar 14, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

eaae3a2

…equires stages/failedTasks Rest API) (linkedin#288)

akshayrai pushed a commit that referenced this pull request Mar 19, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

e2b403a

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request Mar 19, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

edee74b

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request Mar 30, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

ee1ed85

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request Apr 6, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

a40d251

…equires stages/failedTasks Rest API) (#288)

akshayrai pushed a commit that referenced this pull request May 21, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

eec5ff7

…equires stages/failedTasks Rest API) (#288)

pralabhkumar pushed a commit to pralabhkumar/dr-elephant that referenced this pull request Aug 31, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

2825e79

…equires stages/failedTasks Rest API) (linkedin#288)

varunsaxena pushed a commit that referenced this pull request Oct 16, 2018

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - R…

c4544cb

…equires stages/failedTasks Rest API) (#288)

		@@ -0,0 +1,18 @@
		package com.linkedin.drelephant.spark.fetchers.statusapiv1;



		/**
		* A heuristic based on errors encountered by failed tasks

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - Requires stages/failedTasks Rest API) #288

Spark Stages with Failed tasks Heuristic - (Depends on Custom SHS - Requires stages/failedTasks Rest API) #288

Conversation

skakker commented Sep 19, 2017 • edited Loading

shkhrgpt commented Oct 5, 2017

akshayrai commented Oct 10, 2017

akshayrai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkhrgpt Dec 1, 2017 • edited Loading

Choose a reason for hiding this comment

skakker Dec 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shankar37 commented Dec 15, 2017 via email

shkhrgpt commented Dec 15, 2017

skakker commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkhrgpt commented Jan 9, 2018

akshayrai commented Jan 10, 2018

skakker commented Sep 19, 2017 •

edited

Loading

shkhrgpt Dec 1, 2017 •

edited

Loading

skakker Dec 11, 2017 •

edited

Loading