-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes MapReduce aggregator and heuristic to correctly handle task data when sampling is enabled #222
Fixes MapReduce aggregator and heuristic to correctly handle task data when sampling is enabled #222
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,7 @@ public long getResourceUsed() { | |
|
||
/** | ||
* Computes the aggregated metrics -> peakMemory, delay, total task duration, wasted resources and memory usage. | ||
* Aggregated metrics are expected to be approximation when sampling is enabled. | ||
* @param taskDatas | ||
* @param containerSize | ||
* @param idealStartTime | ||
|
@@ -106,6 +107,9 @@ private void compute(MapReduceTaskData[] taskDatas, long containerSize, long ide | |
} | ||
|
||
for (MapReduceTaskData taskData: taskDatas) { | ||
if (!taskData.isTimeAndCounterDataPresent()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment to the function saying that the aggregatemetrics are not expected to be accurate when sampling is enabled so that it's clear ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @shankar37 for the review. |
||
continue; | ||
} | ||
long taskMemory = taskData.getCounters().get(MapReduceCounterData.CounterName.PHYSICAL_MEMORY_BYTES)/ FileUtils.ONE_MB; // MB | ||
long taskVM = taskData.getCounters().get(MapReduceCounterData.CounterName.VIRTUAL_MEMORY_BYTES)/ FileUtils.ONE_MB; // MB | ||
long taskDuration = taskData.getFinishTimeMs() - taskData.getStartTimeMs(); // Milliseconds | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,14 +19,16 @@ | |
import com.linkedin.drelephant.mapreduce.data.MapReduceApplicationData; | ||
import com.linkedin.drelephant.mapreduce.data.MapReduceTaskData; | ||
import com.linkedin.drelephant.configurations.heuristic.HeuristicConfigurationData; | ||
|
||
import java.util.ArrayList; | ||
import java.util.List; | ||
import java.util.Properties; | ||
import java.util.concurrent.TimeUnit; | ||
|
||
import com.linkedin.drelephant.analysis.Heuristic; | ||
import com.linkedin.drelephant.analysis.HeuristicResult; | ||
import com.linkedin.drelephant.analysis.Severity; | ||
|
||
|
||
public class JobQueueLimitHeuristic implements Heuristic<MapReduceApplicationData> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akshayrai Is this heuristic used at linkedin ? I dont see it enabled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. We don't use this. As far as I remember, this heuristic was created to warn jobs which were hitting close to the default queue time out limit of 15mins. |
||
|
||
private HeuristicConfigurationData _heuristicConfData; | ||
|
@@ -89,13 +91,13 @@ public HeuristicResult apply(MapReduceApplicationData data) { | |
} | ||
|
||
private Severity[] getTasksSeverity(MapReduceTaskData[] tasks, long queueTimeout) { | ||
Severity[] tasksSeverity = new Severity[tasks.length]; | ||
int i = 0; | ||
List<Severity> taskSeverityList = new ArrayList<Severity>(); | ||
for (MapReduceTaskData task : tasks) { | ||
tasksSeverity[i] = getQueueLimitSeverity(task.getTotalRunTimeMs(), queueTimeout); | ||
i++; | ||
if (task.isTimeAndCounterDataPresent()) { | ||
taskSeverityList.add(getQueueLimitSeverity(task.getTotalRunTimeMs(), queueTimeout)); | ||
} | ||
} | ||
return tasksSeverity; | ||
return taskSeverityList.toArray(new Severity[taskSeverityList.size()]); | ||
} | ||
|
||
private long getSeverityFrequency(Severity severity, Severity[] tasksSeverity) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this always true ? Can you elaborate on what exactly is the issue you are trying to fix ? I remember you mentioning something in the meetup but dont recollect exactly what the issue was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not always true. If sampling is enabled, then it's only going to be true for sampled tasks. In
MapReduceFetcherHadoop2
class,MapReduceTaskData
instance is created for all the tasks, but time and counter data is only stored for sampled tasks. Please look at the logic here.The unit test in this change will fail because NullPointerException without this check.