Dr-elephant is not collecting the data #206

shyamraj242 · 2017-02-14T15:28:37Z

Daily usually we ran 1000's of jobs's. I am unable to see the job's on Dr.Elephant UI. We have restarted yarn-server on Jan28 and also we restarted Dr.Elephant but from next moment I am unable to see the jobs on UI. I am getting the connection error shown in the below. I am attaching the Dr.Elephant log file and screen shot of Dr.Elephant UI.

02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The list of RM IDs are rm1,rm2
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Checking RM URL: http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/info
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : ylpd269.kmdc.att.com:8088 is ACTIVE
02-14-2017 10:24:08 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1487085728997, and current time: 1487085788998
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79341
02-14-2017 10:24:09 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 4432
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 3 analyzing MAPREDUCE application_1486843207585_79340
02-14-2017 10:24:11 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79343
02-14-2017 10:24:12 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79344
02-14-2017 10:24:13 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79384
02-14-2017 10:24:14 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing SPARK application_1486843207585_79387
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner :
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.net.ConnectException: Connection refused
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:48)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:686)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:638)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:711)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1436)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:312)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:524)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:545)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:801)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:948)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:963)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$isLegacyLogDirectory(SparkFSFetcher.scala:186)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:143)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:99)
... 13 more

02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1486843207585_79387] into the retry list.

dr_elephant.pdf

shyamraj242 · 2017-02-23T14:44:20Z

@akshayrai : please help here

shyamraj242 · 2017-02-24T22:03:18Z

@shkhrgpt , @akshayrai
I am facing problem for the Spark applications.
Case 1:
With your suggestion I have installed the new version of Dr.Elephant on the same cluster. I set path for the SPARK_HOME, but in the logs still it is throwing an error to set the SPARK_HOME path.

export SPARK_HOME=/usr/hdp/current/spark-client

Case 2:
I have edited the spark-default.conf file which is present in /test/resources as below.
spark.yarn.historyServer.address = blpd214.bhdc.att.com:18080
spark.eventLog.enabled = true
spark.eventLog.compress = true
spark.eventLog.dir = hdfs:///spark-history

I have copied these values from /usr/hdp/current/spark-client/conf/spark-default.conf.

Error:
02-24-2017 14:25:45 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1487694609998_0011
02-24-2017 14:25:45 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1487694609998_0011
02-24-2017 14:25:45 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0011
java.lang.IllegalStateException: can't find Spark conf; please set SPARK_HOME or SPARK_CONF_DIR
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkConf$lzycompute(SparkFetcher.scala:52)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkConf(SparkFetcher.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkRestClient$lzycompute(SparkFetcher.scala:57)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkRestClient(SparkFetcher.scala:57)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.fetchData(SparkFetcher.scala:68)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.fetchData(SparkFetcher.scala:37)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Please help me on this.

shkhrgpt · 2017-02-24T22:30:14Z

Are you setting SPARK_HOME and running Dr Elephant in the same shell?

shyamraj242 · 2017-02-25T02:06:37Z

@shkhrgpt
Thanks for your reply. Yes, I am doing on the same shell.

[drelphnt@blpd218 ~] export SPARK_HOME=/usr/hdp/current/spark-client/
[drelphnt@blpd218 ~] $DR_RELEASE/bin/start.sh $DR_RELEASE1/app-conf

shyamraj242 · 2017-02-27T16:32:53Z

@shkhrgpt
Can you please help me on this.

shyamraj242 · 2017-02-27T21:17:30Z

@akshayrai @shkhrgpt

Can you please help me on this

shkhrgpt · 2017-02-27T21:35:47Z

I am sorry but I can't figure out why you are seeing this error. It should be able to find Saprk conf. Maybe @rayortigas or @shankar37 can add something here?

shkhrgpt · 2017-02-27T21:50:03Z

@shyamraj242
I would like to see whether or not SPARK_HOME is set for Dr Elephant process. So can you please get the process id of your Dr Elephant process using ps aux | grep elephant command or something else. Then run the following command:

sudo cat /proc/[PID_OF_DR_ELEPHANT]/environ | xargs --null --max-args=1 echo

For example, if process id is 4306, then the command will be the following:

sudo cat /proc/4306/environ | xargs --null --max-args=1 echo

Please share the output of the command.

akshayrai · 2017-02-28T17:01:45Z

@shyamraj242 , can you also tell us the absolute path of spark-defaults.conf in your spark distribution?

shyamraj242 · 2017-02-28T19:13:47Z

@shkhrgpt @akshayrai
Thanks for giving reply to the post.
My bad actually SPARK_HOME is present. Now I am getting another new error. Please find the error details below.
SPARK_HOME=/usr/hdp/current/spark-client
[sg865w@blpd218 conf]$ ll
total 32
-rw-r--r-- 1 spark spark 1923 Jan 17 10:07 hive-site.xml
-rw-r--r-- 1 spark spark 620 Aug 2 2016 log4j.properties
-rw-r--r-- 1 spark spark 4956 Aug 2 2016 metrics.properties
-rw-r--r-- 1 spark spark 1309 Feb 16 15:18 spark-defaults.conf
-rw-r--r-- 1 spark spark 1971 Jan 25 14:45 spark-env.sh
-rwxr-xr-x 1 spark spark 253 Jan 4 13:25 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive hadoop 1373 Feb 16 15:18 spark-thrift-sparkconf.conf
[sg865w@blpd218 conf]$ pwd
/usr/hdp/current/spark-client/conf

#Error:
02-28-2017 11:47:39 INFO [Thread-6] com.linkedin.drelephant.ElephantRunner : Job queue size is 1
02-28-2017 11:47:39 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1487694609998_0021
02-28-2017 11:47:39 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1487694609998_0021
02-28-2017 11:47:39 INFO [ForkJoinPool-3-worker-57] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://blpd214.bhdc.att.com:18080/api/v1/applications/application_1487694609998_0021
02-28-2017 11:47:39 INFO [ForkJoinPool-3-worker-43] com.linkedin.drelephant.spark.fetchers.SparkLogClient : looking for logs at webhdfs://null:50070/spark-history/application_1487694609998_0021_1
02-28-2017 11:47:39 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0021
java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:207)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs$lzycompute(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at resource.DefaultManagedResource.open(AbstractManagedResource.scala:106)
at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:85)
at resource.ManagedResourceOperations$class.acquireAndGet(ManagedResourceOperations.scala:25)
at resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

@akshayrai
Please find the path of spark-default.conf and also please see the spark-default.conf file
/usr/hdp/current/spark-client/conf/spark-default.conf

#Spark-Default.conf:

spark.driver.extraJavaOptions -Dhdp.version=2.5.3.0-37
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.kerberos.enabled true
spark.history.kerberos.keytab /etc/security/keytabs/spark.headless.keytab
spark.history.kerberos.principal spark@BRHMLAB01.LAB.ATT.COM
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.3.0-37
spark.yarn.applicationMaster.waitTries 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address blpd214.bhdc.att.com:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.yarn.submit.file.replication 3

shkhrgpt · 2017-02-28T20:07:45Z

You are almost there :)
The problem is the way you have configured spark.eventLog.dir in spark-default.conf. As per this documentation, you should add this property as the following:

spark.eventLog.dir=hdfs://namenode_host:namenode_port/spark-history

You can get namenode host and port from the value of fs.defaultFS property in hdfs-site.xml

I hope this change will fix you issue. However, I think we need to make a change in SparkFetcher so it automatically gets the correct namenode host and port from Hadoop config files, rather than crashing.
@akshayrai and @shankar37 What do you think?

shyamraj242 · 2017-02-28T23:39:53Z

@shkhrgpt
I think this configuration we can't change, because I did an experiment by changing this property in Ambari UI, but the problem we are facing is unable to submit the spark jobs. Please find the image of the spark.eventlog.dir value in Ambari UI.
I have changed the spark-default.conf file which is present in /home/drelphnt/dr-elephant-master/test/resources/spark-defaults.conf. After this when I am trying to compile the compile.sh script. The dr-elephant-2.0.6.zip is not generated.
#Error:

[info] - gets its SparkConf when SPARK_CONF_DIR is set *** FAILED ***
[info] "hdfs://[blpd212.bhdc.att.com:50070/spark-history]" was not equal to "hdfs://[nn1.grid.example.com:9000/logs/spark]" (SparkFetcherTest.scala:119)
[info] - gets its SparkConf when SPARK_HOME is set *** FAILED ***
[info] "hdfs://[blpd212.bhdc.att.com:50070/spark-history]" was not equal to "hdfs://[nn1.grid.example.com:9000/logs/spark]" (SparkFetcherTest.scala:145)

[info] *** 2 TESTS FAILED ***
[error] Failed: Total 302, Failed 2, Errors 0, Passed 299, Skipped 1
[error] Failed tests:
[error] com.linkedin.drelephant.spark.fetchers.SparkFetcherTest
[error] (test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 67 s, completed Feb 28, 2017 5:16:32 PM
./compile.sh: line 131: cd: target/universal: No such file or directory
/bin/ls: cannot access *.zip: No such file or directory

Please help me on this.

shkhrgpt · 2017-03-01T00:02:36Z

Okay. I am going to submit a patch to fix this problem very soon. Please wait a little.
BTW, you shouldn't change /home/drelphnt/dr-elephant-master/test/resources/spark-defaults.conf. That is not going to fix anything.

shyamraj242 · 2017-03-01T00:11:05Z

@shkhrgpt
Thanks Shekar for our help.

rayortigas · 2017-03-01T01:41:46Z

@shkhrgpt Thanks for investigating. If you need code for the patch, you might find some material (both old and new) at: https://github.com/linkedin/dr-elephant/pull/149/files

#149 was never merged, but I'm working on a larger change that will pull all or most of that code in. But don't worry about that, it's probably more important you get your patch in.

shkhrgpt · 2017-03-01T02:37:09Z

Thanks, @rayortigas for this reference.

I am still not sure what is the best way to get the hostname of the namenode. The patch implemented in https://github.com/linkedin/dr-elephant/pull/149/files relies on dfs.nameservices. I am not sure if this property is configured in the majority of clusters.

I submitted this patch which tries to get namenode hostname from the value of fs.defaultFS config property. I am also not very sure about this patch. We always configure hostname in the value of fs.defaultFS. I also came across this documentation which recommends to add hostname in the value of fs.defaultFS. Again, I am not sure if the majority of people set this hostname in this config.

shkhrgpt · 2017-03-01T04:59:26Z

@shyamraj242 I have submitted this patch, you can try this while we work on a more permanent solution. Depending on your configs, this may work for you.

shyamraj242 · 2017-03-01T06:42:13Z

@shkhrgpt
Shekar, still I am getting the same error :-(
03-01-2017 00:41:06 INFO [ForkJoinPool-1-worker-43] com.linkedin.drelephant.spark.fetchers.SparkLogClient : looking for logs at webhdfs://null:50070/spark-history/application_1487694609998_0032_1
03-01-2017 00:41:06 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0032
java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:207)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs$lzycompute(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at resource.DefaultManagedResource.open(AbstractManagedResource.scala:106)
at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:85)
at resource.ManagedResourceOperations$class.acquireAndGet(ManagedResourceOperations.scala:25)
at resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.UnknownHostException: null

shkhrgpt · 2017-03-01T07:13:16Z

@shyamraj242
Did you use this patch? This patch hasn't checked in so you'll have to manually apply this patch your local code.
What's the config value of fs.defaultFS in your core-site.xml or hdfs-site.xml?

shyamraj242 · 2017-03-01T19:09:19Z

@shkhrgpt
Regarding the patch, I am unable to see the file SparkDataCollection.scala, SparkFSFetcher.scala, DummySparkFSFetcher.scala, SparkDataCollectionTest.java, SparkFsFetcherTest.java, SparkFsFetcherTest.scala in drelephant. I am unable to find the org directory in app.

The value of fs.defaultFS in my core-site.xml is "hdfs://BRHMLAB01". Where BRHMLAB01 is my cluster name.
[drelphnt@blpd218 dr-elephant-master]$ ll
total 96
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 app
drwxr-x--- 2 drelphnt hadoop 4096 Mar 1 11:50 app-conf
-rw-r----- 1 drelphnt hadoop 1206 Feb 28 11:06 build.sbt
-rw-r----- 1 drelphnt hadoop 100 Feb 28 11:06 compile.conf
-rwxr-xr-x 1 drelphnt hadoop 4100 Feb 28 11:06 compile.sh
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 conf
drwxr-x--- 3 drelphnt hadoop 4096 Mar 1 12:02 dist
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 images
-rw-r----- 1 drelphnt hadoop 1062 Feb 28 11:06 jacoco.sbt
-rw-r----- 1 drelphnt hadoop 11357 Feb 28 11:06 LICENSE
drwxr-x--- 2 drelphnt hadoop 4096 Mar 1 12:00 logs
-rw-r----- 1 drelphnt hadoop 4990 Feb 28 11:06 NOTICE
drwxr-x--- 4 drelphnt hadoop 4096 Mar 1 11:51 project
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 public
-rw-r----- 1 drelphnt hadoop 2633 Feb 28 11:06 README.md
-rw-r----- 1 drelphnt hadoop 290 Feb 28 11:06 resolver.conf.template
drwxr-x--- 2 drelphnt hadoop 4096 Feb 28 11:06 scripts
drwxr-x--- 7 drelphnt hadoop 4096 Mar 1 12:00 target
drwxr-x--- 7 drelphnt hadoop 4096 Feb 28 11:06 test
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 web
[drelphnt@blpd218 dr-elephant-master]$ cd app
[drelphnt@blpd218 app]$ ll
total 20
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 com
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 controllers
-rw-r----- 1 drelphnt hadoop 2766 Feb 28 11:06 Global.java
drwxr-x--- 2 drelphnt hadoop 4096 Feb 28 11:06 models
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 views

shkhrgpt · 2017-03-01T19:21:38Z

If the value of fs.defaultFS does not contain namenode host, then this path will not work. Can you change the value of fs.defaultFS something like this:

<property>
     <name>fs.defaultFS</name>
     <value>hdfs://$namenode.full.hostname:8020</value>
     <description>Enter your NameNode hostname</description>
</property>

As it's suggested in the documentation.

If you can't change this config, then you'll have to wait until we have a proper fix for this problem.

shkhrgpt · 2017-03-01T19:35:53Z

About other files, SparkDataCollection.scala, SparkFSFetcher.scala, DummySparkFSFetcher.scala, SparkDataCollectionTest.java, SparkFsFetcherTest.java, SparkFsFetcherTest.scala, they are not part of this patch, so you don't need to modify them.

Why do you need org directory in app?

lexpierce · 2017-03-08T19:24:52Z

Until there is a code fix, you need to use the correct port for HDFS access. Not the webdhfs interface:

spark.eventLog.dir=hdfs://my.namenode.com:8020/spark-history

I have this working in my HDP 2.5.3 environment.

murthykurra · 2017-06-24T10:48:36Z

Hi Shyamraj ,
We are also using HDP 2.6.0 and we have installed Dr elephant but unable to see any jobs in UI.
error
Oops, an error occured
This exception has been logged with id 74e4pkf37.
I think we have missed configuration changes.

Please provide the configuration changes need to modify after installing Dr elephant.

Thanks in advance

murthykurra · 2017-06-24T11:46:15Z

Provide me the elephant.conf log and Xml files need to modify

Please share them it is High priority issue for us.
Thanks

shyamraj242 · 2017-07-27T23:09:28Z

@shkhrgpt , @rayortigas , @shankar37
Did you guys given a permanent fix for the spark issue. Still I am facing this error.
07-27-2017 17:53:00 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 2
07-27-2017 17:53:00 INFO [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500337134035_0485
07-27-2017 17:53:00 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 3
07-27-2017 17:53:00 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500337134035_0457
07-27-2017 17:53:00 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500488751736_3394
07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at com.linkedin.drelephant.util.SparkUtils$class.fileSystemAndPathForEventLogDir(SparkUtils.scala:70)
at com.linkedin.drelephant.util.SparkUtils$.fileSystemAndPathForEventLogDir(SparkUtils.scala:312)
at org.apache.spark.deploy.history.SparkFSFetcher.doFetchData(SparkFSFetcher.scala:84)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:78)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: null
... 29 more

07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1500337134035_0485] into the retry list.

shkhrgpt · 2017-07-27T23:34:08Z

Recently, we made changes to allow SparkFetcher to only use history server's REST API. Maybe you can try that. If you are using Spark 1.5 or above, then you need to get the latest code, and have the following config in FetcherConf.xml:

<fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>
    <params>
      <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
      <should_process_logs_locally>true</should_process_logs_locally>
    </params>
  </fetcher>

With the above settings, SparkFetcher should work without depending on webhdfs.

shyamraj242 · 2017-07-28T00:08:09Z

@shkhrgpt
Thanks for the reply Shekar.

We are using spark 1.6 version.

[sg865w@blpd218 spark-client]$ spark-submit --version
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
Welcome to
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 1.6.3
//

Type --help for more information.

I edited the FetcherConf.xml file
[drelphnt@blpd218 ~]$ vi /opt/app/dr-elephant-target/target/universal/dr-elephant-2.0.5/app-conf/FetcherConf.xml
Previously spark fetcher is associated with the below value.

spark org.apache.spark.deploy.history.SparkFSFetcher

I have commented this code and psted the code which you have given to me. Still the same error we are facing.

07-27-2017 19:02:33 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500488751736_3387
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at com.linkedin.drelephant.util.SparkUtils$class.fileSystemAndPathForEventLogDir(SparkUtils.scala:70)
at com.linkedin.drelephant.util.SparkUtils$.fileSystemAndPathForEventLogDir(SparkUtils.scala:312)
at org.apache.spark.deploy.history.SparkFSFetcher.doFetchData(SparkFSFetcher.scala:84)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:78)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: null
... 29 more

-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Drop the analytic job. Reason: reached the max retries for application id = [application_1500488751736_3387].
07-27-2017 19:02:33
ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Drop the analytic job. Reason: reached the max retries for application id = [application_1500488751736_3422].

Help me on this.

shkhrgpt · 2017-07-28T00:19:39Z

From your stacktrace, it looks like that Dr Elephant is trying to load FSFetcher which depends on WebHDFS

at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29

Can you share your latest FetcherConf.xml?

shyamraj242 · 2017-07-28T00:27:33Z

@shkhrgpt

Please find the xml file

mapreduce com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2 false spark com.linkedin.drelephant.spark.fetchers.SparkFetcher true true

FetcherConf.txt

shkhrgpt · 2017-07-28T00:34:04Z

I can't see the entire content of FetcherConf.
For some reason, Dr elephant is still loading the older fetcher.

Do you have the following lines in your FetcherConf.xml

 <fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.FSFetcher</classname>
 </fetcher>

If you have the above lines, then you should remove them.

shyamraj242 · 2017-07-28T00:37:05Z

@shkhrgpt

I have those lines. I have attached the .xml file in the previous post. I have attached in this as well.
FetcherConf.txt

shkhrgpt · 2017-07-28T00:47:19Z

If you remove the lines about FSFetcher, then it should load the correct SparkFetcher which will only use the REST interface of history server.

shyamraj242 · 2017-07-28T01:16:24Z

@shankar37
I have changed the xml file as you suggested , Still I am facing the below error.
07-27-2017 20:14:03 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1501194505315_0509
07-27-2017 20:14:03 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1501194505315_0509
07-27-2017 20:14:03 INFO [ForkJoinPool-1-worker-29] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509
07-27-2017 20:14:03 ERROR [ForkJoinPool-1-worker-29] com.linkedin.drelephant.spark.fetchers.SparkRestClient : error reading applicationInfo http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509
javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1501194505315_0509
javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : HTTP 404 Not Found
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

And also please find the attachment of FetcherConf.xml file
FetcherConf.txt

shyamraj242 · 2017-07-28T01:16:51Z

@shkhrgpt
Thanks for giving support

shkhrgpt · 2017-07-28T01:21:28Z

Is your spark history server up and running? What happens when you try to reach the following URL in your browser

http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509

shyamraj242 · 2017-07-28T01:33:47Z

@shkhrgpt

Shekar spark history server up and it is running properly. With the below URL I have removed /api/v1 and ran the url on browser. I am able to see the jobs on Spark-history server.

http://blpd586.bhdc.att.com:18080/applications/application_1501194505315_0509.

shkhrgpt · 2017-07-28T02:46:31Z

What if you don't remove /api/v1? What do you see in your browser?

One other thing, you should try the following command from the host where you are running Dr elephant to make sure that there are no connection issues between that host and Spark history server:

curl http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509

shyamraj242 · 2017-07-28T15:46:39Z

@shkhrgpt

Thanks for the help. I have installed Dr.Elephant on other cluster, now I am not facing spark issue. I think the reason is on the other cluster we are using hdp 2.6. Finally it's working.

Dr.Elephant supports Tez jobs as well. I think I need to add the below mentioned link URL code in dr-elephant-master. Correct me if I am wrong.

qubole@78d2cc8

shkhrgpt · 2017-07-28T17:11:46Z

That's great.
Which spark fetcher is working for you?

shyamraj242 · 2017-07-28T17:25:37Z

@shkhrgpt

The one which you have given (Spark Fetcher). I Identified the problem, it is with the spark.
Regarding the Tez jobs do I need to copy the code which is present at (qubole/dr-elephant@78d2cc8) into Dr.Elephant.
Once again Shekar thanks for the help.

shkhrgpt · 2017-07-28T19:26:42Z

We have been gathering data to analyze the performance of the new REST API based Spark Fetcher. Can you please let us know how big is your cluster, and how many spark applications this new fetcher is processing? Thank you.

Regarding the qubole TEZ code, it looks like that it is not synchronized with the latest changes in Dr Elephant. So if you try to use their code, you may see merge conflicts and you may also not be able to use the latest changes of Dr Elephant, such as this new Spark Fetcher.

shyamraj242 · 2017-07-28T20:19:32Z

@shkhrgpt
The cluster is very big.

So we need to wait some more time to use TEZ application in Dr.Elephant ??

Shekar I am facing a new issue. It is displaying wrong time stamp. I checked the myswl db entrees where start time and end time is showing 1000. But those jobs are taking almost 45 sec to complete. Help me on this. Please find the attachments.

shkhrgpt · 2017-07-28T20:56:06Z

If it's possible, can you please approximately quantify how big is the cluster?

Can you first see what's the start and end time you are seeing on resource manager REST interface.

Go to this URL in your browser:

http://RMHOST:RMPORT/ws/v1/cluster/apps

shyamraj242 · 2017-07-29T00:32:39Z

@shkhrgpt
Actually the cluster is almost 12PB.
I am with the url and I am able to see the finish and start time for a a particular job which is not 1000. In mysql elephant db it is taking as 1000. because of that it is showing wrong time stamp. Please find the attachments. Please let me know in case of any concerns. Please help me on this.

shyamraj242 · 2017-07-31T13:34:58Z

@akshayrai @shkhrgpt @shankar37

On the UI it is displaying old dates. The value of time in the mysql db is 1000. In the XML the starting and finishing time is not 1000. You can check the images as well. Help me on this.

@shkhrgpt

can you please help me on this. We are unable to see the latest jobs.

shkhrgpt · 2017-07-31T20:01:51Z

I don't know why you are experiencing this issue. Is it only happening with MapReduce applications, or it's happening with Spark application too. Are you seeing any errors in Dr Elephant logs.

Maybe you can wipe out your DB and create a new one?

shyamraj242 · 2017-07-31T20:58:27Z

@shkhrgpt

Shekar we are experiencing this with MapReduce applications.
I have created a new database and still I have the same issue. Please look into the images.

akshayrai · 2017-09-09T20:55:05Z

Closing this as it is not reproducible.

shkhrgpt mentioned this issue Mar 1, 2017

Tries to get namenode host name from hadoop config, if it is missing spark config #218

Open

rayortigas mentioned this issue Mar 10, 2017

Use old Spark fetcher as a fallback. #224

Closed

akshayrai closed this as completed Sep 9, 2017

This was referenced Nov 22, 2018

Sprak Job( master yarn-culster ) is not visible in Dr.Elephant UI #192

Closed

Spark jobs not showing up on Dr Elephant UI #456

Open

Dr-elephant is not collecting the data #206

Dr-elephant is not collecting the data #206

Comments

shyamraj242 commented Feb 14, 2017 • edited

shyamraj242 commented Feb 23, 2017

shyamraj242 commented Feb 24, 2017

shkhrgpt commented Feb 24, 2017

shyamraj242 commented Feb 25, 2017

shyamraj242 commented Feb 27, 2017

shyamraj242 commented Feb 27, 2017

shkhrgpt commented Feb 27, 2017

shkhrgpt commented Feb 27, 2017

akshayrai commented Feb 28, 2017 • edited

shyamraj242 commented Feb 28, 2017 • edited

shkhrgpt commented Feb 28, 2017 • edited

shyamraj242 commented Feb 28, 2017 • edited

shkhrgpt commented Mar 1, 2017

shyamraj242 commented Mar 1, 2017

rayortigas commented Mar 1, 2017

shkhrgpt commented Mar 1, 2017

shkhrgpt commented Mar 1, 2017

shyamraj242 commented Mar 1, 2017 • edited

shkhrgpt commented Mar 1, 2017

shyamraj242 commented Mar 1, 2017

shkhrgpt commented Mar 1, 2017

shkhrgpt commented Mar 1, 2017

lexpierce commented Mar 8, 2017

murthykurra commented Jun 24, 2017

murthykurra commented Jun 24, 2017

shyamraj242 commented Jul 27, 2017

shkhrgpt commented Jul 27, 2017

shyamraj242 commented Jul 28, 2017

shkhrgpt commented Jul 28, 2017 • edited

shyamraj242 commented Jul 28, 2017 • edited

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017 • edited

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017 • edited

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 28, 2017

shkhrgpt commented Jul 28, 2017

shyamraj242 commented Jul 29, 2017 • edited

shyamraj242 commented Jul 31, 2017 • edited

shkhrgpt commented Jul 31, 2017

shyamraj242 commented Jul 31, 2017

akshayrai commented Sep 9, 2017

shyamraj242 commented Feb 14, 2017 •

edited

akshayrai commented Feb 28, 2017 •

edited

shyamraj242 commented Feb 28, 2017 •

edited

shkhrgpt commented Feb 28, 2017 •

edited

shyamraj242 commented Feb 28, 2017 •

edited

shyamraj242 commented Mar 1, 2017 •

edited

shkhrgpt commented Jul 28, 2017 •

edited

shyamraj242 commented Jul 28, 2017 •

edited

shyamraj242 commented Jul 28, 2017 •

edited

shyamraj242 commented Jul 28, 2017 •

edited

shyamraj242 commented Jul 29, 2017 •

edited

shyamraj242 commented Jul 31, 2017 •

edited