Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dr-elephant is not collecting the data #206

Closed
shyamraj242 opened this issue Feb 14, 2017 · 49 comments
Closed

Dr-elephant is not collecting the data #206

shyamraj242 opened this issue Feb 14, 2017 · 49 comments

Comments

@shyamraj242
Copy link

shyamraj242 commented Feb 14, 2017

Daily usually we ran 1000's of jobs's. I am unable to see the job's on Dr.Elephant UI. We have restarted yarn-server on Jan28 and also we restarted Dr.Elephant but from next moment I am unable to see the jobs on UI. I am getting the connection error shown in the below. I am attaching the Dr.Elephant log file and screen shot of Dr.Elephant UI.

02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The list of RM IDs are rm1,rm2
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Checking RM URL: http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/info
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : ylpd269.kmdc.att.com:8088 is ACTIVE
02-14-2017 10:24:08 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1487085728997, and current time: 1487085788998
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79341
02-14-2017 10:24:09 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 4432
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 3 analyzing MAPREDUCE application_1486843207585_79340
02-14-2017 10:24:11 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79343
02-14-2017 10:24:12 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79344
02-14-2017 10:24:13 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79384
02-14-2017 10:24:14 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing SPARK application_1486843207585_79387
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner :
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.net.ConnectException: Connection refused
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:48)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:686)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:638)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:711)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1436)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:312)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:524)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:545)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:801)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:948)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:963)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$isLegacyLogDirectory(SparkFSFetcher.scala:186)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:143)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:99)
... 13 more

02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1486843207585_79387] into the retry list.

dr_elephant.pdf
dr_elephanterror

@shyamraj242
Copy link
Author

@akshayrai : please help here

@shyamraj242
Copy link
Author

@shkhrgpt , @akshayrai
I am facing problem for the Spark applications.
Case 1:
With your suggestion I have installed the new version of Dr.Elephant on the same cluster. I set path for the SPARK_HOME, but in the logs still it is throwing an error to set the SPARK_HOME path.

export SPARK_HOME=/usr/hdp/current/spark-client

Case 2:
I have edited the spark-default.conf file which is present in /test/resources as below.
spark.yarn.historyServer.address = blpd214.bhdc.att.com:18080
spark.eventLog.enabled = true
spark.eventLog.compress = true
spark.eventLog.dir = hdfs:///spark-history

I have copied these values from /usr/hdp/current/spark-client/conf/spark-default.conf.

Error:
02-24-2017 14:25:45 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1487694609998_0011
02-24-2017 14:25:45 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1487694609998_0011
02-24-2017 14:25:45 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0011
java.lang.IllegalStateException: can't find Spark conf; please set SPARK_HOME or SPARK_CONF_DIR
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkConf$lzycompute(SparkFetcher.scala:52)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkConf(SparkFetcher.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkRestClient$lzycompute(SparkFetcher.scala:57)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.sparkRestClient(SparkFetcher.scala:57)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.fetchData(SparkFetcher.scala:68)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher.fetchData(SparkFetcher.scala:37)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Please help me on this.

@shkhrgpt
Copy link
Contributor

Are you setting SPARK_HOME and running Dr Elephant in the same shell?

@shyamraj242
Copy link
Author

@shkhrgpt
Thanks for your reply. Yes, I am doing on the same shell.

[drelphnt@blpd218 ~] export SPARK_HOME=/usr/hdp/current/spark-client/
[drelphnt@blpd218 ~] $DR_RELEASE/bin/start.sh $DR_RELEASE1/app-conf

@shyamraj242
Copy link
Author

@shkhrgpt
Can you please help me on this.

@shyamraj242
Copy link
Author

@akshayrai @shkhrgpt

Can you please help me on this

@shkhrgpt
Copy link
Contributor

I am sorry but I can't figure out why you are seeing this error. It should be able to find Saprk conf. Maybe @rayortigas or @shankar37 can add something here?

@shkhrgpt
Copy link
Contributor

@shyamraj242
I would like to see whether or not SPARK_HOME is set for Dr Elephant process. So can you please get the process id of your Dr Elephant process using ps aux | grep elephant command or something else. Then run the following command:

sudo cat /proc/[PID_OF_DR_ELEPHANT]/environ | xargs --null --max-args=1 echo

For example, if process id is 4306, then the command will be the following:

sudo cat /proc/4306/environ | xargs --null --max-args=1 echo

Please share the output of the command.

@akshayrai
Copy link
Contributor

akshayrai commented Feb 28, 2017

@shyamraj242 , can you also tell us the absolute path of spark-defaults.conf in your spark distribution?

@shyamraj242
Copy link
Author

shyamraj242 commented Feb 28, 2017

@shkhrgpt @akshayrai
Thanks for giving reply to the post.
My bad actually SPARK_HOME is present. Now I am getting another new error. Please find the error details below.
SPARK_HOME=/usr/hdp/current/spark-client
[sg865w@blpd218 conf]$ ll
total 32
-rw-r--r-- 1 spark spark 1923 Jan 17 10:07 hive-site.xml
-rw-r--r-- 1 spark spark 620 Aug 2 2016 log4j.properties
-rw-r--r-- 1 spark spark 4956 Aug 2 2016 metrics.properties
-rw-r--r-- 1 spark spark 1309 Feb 16 15:18 spark-defaults.conf
-rw-r--r-- 1 spark spark 1971 Jan 25 14:45 spark-env.sh
-rwxr-xr-x 1 spark spark 253 Jan 4 13:25 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive hadoop 1373 Feb 16 15:18 spark-thrift-sparkconf.conf
[sg865w@blpd218 conf]$ pwd
/usr/hdp/current/spark-client/conf

#Error:
02-28-2017 11:47:39 INFO [Thread-6] com.linkedin.drelephant.ElephantRunner : Job queue size is 1
02-28-2017 11:47:39 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1487694609998_0021
02-28-2017 11:47:39 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1487694609998_0021
02-28-2017 11:47:39 INFO [ForkJoinPool-3-worker-57] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://blpd214.bhdc.att.com:18080/api/v1/applications/application_1487694609998_0021
02-28-2017 11:47:39 INFO [ForkJoinPool-3-worker-43] com.linkedin.drelephant.spark.fetchers.SparkLogClient : looking for logs at webhdfs://null:50070/spark-history/application_1487694609998_0021_1
02-28-2017 11:47:39 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0021
java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:207)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs$lzycompute(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at resource.DefaultManagedResource.open(AbstractManagedResource.scala:106)
at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:85)
at resource.ManagedResourceOperations$class.acquireAndGet(ManagedResourceOperations.scala:25)
at resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

@akshayrai
Please find the path of spark-default.conf and also please see the spark-default.conf file
/usr/hdp/current/spark-client/conf/spark-default.conf

#Spark-Default.conf:

spark.driver.extraJavaOptions -Dhdp.version=2.5.3.0-37
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.kerberos.enabled true
spark.history.kerberos.keytab /etc/security/keytabs/spark.headless.keytab
spark.history.kerberos.principal spark@BRHMLAB01.LAB.ATT.COM
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.3.0-37
spark.yarn.applicationMaster.waitTries 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address blpd214.bhdc.att.com:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.yarn.submit.file.replication 3

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Feb 28, 2017

You are almost there :)
The problem is the way you have configured spark.eventLog.dir in spark-default.conf. As per this documentation, you should add this property as the following:

spark.eventLog.dir=hdfs://namenode_host:namenode_port/spark-history

You can get namenode host and port from the value of fs.defaultFS property in hdfs-site.xml

I hope this change will fix you issue. However, I think we need to make a change in SparkFetcher so it automatically gets the correct namenode host and port from Hadoop config files, rather than crashing.
@akshayrai and @shankar37 What do you think?

@shyamraj242
Copy link
Author

shyamraj242 commented Feb 28, 2017

@shkhrgpt
I think this configuration we can't change, because I did an experiment by changing this property in Ambari UI, but the problem we are facing is unable to submit the spark jobs. Please find the image of the spark.eventlog.dir value in Ambari UI.
I have changed the spark-default.conf file which is present in /home/drelphnt/dr-elephant-master/test/resources/spark-defaults.conf. After this when I am trying to compile the compile.sh script. The dr-elephant-2.0.6.zip is not generated.
#Error:

[info] - gets its SparkConf when SPARK_CONF_DIR is set *** FAILED ***
[info] "hdfs://[blpd212.bhdc.att.com:50070/spark-history]" was not equal to "hdfs://[nn1.grid.example.com:9000/logs/spark]" (SparkFetcherTest.scala:119)
[info] - gets its SparkConf when SPARK_HOME is set *** FAILED ***
[info] "hdfs://[blpd212.bhdc.att.com:50070/spark-history]" was not equal to "hdfs://[nn1.grid.example.com:9000/logs/spark]" (SparkFetcherTest.scala:145)

[info] *** 2 TESTS FAILED ***
[error] Failed: Total 302, Failed 2, Errors 0, Passed 299, Skipped 1
[error] Failed tests:
[error] com.linkedin.drelephant.spark.fetchers.SparkFetcherTest
[error] (test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 67 s, completed Feb 28, 2017 5:16:32 PM
./compile.sh: line 131: cd: target/universal: No such file or directory
/bin/ls: cannot access *.zip: No such file or directory

Please help me on this.

drelephant_spark_cong

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

Okay. I am going to submit a patch to fix this problem very soon. Please wait a little.
BTW, you shouldn't change /home/drelphnt/dr-elephant-master/test/resources/spark-defaults.conf. That is not going to fix anything.

@shyamraj242
Copy link
Author

@shkhrgpt
Thanks Shekar for our help.

@rayortigas
Copy link
Contributor

@shkhrgpt Thanks for investigating. If you need code for the patch, you might find some material (both old and new) at: https://github.com/linkedin/dr-elephant/pull/149/files

#149 was never merged, but I'm working on a larger change that will pull all or most of that code in. But don't worry about that, it's probably more important you get your patch in.

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

Thanks, @rayortigas for this reference.

I am still not sure what is the best way to get the hostname of the namenode. The patch implemented in https://github.com/linkedin/dr-elephant/pull/149/files relies on dfs.nameservices. I am not sure if this property is configured in the majority of clusters.

I submitted this patch which tries to get namenode hostname from the value of fs.defaultFS config property. I am also not very sure about this patch. We always configure hostname in the value of fs.defaultFS. I also came across this documentation which recommends to add hostname in the value of fs.defaultFS. Again, I am not sure if the majority of people set this hostname in this config.

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

@shyamraj242 I have submitted this patch, you can try this while we work on a more permanent solution. Depending on your configs, this may work for you.

@shyamraj242
Copy link
Author

shyamraj242 commented Mar 1, 2017

@shkhrgpt
Shekar, still I am getting the same error :-(
03-01-2017 00:41:06 INFO [ForkJoinPool-1-worker-43] com.linkedin.drelephant.spark.fetchers.SparkLogClient : looking for logs at webhdfs://null:50070/spark-history/application_1487694609998_0032_1
03-01-2017 00:41:06 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1487694609998_0032
java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:207)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs$lzycompute(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient.fs(SparkLogClient.scala:69)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1$$anonfun$apply$1.apply(SparkLogClient.scala:83)
at resource.DefaultManagedResource.open(AbstractManagedResource.scala:106)
at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:85)
at resource.ManagedResourceOperations$class.acquireAndGet(ManagedResourceOperations.scala:25)
at resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:48)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at com.linkedin.drelephant.spark.fetchers.SparkLogClient$$anonfun$fetchData$1.apply(SparkLogClient.scala:84)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.net.UnknownHostException: null

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

@shyamraj242
Did you use this patch? This patch hasn't checked in so you'll have to manually apply this patch your local code.
What's the config value of fs.defaultFS in your core-site.xml or hdfs-site.xml?

@shyamraj242
Copy link
Author

@shkhrgpt
Regarding the patch, I am unable to see the file SparkDataCollection.scala, SparkFSFetcher.scala, DummySparkFSFetcher.scala, SparkDataCollectionTest.java, SparkFsFetcherTest.java, SparkFsFetcherTest.scala in drelephant. I am unable to find the org directory in app.

The value of fs.defaultFS in my core-site.xml is "hdfs://BRHMLAB01". Where BRHMLAB01 is my cluster name.
[drelphnt@blpd218 dr-elephant-master]$ ll
total 96
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 app
drwxr-x--- 2 drelphnt hadoop 4096 Mar 1 11:50 app-conf
-rw-r----- 1 drelphnt hadoop 1206 Feb 28 11:06 build.sbt
-rw-r----- 1 drelphnt hadoop 100 Feb 28 11:06 compile.conf
-rwxr-xr-x 1 drelphnt hadoop 4100 Feb 28 11:06 compile.sh
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 conf
drwxr-x--- 3 drelphnt hadoop 4096 Mar 1 12:02 dist
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 images
-rw-r----- 1 drelphnt hadoop 1062 Feb 28 11:06 jacoco.sbt
-rw-r----- 1 drelphnt hadoop 11357 Feb 28 11:06 LICENSE
drwxr-x--- 2 drelphnt hadoop 4096 Mar 1 12:00 logs
-rw-r----- 1 drelphnt hadoop 4990 Feb 28 11:06 NOTICE
drwxr-x--- 4 drelphnt hadoop 4096 Mar 1 11:51 project
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 public
-rw-r----- 1 drelphnt hadoop 2633 Feb 28 11:06 README.md
-rw-r----- 1 drelphnt hadoop 290 Feb 28 11:06 resolver.conf.template
drwxr-x--- 2 drelphnt hadoop 4096 Feb 28 11:06 scripts
drwxr-x--- 7 drelphnt hadoop 4096 Mar 1 12:00 target
drwxr-x--- 7 drelphnt hadoop 4096 Feb 28 11:06 test
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 web
[drelphnt@blpd218 dr-elephant-master]$ cd app
[drelphnt@blpd218 app]$ ll
total 20
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 com
drwxr-x--- 3 drelphnt hadoop 4096 Feb 28 11:06 controllers
-rw-r----- 1 drelphnt hadoop 2766 Feb 28 11:06 Global.java
drwxr-x--- 2 drelphnt hadoop 4096 Feb 28 11:06 models
drwxr-x--- 6 drelphnt hadoop 4096 Feb 28 11:06 views

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

If the value of fs.defaultFS does not contain namenode host, then this path will not work. Can you change the value of fs.defaultFS something like this:

<property>
     <name>fs.defaultFS</name>
     <value>hdfs://$namenode.full.hostname:8020</value>
     <description>Enter your NameNode hostname</description>
</property>

As it's suggested in the documentation.

If you can't change this config, then you'll have to wait until we have a proper fix for this problem.

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Mar 1, 2017

About other files, SparkDataCollection.scala, SparkFSFetcher.scala, DummySparkFSFetcher.scala, SparkDataCollectionTest.java, SparkFsFetcherTest.java, SparkFsFetcherTest.scala, they are not part of this patch, so you don't need to modify them.

Why do you need org directory in app?

@lexpierce
Copy link

Until there is a code fix, you need to use the correct port for HDFS access. Not the webdhfs interface:

spark.eventLog.dir=hdfs://my.namenode.com:8020/spark-history

I have this working in my HDP 2.5.3 environment.

@murthykurra
Copy link

Hi Shyamraj ,
We are also using HDP 2.6.0 and we have installed Dr elephant but unable to see any jobs in UI.
error
Oops, an error occured
This exception has been logged with id 74e4pkf37.
I think we have missed configuration changes.

Please provide the configuration changes need to modify after installing Dr elephant.

Thanks in advance

@murthykurra
Copy link

Provide me the elephant.conf log and Xml files need to modify

Please share them it is High priority issue for us.
Thanks

@shyamraj242
Copy link
Author

@shkhrgpt , @rayortigas , @shankar37
Did you guys given a permanent fix for the spark issue. Still I am facing this error.
07-27-2017 17:53:00 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 2
07-27-2017 17:53:00 INFO [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500337134035_0485
07-27-2017 17:53:00 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 3
07-27-2017 17:53:00 INFO [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500337134035_0457
07-27-2017 17:53:00 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500488751736_3394
07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at com.linkedin.drelephant.util.SparkUtils$class.fileSystemAndPathForEventLogDir(SparkUtils.scala:70)
at com.linkedin.drelephant.util.SparkUtils$.fileSystemAndPathForEventLogDir(SparkUtils.scala:312)
at org.apache.spark.deploy.history.SparkFSFetcher.doFetchData(SparkFSFetcher.scala:84)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:78)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: null
... 29 more

07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1500337134035_0485] into the retry list.

@shkhrgpt
Copy link
Contributor

Recently, we made changes to allow SparkFetcher to only use history server's REST API. Maybe you can try that. If you are using Spark 1.5 or above, then you need to get the latest code, and have the following config in FetcherConf.xml:

<fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>
    <params>
      <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
      <should_process_logs_locally>true</should_process_logs_locally>
    </params>
  </fetcher>

With the above settings, SparkFetcher should work without depending on webhdfs.

@shyamraj242
Copy link
Author

@shkhrgpt
Thanks for the reply Shekar.

We are using spark 1.6 version.

  1. [sg865w@blpd218 spark-client]$ spark-submit --version
    Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
    Spark1 will be picked by default
    Welcome to
    / / ___ / /
    \ / _ / _ `/ __/ '/
    /
    / .__/_,// //_\ version 1.6.3
    /
    /

Type --help for more information.

  1. I edited the FetcherConf.xml file
    [drelphnt@blpd218 ~]$ vi /opt/app/dr-elephant-target/target/universal/dr-elephant-2.0.5/app-conf/FetcherConf.xml
    Previously spark fetcher is associated with the below value.
spark org.apache.spark.deploy.history.SparkFSFetcher

I have commented this code and psted the code which you have given to me. Still the same error we are facing.

07-27-2017 19:02:33 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500488751736_3387
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : java.net.UnknownHostException: null
07-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.lang.IllegalArgumentException: java.net.UnknownHostException: null
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:438)
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:456)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at com.linkedin.drelephant.util.SparkUtils$class.fileSystemAndPathForEventLogDir(SparkUtils.scala:70)
at com.linkedin.drelephant.util.SparkUtils$.fileSystemAndPathForEventLogDir(SparkUtils.scala:312)
at org.apache.spark.deploy.history.SparkFSFetcher.doFetchData(SparkFSFetcher.scala:84)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$fetchData$1.apply(SparkFSFetcher.scala:74)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:78)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1846)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:233)
at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: null
... 29 more

-27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Drop the analytic job. Reason: reached the max retries for application id = [application_1500488751736_3387].
07-27-2017 19:02:33
ERROR [dr-el-executor-thread-1] com.linkedin.drelephant.ElephantRunner : Drop the analytic job. Reason: reached the max retries for application id = [application_1500488751736_3422].

Help me on this.

@shkhrgpt
Copy link
Contributor

shkhrgpt commented Jul 28, 2017

From your stacktrace, it looks like that Dr Elephant is trying to load FSFetcher which depends on WebHDFS

at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:108)
at org.apache.spark.deploy.history.SparkFSFetcher.doAsPrivilegedAction(SparkFSFetcher.scala:78)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:74)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:34)
at com.linkedin.drelephant.spark.fetchers.FSFetcher.fetchData(FSFetcher.scala:29

Can you share your latest FetcherConf.xml?

@shyamraj242
Copy link
Author

shyamraj242 commented Jul 28, 2017

@shkhrgpt

Please find the xml file

mapreduce com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2 false spark com.linkedin.drelephant.spark.fetchers.SparkFetcher true true

FetcherConf.txt

@shkhrgpt
Copy link
Contributor

I can't see the entire content of FetcherConf.
For some reason, Dr elephant is still loading the older fetcher.

Do you have the following lines in your FetcherConf.xml

 <fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.FSFetcher</classname>
 </fetcher>

If you have the above lines, then you should remove them.

@shyamraj242
Copy link
Author

@shkhrgpt

I have those lines. I have attached the .xml file in the previous post. I have attached in this as well.
FetcherConf.txt

@shkhrgpt
Copy link
Contributor

If you remove the lines about FSFetcher, then it should load the correct SparkFetcher which will only use the REST interface of history server.

@shyamraj242
Copy link
Author

@shankar37
I have changed the xml file as you suggested , Still I am facing the below error.
07-27-2017 20:14:03 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1501194505315_0509
07-27-2017 20:14:03 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1501194505315_0509
07-27-2017 20:14:03 INFO [ForkJoinPool-1-worker-29] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509
07-27-2017 20:14:03 ERROR [ForkJoinPool-1-worker-29] com.linkedin.drelephant.spark.fetchers.SparkRestClient : error reading applicationInfo http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509
javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1501194505315_0509
javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : HTTP 404 Not Found
07-27-2017 20:14:03 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1020)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:819)
at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:92)
at org.glassfish.jersey.client.JerseyInvocation$2.call(JerseyInvocation.java:701)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:228)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:444)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:697)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:420)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:316)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient$.get(SparkRestClient.scala:239)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationInfo(SparkRestClient.scala:131)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.getApplicationMetaData(SparkRestClient.scala:120)
at com.linkedin.drelephant.spark.fetchers.SparkRestClient.fetchEventLogAndParse(SparkRestClient.scala:101)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at com.linkedin.drelephant.spark.fetchers.SparkFetcher$$anonfun$com$linkedin$drelephant$spark$fetchers$SparkFetcher$$doFetchSparkApplicationData$1.apply(SparkFetcher.scala:106)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

And also please find the attachment of FetcherConf.xml file
FetcherConf.txt

@shyamraj242
Copy link
Author

@shkhrgpt
Thanks for giving support

@shkhrgpt
Copy link
Contributor

Is your spark history server up and running? What happens when you try to reach the following URL in your browser

http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509

@shyamraj242
Copy link
Author

shyamraj242 commented Jul 28, 2017

@shkhrgpt

Shekar spark history server up and it is running properly. With the below URL I have removed /api/v1 and ran the url on browser. I am able to see the jobs on Spark-history server.

http://blpd586.bhdc.att.com:18080/applications/application_1501194505315_0509.

@shkhrgpt
Copy link
Contributor

What if you don't remove /api/v1? What do you see in your browser?

One other thing, you should try the following command from the host where you are running Dr elephant to make sure that there are no connection issues between that host and Spark history server:

curl http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509

@shyamraj242
Copy link
Author

@shkhrgpt

Thanks for the help. I have installed Dr.Elephant on other cluster, now I am not facing spark issue. I think the reason is on the other cluster we are using hdp 2.6. Finally it's working.

Dr.Elephant supports Tez jobs as well. I think I need to add the below mentioned link URL code in dr-elephant-master. Correct me if I am wrong.

qubole@78d2cc8

@shkhrgpt
Copy link
Contributor

That's great.
Which spark fetcher is working for you?

@shyamraj242
Copy link
Author

shyamraj242 commented Jul 28, 2017

@shkhrgpt

The one which you have given (Spark Fetcher). I Identified the problem, it is with the spark.
Regarding the Tez jobs do I need to copy the code which is present at (qubole/dr-elephant@78d2cc8) into Dr.Elephant.
Once again Shekar thanks for the help.

@shkhrgpt
Copy link
Contributor

We have been gathering data to analyze the performance of the new REST API based Spark Fetcher. Can you please let us know how big is your cluster, and how many spark applications this new fetcher is processing? Thank you.

Regarding the qubole TEZ code, it looks like that it is not synchronized with the latest changes in Dr Elephant. So if you try to use their code, you may see merge conflicts and you may also not be able to use the latest changes of Dr Elephant, such as this new Spark Fetcher.

@shyamraj242
Copy link
Author

@shkhrgpt
The cluster is very big.

So we need to wait some more time to use TEZ application in Dr.Elephant ??

Shekar I am facing a new issue. It is displaying wrong time stamp. I checked the myswl db entrees where start time and end time is showing 1000. But those jobs are taking almost 45 sec to complete. Help me on this. Please find the attachments.
elephant
ddd

@shkhrgpt
Copy link
Contributor

If it's possible, can you please approximately quantify how big is the cluster?

Can you first see what's the start and end time you are seeing on resource manager REST interface.

Go to this URL in your browser:

http://RMHOST:RMPORT/ws/v1/cluster/apps

@shyamraj242
Copy link
Author

shyamraj242 commented Jul 29, 2017

@shkhrgpt
Actually the cluster is almost 12PB.
I am with the url and I am able to see the finish and start time for a a particular job which is not 1000. In mysql elephant db it is taking as 1000. because of that it is showing wrong time stamp. Please find the attachments. Please let me know in case of any concerns. Please help me on this.

bug1

bug2
bug3

@shyamraj242
Copy link
Author

shyamraj242 commented Jul 31, 2017

@akshayrai @shkhrgpt @shankar37

On the UI it is displaying old dates. The value of time in the mysql db is 1000. In the XML the starting and finishing time is not 1000. You can check the images as well. Help me on this.

@shkhrgpt

can you please help me on this. We are unable to see the latest jobs.

@shkhrgpt
Copy link
Contributor

I don't know why you are experiencing this issue. Is it only happening with MapReduce applications, or it's happening with Spark application too. Are you seeing any errors in Dr Elephant logs.

Maybe you can wipe out your DB and create a new one?

@shyamraj242
Copy link
Author

@shkhrgpt

Shekar we are experiencing this with MapReduce applications.
I have created a new database and still I have the same issue. Please look into the images.

bug11
bug12

@akshayrai
Copy link
Contributor

Closing this as it is not reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants