New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dr-elephant is not collecting the data #206
Comments
@akshayrai : please help here |
@shkhrgpt , @akshayrai export SPARK_HOME=/usr/hdp/current/spark-client Case 2: I have copied these values from /usr/hdp/current/spark-client/conf/spark-default.conf. Error: Please help me on this. |
Are you setting SPARK_HOME and running Dr Elephant in the same shell? |
@shkhrgpt [drelphnt@blpd218 ~] export SPARK_HOME=/usr/hdp/current/spark-client/ |
@shkhrgpt |
Can you please help me on this |
I am sorry but I can't figure out why you are seeing this error. It should be able to find Saprk conf. Maybe @rayortigas or @shankar37 can add something here? |
@shyamraj242 sudo cat /proc/[PID_OF_DR_ELEPHANT]/environ | xargs --null --max-args=1 echo For example, if process id is 4306, then the command will be the following: sudo cat /proc/4306/environ | xargs --null --max-args=1 echo Please share the output of the command. |
@shyamraj242 , can you also tell us the absolute path of spark-defaults.conf in your spark distribution? |
@shkhrgpt @akshayrai #Error: @akshayrai #Spark-Default.conf: spark.driver.extraJavaOptions -Dhdp.version=2.5.3.0-37 |
You are almost there :)
You can get namenode host and port from the value of fs.defaultFS property in hdfs-site.xml I hope this change will fix you issue. However, I think we need to make a change in SparkFetcher so it automatically gets the correct namenode host and port from Hadoop config files, rather than crashing. |
@shkhrgpt [info] - gets its SparkConf when SPARK_CONF_DIR is set *** FAILED *** [info] *** 2 TESTS FAILED *** Please help me on this. |
Okay. I am going to submit a patch to fix this problem very soon. Please wait a little. |
@shkhrgpt |
@shkhrgpt Thanks for investigating. If you need code for the patch, you might find some material (both old and new) at: https://github.com/linkedin/dr-elephant/pull/149/files #149 was never merged, but I'm working on a larger change that will pull all or most of that code in. But don't worry about that, it's probably more important you get your patch in. |
Thanks, @rayortigas for this reference. I am still not sure what is the best way to get the hostname of the namenode. The patch implemented in https://github.com/linkedin/dr-elephant/pull/149/files relies on I submitted this patch which tries to get namenode hostname from the value of |
@shyamraj242 I have submitted this patch, you can try this while we work on a more permanent solution. Depending on your configs, this may work for you. |
@shkhrgpt |
@shyamraj242 |
@shkhrgpt The value of fs.defaultFS in my core-site.xml is "hdfs://BRHMLAB01". Where BRHMLAB01 is my cluster name. |
If the value of fs.defaultFS does not contain namenode host, then this path will not work. Can you change the value of fs.defaultFS something like this:
As it's suggested in the documentation. If you can't change this config, then you'll have to wait until we have a proper fix for this problem. |
About other files, SparkDataCollection.scala, SparkFSFetcher.scala, DummySparkFSFetcher.scala, SparkDataCollectionTest.java, SparkFsFetcherTest.java, SparkFsFetcherTest.scala, they are not part of this patch, so you don't need to modify them. Why do you need org directory in app? |
Until there is a code fix, you need to use the correct port for HDFS access. Not the webdhfs interface: spark.eventLog.dir=hdfs://my.namenode.com:8020/spark-history I have this working in my HDP 2.5.3 environment. |
Hi Shyamraj , Please provide the configuration changes need to modify after installing Dr elephant. Thanks in advance |
Provide me the elephant.conf log and Xml files need to modify Please share them it is High priority issue for us. |
@shkhrgpt , @rayortigas , @shankar37 07-27-2017 17:53:00 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1500337134035_0485] into the retry list. |
Recently, we made changes to allow SparkFetcher to only use history server's REST API. Maybe you can try that. If you are using Spark 1.5 or above, then you need to get the latest code, and have the following config in FetcherConf.xml:
With the above settings, SparkFetcher should work without depending on webhdfs. |
@shkhrgpt We are using spark 1.6 version.
Type --help for more information.
I have commented this code and psted the code which you have given to me. Still the same error we are facing. 07-27-2017 19:02:33 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1500488751736_3387 -27-2017 19:02:33 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Drop the analytic job. Reason: reached the max retries for application id = [application_1500488751736_3387]. Help me on this. |
From your stacktrace, it looks like that Dr Elephant is trying to load FSFetcher which depends on WebHDFS
Can you share your latest FetcherConf.xml? |
Please find the xml file mapreduce com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2 false spark com.linkedin.drelephant.spark.fetchers.SparkFetcher true true |
I can't see the entire content of FetcherConf. Do you have the following lines in your FetcherConf.xml
If you have the above lines, then you should remove them. |
I have those lines. I have attached the .xml file in the previous post. I have attached in this as well. |
If you remove the lines about FSFetcher, then it should load the correct SparkFetcher which will only use the REST interface of history server. |
@shankar37 And also please find the attachment of FetcherConf.xml file |
@shkhrgpt |
Is your spark history server up and running? What happens when you try to reach the following URL in your browser http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509 |
Shekar spark history server up and it is running properly. With the below URL I have removed /api/v1 and ran the url on browser. I am able to see the jobs on Spark-history server. http://blpd586.bhdc.att.com:18080/applications/application_1501194505315_0509. |
What if you don't remove /api/v1? What do you see in your browser? One other thing, you should try the following command from the host where you are running Dr elephant to make sure that there are no connection issues between that host and Spark history server: curl http://blpd586.bhdc.att.com:18080/api/v1/applications/application_1501194505315_0509 |
Thanks for the help. I have installed Dr.Elephant on other cluster, now I am not facing spark issue. I think the reason is on the other cluster we are using hdp 2.6. Finally it's working. Dr.Elephant supports Tez jobs as well. I think I need to add the below mentioned link URL code in dr-elephant-master. Correct me if I am wrong. |
That's great. |
The one which you have given (Spark Fetcher). I Identified the problem, it is with the spark. |
We have been gathering data to analyze the performance of the new REST API based Spark Fetcher. Can you please let us know how big is your cluster, and how many spark applications this new fetcher is processing? Thank you. Regarding the qubole TEZ code, it looks like that it is not synchronized with the latest changes in Dr Elephant. So if you try to use their code, you may see merge conflicts and you may also not be able to use the latest changes of Dr Elephant, such as this new Spark Fetcher. |
@shkhrgpt So we need to wait some more time to use TEZ application in Dr.Elephant ?? Shekar I am facing a new issue. It is displaying wrong time stamp. I checked the myswl db entrees where start time and end time is showing 1000. But those jobs are taking almost 45 sec to complete. Help me on this. Please find the attachments. |
If it's possible, can you please approximately quantify how big is the cluster? Can you first see what's the start and end time you are seeing on resource manager REST interface. Go to this URL in your browser: |
@shkhrgpt |
@akshayrai @shkhrgpt @shankar37 On the UI it is displaying old dates. The value of time in the mysql db is 1000. In the XML the starting and finishing time is not 1000. You can check the images as well. Help me on this. can you please help me on this. We are unable to see the latest jobs. |
I don't know why you are experiencing this issue. Is it only happening with MapReduce applications, or it's happening with Spark application too. Are you seeing any errors in Dr Elephant logs. Maybe you can wipe out your DB and create a new one? |
Shekar we are experiencing this with MapReduce applications. |
Closing this as it is not reproducible. |
Daily usually we ran 1000's of jobs's. I am unable to see the job's on Dr.Elephant UI. We have restarted yarn-server on Jan28 and also we restarted Dr.Elephant but from next moment I am unable to see the jobs on UI. I am getting the connection error shown in the below. I am attaching the Dr.Elephant log file and screen shot of Dr.Elephant UI.
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The list of RM IDs are rm1,rm2
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Checking RM URL: http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/info
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : ylpd269.kmdc.att.com:8088 is ACTIVE
02-14-2017 10:24:08 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1487085728997, and current time: 1487085788998
02-14-2017 10:24:08 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79341
02-14-2017 10:24:09 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://ylpd269.kmdc.att.com:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1487085728997&finishedTimeEnd=1487085788998
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 4432
02-14-2017 10:24:09 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 3 analyzing MAPREDUCE application_1486843207585_79340
02-14-2017 10:24:11 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79343
02-14-2017 10:24:12 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79344
02-14-2017 10:24:13 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing MAPREDUCE application_1486843207585_79384
02-14-2017 10:24:14 INFO com.linkedin.drelephant.ElephantRunner : Executor thread 2 analyzing SPARK application_1486843207585_79387
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner :
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.net.ConnectException: Connection refused
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:48)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:686)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:638)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:711)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1436)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:312)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:524)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:545)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:801)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:559)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:588)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:584)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:948)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:963)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$isLegacyLogDirectory(SparkFSFetcher.scala:186)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:143)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:99)
... 13 more
02-14-2017 10:24:14 ERROR com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1486843207585_79387] into the retry list.
dr_elephant.pdf
The text was updated successfully, but these errors were encountered: