Using Elasticsearch-hadoop library to import and process logs in parallel #62

quintinali · 2016-09-27T18:29:30Z

We are using Elasticsearch version 2.3, which only supports up to spark 1.6.1. If we would like to use the Spark 2.0 compatibility, we will need to switch to using the 5.0.0-alpha5 version. However, it is an alpha release version and should be used for testing purposes only (i.e. not to be used in production).

Do you have any suggestion?

refer: https://discuss.elastic.co/t/write-es-error-with-spark-2-0-release/56967/3

quintinali · 2016-10-01T18:48:09Z

Branch name: cc_es_hadoop
The first step is to update Elasticsearch from 2.3 to 5.0.0 beta1

I changed log4j configuration in pom.xml since Elasticsearch will no longer detect logging implementations. Could @lewismc revise my modification because I am not familiar with maven configuration.
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/breaking_50_java_api_changes.html#_elasticsearch_will_no_longer_detect_logging_implementations

2.I changed term aggregation parameter since size: 0 is no longer valid for the terms. Could @Yongyao , @lewismc check if there is a better implementation with higher performance.
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/breaking_50_aggregations_changes.html#_literal_size_0_literal_on_terms_significant_terms_and_geohash_grid_aggregations
http://stackoverflow.com/questions/22927098/show-all-elasticsearch-aggregation-results-buckets-and-not-just-10?rq=1

lewismc · 2016-10-03T18:33:54Z

I'll scope this out tomorrow.

quintinali · 2016-10-14T00:38:27Z

@lewismc I have upgraded Elasticsearch from 2.x to 5.x and import logs in parallel using Elasticsearch-Hadoop library. I set up a cluster and want to test the whole process in the cluster, but I don't know which package in the target directory can be used with the spark-submit command.

I tried to execute the following command in the mudrod-core target directory

/home/centos/mudrod/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class esiptestbed.mudrod.main.MudrodEngine mudrod-core-0.0.1-SNAPSHOT.jar

and get this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/jdom2/JDOMException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:633)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.jdom2.JDOMException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

I think the reason is that the dependency packages are not included in the built package. Is that correct? Do we have to modify pom.xml to generate another package including all dependencies?

lewismc · 2016-10-16T20:25:18Z

Hi @quintinali

I think the reason is that the dependency packages are not included in the built package. Is that correct? Do we have to modify pom.xml to generate another package including all dependencies?

Yes correct. This is trivial work and is described at https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html
We should use the Maven shade plugin to create an 'uber' jar which can be submitted to the Spark cluster.

lewismc · 2016-10-26T01:18:53Z

@quintinali can you please create a pull request for your work on the following branch https://github.com/mudrod/mudrod/tree/cc_es_hadoop
Everything needs to go through pull requests.

lewismc · 2016-10-26T01:24:29Z

@quintinali please see my mail on the Mudrod mailing list for guidance on creating pull requests.

quintinali · 2016-10-26T18:38:45Z

Main modifications in this branch:

Update Elasticsearch from 2.3 to 5.0-beta1
Add Elasticsearch-spark library
Parallelize log import, crawler detection, session reconstruction. The original sequential functions are kept for comparison.
Elasticsearch API changes due to upgrading of elasticsearch

lewismc · 2017-02-01T23:30:51Z

Issue resolved in #77

quintinali added enhancement elasticsearch labels Sep 27, 2016

lewismc added this to the 09/30/2016 milestone Sep 28, 2016

quintinali assigned quintinali, Yongyao and lewismc Oct 3, 2016

lewismc removed this from the 09/30/2016 milestone Oct 12, 2016

lewismc added this to the 02/01/17 milestone Jan 18, 2017

lewismc added this to Engine Integration, Deployment and Testing in AIST Master Schedule Feb 1, 2017

lewismc closed this as completed Feb 1, 2017

lewismc removed this from Engine Integration and Deployment in AIST Master Schedule Feb 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Elasticsearch-hadoop library to import and process logs in parallel #62

Using Elasticsearch-hadoop library to import and process logs in parallel #62

quintinali commented Sep 27, 2016 •

edited

quintinali commented Oct 1, 2016 •

edited

lewismc commented Oct 3, 2016

quintinali commented Oct 14, 2016

lewismc commented Oct 16, 2016

lewismc commented Oct 26, 2016

lewismc commented Oct 26, 2016

quintinali commented Oct 26, 2016

lewismc commented Feb 1, 2017

Using Elasticsearch-hadoop library to import and process logs in parallel #62

Using Elasticsearch-hadoop library to import and process logs in parallel #62

Comments

quintinali commented Sep 27, 2016 • edited

quintinali commented Oct 1, 2016 • edited

lewismc commented Oct 3, 2016

quintinali commented Oct 14, 2016

lewismc commented Oct 16, 2016

lewismc commented Oct 26, 2016

lewismc commented Oct 26, 2016

quintinali commented Oct 26, 2016

lewismc commented Feb 1, 2017

quintinali commented Sep 27, 2016 •

edited

quintinali commented Oct 1, 2016 •

edited