Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Elasticsearch-hadoop library to import and process logs in parallel #62

Closed
quintinali opened this issue Sep 27, 2016 · 8 comments
Closed

Comments

@quintinali
Copy link
Collaborator

quintinali commented Sep 27, 2016

We are using Elasticsearch version 2.3, which only supports up to spark 1.6.1. If we would like to use the Spark 2.0 compatibility, we will need to switch to using the 5.0.0-alpha5 version. However, it is an alpha release version and should be used for testing purposes only (i.e. not to be used in production).

Do you have any suggestion?

refer: https://discuss.elastic.co/t/write-es-error-with-spark-2-0-release/56967/3

@quintinali
Copy link
Collaborator Author

quintinali commented Oct 1, 2016

Branch name: cc_es_hadoop
The first step is to update Elasticsearch from 2.3 to 5.0.0 beta1

  1. I changed log4j configuration in pom.xml since Elasticsearch will no longer detect logging implementations. Could @lewismc revise my modification because I am not familiar with maven configuration.
    https://www.elastic.co/guide/en/elasticsearch/reference/5.x/breaking_50_java_api_changes.html#_elasticsearch_will_no_longer_detect_logging_implementations

2.I changed term aggregation parameter since size: 0 is no longer valid for the terms. Could @Yongyao , @lewismc check if there is a better implementation with higher performance.
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/breaking_50_aggregations_changes.html#_literal_size_0_literal_on_terms_significant_terms_and_geohash_grid_aggregations
http://stackoverflow.com/questions/22927098/show-all-elasticsearch-aggregation-results-buckets-and-not-just-10?rq=1

@lewismc
Copy link
Collaborator

lewismc commented Oct 3, 2016

I'll scope this out tomorrow.

@lewismc lewismc removed this from the 09/30/2016 milestone Oct 12, 2016
@quintinali
Copy link
Collaborator Author

@lewismc I have upgraded Elasticsearch from 2.x to 5.x and import logs in parallel using Elasticsearch-Hadoop library. I set up a cluster and want to test the whole process in the cluster, but I don't know which package in the target directory can be used with the spark-submit command.

I tried to execute the following command in the mudrod-core target directory

/home/centos/mudrod/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class esiptestbed.mudrod.main.MudrodEngine mudrod-core-0.0.1-SNAPSHOT.jar

and get this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/jdom2/JDOMException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:633)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.jdom2.JDOMException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

I think the reason is that the dependency packages are not included in the built package. Is that correct? Do we have to modify pom.xml to generate another package including all dependencies?

@lewismc
Copy link
Collaborator

lewismc commented Oct 16, 2016

Hi @quintinali

I think the reason is that the dependency packages are not included in the built package. Is that correct? Do we have to modify pom.xml to generate another package including all dependencies?

Yes correct. This is trivial work and is described at https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html
We should use the Maven shade plugin to create an 'uber' jar which can be submitted to the Spark cluster.

@lewismc
Copy link
Collaborator

lewismc commented Oct 26, 2016

@quintinali can you please create a pull request for your work on the following branch https://github.com/mudrod/mudrod/tree/cc_es_hadoop
Everything needs to go through pull requests.

@lewismc
Copy link
Collaborator

lewismc commented Oct 26, 2016

@quintinali please see my mail on the Mudrod mailing list for guidance on creating pull requests.

@quintinali
Copy link
Collaborator Author

Main modifications in this branch:

  1. Update Elasticsearch from 2.3 to 5.0-beta1
  2. Add Elasticsearch-spark library
  3. Parallelize log import, crawler detection, session reconstruction. The original sequential functions are kept for comparison.
  4. Elasticsearch API changes due to upgrading of elasticsearch

@lewismc lewismc added this to the 02/01/17 milestone Jan 18, 2017
@lewismc lewismc added this to Engine Integration, Deployment and Testing in AIST Master Schedule Feb 1, 2017
@lewismc
Copy link
Collaborator

lewismc commented Feb 1, 2017

Issue resolved in #77

@lewismc lewismc closed this as completed Feb 1, 2017
@lewismc lewismc removed this from Engine Integration and Deployment in AIST Master Schedule Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants