Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any Tutorial / Example for writing a Presto-ElasticSearch connector ? #3057

Closed
sumanth232 opened this issue Jun 9, 2015 · 18 comments
Closed

Comments

@sumanth232
Copy link

I want to write an ElasticSearch connector to perform JOINS in ElasticSearch using Presto.
Can anyone pls suggest on how to start. Any guidance will be of lot of help.

@Downchuck
Copy link

ElasticSearch has several Java clients available -- use the scroll method for large result sets:
https://www.elastic.co/guide/en/elasticsearch/guide/master/scan-scroll.html

Crate.io has an SQL adapter on top of the ElasticSearch code base, which could be of some help.

@sumanth232
Copy link
Author

Crate does not support JOINs.
Its mentioned in the FAQ - https://crate.io/docs/faq/

Q: Does Crate support JOINs?

Not yet. JOINs are on our roadmap and we try to do them the right way in usual quality and well performing.
For a lot of use cases there are other ways to achieve the same result as with JOINs. The best way to start is to take a look at the ARRAY and OBJECT data types to denormalise your data.

@dain
Copy link
Contributor

dain commented Jun 10, 2015

I would start by forking the https://github.com/facebook/presto/tree/master/presto-example-http plugin, and adapting it to be able to read from elastic search. Last time I used elastic search the apis were all REST based so the presto-example-http plugin should be pretty close to what you need. Once you get that working, you’ll want to work on getting predicate push down working, but I’d start by just getting it to read at all.

-dain

On Jun 9, 2015, at 6:14 AM, Sumanth Bandi notifications@github.com wrote:

I want to write an ElasticSearch connector to perform JOINS in ElasticSearch using Presto.
Can anyone pls suggest on how to start. Any guidance will be of lot of help.


Reply to this email directly or view it on GitHub.

@dain
Copy link
Contributor

dain commented Jun 24, 2015

In the legacy SPI that the example connector implements, a table is logically divided in partitions and partitions are divided into splits. A partition can provide a TupleDomain which describes the bounds of the values present in the partition which Presto can use to skip sections of the table that can not match the filter predicate. A split is simply a part of a partition.

Presto will enumerate and filter the partitions and then enumerate the splits for the partitions. Then Presto reads data in parallel from splits.

If your system does not support parallel reading, simply return a single Partition and a single Split. If your system has a more sophisticated physical layout, you will want to use the new TableLayouts SPI so that Presto can take advantage of the data organization.

@sumanth232
Copy link
Author

I wrote a basic connector with the necessary classes implemented. I also added a .properties file in 'etc/catalog' and
also edited the plugin.bundles 'etc/config.properties' file : (added ../presto-elasticsearch/pom.xml)

plugin.bundles=\
  ../presto-raptor/pom.xml,\
  ../presto-hive-cdh4/pom.xml,\
  ../presto-example-http/pom.xml,\
  ../presto-kafka/pom.xml, \
  ../presto-tpch/pom.xml,\
  ../presto-elasticsearch/pom.xml,\
  ../presto-mysql/pom.xml

but I get this error :

2015-06-25T19:16:22.214+0530    INFO    main    com.facebook.presto.metadata.CatalogManager -- Loading catalog etc/catalog/elasticsearch.properties --
2015-06-25T19:16:22.215+0530    ERROR   main    com.facebook.presto.server.PrestoServer No factory for connector elasticsearch
java.lang.IllegalArgumentException: No factory for connector elasticsearch

The problem is here :

private void loadPlugin(URLClassLoader pluginClassLoader)
            throws Exception
    {
        ServiceLoader<Plugin> serviceLoader = ServiceLoader.load(Plugin.class, pluginClassLoader);
        List<Plugin> plugins = ImmutableList.copyOf(serviceLoader);

        if (plugins.isEmpty()) {
            log.warn("No service providers of type %s", Plugin.class.getName());
        }

        for (Plugin plugin : plugins) {
            log.info("Installing %s", plugin.getClass().getName());
            installPlugin(plugin);
        }
    }

The size of plugins when loading this new plugin is 0, whereas for other old plugins , it is 1

List<Plugin> plugins = ImmutableList.copyOf(serviceLoader);

Can you please help, why the first 2 lines of this code are not working as expected ?

Can you please elaborate on this part of the Developer Docs, which I could not understand properly ?

Each plugin identifies an entry point: an implementation of the Plugin interface. 
This class name is provided to Presto via the standard Java ServiceLoader interface: 
the classpath contains a resource file named com.facebook.presto.spi.Plugin in the META-INF/services directory. 
The content of this file is a single line listing the name of the plugin class:

How should I provide the classname of my new plugin to presto ?

@sumanth232
Copy link
Author

The above problem solved after I added a file with the name 'com.facebook.presto.spi.Plugin' in the 'META-INF/services' directory :

presto/presto-elasticsearch/src/main/resources/META-INF/services/com.facebook.presto.spi.Plugin

But, I observed that for the other connectors (except tpch), the same file is present in a different directory :

presto/presto-kafka/target/classes/META-INF/services/com.facebook.presto.spi.Plugin
presto/presto-raptor/target/classes/META-INF/services/com.facebook.presto.spi.Plugin
...
...
kafka and raptor do not have a 'src/main/resources/META-INF/services' directory at all

Then how is the serviceloader loading the connectors kafka and raptor ?
Can anybody please give an explanation ?

@sumanth232
Copy link
Author

Pls tell me how to accurately implement these 3 function in the 'RecordCursor' interface while writing a connector

    long getTotalBytes();

    long getCompletedBytes();

    long getReadTimeNanos();

Please help ... pls..pls..pls..

@electrum
Copy link
Contributor

electrum commented Jul 3, 2015

Those functions are only for stats. If they don't mean anything for your connector or that info is not available just return 0.

@sumanth232
Copy link
Author

@electrum , @dain : Does Presto support dynamic columns in Tables, (for example, data stores which contain JSON documents, where new properties can be added in a JSON doc residing in an index/type) ?
In the Example connector, all the columns are hardcoded in 'example-metadata.json'. what if new columns are added in the csv doc ? How to handle these newly added columns in the csv (or new properties in JSON docs in elasticsearch indices) without restarting the Presto server every time a new column is added in a table ? Can this be handled by Presto at all ? Any suggestions will be of great help.
Thanks.

@RobinUS2
Copy link
Contributor

What's the status on this one? Any progress made? Thanks!

@sumanth232
Copy link
Author

@RobinUS2 , here is a basic version of the connector. It needs to be developed further and optimised.
#3240

@corneversloot
Copy link

Slightly off topic but still relevant for people looking into this topic; we have released a first version of a JDBC driver for Elasticsearch called sql4es. It supports most common SQL statements and can be use from any system supporting the JDBC interface.

@ebuildy
Copy link

ebuildy commented Oct 25, 2016

Should not be better to have a connector to Apache Lucene instead Elasticsearch HTTP API ?

BTW you could use Hive external table elastisearch (via elasticsearch-hadoop official hive connector) and query it from PrestoDB.

@corneversloot
Copy link

Well, first of all the driver uses the transport API and not the HTTP one. I think you actually do want to use Elasticsearch because it provides distributed query execution and high availability.

The Hive connection you mention should work I think although I must admit i have never used it.

@rohanarora0921
Copy link

@sumanth232 Are you still working on this? Any progress?

@eulalie367
Copy link

https://github.com/albertocsm/presto/tree/master/presto-elasticsearch

@dzen
Copy link

dzen commented Aug 3, 2017

I found this other fork today : https://github.com/ebyhr/presto, with an elastic branch.

@findepi
Copy link
Contributor

findepi commented Jun 12, 2018

This issue is obsolete now, closing.

As to the elasticsearch connector, if the above mentioned implementations are applicable to general audience, it would be valuable to have one in presto codebase.

@findepi findepi closed this as completed Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests