You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use presto to query data in ORC files produced by pinterest/secor, stored in S3. My environment doesn't have any Hadoop/Hive setup, rather we use Kafka directly. Using presto against Kafka isn't workable (too slow, and would require infinite retention of the data inside Kafka). My table descriptions are produced automatically based on application data, and I want my users to only interact with presto.
I got things to work with Hive 3.1.1, but the requirements for running a Hive metastore are "scary" (no knowledge otherwise, so this adds quite some operational complexity).
I saw in the source code that there is a FileHiveMetastore which seems like it would remove the need to configure and run an actual Hive metastore server, and it should be easy enough to convert my application knowledge into suitable schemas.
What I would need for that is basically some form of example of how to describe a table for this meta-store, or ideally some pointers to documentation.
My table definition right now looks like this:
create external table if not exists events (`type` string,`_message` string,`_meta` struct<`id`:string,...,`timestamp`:string>) partitioned by (`dt` string) stored as orc location 's3a://bucket/raw_logs/secor_backup/events/';
msck repair table events;
The text was updated successfully, but these errors were encountered:
Since you're using AWS, Presto can be configured to use AWS Glue as a Hive Metastore. This is specific to using EMR but the general idea is the same, see here.
You can launch an EMR Presto with the checkbox enabled to use Glue as the Hive metastore to see how this ends up being configured on the EC2 instances it uses. If you're not using EMR it should essentially just be configuring Presto's hive.properties to use Glue e.g.:
hive.metastore = glue
and using an EC2 instance profile that can read from AWS Glue.
Also, if you use AWS Athena (managed Presto) you can issue create external table statements through it and it delegates those to the Glue metastore. So you could just do this in the AWS Athena console to get started. You can also query the data then via AWS Athena as it's Presto against S3 basically (doesn't have other connectors though e.g. Kafka). Depending on your use case that may be sufficient as alternative to running your own Presto. Running your own Presto on EC2 or via EMR would give your more control though over optimizing performance etc. where as Athena will be more convenient/potentially more cost effective depending on your query patterns.
@ankon AFAIK, FileHiveMetaStore should be view as a helper class allowing us to run tests without external Hive Metastore process.
For example, com.facebook.presto.hive.metastore.file.FileHiveMetastore#isDatabaseOwner has some hard-coded logic that any user has owner privileges in default schema, which is OK in tests, but not OK in any real-life usage. For these reasons, we're not planning to document it, nor we encourage people to use this class directly.