Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RaptorX: Unifying Hive-Connector and Raptor-Connector with Low Latency #13205

Closed
7 tasks done
highker opened this issue Aug 12, 2019 · 23 comments
Closed
7 tasks done

RaptorX: Unifying Hive-Connector and Raptor-Connector with Low Latency #13205

highker opened this issue Aug 12, 2019 · 23 comments
Labels
Roadmap

Comments

@highker
Copy link
Contributor

@highker highker commented Aug 12, 2019

The feature has been general available and fully battled tested in Hive connector. Raptor connector is no longer maintained. Please use this feature instead.

Raptor is a sub-second latency query engine but has limited storage capacity; while Presto is a large-scale sub-min latency query engine. To take advantage of both engines, We propose “RaptorX”, which runs on top of HDFS like Presto but still provides sub-second latency like Raptor. The goal is to provide a unified, cheap, fast, and scalable solution to OLAP and interactive use cases.

Current Presto Design

  • Presto uses compute servers. It directly connects to Hive Metastore as its metastore.
  • The servers are stateless. A cluster directly reads data from HDFS. How data is maintained by HDFS or Hive Metastore is completely off the business from Presto.

Screen Shot 2020-02-17 at 10 31 48 PM

Current Raptor Design

  • Raptor uses compute and storage combined servers. It usually needs a remote backup store and MySQL as its metastore.
  • Users need to run batch jobs to load data from warehouse to Raptor clusters.
  • Each Raptor worker has background jobs to maintain files (backup, compaction, sorting, deletion, etc).

Screen Shot 2020-02-17 at 10 32 19 PM

Read/Write Path

  • Page Source: ORC file stream reader for file reading
  • Page Sink: ORC file stream writer for file writing

Background Jobs

  • Compaction and Cleaning (with data sorting as an option): ShardCompactor to merge small files into large ones to avoid file reading overhead.
    • Create a single “Page Sink” with multiple “Page Sources”. Read multiple ORC files and flush into a single one.
    • ShardCleaner to garbage collect files that have been marked as deleted.
  • Backup: Backup store for creating backup files in remote key-value store and restore backup file when necessary

Metadata

  • Raptor Metadata: a wrapper on several MySQL clients given all metadata is stored in MySQL.
    • Schema: save all the metadata at table level including schemas, tables, views, data properties (sorting, bucketing, ), table stats, bucket distribution, etc.
    • Stats: save all the metata at file level including nodes, transaction, buckets, etc.

Pros and Cons with Current Raptor/Presto

  Presto Raptor
Pros Large-scale storage (EB) Low latency (sub-second)
Pros Independent storage and compute scale Refined metastore (file-level)
Cons High latency (sub-minute) Mid-scale storage (PB)
Cons Coarse metastore (partition-level) Coupled storage and compute

RaptorX Design

"RaptorX" is a project code name. It aims to evolve Presto (presto-hive) in a way to unify presto-raptor and presto-hive. To make sure taking the pros from both Presto and Raptor, we made the following design decisions:

  • Use HDFS for large-scale storage (EB) and independent storage and compute scale
    Use hierarchical cache to achieve low latency (sub-second)
  • Inherit and refine the existing Raptor metastore for file pruning and scheduling decision making

Screen Shot 2020-02-15 at 4 56 42 PM

Read/Write Path

  • Page Source/Sink: migrate file system from local FS to WS and HDFS environment, where WS is for FB and HDFS is for open source.
  • Hierarchical Cache: check the following “Cache” bullet.

Cache

  • Worker cache
    • File handle caching: avoid file open calls
    • File/stripe footer caching: avoid multiple redundant RPC calls to HDFS/WS
    • In-memory data caching: async IO + prefetch
    • Local data cache on SSD: avoid network or HDFS/WS latency.
  • Coordinator cache
    • File stats for pruning
    • ACL checks
    • Parametric query plans

Background Jobs

  • Completely remove: users can use Spark or Presto to write sorted and compacted data. WS will take care of fault tolerance. This will provide immutable partitions for Raptor that will be able to have free WH DR and privacy

Metastore

  • Hive Metastore compatibility: Hive Metastore will be the source of truth. RaptorX will keep a refined copy of Hive Metastore containing file-level stats for pruning and scheduling.
  • Raptor Metastore: the enriched metastore.
    • Detach worker assignment from metastore. Metastore should not have ties with worker topology.
    • Raptor Metastore is a richer metastore contains file-level stats that can be used for range/z-order partitioning
  • Metastore protocol
    • RaptorX calls Hive Metastore for every query
    • Hive Metastore provides a hash/digest to indicate the version number of a partition. A digest indicates if the content has been changed for a partition due to backfill, NRT, etc
    • If the digest matches the Raptor Metastore one, use the Raptor Metastore; otherwise fall back to Hive Metastore
    • If a user chooses to ingest data with RaptorX, file-level stats + partition info will be kept in Raptor Metatore and partition info will be published to Hive Metastore.

Coordinator

  • Soft affinity scheduling:
    • The same worker should pull data from the chunk server to leverage FS cache on chunk server.
    • The scheduling should be “soft” affinity, meaning that it should still take workload balancing into consideration to avoid scheduling skew. The current Raptor scheduling is “hard” affinity that can cause some workers to pull more data than others causing long-tail issues.
  • Low-latency overhead:
    • Reduce coordinator parsing, planning, optimizing, ACL checking overhead to less than 100ms.

Interactive query flow

  • Presto coordinator receives a user query and sends a request to RaptorX metastore with the table name, partition values, constraints from the filter, user info (for ACL check), etc.
  • RaptorX metastore checks its in-memory table schema and ACL cache for basic info. If there is a cache miss, it asks Hive metastore for them.
  • RaptorX metastore checks its in-memory cache if it contains the required table name + partition value pair
    • If it contains the pair, RaptorX metastore sends the version number to the KV store to see if that is the latest version. If it is the latest version, RaptorX metastore prunes files based on the given constraints.
    • If it does not contain the pair, fetch it from the KV store and repeat the above step.
      RaptorX returns the schema, file locations, etc back to Presto coordinator.

Production Benchmark Result

Screen Shot 2019-12-23 at 2 50 24 PM

Milestones and Tasks

  • Soft affinity scheduling (#13966, #14095, #14108)
  • Cache
    • In-memory file handle (this should be implemented inside FS implementation)
    • In-memory file/stripe footer (#13501, #13575)
    • File-based data cache (#13644, #13904)
    • Segment-based data cache (Alluxio integration) (#14196, #14499)
    • Fragment result cache

How to Use

Enable the following configs in Hive connector (with the exception that fragment result cache is for main engine)

Scheduling (/catalog/hive.properties):

hive.node-selection-strategy=SOFT_AFFINITY

Metastore versioned cache (/catalog/hive.properties):

hive.partition-versioning-enabled=true
hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

List files cache (/catalog/hive.properties):

hive.file-status-cache-expire-time=24h
hive.file-status-cache-size=100000000
hive.file-status-cache-tables=*

Data cache (/catalog/hive.properties):

cache.enabled=true
cache.base-directory=file:///mnt/flash/data
cache.type=ALLUXIO
cache.alluxio.max-cache-size=1600GB

Fragment result cache (/config.properties and /catalog/hive.properties):

fragment-result-cache.enabled=true
fragment-result-cache.max-cached-entries=1000000
fragment-result-cache.base-directory=file:///mnt/flash/fragment
fragment-result-cache.cache-ttl=24h
hive.partition-statistics-based-optimization-enabled=true

File and stripe footer cache (/catalog/hive.properties):

--- for orc:
hive.orc.file-tail-cache-enabled=true
hive.orc.file-tail-cache-size=100MB
hive.orc.file-tail-cache-ttl-since-last-access=6h
hive.orc.stripe-metadata-cache-enabled=true
hive.orc.stripe-footer-cache-size=100MB
hive.orc.stripe-footer-cache-ttl-since-last-access=6h
hive.orc.stripe-stream-cache-size=300MB
hive.orc.stripe-stream-cache-ttl-since-last-access=6h
-- for parquet:
hive.parquet.metadata-cache-enabled=true
hive.parquet.metadata-cache-size=100MB
hive.parquet.metadata-cache-ttl-since-last-access=6h
@highker highker added the Roadmap label Aug 12, 2019
@highker
Copy link
Contributor Author

@highker highker commented Aug 12, 2019

cc: @biswapesh, @jessesleeping, @shixuan-fan

@highker
Copy link
Contributor Author

@highker highker commented Sep 5, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

Some benchmark results on prod queries

Screen Shot 2019-09-04 at 10 32 18 PM

Screen Shot 2019-09-04 at 10 32 12 PM

Screen Shot 2019-09-04 at 10 32 25 PM

@rongrong
Copy link
Contributor

@rongrong rongrong commented Sep 5, 2019

Some benchmark results on prod queries

Nice work! Would be even better to align the x-axis among graphs for easier visualization.

@highker
Copy link
Contributor Author

@highker highker commented Sep 10, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

Reducing default read size from 8MB to 0.2MB can greatly leverage the benefit of using SSD

Screen Shot 2019-09-09 at 8 48 10 PM

Screen Shot 2019-09-09 at 8 32 39 PM

@shixuan-fan
Copy link
Contributor

@shixuan-fan shixuan-fan commented Sep 11, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

Here are the results with file handle cache:
image
image

@highker
Copy link
Contributor Author

@highker highker commented Sep 11, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

0.5MB max read result. We will mainly focus on optimizing IO than using data cache.

Screen Shot 2019-09-10 at 10 35 54 PM

Screen Shot 2019-09-11 at 12 53 36 PM

@highker
Copy link
Contributor Author

@highker highker commented Sep 12, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

Given HDFS can provide much higher throughput, we increase the IO fanout to workaround latency:
Screen Shot 2019-09-11 at 5 59 16 PM

@highker
Copy link
Contributor Author

@highker highker commented Sep 14, 2019

(out-of-date; check #13205 (comment) for actual benchmark result)

Screen Shot 2019-09-13 at 7 20 37 PM
Screen Shot 2019-09-13 at 7 31 01 PM

@highker
Copy link
Contributor Author

@highker highker commented Sep 16, 2019

HDD results (out-of-date; check #13205 (comment) for actual benchmark result)

Screen Shot 2019-09-15 at 11 48 01 PM

@fengyun2066
Copy link

@fengyun2066 fengyun2066 commented Dec 18, 2019

Can you send out specific test models and use cases

@highker
Copy link
Contributor Author

@highker highker commented Dec 23, 2019

@fengyun2066 we use prod workload to test the performance. In case you are interested in the performance
Screen Shot 2019-12-23 at 2 50 24 PM

@hustnn
Copy link

@hustnn hustnn commented Dec 26, 2019

@highker
From the above table, the disagg performs better. However, from the above Figures, it seems the collocated performs better. Is my understanding right?

@highker
Copy link
Contributor Author

@highker highker commented Dec 26, 2019

@hustnn, the above figures are out of date; I will update the desc to make sure they are not misleading future users. The table shows the results having local data cache on SSD (#13644). That is a key feature to enable disagg storage.

@highker highker changed the title [Design] Disaggregated Raptor [Design] RaptorX: unified presto-raptor and presto-hive Dec 26, 2019
@highker highker changed the title [Design] RaptorX: unified presto-raptor and presto-hive [Design] RaptorX: unified Presto and Raptor Dec 26, 2019
@highker highker changed the title [Design] RaptorX: unified Presto and Raptor [Design] RaptorX: Unified Presto and Raptor with Extreme Low Latency Jan 24, 2020
@highker highker changed the title [Design] RaptorX: Unified Presto and Raptor with Extreme Low Latency [Design] RaptorX: Unifying Presto and Raptor with Low Latency Jan 24, 2020
@highker highker changed the title [Design] RaptorX: Unifying Presto and Raptor with Low Latency RaptorX: Unifying Presto and Raptor with Low Latency Jan 24, 2020
@hustnn
Copy link

@hustnn hustnn commented Feb 17, 2020

@highker
In the cache to-Do part, you mentioned "In-memory file handle (this should be implemented inside FS implementation)", may I know the file handle can be reused here stands for the fd of the opening parquet file?

@highker
Copy link
Contributor Author

@highker highker commented Feb 17, 2020

@hustnn, the file handle/descriptor is an FS-specific concept. It various from FS to FS. Distributed FS like Google's Colossus or Facebook's Warm Storage are similar in design that chops files into chunks and saves them on different chunk servers. File format on the other hand is orthogonal from from file handle. So whichever file format you use should be fine. For Facebook, Warm Storage opens file by retrieving chunk information from chunk allocator servers. This information is (mostly) immutable. It provides where each chunk of a file comes from. By caching this information, a split can avoid overhead of chunk lookup (which is around ~100ms for Warm Storage or Colossus). Therefore, it depends on what FS you use and how much implementation details it exposes to its interface. Check the open interface from HDFS you use; see if the underlying implementation has any RPC calls to retrieve file handles/descriptors.

@hustnn
Copy link

@hustnn hustnn commented Feb 17, 2020

@highker I see, Thanks. Is the chunk information of a file in Facebook Warm Storage similar to the blocks of a file in HDFS? If yes, does the cached information indicate the block locations in HDFS?

@highker
Copy link
Contributor Author

@highker highker commented Feb 17, 2020

@hustnn, that is correct. It's the block metadata for a file

@hustnn
Copy link

@hustnn hustnn commented Feb 17, 2020

@highker Thanks!

@teejteej
Copy link

@teejteej teejteej commented Jan 11, 2021

Just pinging to see if there are any updates on RaptorX? Curious to try it out and see how the disaggregated compute/storage will perform on some datasets.

@highker highker changed the title RaptorX: Unifying Presto and Raptor with Low Latency RaptorX: Unifying Hive-Connector and Raptor-Connector with Low Latency Jan 11, 2021
@highker
Copy link
Contributor Author

@highker highker commented Jan 11, 2021

@teejteej, yes the feature is fully available and battled tested. Here are the configs to enable for hive connectors:

Scheduling (/catalog/hive.properties):

hive.node-selection-strategy=SOFT_AFFINITY

Metastore versioned cache (/catalog/hive.properties):

hive.partition-versioning-enabled=true
hive.metastore-cache-scope=PARTITION
hive.metastore-cache-ttl=2d
hive.metastore-refresh-interval=3d
hive.metastore-cache-maximum-size=10000000

List files cache (/catalog/hive.properties):

hive.file-status-cache-expire-time=24h
hive.file-status-cache-size=100000000
hive.file-status-cache-tables=*

Data cache (/catalog/hive.properties):

cache.enabled=true
cache.base-directory=file:///mnt/flash/data
cache.type=ALLUXIO
cache.alluxio.max-cache-size=1600GB

Fragment result cache (/config.properties):

fragment-result-cache.enabled=true
fragment-result-cache.max-cached-entries=1000000
fragment-result-cache.base-directory=file:///mnt/flash/fragment
fragment-result-cache.cache-ttl=24h

@shixuan-fan
Copy link
Contributor

@shixuan-fan shixuan-fan commented Jan 11, 2021

Adding to what @highker said, there are also knobs for file/stripe footer cache for ORC and Parquet files as shown below. Note that in Facebook's deployment (we use ORC), we found it would increase GC pressure so it is disabled, but I think it is still worthwhile to point this feature out just for visibility as it might work for different workload.

@ClarenceThreepwood might have a better idea about how Parquet metadata cache works in Uber.

ORC:

hive.orc.file-tail-cache-enabled=true
hive.orc.file-tail-cache-size=100MB
hive.orc.file-tail-cache-ttl-since-last-access=6h
hive.orc.stripe-metadata-cache-enabled=true
hive.orc.stripe-footer-cache-size=100MB
hive.orc.stripe-footer-cache-ttl-since-last-access=6h
hive.orc.stripe-stream-cache-size=300MB
hive.orc.stripe-stream-cache-ttl-since-last-access=6h

Parquet:

hive.parquet.metadata-cache-enabled=true
hive.parquet.metadata-cache-size=100MB
hive.parquet.metadata-cache-ttl-since-last-access=6h

@highker
Copy link
Contributor Author

@highker highker commented Jan 27, 2021

All merged; fully available in 0.248

@highker highker closed this as completed Jan 27, 2021
@zzcclp
Copy link

@zzcclp zzcclp commented Feb 26, 2021

@highker , hi, 0.247 includes all the features of RaptorX , or it needs to use 0.248?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap
Projects
None yet
Development

No branches or pull requests

7 participants