Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Overview

hbutani edited this page Aug 4, 2016 · 1 revision

The Sparkline BI Accelerator is a Spark native Business Intelligence Stack geared towards providing fast ad-hoc querying over a Logical Cube(aka Star-Schema). Its based on an Apache Druid OLAP Indexing technology; the technique of using OLAP Indexing has the additional benefit of simplify the ETL processes and the Data Management layer.

Hadoop/Spark(Big Open Data Stack) is fast becoming an accepted Architecture for an Enterprise Data Warehouse. One of the gaps in this stack is the ability to support fast Slice-and-Dice ad-hoc workloads; such a workload is fairly common, for example when an Analyst is browsing/doing discovery through large datasets.

The current norm to support ad-hoc queries at scale with think time response times is to copy the data into a ‘Specialized’ Data Store; part of the maintainence is to pre-aggregate the underlying data into many Materialized Views. Such a solution has several drawbacks: the cost of the entire system goes up significantly. It’s not just the cost of another system(this itself can be significantly given that these systems are usually not open-source, Apache licensed), there is the hardware cost, the cost of managing two systems, and the cost of managing and running a complex ETL process to keep the two systems in sync. Additionally these solution only work well when the workload is known before-hand, when pre-aggregates can be used to optimize the known workload. Pre-aggregates typically break down when the workload is ad-hoc.

The Sparkline BI Accelerator simplifies how enterprises can provide an ad-hoc query layer on top of a Hadoop/Spark(Big Open Data Stack).

  • we provide the ad-hoc query capability by extending the Spark SQL layer, through SQL extensions and an extended Optimizer(both logical and Physical optimizations).
  • we use OLAP indexing vs. pre-materialization as a technique to achieve query performance. OLAP Indexing is a well-known technique that is far superior to materialized views to support ad-hoc querying. We utilize another Open-Source Apache licensed Big Data component for the OLAP Indexing capability.

The Sparkline BI Accelerator solution starts with a Spark Cluster with added Druid Daemons(for Batch Indexing/Real-time Ingestion, Query serving and Coordination). The Druid daemons can cohabitate with Spark daemons or can be run on a separate set of machines. The core component is the Sparkline Planner and runtime extensions: that enable the optimal engine/technique being used for every Spark Catalyst Query Plan.

During run-time a query on a Sparkline accelerated Dataset is answered by a Physical Plan that combines operations via the Optimized and Regular Paths.

Generally the deployment of the Sparkline Accelerator falls into the following 2 buckets:

  1. For existing deployments of Druid, adding the Sparkline Accelerator exposes the Druid Index as a DataSource in Spark. This provides a complete SQL interface(with jdbc/odbc access) to the Druid Index; it also enables analytics in Spark(SQL + advanced analytics) to utilize the fast navigation/aggregation capabilities of the Druid Index.
  2. For situations where Hadoop + Spark is the Enterprise Data Warehouse, we have already explained the advantages of the Sparkline Accelerator approach that leverages OLAP Indexing to provides a solution with distinct advantages.

The Sparkline extensions enables Catalyst Logical Plans written against a raw flattened dataset or a star schema to be optimized to take advantage of a Druid Index of the data. The key components are a Druid Spark DataSource that that orchestrates execution between Druid and Spark and the Druid Planner that generates optimal Physical Plans to take advantage of both Spark and Druid capabilities.

The Design Document is a good(albeit dated) source of information on the core components of the Accelerator Design. We ran a benchmark of a set of representative queries from the TPCH benchmark. For slice and dice queries like TPCH Q7 we see orders of magnitude improvement. The benchmark results and an analysis are in the Design document(a more detailed description of the benchmark is available here) The Devolper Guide has detailed sections on the Optimizations we support, the Cost Model, the Query Execution Modes and many more.

The Quick Start Guide has detail instructions on how to setup a demo environment for the TPCH flattened and Star Schema use cases.

The User Guide has detailed sections on Setting up Datasources, running the Sparkline enhanced Spark thriftserver, SQLContext options and many more.

Clone this wiki locally