Druid Query Optimizations

Based on the Query pattern we rewrite Druid Queries to use the most optimal Druid Query Type: SearchQuery, TimeSeriesQuery or TopNQuery. To quite the Druid website:

Where possible, we recommend using Timeseries and TopN queries instead of GroupBy. GroupBy is the most flexible Druid query, but also has the poorest performance. Timeseries are significantly faster than groupBy queries for aggregations that don’t require grouping over dimensions. For grouping and sorting over a single dimension, topN queries are much more optimized than groupBys.

SearchQuery transformation

A ‘metadata’ query on a dimension is executed as a Search Query in Druid. In SQL this is expressed as select distinct colName query.

query must be on a single Dimension
there must be no aggregations
the query must be for the entire duration of the index
if a limit is specified it must be on the dimension and must have ‘ascending’ order

Following is an example query that gets executed as a Druid Search Query

select c_name
from orderLineItemPartSupplier 
group by c_name

TimeSeriesQuery transformation

Queries that have no Groping Expressions(sql aggregations without a Group By clause) using the TimeSeries Engine results in much faster execution. The query must have:

no dimensions specified
no limit or having clause

Following is an example query that gets executed as a Druid TimeSeries Query

SELECT 
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__0`,
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__1`,
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__2`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__3`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__4`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__5`,
count(1)                                                                                                AS `x__alias__0`
FROM   orderLineItemPartSupplier `tmp_sed_druid_20160201_20160331_v6_sparkline`
WHERE  (
          NOT (
                 cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) IS NULL))
HAVING count(1) > 0

TopNQuery transformation

The user must choose to allow the approximate ranking using the allowTopN option. The query must have:

a single dimension
a single order by expression, the order by must be on a metric and not the single dimension.
must have a limit

Following is an example query that gets executed as a Druid TopN Query

select l_returnflag, 
       count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a 
from orderLineItemPartSupplier 
group by l_returnflag 
order by max(ps_supplycost) 
limit 200

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid Query Optimizations

SearchQuery transformation

TimeSeriesQuery transformation

TopNQuery transformation

Clone this wiki locally