Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Druid Query Optimizations

hbutani edited this page Sep 2, 2016 · 1 revision

Based on the Query pattern we rewrite Druid Queries to use the most optimal Druid Query Type: SearchQuery, TimeSeriesQuery or TopNQuery. To quite the Druid website:

Where possible, we recommend using Timeseries and TopN queries instead of GroupBy. GroupBy is the most flexible Druid query, but also has the poorest performance. Timeseries are significantly faster than groupBy queries for aggregations that don’t require grouping over dimensions. For grouping and sorting over a single dimension, topN queries are much more optimized than groupBys.

SearchQuery transformation

A ‘metadata’ query on a dimension is executed as a Search Query in Druid. In SQL this is expressed as select distinct colName query.

  • query must be on a single Dimension
  • there must be no aggregations
  • the query must be for the entire duration of the index
  • if a limit is specified it must be on the dimension and must have ‘ascending’ order

Following is an example query that gets executed as a Druid Search Query

select c_name
from orderLineItemPartSupplier 
group by c_name

TimeSeriesQuery transformation

Queries that have no Groping Expressions(sql aggregations without a Group By clause) using the TimeSeries Engine results in much faster execution. The query must have:

  • no dimensions specified
  • no limit or having clause

Following is an example query that gets executed as a Druid TimeSeries Query

SELECT 
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__0`,
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__1`,
min(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__2`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__3`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__4`,
max(cast(cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) AS timestamp)) AS `x_measure__5`,
count(1)                                                                                                AS `x__alias__0`
FROM   orderLineItemPartSupplier `tmp_sed_druid_20160201_20160331_v6_sparkline`
WHERE  (
          NOT (
                 cast(`tmp_sed_druid_20160201_20160331_v6_sparkline`.`l_shipdate` AS timestamp) IS NULL))
HAVING count(1) > 0

TopNQuery transformation

The user must choose to allow the approximate ranking using the allowTopN option. The query must have:

  • a single dimension
  • a single order by expression, the order by must be on a metric and not the single dimension.
  • must have a limit

Following is an example query that gets executed as a Druid TopN Query

select l_returnflag, 
       count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a 
from orderLineItemPartSupplier 
group by l_returnflag 
order by max(ps_supplycost) 
limit 200
Clone this wiki locally