Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

How to debug a Query Plan?

hbutani edited this page Jul 21, 2016 · 1 revision

1. How to obtain the detailed Query Plan

Use the explain druid rewrite statement. See details here

2. Where do I find details about supported sql query patterns, functions, expressions and datatypes?

The Transformation and Optimization Page has details on supported functionality. The Expression Pushdown section has details on supported sql functions, expressions and datatypes.

3. How do I collect runtime information about a Query?

There maybe several reasons for a slow performing Query: Druid segment processing, processing and scheduling in Spark or a suboptimal physical Plan.

  • Use the DruidQueries Runtime View to gather information about the execution each druid query. Each row provides details about the time to time to execute the query on a Druid daemon, and the time to pull in the entire resultset. You can also see the druid query and on which druid daemon this query ran. The stageId, partitionId columns tie this back to the Spark Stage and Task in which this query ran.

  • The Spark Jobs and Stages UI Tab provides details about the Spark Job and individual Stages. The interesting pieces are the time taken for each task within the Stages and the Stage timeline. A rule of thumb is that the overhead(anything other than execution time) for Spark Tasks running Druid Queries should be small, and the time for other Stages(like the merge stage) should be small. Other things to watch for is the scheduling time for tasks within a stage should be small.

The Spark Jobs UI Tab

The Spark Stages UI Tab

The Spark Stage Details UI Tab

The Spark Stage Timeline

  • Finally the thriftsever log has a detailed log of the runtime events during query execution. In case of unexplained behavior, please look for exceptions in this file.

4. How do I try different Druid execution strategies?

The Druid Query Cost Model has details on how we make decisions about the runtime execution strategy for a Query. By default the Cost Model is on, and we recommend that you run in this mode. But the Cost Model is based on a set of Cost Factors, these have default values which maybe needed to be tweaked for your environment. All costs are relative to the cost to shuffle 1 row on the Spark Cluster. For example by default we set the cost to process 1 row in Druid to 0.25 of this cost. In reality this maybe as much as an order of magnitude too high. The process setting these factors is a trial and error process, following are some tools/tips:

Use the Broker only mode
you can override the cost model and force all queries to go to the broker by using the following settings:
set spark.sparklinedata.druid.querycostmodel.enabled=false
set spark.sparklinedata.druid.option.queryHistoricalServers=false

Run your tests with these settings; if you find the queries are performing consistently faster, this implies that the druid cost factors should be reduced, so that plans that favor the broker or more segmentsPerQuery are favored.

Use the Historical mode, with certain numSegmentsPerQuery
you can override the cost model and force all queries to go to the broker by using the following settings:
set spark.sparklinedata.druid.querycostmodel.enabled=false
set spark.sparklinedata.druid.option.queryHistoricalServers=true
set spark.sparklinedata.druid.option.numSegmentsPerHistoricalQuery=?

You can try this with different values of numSegmentsPerHistoricalQuery. If you find the queries are performing consistently faster for a particular value the druid processing cost and spark costs should be changed accordingly(based on the numSegmentsPerHistoricalQuery being chosen by the cost model settings.

Clone this wiki locally