Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Quick Start Guide

Vincent Smith edited this page Apr 24, 2017 · 20 revisions

There are 3 options of quickly get started.

  • You can use our Cluster Spin Tool to spin up an AWS EMR cluster with Druid daemons setup and the TPCH(datascale 1 or datascale 100) datasets and Table metadata setup. Once started look at our Sample Queries for examples; the User Guide has more details on how to configure and understand query processing.
  • If you want to setup an environment on your own, follow the instructions on this page. We provide you with instructions to setup a dev. environment and instructions on how to index datasets. We have provide a sample for the TPCH dataset and a demo Retail dataset.

Environment Setup

Install and Setup Druid

You can follow the instructions here

Download and unzip Spark

Download a spark version. As of this writing, we have tested with spark-1.6.0

TPCH Example

The TPCH dataset

We included a sample of the TPCH dataset scale1 in the quickstart folder. This is includes a sample of facts from the datascale1 dataset. Note the rows in the flattened dataset and the original lineitem table may not be the same; so don't use these datasets to compare result data values.

Index sample tpch dataset

Follow the instructions here to create the Druid Index for the flattened dataset. Copy and edit the tpch_index_task.json.template. You need to change the location of the baseDir in your configuration.

Download the Sparkline BI Accelerator jar from Maven Central

As of this writing the current release version(0.2.0) is available here

Setup and Start the ThriftServer

Following instructions on this page

For example:

sbin/start-sparklinedatathriftserver.sh  ~/sparkline/spark-druid-olap/target/scala-2.10/accelerator-assembly-0.2.0.jar

start-sparklinedatathriftserver.sh is a thin wrapper on spark scripts, so all spark command line options work when starting the Sparkline Enhanced ThriftServer.

Setup the SQL tables

  • connect to the thriftserver, connectivity is the sames as connecting to Spark; if you are new to Spark detailed instructions are available here
  • create the Spark underlying and Druid Datasource tables. For TPCH these would look like th following:
CREATE TABLE orderLineItemPartSupplierBase(
      o_orderkey integer, o_custkey integer,
      o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
      o_clerk string,
      o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
      l_linenumber integer,
      l_quantity double, l_extendedprice double, l_discount double, l_tax double,
      l_returnflag string,
      l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
      l_shipinstruct string,
      l_shipmode string, l_comment string, order_year string, ps_partkey integer,
      ps_suppkey integer,
      ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
      s_phone string, s_acctbal double, s_comment string, s_nation string,
      s_region string, p_name string,
      p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
      p_retailprice double,
      p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
      c_mktsegment string , c_comment string , c_nation string , c_region string)
      USING com.databricks.spark.csv
      OPTIONS (path "<location of orderLineItemPartSupplierCustomer>",
      header "false", delimiter "|")
;

-- now create the orderLineItemPartSupplier DataFrame that links the raw data with the Druid Index
CREATE TABLE if not exists orderLineItemPartSupplier
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
      timeDimensionColumn "l_shipdate",
      druidDatasource "tpch",
      druidHost "localhost",
      zkQualifyDiscoveryNames "true",
      columnMapping '{  "l_quantity" : "sum_l_quantity",  "ps_availqty" : "sum_ps_availqty",  "cn_name" : "c_nation",  "cr_name" : "c_region",   "sn_name" : "s_nation",  "sr_name" : "s_region" }     ',
      numProcessingThreadsPerHistorical '1',
      starSchema ' {   "factTable" : "orderLineItemPartSupplier",   "relations" : []  }     ')
;

Now you can run queries against orderLineItemPartSupplier Druid DataSource Table. The Sparkline Planner will generate optimized query plans for queries against this table. For example:

explain
select l_returnflag as r, sum(l_extendedprice) as s 
from orderLineItemPartSupplier 
group by l_returnflag
;

== Physical Plan ==
Project [l_returnflag#75 AS r#126,alias-1#129 AS s#127]
+- Scan DruidQuery(1581882452): {
  "q" : {
    "jsonClass" : "GroupByQuerySpec",
    "queryType" : "groupBy",
    "dataSource" : "tpch",
    "dimensions" : [ {
      "jsonClass" : "DefaultDimensionSpec",
      "type" : "default",
      "dimension" : "l_returnflag",
      "outputName" : "l_returnflag"
    } ],
    "granularity" : "all",
    "aggregations" : [ {
      "jsonClass" : "FunctionAggregationSpec",
      "type" : "doubleSum",
      "name" : "alias-1",
      "fieldName" : "l_extendedprice"
    } ],
    "intervals" : [ "1993-01-01T00:00:00.000Z/1997-12-31T00:00:01.000Z" ]
  },
  "queryHistoricalServer" : false,
  "numSegmentsPerQuery" : -1,
  "intervalSplits" : [ {
    "start" : 725846400000,
    "end" : 883526401000
  } ],
  "outputAttrSpec" : [ {
    "exprId" : {
      "id" : 75,
      "jvmId" : { }
    },
    "name" : "l_returnflag",
    "dataType" : { },
    "tf" : "toString"
  }, {
    "exprId" : {
      "id" : 129,
      "jvmId" : { }
    },
    "name" : "alias-1",
    "dataType" : { },
    "tf" : ""
  } ]
}[l_returnflag#75,alias-1#129]

This gets run as Druid Query with a Spark Project Operator. The Aggregation is performed using the Druid Index, which is significantly faster than performing an Aggregation in Spark. The same query against the underlying Plan has a traditional Spark Plan:

explain
select l_returnflag as r, sum(l_extendedprice) as s 
from orderLineItemPartSupplierBase
group by l_returnflag

== Physical Plan ==
TungstenAggregate(key=[l_returnflag#22], functions=[(sum(l_extendedprice#19),mode=Final,isDistinct=false)], output=[r#130,s#131])
+- TungstenExchange hashpartitioning(l_returnflag#22,200), None
   +- TungstenAggregate(key=[l_returnflag#22], functions=[(sum(l_extendedprice#19),mode=Partial,isDistinct=false)], output=[l_returnflag#22,sum#135])
      +- Project [l_returnflag#22,l_extendedprice#19]
         +- Scan CsvRelation(file:///Users/hbutani/tpch/datascale1/orderLineItemPartSupplierCustomer.small/,false,|,"...

More examples on the TPCH dataset are in the Sample Queries page; the User Guide has more details on how to configure and understand query processing

Retail Example

This demo provides an ultra quick start but does not have the depth of scenarios and scale as the tpch dataset below.

Clone this wiki locally