Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Cluster Spinup Tool

hbutani edited this page Jul 23, 2016 · 1 revision

Use the Cluster Spin Tool to bring up a Sparkline BI Accelerator cluster on AWS. The Cluster is setup with EMR(Hadoop + Spark), Druid 0.9.0 and Sparkline Accelerator. The Cluster starts with Druid and Sparkline Thriftserver running, and optionally serving the Druid DataSource specified in the configuration files.

The tool is driven from a ConfigFile, use it to specify the following information:

  • The Cluster Name
  • The location of configuration files for Druid, Spark, and Sparkline.
  • Your EC2 keypair
  • An AWS SecurityGroupID for the machine spun up.
  • The InstanceTypes for the master and salve nodes. For example “m3.xlarge”
  • The size of the cluster.
  • Bid price for the machines.
  • The availability zone to spin up the machines in.

Here is an example ConfigFile to spinup a cluster for the TPCH-1 demo. The configFile provides step-by-step documentation on each setting.

Cluster Configuration details.

Detail configurations are under a configuration folder. The structure of the folder is the following:

Here is a detailed configuration example for the TPCH-1 demo.

Druid Configuration

The folder contains the configurations for all the Druid daemons. See the Druid documentation for details on the different configuration options. We enable you setup all the configuration scripts here, which are then deployed to the cluster.

Common Configuration

See druid documentation for details. We pick up these settings from the _common sub-folder. The druid cluster is setup to run with emr hadoop and also has the s3, hdfs and mysql extensions installed. So you can enable these in the _common settings. You can point to an existing mysql instance(the spinup tool doesn’t install mysql).

Coordinator, Broker Configuration

See druid documentation for details. The coordinator and broker sub-folders have their runtime.properties and jvm settings.

Historical Configuration

See druid documentation for details. We setup the slaves with 2 mount points (/mnt, /mnt1) to use as druid local storage. The processing threads, http threads and jvm settings should be adjusted based on the class machine used. The historical sub-folder has the runtime.properties and jvm settings.

Overlord Configuration

See druid documentation for details. Druid is setup to run indexing using hadoop. Hadoop Client is setup to work with EMR. The overlord sub-folder has the runtime.properties and jvm settings.

The cluster is spun up with the coordinator and broker running on the master node, and each slave running a historical daemon. Zookeper is setup on the master node. The machines have aliases starting/stopping the druid daemons. For indexing the overlord is started on the master node.

Spark configuration

Use the sparkline sub-folder to configure spark and sparkline. The sparkline.spark.properties contains params to configure the spark cluster. The default emr spark-defaults.conf is appended to the end of this file; so you don’t need to specify log locations, emr libraries, executors and memory etc. This folder also contains the sparkline jar that should be deployed.

On spin-up the sparkline enhanced spark thriftserver is started on the master node using the provided sparkline.spark.properties.

Initial DDL

If you point to Druid metastore that already has segments defined, then on startup the historicals will starting pulling these segments to local store and start serving the indexes defined. The deep-storage must be accessible by historicals, so make sure the s3 bucket has the right permissions. In order to start running sqls against these indexes you need to define th raw table and the druid datasource in Spark. Use the ddl script to specify these. These are setup using spark-sql cli before staring the thriftserver. Here is the ddl script for the TPCH-1 demo.

Clone this wiki locally