Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Using Apache Drill M1



(OPTIONAL) Download and install the Lilith logging and access event viewer.

Make sure you have Java 7 installed:

$ java -version
java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Download the Apache Drill M1 binary release from people.apache.org/~jacques/apache-drill-1.0.0-m1.rc3/.

Running Apache Drill in distributed mode

What follows are instructions how to run Apache Drill in distributed mode (in a local cluster).

Install Zookeeper (ZK)

As of this mailing list thread Drill M1 doesn't launch ZK in local mode, hence needs an external ZK instance running.

So, to achieve this, do the following to install and set up ZK:

  • Download ZK from zookeeper.apache.org/releases.html in the version 3.4.3
  • I assume in the following that $ZK_HOME is /Users/mhausenblas2/bin/zookeeper-3.4.3
  • Create a config file conf/apache-drill-zoo.cfg that holds the Apache Drill specific settings. You can find the of this ZK config file content in the conf/ directory of this repo. You'll need to adapt one line and this is: dataDir=/Users/mhausenblas2/data/zk which you change to any directory you want to, as long as you have write access there.

Then, launch ZK as shown below:

$ bin/zkServer.sh start apache-drill-zoo.cfg
JMX enabled by default
Using config: /Users/mhausenblas2/bin/zookeeper-3.4.3/bin/../conf/apache-drill-zoo.cfg
Starting zookeeper ... STARTED

Launch 3-node local cluster

For a simple 3-node cluster, extract three copies of Apache Drill into three different dirs xxx-node1 to xxx-node3 as so:

[~/sandbox/apache-drill-1.0.0-m1-cluster] $ ls -al
total 115080
drwxr-xr-x   7 mhausenblas2  staff   238B 28 Oct 05:40 .
drwxr-xr-x  22 mhausenblas2  staff   748B 28 Oct 05:40 ..
-rw-r--r--@  1 mhausenblas2  staff    56M 10 Sep 14:10 apache-drill-1.0.0-m1-binary-release.tar.gz
drwxr-xr-x@  7 mhausenblas2  staff   238B 28 Oct 05:42 apache-drill-1.0.0-m1-node1
drwxr-xr-x@  7 mhausenblas2  staff   238B 28 Oct 05:40 apache-drill-1.0.0-m1-node2
drwxr-xr-x@  7 mhausenblas2  staff   238B 28 Oct 05:40 apache-drill-1.0.0-m1-node3

For each apache-drill-1.0.0-m1-cluster/apache-drill-1.0.0-m1-nodeX launch the respective Drillbit as so:

$ pwd
$ export DRILL_LOG_DIR=$PWD/log
$ ./bin/drillbit.sh start

Check if all Drillbits are started:

$ jps
4590 Drillbit
14036 Drillbit
14823 Jps
13133 ZooKeeperMain
14816 Drillbit
11465 QuorumPeerMain

So above you see the three Drillbits and the respective ZK jobs, as expected. Next check in ZK if all three nodes are registered:

$ pwd
$ bin/zkCli.sh -server
[zk: 1] get /drill/drillbits1
cZxid = 0x3
ctime = Mon Oct 28 05:33:31 GMT 2013
mZxid = 0x3
mtime = Mon Oct 28 05:33:31 GMT 2013
pZxid = 0x19
cversion = 11
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 3

OK, we have numChildren = 3 so looking good.

Submit a physical query plan

Make sure the data is available. For each apache-drill-1.0.0-m1-cluster/apache-drill-1.0.0-m1-nodeX you'd have the following data files via sample-data/ available:

  • donuts.json
  • nation.parquet and region.parquet are already shipped with M1

Also we need the physical query plan (in at least one of the nodeX/sample-data/):

And then submit the plan from one of the nodes (node1 in my case):

$ pwd

Scan JSON doc

$ bin/submit_plan -f sample-data/physical_json_scan_test1.json -t physical -zk

Scan Parquet doc

$ bin/submit_plan -f sample-data/parquet_scan_union_screen_physical.json -t physical -zk

Shutdown cluster

First ./bin/drillbit.sh stop for each Drillbit, then bin/zkServer.sh stop.

Interactive query on single node with sqlline

The following describes how to use the sqlline for single-node SQL queries.

$ pwd

Make sure the following env vars are set:

$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents/Home
$ export DRILL_LOG_DIR=$PWD/log
$ ./bin/drillbit.sh start

Now you're ready to launch the interactive SQL query CLI:

$ ./bin/sqlline -u jdbc:drill:schema=parquet-local

0: jdbc:drill:schema=parquet-local>
SELECT _MAP['R_REGIONKEY'] as region_key, _MAP['R_NAME'] AS name, _MAP['R_COMMENT'] AS comment FROM "sample-data/region.parquet";
SELECT count(distinct _MAP['N_REGIONKEY']) FROM "sample-data/nation.parquet";	
SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "sample-data/nation.parquet" WHERE cast(_MAP['N_NAME'] as varchar) < 'M';

More queries at on the Wiki.

Behind the scenes of the Command Line Interface (CLI)

The CLI of the M1 release essentially comprises three end-user facing shell scripts:

  • bin/drillbit.sh … the Drillbit (launches the worker task on a node)
  • bin/sqlline … the interactive SQL interface (for single-node usage)
  • bin/submit_plan … the interface to submit phyiscal plans (for cluster usage)

For a more complete overview of the call dependencies in the CLI, consult the following figure:

CLI dependencies

If you're not familiar with the basics of Apache Drill, you might find the article Apache Drill: Interactive Ad-Hoc Analysis at Scale useful.