Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Generating Denormalized TPCH Dataset

hbutani edited this page Jan 14, 2016 · 1 revision

These instructions are for creating the Flattened dataset when running locally(in developer environment).

  1. Use TPCH DBGen tool to generate the tpch dataset for a certain datascale. DataScale 1 should be more than enough for a dev. environment.

  2. Clone and build tpch utils package. To build issue the commands: cd tpchData; sbt clean compile package (You need sbt installed for this)

  3. Download a spark version. As of this writing, we have tested with spark-1.5.2

  4. Issue the following to create the flattened dataset:

bin/spark-submit \
--packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2,SparklineData:spark-druid-olap:0.0.2 \
--class org.sparklinedata.tpch.TpchGenMain   \
/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar \
--baseDir /Users/hbutani/tpch/ --scale 1

where:

  • "/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar' is the location of the tpch-utils jar
  • "/Users/hbutani/tpch/" is the location of the tpch data. Under this folder there are one or more datascale folders whose names are of the form 'datascale%n'(for e.g. 'datascale1')
  1. The flattened dataset is written to a subfolder named 'orderLineItemPartSupplierCustomer' under the datascale folder
Clone this wiki locally