Python 2.x(>=2.6) is required.
bcis required to generate the HiBench report.
Supported Hadoop version: Apache Hadoop 2.x, CDH5.x, HDP
Supported Spark version: 1.6.x, 2.0.x, 2.1.x, 2.2.x
Build HiBench according to build HiBench.
Start HDFS, Yarn, Spark in the cluster.
Hadoop is used to generate the input data of the workloads.
Create and edit
cp conf/hadoop.conf.template conf/hadoop.conf
|hibench.hadoop.home||The Hadoop installation location|
|hibench.hadoop.executable||The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop|
|hibench.hadoop.configure.dir||Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop|
|hibench.hdfs.master||The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username|
|hibench.hadoop.release||Hadoop release provider. Supported value: apache, cdh5, hdp|
Note: For CDH and HDP users, please update
hibench.hadoop.release properly. The default value is for Apache release.
Create and edit
cp conf/spark.conf.template conf/spark.conf
Set the below properties properly:
hibench.spark.home The Spark installation location hibench.spark.master The Spark master, i.e. `spark://xxx:7077`, `yarn-client`
4. Run a workload
To run a single workload i.e.
prepare.sh launches a Hadoop job to generate the input data on HDFS. The
run.sh submits the Spark job to the cluster.
bin/run_all.sh can be used to run all workloads listed in conf/benchmarks.lst.
5. View the report
<HiBench_Root>/report/hibench.report is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.
The report directory also includes further information for debugging and tuning.
<workload>/spark/bench.log: Raw logs on client side.
<workload>/spark/monitor.html: System utilization monitor results.
<workload>/spark/conf/<workload>.conf: Generated environment variable configurations for this workload.
<workload>/spark/conf/sparkbench/<workload>/sparkbench.conf: Generated configuration for this workloads, which is used for mapping to environment variable.
<workload>/spark/conf/sparkbench/<workload>/spark.conf: Generated configuration for spark.
6. Input data size
To change the input data size, you can set
conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata. The definition of these profiles can be found in the workload's conf file i.e.
Change the below properties in
conf/hibench.conf to control the parallelism
|hibench.default.map.parallelism||Partition number in Spark|
|hibench.default.shuffle.parallelism||Shuffle partition number in Spark|
Change the below properties to control Spark executor number, executor cores, executor memory and driver memory.
|hibench.yarn.executor.num||Spark executor number in Yarn mode|
|hibench.yarn.executor.cores||Spark executor cores in Yarn mode|
|spark.executor.memory||Spark executor memory|
|spark.driver.memory||Spark driver memory|