Getting Started for StreamingBench

Note: this is for HiBench 5.0

Prerequirements.

Finish configurations described in Getting Started . For running Samza, Hadoop YARN cluster is needed.

Download & setup ZooKeeper (3.3.3 is preferred).

Download & setup Apache Kafka (0.8.1, scala version 2.10 is preferred).

Download & setup Apache Storm (0.9.3 is preferred).
ZooKeeper setup

Edit the config file in zookeeper installation directory, please refer to conf/example/zookeeper for an example.

Go to the install directory of Zookeeper, start Zookeeper with that config file.

You may run bin/zkCli.sh to verify if zookeeper is working properly.

Sometimes you may need to clean up the data inside zookeeper. First stop the server, then run "rm -rf /path/to/zookeeper/datadir" to clean the data dir. The directory is defined in your config file.
Kafka setup

When configuring Kafka and topic count, we need to ensure disk won't become bottleneck. It is suggested to start several brokers in each kafka node, and configure each broker several disks. Different brokers in the same node may share disks but have their own directories in the same disk. Our topic count is 16 for each kafka node, that is, if the kafka cluster contains only 1 kafka node, then we create topics with 16 partitions. For environment with 3 kafka nodes, we create topics with 48 partitions.

A typical set of kafka config files are config/serv1.properties till serv4.properties under kafka installation directory.

Ensure zookeeper configured in the config/servX.properties is working properly

To start 4 brokers on a node, go to the kafka install directory and run the following commands: "env JMX_PORT=10000 bin/kafka-server-start.sh config/serv1.properties" "env JMX_PORT=10001 bin/kafka-server-start.sh config/serv2.properties" "env JMX_PORT=10002 bin/kafka-server-start.sh config/serv3.properties" "env JMX_PORT=10003 bin/kafka-server-start.sh config/serv4.properties".

To see if kafka brokers are registered in zookeeper, go to zookeeper install directory and run bin/zkCli.sh to start zookeeper client window and run ls /brokers/ids.

Same with ZooKeeper, you may need to clean old data that's located in disks of kafka brokers. Just rm -rf <all_data_path> in all your kafka nodes and directories.

Spark setup

All spark streaming related parameters can be defined in conf/99-user_defined_properties.conf.

Param Name	Param Meaning
spark.executor.memory	available memory for Spark worker machines
spark.serializer
spark.kryo.referenceTracking	relevant to data encoding format
spark.streaming.receiver.writeAheadLog.enable	whether to enable Write Ahead Log
spark.streaming.blockQueueSize	size of streaming block queue

Spark streaming can be deployed as YARN mode or standalone mode. For YARN mode, just set hibench.spark.master to yarn-client. For standalone mode, set it to spark://spark_master_ip:port and run sbin/start-master.sh in your spark home.

Storm setup

The conf file is conf/storm.yaml. Basically we configure following params:

Param Name	Param Meaning
supervisor.slots.ports	number of workers in one supervisor (we set 3 slots each supervisor)
nimbus.childopts	jvm size of nimbus
supervisor.childopts	jvm size of supervisor
worker.childopts	jvm size of worker
topology.max.spout.pending	pending spout threads that can be tolerated

Run bin/storm nimbus to start nimbus and bin/storm ui to setup storm ui Run bin/storm supervisor to start storm supervisors

HiBench setup

Same as step.2 in Getting Started

Streaming workloads is defined in conf/99-user_defined_properties.conf, in hibench.streamingbench.benchname. You may set it to a value of following: identity, sample, project, grep, wordcount, distinctcount and statistics.

Other parameters can be adjusted in conf/01-default-streamingbench.conf.

Param Name	Param Meaning
hibench.streamingbench.prepare.mode	push / periodic mode
hibench.streamingbench.prepare.push.records	records to send in push mode
hibench.streamingbench.prepare.periodic.recordPerInterval	records to send per interval in periodic mode
hibench.streamingbench.prepare.periodic.intervalSpan	interval in periodic mode
hibench.streamingbench.prepare.periodic.totalRound	total round in periodic mode
hibench.streamingbench.zookeeper.host	zookeeper host:port of kafka cluster
hibench.streamingbench.receiver_nodes	number of nodes that will receive kafka input
hibench.streamingbench.brokerList	Kafka broker lists
hibench.streamingbench.direct_mode	direct mode selection (Sparkstreaming only)
hibench.streamingbench.storm.home	storm home
hibench.streamingbench.kafka.home	kafka home
hibench.streamingbench.storm.nimbus	host name of storm-nimbus
hibench.streamingbench.storm.nimbusAPIPort	port number of storm-nimbus
hibench.streamingbench.storm.ackon	ack mode on/off for storm

Run. Usually you need to run the streaming data generation scripts to push data to kafka while running the streaming job. Please create the kafka topics first, generate the seed file and then generate the real data. You can run the following 3 scripts.
```
 workloads/streamingbench/prepare/initTopic.sh
 workloads/streamingbench/prepare/genSeedDataset.sh
 workloads/streamingbench/prepare/gendata.sh
```
While the data are being sent to kafka, start the streaming job like Spark Streaming to process the data:
```
 workloads/streamingbench/spark/bin/run.sh
```
View the report:

Same as step.4 in Getting Started.

However, the streamingbench is very different with nonstreaming workloads. Streaming workloads will collect throughput and lattency endlessly and print to terminal directly and log to report/<workload>/<language APIs>/bench.log.
Stop the streaming workloads:

For SparkStreaming press ctrl+c will stop the works. For Storm & Trident, you'll need to execute storm/bin/stop.sh to stop the works. For Samza, currently you'll have to kill all applications in YARN manually, or restart YARN cluster directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting Started for StreamingBench