Skip to content

Getting Started for StreamingBench

Carson Wang edited this page Nov 9, 2016 · 6 revisions

Getting Started for StreamingBench

Note: this is for HiBench 5.0

  1. Prerequirements.

    Finish configurations described in Getting Started . For running Samza, Hadoop YARN cluster is needed.

    Download & setup ZooKeeper (3.3.3 is preferred).

    Download & setup Apache Kafka (0.8.1, scala version 2.10 is preferred).

    Download & setup Apache Storm (0.9.3 is preferred).

  2. ZooKeeper setup

    Edit the config file in zookeeper installation directory, please refer to conf/example/zookeeper for an example.

    Go to the install directory of Zookeeper, start Zookeeper with that config file.

    You may run bin/zkCli.sh to verify if zookeeper is working properly.

    Sometimes you may need to clean up the data inside zookeeper. First stop the server, then run "rm -rf /path/to/zookeeper/datadir" to clean the data dir. The directory is defined in your config file.

  3. Kafka setup

    When configuring Kafka and topic count, we need to ensure disk won't become bottleneck. It is suggested to start several brokers in each kafka node, and configure each broker several disks. Different brokers in the same node may share disks but have their own directories in the same disk. Our topic count is 16 for each kafka node, that is, if the kafka cluster contains only 1 kafka node, then we create topics with 16 partitions. For environment with 3 kafka nodes, we create topics with 48 partitions.

    A typical set of kafka config files are config/serv1.properties till serv4.properties under kafka installation directory.

    Ensure zookeeper configured in the config/servX.properties is working properly

    To start 4 brokers on a node, go to the kafka install directory and run the following commands: "env JMX_PORT=10000 bin/kafka-server-start.sh config/serv1.properties" "env JMX_PORT=10001 bin/kafka-server-start.sh config/serv2.properties" "env JMX_PORT=10002 bin/kafka-server-start.sh config/serv3.properties" "env JMX_PORT=10003 bin/kafka-server-start.sh config/serv4.properties".

    To see if kafka brokers are registered in zookeeper, go to zookeeper install directory and run bin/zkCli.sh to start zookeeper client window and run ls /brokers/ids.

    Same with ZooKeeper, you may need to clean old data that's located in disks of kafka brokers. Just rm -rf <all_data_path> in all your kafka nodes and directories.

  4. Spark setup

    All spark streaming related parameters can be defined in conf/99-user_defined_properties.conf.

    Param Name Param Meaning
    spark.executor.memory available memory for Spark worker machines
    spark.serializer
    spark.kryo.referenceTracking relevant to data encoding format
    spark.streaming.receiver.writeAheadLog.enable whether to enable Write Ahead Log
    spark.streaming.blockQueueSize size of streaming block queue

    Spark streaming can be deployed as YARN mode or standalone mode. For YARN mode, just set hibench.spark.master to yarn-client. For standalone mode, set it to spark://spark_master_ip:port and run sbin/start-master.sh in your spark home.

  5. Storm setup

    The conf file is conf/storm.yaml. Basically we configure following params:

    Param Name Param Meaning
    supervisor.slots.ports number of workers in one supervisor (we set 3 slots each supervisor)
    nimbus.childopts jvm size of nimbus
    supervisor.childopts jvm size of supervisor
    worker.childopts jvm size of worker
    topology.max.spout.pending pending spout threads that can be tolerated

    Run bin/storm nimbus to start nimbus and bin/storm ui to setup storm ui Run bin/storm supervisor to start storm supervisors

  6. HiBench setup

    Same as step.2 in Getting Started

    Streaming workloads is defined in conf/99-user_defined_properties.conf, in hibench.streamingbench.benchname. You may set it to a value of following: identity, sample, project, grep, wordcount, distinctcount and statistics.

    Other parameters can be adjusted in conf/01-default-streamingbench.conf.

    Param Name Param Meaning
    hibench.streamingbench.prepare.mode push / periodic mode
    hibench.streamingbench.prepare.push.records records to send in push mode
    hibench.streamingbench.prepare.periodic.recordPerInterval records to send per interval in periodic mode
    hibench.streamingbench.prepare.periodic.intervalSpan interval in periodic mode
    hibench.streamingbench.prepare.periodic.totalRound total round in periodic mode
    hibench.streamingbench.zookeeper.host zookeeper host:port of kafka cluster
    hibench.streamingbench.receiver_nodes number of nodes that will receive kafka input
    hibench.streamingbench.brokerList Kafka broker lists
    hibench.streamingbench.direct_mode direct mode selection (Sparkstreaming only)
    hibench.streamingbench.storm.home storm home
    hibench.streamingbench.kafka.home kafka home
    hibench.streamingbench.storm.nimbus host name of storm-nimbus
    hibench.streamingbench.storm.nimbusAPIPort port number of storm-nimbus
    hibench.streamingbench.storm.ackon ack mode on/off for storm
  7. Run. Usually you need to run the streaming data generation scripts to push data to kafka while running the streaming job. Please create the kafka topics first, generate the seed file and then generate the real data. You can run the following 3 scripts.

     workloads/streamingbench/prepare/initTopic.sh
     workloads/streamingbench/prepare/genSeedDataset.sh
     workloads/streamingbench/prepare/gendata.sh
    

    While the data are being sent to kafka, start the streaming job like Spark Streaming to process the data:

     workloads/streamingbench/spark/bin/run.sh
    
  8. View the report:

    Same as step.4 in Getting Started.

    However, the streamingbench is very different with nonstreaming workloads. Streaming workloads will collect throughput and lattency endlessly and print to terminal directly and log to report/<workload>/<language APIs>/bench.log.

  9. Stop the streaming workloads:

    For SparkStreaming press ctrl+c will stop the works. For Storm & Trident, you'll need to execute storm/bin/stop.sh to stop the works. For Samza, currently you'll have to kill all applications in YARN manually, or restart YARN cluster directly.


Clone this wiki locally