Batch processing comparison - Apache Spark vs. Apache Flink

This project includes all the Karamel definition files which are required to do the batch processing comparison between Apache Spark vs Apache Flink in public cloud. This project used TeraSort for benchmarking the systems and TeraGen has been used to generate the data. You can feed each Karamel definition file to Karamel to execute its task.

###Step 1 : Deploy the cluster in cloud

Definition file : cluster-deployment.yml

Above definition file will deploy the following clusters.

Cluster	Version
Spark	1.3.1
Flink	0.9.1
Hadoop	2.4.0

####Configurations that you might want to change

EC2 machine type and region:

ec2:
  region: us-west-2
  type: m3.xlarge

You can override the machine type, region for different node groups.

datanodes:
    size: 2
    ec2:
      region: us-west-2
      type: i2.4xlarge
      price: 0.5

Size and bid for spot instances

Using the above configuration, size of a group can be changed with the parameter size: and price: 0.5 specify using spot instances with a bid of $0.5 per hour. But if you do not need to take the risk of using spot instances, you can remove the line price: 0.5 and Karamel will automatically pick up on-demand instances.

Memory allocations:

Spark

driver_memory: 8192m
executor_memory: 100g

Flink

jobmanager:
      heap_mbs: '8192'
    taskmanager:
      heap_mbs: '102400'

For a fair comparison, you can allocate the same amount of memory for Spark driver_memory and Flink jobmanager. And similarly for Spark executor_memory and Flink taskmanager.

###Step 2 : Run TeraGen experiment to generate data **

Definition file: teragen.yml

TeraGen experiment generates the data to be used in TeraSort experiment by executing a hadoop map-reduce job.

####Configuration changes

IP to deploy the experiment (required*)

You should add the public IP address of your master node in configuration for ips:

ips:
      - 54.203.56.51

Number of records to be generated

Each record generated by TeraGen is of size 100bytes and the number of records to be generated can be configured with the following [configuration] (https://github.com/karamel-lab/batch-processing-comparison/blob/master/teragen.yml#L13)

attrs:
  teragen:
    records: '2000000000'

records: '2000000000' generated 200GB of data.

###Step 3 : Run TeraSort experiment **

TeraSort experiment runs the benchmarking algorithm for the performance comparison of the systems. TeraSort benchmark for Spark and Flink shoule be run seperately. The public IP address of the master node shoud be configured in ips: section similarly as for TeraGen experiment

####Spark

Definition file: terasort-spark.yml

####Configuration changes required *

You should add the public IP address of your master node in configuration for ips:

ips:
      - 54.203.56.51

####Flink

Definition file: terasort-flink.yml

####Configuration changes required *

You should add the public IP address of your master node in configuration for ips:

ips:
      - 54.203.56.51

You can use the collectl-monitoring tool to monitor the system level performance of these big data engines while executing the TeraSort experiment. Collectl-monitor should be started before you are running the experiment and should be stopped once the experiment is over. You can visit here for an example scenario described for a quick look.

Future enhancements

**To complete step 2 and 3 seamlessly, Karamel should provide support for adding both public and private IPs in the Karamel definition file fore bare-metal. This feature will be soon supported by Karamel.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
cluster-deployment.yml		cluster-deployment.yml
teragen.yml		teragen.yml
terasort-flink.yml		terasort-flink.yml
terasort-spark.yml		terasort-spark.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

cluster-deployment.yml

cluster-deployment.yml

teragen.yml

teragen.yml

terasort-flink.yml

terasort-flink.yml

terasort-spark.yml

terasort-spark.yml

Repository files navigation

Batch processing comparison - Apache Spark vs. Apache Flink

###Step 1 : Deploy the cluster in cloud

###Step 2 : Run TeraGen experiment to generate data **

###Step 3 : Run TeraSort experiment **

Future enhancements

About

Releases

Packages

Contributors 2

License

karamel-lab/batch-processing-comparison

Folders and files

Latest commit

History

Repository files navigation

Batch processing comparison - Apache Spark vs. Apache Flink

###Step 1 : Deploy the cluster in cloud

###Step 2 : Run TeraGen experiment to generate data **

###Step 3 : Run TeraSort experiment **

Future enhancements

About

Resources

License

Stars

Watchers

Forks