With cloud computing, users are able to tune cloud configurations to meet their performance or cost objectives. In our research project, we aim to find out the best cloud configuration for a given workload and a give objective. During our research, we found performance data is very hard to find—at least, we could not find performance that suits our needs. We instead collected the required data. This data repository is the effort. We make this data available to encourage research advance in cloud performance optimization.
This data repository includes large-scale performance data of Hadoop and Spark applications on AWS EC2. Since performance varies with different inputs, our data includes multiple combinations of applications and inputs. We use workload to describe an application and its input. The workloads are extracted from HiBench and spark-perf.
We ran these workloads on numerous cloud configuration on Amazon EC2. Each configuration is composed of a virtual machine (VM) type and a number of the same VMs. This data repository includes both the single-node setting and the multi-node setting. The single-node setting includes 18 VM types and the multi-node setting includes 69 configurations (9 VM types and various numbers of VMs).
For each measurement, we collect its execution time and also its low-level performance information using sar. For more detail, read the description of each dataset.
ID | Platforms | Systems | Workloads | Description |
---|---|---|---|---|
osr_single_node | AWS EC2 |
|
|
Multiple workloads running on a single-node setting on AWS |
osr_multiple_nodes | AWS EC2 |
|
|
Multiple workloads running on the multiple-nodes setting on AWS |
- We encourage researchers share their performance data. Please submit a pull request.
- You can obtain the scripts and required AMI at the scout-scripts repo.
@inproceedings{hsu2018arrow,
title={Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM},
author={Hsu, Chin-Jung and Nair, Vivek and Freeh, Vincent W and Menzies, Tim},
booktitle={the 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS 2018)},
year={2018}
}
@inproceedings{hsu2018micky,
title={Micky: A Cheaper Alternative for Selecting Cloud Instances},
author={Hsu, Chin-Jung and Nair, Vivek and Menzies, Tim and Freeh, Vincent},
booktitle={the IEEE International Conference on Cloud Computing (IEEE CLOUD 2018)}
year={2018}
}
@article{hsu2018scout,
title={Scout: An Experienced Guide to Find the Best Cloud Configuration},
author={Hsu, Chin-Jung and Nair, Vivek and Menzies, Tim and Freeh, Vincent},
journal={arXiv preprint arXiv:1803.01296},
year={2018}
}