









# WATTSKIT, Software-Defined Power Monitoring of Distributed Systems

CCGrid'17: Performance Modeling and Evaluation (Session 18B)

17<sup>th</sup> May, 2017 – 10:55

#### Authors:

| Maxime | COLMANT    | ADE |
|--------|------------|-----|
| Pascal | FELBER     | Uni |
| Romain | Rouvoy     | Uni |
| Lionel | SEINTURIER | Uni |

ADEME / UNIVERSITY LILLE 1 / INRIA
UNIVERSITY NEUCHÂTEL
UNIVERSITY LILLE 1 / INRIA / IUF
UNIVERSITY LILLE 1 / INRIA / IUF

#### TABLE OF CONTENTS

- 1. Introduction
- 2. Contributions
- 3. Conclusion

#### **INTRODUCTION**

#### THE GLOBAL ICT<sup>1</sup> FOOTPRINT<sup>2</sup>



Introduction 2/31

<sup>&</sup>lt;sup>1</sup>Information and Communications Technology

<sup>&</sup>lt;sup>2</sup>The Climate Group. SMART 2020: Enabling the low carbon economy in the information age. 2008.

#### MULTI-CORE CPU ARCHITECTURES ARE EVERYWHERE!



Introduction 3/31

#### **CASE STUDY**



Introduction 4/31



Introduction 5/31



Introduction 6/31



Introduction 7/31



Introduction 8/31



Introduction 9/31

#### **RESEARCH QUESTIONS**

**RQ1:** Can we model the software power consumption regardless of the underlying architecture?





Introduction 10/31

#### **RESEARCH QUESTIONS**

**RQ2:** Can we propose a uniform view of the service power consumption?



Introduction 11/31

### CONTRIBUTIONS

**RQ1:** Can we model the software power consumption regardless of the underlying architecture?





Contributions 12/31

RQ1: Can we model the software power consumption regardless of the underlying architecture?





Learning CPU Power Models

Contributions 12/31

| Ref.     | Processor(s)                            | Feature(s)                         | Regression(s)   | Benchmarks                                                     |
|----------|-----------------------------------------|------------------------------------|-----------------|----------------------------------------------------------------|
| [Ber+10] | Core 2 Duo                              | 14 PCs regrouped by component      |                 | sampl.: μ-benchs<br>eval.: SPEC CPU 06                         |
| [Col+15] | Xeon<br>W3520 & i3 2120                 | non-halted cycles reference cycles | nolynomial      | sampl.: stress<br>eval.: PARSEC, SPECjbb                       |
| [CM05]   | XScale<br>PXA255                        | 5 PCs                              | multiple linear | eval.: SPEC CPU 00,<br>Java CDC/CLDC                           |
| [Dol+15] | Xeon<br>E3-1275                         | 3 PCs<br>HW sensors                | linear          | sampl.: linpack, stream, iperf, IOR<br>eval.: Quantum Espresso |
| [ERK06]  | Turion,<br>Itanium 2                    | HW sensors                         | multiple linear | sampl.: Gamut<br>eval.: SPECs, Matrix, Stream                  |
| [IM03]   | Pentium 4                               | 15 PCs                             | multiple linear | eval.: μ-benchs, AbiWord,<br>Mozilla, Gnumeric                 |
| [RRK08]  | Core 2 Duo & Xeon,<br>Itanium 2, Turion | HW sensors<br>PCs                  | multinla linaar | sampl.: calibration suite<br>eval.: SPECs, stream, Nsort       |
| [Yan+14] | Xeon<br>E5620 & E7530                   | 7 components<br>91 preselected     | support vector  | sampl.: NPB, IOzone, CacheBench<br>eval.: SPEC CPU 06, IOzone  |
| [Zha+14] | Sandy Bridge                            | non-halted cycles                  | linear          | eval.: Google, SPEC CPU 06                                     |
| ???      | ARM                                     | ???                                | ???             | ???                                                            |

#### Only for Intel or AMD architectures

| Ref.     | Processor(s)                            | Feature(s)                         | Regression(s)                   | Benchmarks                                                    |
|----------|-----------------------------------------|------------------------------------|---------------------------------|---------------------------------------------------------------|
| [Ber+10] | Core 2 Duo                              | 14 HPCs regrouped by component     | multiple linear<br>by component | sampl.: μ-benchs<br>eval.: SPEC CPU 06                        |
| [Col+15] | Xeon<br>W3520 & i3 2120                 | non-halted cycles reference cycles | polynomial                      | sampl.: stress<br>eval.: PARSEC, SPECjbb                      |
| [CM05]   | XScale<br>PXA255                        | 5 HPCs                             | multiple linear                 | eval.: SPEC CPU 00,<br>Java CDC/CLDC                          |
| [Dol+15] | Xeon<br>E3-1275                         | 3 HPCs<br>HW sensors               | linear                          | sampl.: linpack, stream, iperf, IOR eval.: Quantum Espresso   |
| [ERK06]  | Turion,<br>Itanium 2                    | HW sensors                         | multiple linear                 | sampl.: Gamut<br>eval.: SPECs, Matrix, Stream                 |
| [IM03]   | Pentium 4                               | 15 HPCs                            | multiple linear                 | eval.: μ-benchs, AbiWord,<br>Mozilla, Gnumeric                |
| [RRK08]  | Core 2 Duo & Xeon,<br>Itanium 2, Turion | HW sensors<br>HPCs                 | multiple linear                 | sampl.: calibration suite<br>eval.: SPECs, stream, Nsort      |
| [Yan+14] | Xeon<br>E5620 & E7530                   | 7 components<br>91 preselected     | support vector                  | sampl.: NPB, IOzone, CacheBench<br>eval.: SPEC CPU 06, IOzone |
| [Zha+14] | Sandy Bridge                            | non-halted cycles                  | linear                          | eval.: Google, SPEC CPU 06                                    |

#### HW sensors: coarse-grained CPU metrics

| Ref.     | Processor(s)                            | Feature(s)                         | Regression(s)                   | Benchmarks                                                    |
|----------|-----------------------------------------|------------------------------------|---------------------------------|---------------------------------------------------------------|
| [Ber+10] | Core 2 Duo                              | 14 HPCs regrouped by component     | multiple linear<br>by component | sampl.: μ-benchs<br>eval.: SPEC CPU 06                        |
| [Col+15] | Xeon<br>W3520 & i3 2120                 | non-halted cycles reference cycles | polynomial                      | sampl.: stress<br>eval.: PARSEC, SPECjbb                      |
| [CM05]   | XScale<br>PXA255                        | 5 HPCs                             | multiple linear                 | eval.: SPEC CPU 00,<br>Java CDC/CLDC                          |
| [Dol+15] | Xeon<br>E3-1275                         | 3 HPCs<br>HW sensors               | linear                          | sampl.: linpack, stream, iperf, IOR eval.: Quantum Espresso   |
| [ERK06]  | Turion,<br>Itanium 2                    | HW sensors                         | multiple linear                 | sampl.: Gamut<br>eval.: SPECs, Matrix, Stream                 |
| [IM03]   | Pentium 4                               | 15 HPCs                            | militinie linear                | eval.: μ-benchs, AbiWord,<br>Mozilla, Gnumeric                |
| [RRK08]  | Core 2 Duo & Xeon,<br>Itanium 2, Turion | HW sensors<br>HPCs                 | multiple linear                 | sampl.: calibration suite<br>eval.: SPECs, stream, Nsort      |
| [Yan+14] | Xeon<br>E5620 & E7530                   | 7 components<br>91 preselected     | support vector                  | sampl.: NPB, IOzone, CacheBench<br>eval.: SPEC CPU 06, IOzone |
| [Zha+14] | Sandy Bridge                            | non-halted cycles                  | linear                          | eval.: Google, SPEC CPU 06                                    |

#### **HPCs:** fine-grained CPU metrics

| Ref.     | Processor(s)                            | Feature(s)                         | Regression(s)                   | Benchmarks                                                    |
|----------|-----------------------------------------|------------------------------------|---------------------------------|---------------------------------------------------------------|
| [Ber+10] | Core 2 Duo                              | 14 HPCs regrouped by component     | multiple linear<br>by component | sampl.: μ-benchs<br>eval.: SPEC CPU 06                        |
| [Col+15] | Xeon<br>W3520 & i3 2120                 | non-halted cycles reference cycles | polynomial                      | sampl.: stress<br>eval.: PARSEC, SPECjbb                      |
| [CM05]   | XScale<br>PXA255                        | 5 HPCs                             | multiple linear                 | eval.: SPEC CPU 00,<br>Java CDC/CLDC                          |
| [Dol+15] | Xeon<br>E3-1275                         | 3 HPCs<br>HW sensors               | linear                          | sampl.: linpack, stream, iperf, IOR eval.: Quantum Espresso   |
| [ERK06]  | Turion,<br>Itanium 2                    | HW sensors                         | multiple linear                 | sampl.: Gamut<br>eval.: SPECs, Matrix, Stream                 |
| [IM03]   | Pentium 4                               | 15 HPCs                            | imilitinie linear               | eval.: μ-benchs, AbiWord,<br>Mozilla, Gnumeric                |
| [RRK08]  | Core 2 Duo & Xeon,<br>Itanium 2, Turion | HW sensors<br>HPCs                 | multiple linear                 | sampl.: calibration suite<br>eval.: SPECs, stream, Nsort      |
| [Yan+14] | Xeon<br>E5620 & E7530                   | 7 components<br>91 preselected     | support vector                  | sampl.: NPB, IOzone, CacheBench<br>eval.: SPEC CPU 06, IOzone |
| [Zha+14] | Sandy Bridge                            | non-halted cycles                  | linear                          | eval.: Google, SPEC CPU 06                                    |

#### Power models are mostly linear

| Ref.     | Processor(s)                            | Feature(s)                         | Regression(s)      | Benchmarks                                                    |
|----------|-----------------------------------------|------------------------------------|--------------------|---------------------------------------------------------------|
| [Ber+10] | Core 2 Duo                              | 14 HPCs regrouped by component     |                    | sampl.: μ-benchs<br>eval.: <b>SPEC CPU 06</b>                 |
| [Col+15] | Xeon<br>W3520 & i3 2120                 | non-halted cycles reference cycles | nolynomial         | sampl.: stress<br>eval.: PARSEC, <b>SPECjbb</b>               |
| [CM05]   | XScale<br>PXA255                        | 5 HPCs                             | multiple linear    | eval.: SPEC CPU 00,<br>Java CDC/CLDC                          |
| [Dol+15] | Xeon<br>E3-1275                         | 3 HPCs<br>HW sensors               | linear             | sampl.: linpack, stream, iperf, IOR eval.: Quantum Espresso   |
| [ERK06]  | Turion,<br>Itanium 2                    | HW sensors                         | multiple linear    | sampl.: Gamut<br>eval.: SPECs, Matrix, Stream                 |
| [IM03]   | Pentium 4                               | 15 HPCs                            | i militinie linear | eval.: μ-benchs, AbiWord,<br>Mozilla, Gnumeric                |
| [RRK08]  | Core 2 Duo & Xeon,<br>Itanium 2, Turion | HW sensors<br>HPCs                 | multiple linear    | sampl.: calibration suite<br>eval.: SPECs, stream, Nsort      |
| [Yan+14] | Xeon<br>E5620 & E7530                   | 7 components<br>91 preselected     |                    | sampl.: NPB, IOzone, CacheBench<br>eval.: SPEC CPU 06, IOzone |
| [Zha+14] | Sandy Bridge                            | non-halted cycles                  | linear             | eval.: Google, SPEC CPU 06                                    |

#### Non free or private workloads

1. Portability

- 1. Portability
- 2. Accuracy

- 1. Portability
- 2. Accuracy
- 3. Reproducibility

- 1. Portability
- 2. Accuracy
- 3. Reproducibility

Towards an automatic approach for learning CPU power models

#### OUR APPROACH:

#### OPEN-TESTBED TO AUTOMATICALLY LEARN POWER MODELS



- Input workload injection
  - Configurable
  - PARSEC (open-source, multi-threaded)<sup>3</sup>
  - Run several applications (x264, vips, etc.)

<sup>&</sup>lt;sup>3</sup>C. Bienia et al. "PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors". In: Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation. 2009.

### Our approach: Open-Testbed To Automatically Learn Power Models



- 2 Acquisition of raw input metrics
  - Automatically explore the high number of the available HPCs (Xeon W3520: 514 HPCs)
  - Take care of HPC multiplexing<sup>4</sup>

<sup>&</sup>lt;sup>4</sup>Intel. Intel 64 and IA-32 Architectures Software Developer's Manual. 2015.

### Our approach: Open-Testbed To Automatically Learn Power Models



- 3 Selection of relevant HPCs
  - Pearson coefficient (HPC ⇔ Power)
  - 1<sup>st</sup> phase: quickly filtering out uncorrelated HPCs (< 0.5) (Xeon W3250: 253 left out)
  - $\cdot$  2<sup>nd</sup> phase: full sampling for the remaining HPCs

### OUR APPROACH: OPEN-TESTBED TO AUTOMATICALLY LEARN POWER MODELS



- Power model inference
  - · Minimize the number of HPCs
  - Robust ridge regression (SotA?)

## Our approach: Open-Testbed To Automatically Learn Power Models

Relative errors for the PARSEC suite on a Xeon W3520.

$$P_{idle} = 92 \text{ W}; \ P_{CPU} = \frac{1.40 \cdot \text{l1i:reads}}{10^8} + \frac{7.29 \cdot \text{lsd:inactive}}{10^9}$$



#### **SUMMARY**

Portability

Beyond SotA: adaptive approach

#### SUMMARY

Portability

Beyond SotA: adaptive approach

Accuracy

Avg. error: 1.35%

#### SUMMARY

· Portability

Beyond SotA: adaptive approach

Accuracy

Avg. error: 1.35%

· Reproducibility

Built on open-source workloads

**RQ2:** Can we propose a uniform view of the service power consumption?



Contributions 21/31

## **RQ2:** Can we propose a uniform view of the service power consumption?



#### Challenges

- 1. Native
- 2. Distributed

Contributions 22/31

## **RQ2:** Can we propose a uniform view of the service power consumption?



#### Challenges

- 1. Native
- 2. Distributed

Contributions 22/31

- Code freely available: wattskit.powerapi.org
  - · Scala / Akka
  - · LoC: 8.7k
  - Docker
  - · AGPLv3









## SD Power Meter For Monitoring Concurrent Apps



· On an Intel Xeon W3520

Monitoring freq.: 4Hz

· Avg. error: 2%

· Low overhead: 2 W

# **RQ2:** Can we propose a uniform view of the service power consumption?



## Challenges

- 1. Native
- 2. Distributed

Contributions 26/31

## **CURRENT COARSE-GRAINED SOLUTIONS**





## A SERVICE-LEVEL POWER MONITORING



## A SERVICE-LEVEL POWER MONITORING



## A Service-Level Power Monitoring



## A Service-Level Power Monitoring



# CONCLUSION

## CONTRIBUTIONS

WATTSKIT, Software-Defined Power Monitoring of Distributed Systems

Conclusion 31/31

## **CONTRIBUTIONS**

WATTSKIT, Software-Defined Power Monitoring of Distributed Systems

• RQ1: Can we model the software power consumption regardless of the underlying architecture?

Open-testbed approach for learning multi-core power models

Conclusion 31/31

### CONTRIBUTIONS

WATTSKIT, Software-Defined Power Monitoring of Distributed Systems

• RQ1: Can we model the software power consumption regardless of the underlying architecture?

Open-testbed approach for learning multi-core power models

 RQ2: Can we propose a uniform view of the service power consumption?

In width energy monitoring, thanks to WATTSKIT

Conclusion 31/31



## Thanks for your attention.

Maxime COLMANT maxime.colmant@inria.fr

WattsKit, for distributed systems:

[Col+17]

http://wattskit.powerapi.org/

BitWatts, for virtualized environments:

http://bitwatts.powerapi.org/

#### REFERENCES I

- [Ber+10] R. Bertran et al. "Decomposable and Responsive Power Models for Multicore Processors Using Performance Counters". In: Proceedings of the 24th ACM International Conference on Supercomputing. 2010.
- [BL09] C. Bienia and K. Li. "PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors". In: Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation. 2009.
- [CM05] G. Contreras and M. Martonosi. "Power Prediction for Intel XScale® Processors Using Performance Monitoring Unit Events". In: Proceedings of the International Symposium on Low Power Electronics and Design. 2005.
- [Col+15] M. Colmant et al. "Process-level Power Estimation in VM-based Systems". In: Proceedings of the 10th European Conference on Computer Systems (EuroSys). 2015.
- [Col+17] M. Colmant et al. "WattsKit: Software-Defined Power Monitoring of Distributed Systems". In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 2017.
- [Dol+15] M. F. Dolz et al. "An analytical methodology to derive power models based on hardware and software metrics". In: Computer Science Research and Development (2015).

#### REFERENCES II

- [ERK06] D. Economou, S. Rivoire, and C. Kozyrakis. "Full-System Power Analysis and Modeling for Server Environments". In: In Workshop on Modeling Benchmarking and Simulation. 2006.
- [IM03] C. Isci and M. Martonosi. "Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data". In: Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. 2003.
- [RRK08] S. Rivoire, P. Ranganathan, and C. Kozyrakis. "A Comparison of High-level Full-system Power Models". In: Proceedings of the Conference on Power Aware Computing and Systems. 2008.
- [The08] The Climate Group. SMART 2020: Enabling the low carbon economy in the information age. 2008. URL: http://gesi.org/article/43 (visited on 09/23/2016).
- [Yan+14] H. Yang et al. "iMeter: An integrated VM power model based on performance profiling".In: Future Generation Computer Systems (2014).
- [Zha+14] Y. Zhai et al. "HaPPy: Hyperthread-aware Power Profiling Dynamically". In: Proceedings of the USENIX Annual Technical Conference. 2014.