## 1.6 Performance

# **Defining Performance**

### Which airplane has the best performance?









Chapter 1 — Computer Abstractions and Technology — 2

# Response Time and Throughput

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

#### **Execution Time**

- Elapsed Time (or wall-clock time, response time)
  - counts everything (disk and memory accesses, I/O, etc.)
  - a useful number, but often not good for comparison purposes
- CPU time
  - doesn't count I/O or time spent running other programs
  - can be broken up into system time, and user time

# Unix command: %time ls 90.7u 12.9s 2:39 65% user+system time elapsed time cPU time clapsed time

- Our focus: user CPU time
  - time spent executing the lines of code that are "in" our program

## Relative Performance

- Define Performance = 1/Execution Time
- "X is n time faster than Y"
  - Performance<sub>x</sub>/Performance<sub>y</sub>
  - = Execution time $_{Y}$ /Execution time $_{X} = n$
- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>= 15s / 10s = 1.5
  - So A is 1.5 times faster than B

# **CPU Clocking**

Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g.,  $4.0GHz = 4000MHz = 4.0 \times 10^9Hz$

## **CPU Time**

CPU Time = CPU Clock Cycles × Clock Cycle Time

= CPU Clock Cycles

Clock Rate

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

## **Example**

- 2 GHz 클럭의 컴퓨터 A 에서 10초에 수행되는 프로그램이 있다.
- 이 프로그램을 6초에 수행하는 컴퓨터를 설계하고자 한다.
- 클럭 속도는 얼마든지 빠르게 만들 수 있는데 이렇게 하면 CPU 다른 부분의 설계에 영향을 미쳐 같은 프로그램에 대해 A보다 1.2 배 많은 클럭 사이클이 필요하게 된다고 한다.
- 컴퓨터 B의 클럭 속도는 얼마로 해야 하겠는가?

$$\rightarrow$$

$$CPU \ time_{B} = \frac{CPU \ clock \ cycles_{B}}{clock \ rate_{B}}$$

$$6 \ seconds = \frac{1.2 \cdot CPU \ clock \ cycles_{A}}{clock \ rate_{B}}$$

## **Example: solution**

$$CPU \ time_{B} = \frac{CPU \ clock \ cycles_{B}}{clock \ rate_{B}}$$

$$6 \ seconds = \frac{1.2 \cdot \ CPU \ clock \ cycles_{A}}{clock \ rate_{B}}$$

$$CPU \ time_{A} = \frac{CPU \ clock \ cycles_{A}}{clock \ rate_{A}}$$

$$10 \ seconds = \frac{CPU \ clock \ cycles_{A}}{2GHz}$$

$$CPU \ clock \ cycles_{A} = 10 \cdot 2 \cdot 10^{9}$$

$$clock \ rate_B = \frac{1.2 \cdot 10 \cdot 2 \cdot 10^9 \text{ cycles}}{6 \text{seconds}} = \frac{4 \cdot 10^9 \text{ cycles}}{\text{seconds}} = 4 \text{GHz}$$

## Instruction Count and CPI

Clock Cycles = Instruction Count  $\times$  Cycles per Instruction

CPU Time = Instruction Count  $\times$  CPI  $\times$  Clock Cycle Time  $= \frac{Instruction Count \times CPI}{Clock Rate}$ 

- Instruction Count for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

## **CPI Example**

 Suppose we have two implementations of the same instruction set architecture (ISA).

For some program,

Machine A has a clock cycle time of 250 ps. and a CPI of 2.0 Machine B has a clock cycle time of 500 ps. and a CPI of 1.2

What machine is faster for this program, and by how much?

• If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?

$$Execution \ time = Instruction \ Count \cdot \ CPI \cdot \ cycle \ time = \frac{Instruction \ Count \cdot \ CPI}{clock \ rate}$$

# **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

$$\begin{aligned} \text{CPU Time}_{A} &= \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= \text{I} \times 2.0 \times 250 \text{ps} = \text{I} \times 500 \text{ps} & \text{A is faster...} \end{aligned}$$
 
$$\begin{aligned} \text{CPU Time}_{B} &= \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= \text{I} \times 1.2 \times 500 \text{ps} = \text{I} \times 600 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= \text{CPU Time}_{B} \\ &= \text{CPU Time}_{A} \end{aligned} = \frac{\text{I} \times 600 \text{ps}}{\text{I} \times 500 \text{ps}} = 1.2 & \text{...by this much} \end{aligned}$$

## **CPI in More Detail**

 If different instruction classes take different numbers of cycles

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

Weighted average CPI

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency

## # of Instructions Example

 A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C. The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

Which sequence will be faster? How much? What is the CPI for each sequence?

$$Execution \ time = (Instruction \ Count \cdot \ CPI) \cdot \ cycle \ time = \frac{Instruction \ Count \cdot \ CPI}{clock \ rate}$$

# **CPI Example**

 Alternative compiled code sequences using instructions in classes A, B, C

| Class            | А | В | С |
|------------------|---|---|---|
| CPI for class    | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

- Sequence 1: IC = 5
  - Clock Cycles= 2×1 + 1×2 + 2×3= 10
  - Avg. CPI = 10/5 = 2.0

- Sequence 2: IC = 6
  - Clock Cycles= 4×1 + 1×2 + 1×3= 9
  - Avg. CPI = 9/6 = 1.5

# **Performance Summary**

**The BIG Picture** 

$$CPU Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

- Performance depends on
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>

#### What is "Computer Architecture"?



Coordination of many

levels (layers) of abstraction

#### **Instruction Set Architecture**

- A very important abstraction
  - interface between hardware and low-level software
  - standardizes set of instructions, their operations and machine language representations.
  - advantage: different implementations of the same architecture
  - disadvantage: sometimes prevents using new innovations (Compatibility issue)
- RISC (Reduced Instruction Set Computer) vs. CISC (Complex Instruction Set Computer)

## Reduced Instruction Set Computer (RISC) architecture

$$CPU Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

- vs. CISC (Complex ISC) architecture
- simpler, regular instruction set
- load/store architecture
- suited for pipelining
- importance of good compiler increased
- MIPS, RISC I/II→ SPARC, RISC V, Alpha, PowerPC, AVR, ARM ...

# **Effect of Compiler Optimization**









## **Power Trends**



Changes in CMOS IC technology

# Reducing Power

- Suppose a new CPU has
  - 85% of capacitive load of old CPU
  - 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?

# **Uniprocessor Performance**



#### Moore's law

- An observation made by Gordon Moore, co-founder of Intel
- "The number of transistors in an IC(Integrated Circuit) doubles every year." in1965 paper
- "The number of transistors in an IC doubles every 18~24 months." (updated in 1975)

#### Microprocessor Transistor Counts 1971-2011 & Moore's Law



By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.

## 120 Years of Moore's Law



By Steve Jurvetson - https://www.flickr.com/photos/jurvetson/31409423572/,

# Multiprocessors

- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization

#### **Amdahl's Law**

$$\mathsf{T}_{\mathsf{improved}} = \frac{\mathsf{T}_{\mathsf{affected}}}{\mathsf{improvement}\,\mathsf{factor}} + \mathsf{T}_{\mathsf{unaffected}}$$

Example:

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

How about making it 5 times faster?

• Principle: Make the common case fast

$$\frac{100seconds}{4} = \frac{80seconds}{n} + 20seconds$$

## Pitfall: MIPS as a Performance Metric

- MIPS: Millions of Instructions Per Second
  - Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions

$$\begin{aligned} \text{MIPS} &= \frac{\text{Instruction count}}{\text{Execution time} \times 10^6} \\ &= \frac{\text{Instruction count}}{\frac{\text{Instruction count} \times \text{CPI}}{\text{Clock rate}}} \times 10^6 \\ &= \frac{\text{Clock rate}}{\text{Clock rate}} \end{aligned}$$

CPI varies between programs on a given CPU

## **SPEC CPU Benchmark**

- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, ...
- SPEC CPU2017
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2017 (integer) and CFP2017 (floating-point)



# SPECspeed 2017 Integer benchmarks on a 1.8 GHz Intel Xeon E5-2650L

| Description                                                          | Name      | Instruction<br>Count x 10^9 | CPI  | Clock cycle time<br>(seconds x 10^-9) | Execution<br>Time<br>(seconds) | Reference<br>Time<br>(seconds) | SPECratio |
|----------------------------------------------------------------------|-----------|-----------------------------|------|---------------------------------------|--------------------------------|--------------------------------|-----------|
| Perl interpreter                                                     | perlbench | 2684                        | 0.42 | 0.556                                 | 627                            | 1774                           | 2.83      |
| GNU C compiler                                                       | gcc       | 2322                        | 0.67 | 0.556                                 | 863                            | 3976                           | 4.61      |
| Route planning                                                       | mcf       | 1786                        | 1.22 | 0.556                                 | 1215                           | 4721                           | 3.89      |
| Discrete Event<br>simulation -<br>computer network                   | omnetpp   | 1107                        | 0.82 | 0.556                                 | 507                            | 1630                           | 3.21      |
| XML to HTML conversion via XSLT                                      | xalancbmk | 1314                        | 0.75 | 0.556                                 | 549                            | 1417                           | 2.58      |
| Video compression                                                    | x264      | 4488                        | 0.32 | 0.556                                 | 813                            | 1763                           | 2.17      |
| Artificial Intelligence:<br>alpha-beta tree<br>search (Chess)        | deepsjeng | 2216                        | 0.57 | 0.556                                 | 698                            | 1432                           | 2.05      |
| Artificial Intelligence:<br>Monte Carlo tree<br>search (Go)          | leela     | 2236                        | 0.79 | 0.556                                 | 987                            | 1703                           | 1.73      |
| Artificial Intelligence:<br>recursive solution<br>generator (Sudoku) | exchange2 | 6683                        | 0.46 | 0.556                                 | 1718                           | 2939                           | 1.71      |
| General data compression                                             | xz        | 8533                        | 1.32 | 0.556                                 | 6290                           | 6182                           | 0.98      |
| Geometric mean                                                       |           |                             |      |                                       |                                |                                | 2.36      |

# **SPEC Power Benchmark**

- Power consumption of server at different workload levels
  - Performance: ssj\_ops/sec
  - Power: Watts (Joules/sec)

Overall ssj\_ops per Watt 
$$\Box \left( \sum_{i=0}^{10} ssj_ops_i \right) / \left( \sum_{i=0}^{10} power_i \right)$$

# SPECpower\_ssj2008 for Xeon E5-2650L

| Target<br>Load % | Performance<br>(ssj_ops) | Average<br>Power<br>(watts) |
|------------------|--------------------------|-----------------------------|
| 100%             | 4,864,136                | 347                         |
| 90%              | 4,389,196                | 312                         |
| 80%              | 3,905,724                | 278                         |
| 70%              | 3,418,737                | 241                         |
| 60%              | 2,925,811                | 212                         |
| 50%              | 2,439,017                | 183                         |
| 40%              | 1,951,394                | 160                         |
| 30%              | 1,461,411                | 141                         |
| 20%              | 974,045                  | 128                         |
| 10%              | 485,973                  | 115                         |
| 0%               | 0                        | 48                          |
| Overall Sum      | 26,815,444               | 2,165                       |
| ∑ssj_ops / ∑pov  | 12,385                   |                             |

# **Concluding Remarks**

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance