# Cache Revive: Tuning Retention times of STT-RAM Caches for Enhanced Performance in CMPs.

Anonymous MICRO submission

Paper ID 761

# **Abstract**

Spin-Transfer Torque RAM (STT-RAM) is a CMOS compatible emerging non-volatile memory (NVM) technology that has the potential to replace the conventional on-chip SRAM caches for designing a more efficient memory hierarchy for future multicore architectures. However, it's high write latency and dynamic write energy are major obstacles for being competitive with the SRAM-based cache hierarchy. On the other hand, STT-RAM technology has another adaptable feature that it is possible to reduce its write latency by reducing its retention time, thereby making it volatile. In this paper, we exploit this volatile property of the STT-RAM for designing an efficient L2 cache architecture. The paper addresses several critical design issues such as how do we decide a suitable retention time for last level cache, what is the relationship between retention time and write latency, and how do we architect the cache hierarchy with a volatile STT-RAM. Through an extensive execution driven analysis of the inter-write time of several PARSEC and SPEC 2006 benchmarks, we observe that retention time in the order of 10-40 ms is a good design point to handle most of the writes. Then for the rest of the cache blocks that have a higher inter-write time than the STT-RAM retention time, we propose an architectural solution to identify these blocks with a per block 2 bit counter, temporarily save a limited number of MRU blocks in a buffer, and write-back the rest of the dirty blocks to avoid any data loss. Our experiments with

4 and 8-core architectures with an SRAM-based L1 cache and STT-RAM-based L2 cache indicate that not only we can eliminate the high write overhead of an NVM STT-RAM, but can provide on an average 10-12% improvement in IPC compared to the traditional SRAM-based design, while reducing the energy consumption significantly

# 1. Introduction

Spin-Transfer Torque RAM (STT-RAM) is a promising memory technology that delivers on many aspects desirable of an universal memory. They exhibit high density, fast read times and low static power consumption. However, the high write latencies and write energy are key drawbacks of this technology. Consequently, recent efforts have focused on masking the effects of high write latencies and write energy at the architectural level. In contrast to these architectural approaches, a recent technique considers relaxing STT-RAM data retention times to reduce both write latencies and write energy. The focus of this paper is to tune this data retention time to closely match the required lifetime of cache line blocks to achieve significant performance and energy gains.

The non-volatile nature and non-destructive read ability of STT-RAM provides a key difference with regard to a comparably high-density DRAM memory. However, for many applications it is sufficient if the data stored in a cache hierarchy remains valid for a few tens of milliseconds. Consequently, the duration of data retention in STT-RAM is an obvious candidate for device optimization for cache design. First, we analyze how changes to the STT-RAM retention times influence the performance, power and area characteristics. A key distinction from prior efforts to relax data retention times is our consideration of device variability in our analysis. Because the retention time of MTJ is exponentially proportional to the thermal barrier, the retention time of individual STT-RAM device is extremely sensitive to any factor that has impact on thermal barrier, particularly device geometry. Thus it's important to take practical value of device geometry such as MTJ planar area and take the process variation into consideration. We get these parameters and corresponding variabilities from fabricated STT-RAM published in recent years [?]. This analysis reveals the granularities at which a device designer can reliably tune the data

#761

retention times.

Analyzing the lifetime of cache lines has been the focus of prior efforts to improve performance and reduce power consumption. In this work, we revisit this topic with the aim of identifying the suitable data retention times for STT-RAM caches. A key challenge in determining a suitable data retention times for the STT-RAM is to balance the reduced write latency of cells with lower retention time with the overhead for data refresh or writeback of cache lines with longer lifetimes. Our analysis of the cache lifetimes performed for a multi-threaded workload demonstrates that a significant fraction of L2 cache lines can operate correctly without any additional support when the STT-RAM retention times are of the order of 50ms. However, architectural support is required to ensure that correct program state is maintained for the rest of the cache lines that have lifetimes exceeding 50ms. While a simple DRAM-style refresh has been proposed in [10] to ensure correctness, it is possible to avoid many of these refresh by pursuing a life-time aware refresh strategy.

This work makes the following contributions

- We present a detailed device characterization of data retention tunability in STT-RAM Cells providing insight to the underlying principles enabling these tradeoffs
- We analyze the time between writes or replacements to a cache line for various multi-threaded and multi-programmed workloads. Our characterization augments the prior body of work that analyzes cache lifetimes mainly in single processor and single program configurations.
- We present a simple buffering mechanism to ensure integrity of programs given the volatile nature of our tuned STT-RAM cells.
- Finally, we show that our combined device-architecture life time tuning approach is better than recent efforts that attempt to address the long write latencies of STT-RAM.

The rest of this paper is as follows.

Next, we evaluate the lifetimes of cache lines when executing multi-threaded workloads. A cache line is no longer required, if the

for the duration of its lifetime in the program execution.

The cache hierarchy is a key component influencing both the



Figure 1. Distribution of Blocks Showing Different Revival Times

## 2. Motivation

In order to utilize the volatile STT-RAM as the last level cache in designing an effective cache hiear-archy, we need to know what should be the ideal/feasible retention time. Ideally, the STT-RAM write latency should be competitive to SRAM latency and the cache retention time should be high. However, as discussed in the following section, since the write latency is inversely propositonal to the retention time, we need to find a feasible tradeoff based on the STT-RAM device characteristics. Thus, we first attempt to decide an ideal retention time by analyzing the characteristics of a last level cache in a multi-programmed environment. The idea is to understand the distribution of the inter-write interval and thus the average inter-write time to a last level cache and use this time as the STT-RAM retention time. This section describes our application-driven study to estimate the retention time.

## 2.1. Relating Application Characteristics to Retention Time

Application characterization gives the basis for evaluating the impact of low retention time STT-RAM caches on the overall system performance. In order to do this characterization, the first step is to find the ideal time for which the cache block should retain the data. We see that, cache block is only refreshed when the block is written. Since we are using low retention time STT-RAM for L2 cache, we record intervals between two successive writes (refreshes) to the same L2 cache block. We define this interval to be *revival time*. While collecting these results, we ensure that if a block gets invalidated in between two consecutive writes, we don't consider the time in between the invalidation and the next

write. Previous works [?] do similar type of revival time analysis but for L1 cache. Figure 1 shows the distribution of L2 cache blocks having different revival time intervals. These results are obtained by running multi-threaded (PARSEC [?]) and multi-programmed (SPEC 2006 [?]) applications on M5 Simulator [?] which models a 4GHz processor consisting of 4 cores, with 4MB STT-RAM L2 cache. Table ?? contains additional details of the system configuration. Figure 1 (a) and (b) shows results of three PARSEC and SPEC benchmarks along with the averages across 12 PARSEC and 14 SPEC benchmarks respectively. We observe from the figure that, on an average, approximately 50% of cache blocks get refreshed within 10 ms. About 20% of blocks remain in the cache for more than 40 ms and rest of the blocks have intermediate revival times. We conclude that blocks which stay longer than the retention time in the cache without being refreshed are going to affect the application performance the most. The opportunities lie with the computer architect to design micro-architecture schemes to make the unrefreshed block percentage as low as possible. This distribution also gives us the basis on which we can choose the optimal retention time. Reducing the retention time too much will make the cache too volatile leading to degraded performance. In the next subsection, we will see how increasing the retention time, has negative impacts on write latency.

#### 2.2. Low Retention STT-RAM Characteristics

Table 1 shows that there is significant reduction in write latency with reduction in retention time. Section 4 explains how these numbers originate. We want to clarify from device fabrication perspective that, these retention times are the most stable designs possible. As we lower the retention times of these STT-RAM cells in the range of *ms* it becomes much harder to precisely mark a STT-RAM cell with a fixed retention time. For the sake of correctness and preciseness we discuss these designs only in the paper. Later in the Section 7, it will be clear, that our design assumptions have no affect on the generality of the results.

To analyze the tradeoffs between retention time and overall system performance, lets consider an utopian cache with 10 year retention time having minimum write latency and energy. To bridge the gap between current and utopian cache, we need to reap the benefits of both: application characteristics and

Table 1 Retention and Write Latencies for STT-RAM L2 Cache

| Retention Time 10 years 1 sec 500 ms 100 ms 10 ms |           |           |           |           |           |  |  |
|---------------------------------------------------|-----------|-----------|-----------|-----------|-----------|--|--|
| Retellition Time                                  | Toyears   | 1860      | JUUIIIS   | 1001118   | 101118    |  |  |
| Write Latency @4GHz                               | 44 cycles | 24 cycles | 20 cycles | 16 cycles | 12 cycles |  |  |

emerging device technology. From application side, it is best to choose a retention time which minimizes the unrefreshed blocks and from technology side it is ideal to choose the STT-RAM with minimum write latency and energy. We choose 10 ms retention time as optimal retention time which balances both the sides. In Section 7, we do a sensitivity analysis by choosing retention times 100 ms, 500 ms and 1sec. In Section 5, we propose micro-architecture techniques to deal with blocks having revival time greater than 10ms.

# 3. STT-RAM Design

# 3.1. Preliminary on STT-RAM



Figure 2. (a) Structural view of of STT-RAM Cache Cell (b) Anti Space Parallel (High Resistance, Indicating "1" state (c) Parallel (Low Resistance, Indicating "0" state

STT-RAM uses Magnetic Tunnel Junction (MTJ) as the memory storage and leverages the difference in magnetic directions to represent the memory bit. As shown in Fig. 2, MTJ contains two ferromagnetic layers. One ferromagnetic layer is has fixed magnetization direction and it is called the reference laver, while the other layer has a free magnetization direction that can be changed by passing a write

current and it is called the free layer. The relative magnetization direction of two ferromagnetic layers determines the resistance of MTJ. If two ferromagnetic layers have the same directions, the resistance of MTJ is low, indicating a "1" state; if two layers have different directions, the resistance of MTJ is high, indicating a "0" state.

As shown in Fig. 2, when writing "0" state into STT-RAM cells, positive voltage difference is established between SL and BL; when writing "1" state, vice versa. The current amplitude required to reverse

the direction of the free ferromagnetic layer is determined by the size and aspect ratio of MTJ and the write pulse duration.

#### 3.2. Write current versus write pulse width trade-off



Figure 3. (a) Structural view of of STT-RAM Cache Cell (b) Anti Space Parallel (High Resistance, Indicating "1" state (c) Parallel (Low Resistance, Indicating "0" state

The current amplitude required to reverse the direction of the free ferromagnetic layer is determined by a lot of factors such as material property, device geometry and importantly the write pulse duration. Generally, the longer the write pulse is applied, the less the switching current is needed to switch the MTJ state. Three distinct switching modes were identified [3] according to the operating range of

switching pulse width  $\tau$ : thermal activation ( $\tau > 20ns$ ), processional switching ( $\tau < 3ns$ ) and dynamic reversal ( $3ns < \tau < 20ns$ ).

The relationship between switching current density  $J_c$  and write pulse width  $\tau$  was characterized by an analytical model in [9]. The equations are listed as follows,

$$J_{c,TA}(\tau) = J_{c0} \{ 1 - (\frac{k_B T}{E_b}) ln(\frac{\tau}{\tau_0}) \}$$
 (1)

$$J_{c,PS}(\tau) = J_{c0} + \frac{C}{\tau^{\gamma}} \tag{2}$$

$$J_{c,DR}(\tau) = \frac{J_{c,TA}(\tau) + J_{c,PS}(\tau)e^{-k(\tau - \tau_c)}}{1 + e^{-k(\tau - \tau_c)}}$$
(3)

where  $J_{c,TA}$ ,  $J_{c,PS}$ ,  $J_{c,DR}$  are the switching current densities for thermal activation, precessional switching and dynamic reversal respectively.  $J_{c0}$  is the critical switching current density,  $k_B$  is the Boltzmann constant, T is the temperature,  $E_b$  is the thermal barrier, and  $\tau_0$  is inverse of the attempt frequency.  $C, \gamma$ , k, and  $\tau_c$  are fitting constants. Based on the observation from Fig. 3 and analysis of the analytical model, we found very different switching characteristics in the three switching modes. For example, in thermal

activation mode, the required switching current increases very slowly even we decrease the write pulse width by orders of magnitude, thus short write pulse width is more favorable in this regime because reducing write pulse can reduce both write latency and energy without much penalty on read latency and energy. While in processional switching, write current goes up rapidly if we further reduce write pulse width, therefore minimum write energy of the MTJ is achieved at some particular write pulse width in this regime. Consequently, this paper will focus on the exploration of write pulse width in processional switching and dynamic reversal to optimize for different design goals.

## 3.3. STT-RAM Modeling

To simulate the performance of STT-RAM cache, it is important to estimate its cell area first. As mentioned before, each 1T1J STT-RAM cell is composed of one NMOS and one MTJ. The NMOS access device is connected in series with the MTJ. The size of NMOS is constrained by both SET and RESET current, which are inversely proportional to the writing pulse width. In order to estimate the current driving ability of MOSFET devices, a small test circuit using HSPICE with PTM 45nm HP model [12] is simulated. The BL-to-SL current and SL-to-BL current are obtained by assuming typical TMR (120%) and LRS ( $3k\Omega$ ) value [8] and bursting wordline voltage to be 1.5V (the optimal value is extracted from [1]). And we over size the access transistor width to guarantee enough write current provided to MTJ using the methodology discussed in [11]. To achieve high cell density, we model the STT-RAM cell area by referring to DRAM design rules [6]. As a result, the cell size of a STT-RAM cell is calculated as follows,

Area<sub>cell</sub> = 
$$3(W/L + 1)(F^2)$$
 (4)

#### 3.4. Impact of MTJ Retention Time on STT-RAM



Retention Time (second)

Figure 4. MTJ thermal stability requirement for different retention

The retention time of a MTJ is largely determined by the thermal stability of the MTJ. The relation between retention time and thermal barrier is captured in Figure 4, which can be modeled as  $t = C \times e^{k\Delta}$ , where t is the re-

Table 2. 16-way L2 Cache Simulation Results

|             |           |              | Area     | Read Latency | Write Latency | Leakage Power |
|-------------|-----------|--------------|----------|--------------|---------------|---------------|
|             |           |              | $(mm^2)$ | (ns)         | (ns)          | (mW)          |
| 1MB SRAM    |           | 2.612        | 1.012    | 1.012        | 4542          |               |
| 4MB STT-RAM | t = 10yr  | Leakage Opt. | 2.628    | 2.434        | 4.919         | 1399          |
|             |           | Latency Opt. | 3.003    | 0.998        | 10.61         | 2524          |
|             | t = 1s    | Leakage Opt. | 2.203    | 2.044        | 3.552         | 1388          |
|             |           | Latency Opt. | 2.904    | 0.973        | 5.571         | 2235          |
|             | t = 100ms | Leakage Opt. | 2.181    | 1.994        | 3.432         | 1250          |
|             |           | Latency Opt. | 2.902    | 0.963        | 3.002         | 2230          |
|             | t = 10ms  | Leakage Opt. | 2.167    | 1.956        | 3.390         | 1151          |
|             |           | Latency Opt. | 2.901    | 0.959        | 2.598         | 2227          |

tention time and  $\Delta$  is the thermal barrier while C and k are fitting constants. Thermal stability of the free layer in an MTJ does not only have impact on retention time of STT-RAM memory

cell but also on the write current. It was found in [7] that the switching current of MTJ increases almost linearly with thermal barrier when thermal barrier is  $<70k_BT$ , where  $k_B$  is the Boltzman constant and T is temperature. Here we combine this observation with the write current versus write time trade-off described in Section 3.2, which essentially means that once the thermal barrier of a MTJ is lowered we are able to achieve faster write speed or/and smaller write current/energy. The most straightforward way to reduce thermal barrier is to tune device geometry such as planar area, thickness of free layer and aspect ratio of the elliptic MTJ. In [10], the author use  $32F^2$  as baseline MTJ planar area for nonvolatile STT-RAM and reduce it to  $10F^2$  to get the retention time in the order of microseconds. However, most state-of-the-art MTJ [?] has much smaller baseline planar area ( $2F^2$ ). There is not too much room to reduce retention time by aggressively reduce MTJ planar are. In this paper, we focus on the MTJ with worst-case retention time larger than millisecond and optimize STT-RAM cache correspondingly.

#### 3.5. STT-RAM Cache Simulation Setup

We simulate SRAM-based caches and STT-RAM-based caches with a tool called NVsim [4], which is a circuit-level performance, energy, and area simulator based on CACTI for emerging non-volatile

memories. All the models described in this Section has been integrated in NVsim. The simulation results are listed in Table 2. We can see that the leakage-optimized 4MB non-volatile STT-RAM cache has almost the same area with 1MB SRAM. This is consistent with previous work [5]. By relaxing retention time of STT-RAM with lower thermal barrier, the leakage-optimized STT-RAM cache can have smaller area, faster write latency and less leaky peripheral circuity. However, as retention time is exponentially related with thermal barrier and thermal barrier is extremely sensitive to process variation and temperature, the benefit of decreasing write latency by relaxing the retention time in the same order (i.e. from 50ms to 10ms) is so small which may be offset by slight variation in device geometry or environment temperature. Moreover, the intrinsic fluctuations in CACTI make it very difficult to observe that small benefit as well. Another point worth mentioning is that the read latency of leakage-optimized 4MB STT-RAM cache is significantly larger than 1MB SRAM cache because sensing the state of STT-RAM call takes longer and fast SRAM sensing. Thus, we reduce the array size to improve the latency of STT-RAM cache. As can be seen in Table 2, the latency-optimized STT-RAM cache has noticeable better read and write latency with 14% - 34% area overhead compared to leakage-optimized STT-RAM cache with the same retention time.

# 4. Architecting Volatile STT-RAM

We observe from the figure 1 that approximately 50% of L2 cache blocks are refreshed within 10ms interval. The substantial number of blocks after this interval underscores the need of strong architecture support. In this section we will first describe our counter design for keeping track of retention time of blocks. After that we will propose our micro-architecture solution to handle blocks which stay unrefreshed for longer time compared to retention time.

Counter Design: We maintain a 2 bit counter per physical block similar to the one used in [?]. Figure ?? (b) shows the transition diagram for the counter. Each cache block can be in one of the four states. We divide the retention time of the STT-RAM cell into four equal parts and define each part as *transition time*. After every *transition time*, block goes from one state to another according to the transition diagram. The block goes back to *S0* state if the data is refreshed. The block in *S3* state indicates that it has



attained its retention time. We calculate the overhead of 2 bit counters to be 1.6% over L2 cache size.

#### 4.1. Volatile STT-RAM Scheme

In this naive design, if a particular block is dirty and is in S3 state, then it is written back to the main memory. This design has no microarchitecture overhead but has negative impact on the performance for two reasons: 1) There will be large number of write backs to the main memory after every interval. 2) The S3 state block could have been in MRU position because of frequent reads and losing it will incur additional read misses. We evaluate the results of using this scheme in Section 7.

Figure ?? shows the schematic of overall architecture design. **Buffer Overflow detector:** We keep a per bank small buffer with fixed number of entries made up of low-retention time STT-RAM cells, to temporarily store dead blocks. The optimal size of this buffer is calculated later in the section.

**Buffer Controller** We maintain a 12 bit overflow detector per buffer and 12 bit block identifier associated with every buffer entry in the buffer controller. If a S3 state block is directed to the buffer, overflow detector is checked to see the occupancy of the buffer. If the buffer can hold blocks, a block id is generated by concatenating its set and way id and stored along with actual block in one of the empty buffer



slots. If the buffer is fully occupied, then block needs to be written back if it is dirty, otherwise it is invalidated.

#### 4.2. Schemes

Revived STT-RAM Scheme We propose our scheme

Revived STT-RAM Scheme is hereby proposed. We try to find the position of the block from MRU. If the block is in certain slots from the MRU, we will use a small buffer bank to copy the dying block in the buffer and again copy back to the same slot. If the block is not in chosen MRU slots we This method has multiple advantages: 1) It helps in preserving the frequently used block, which is most likely is frequently read. 2) It prevents unnecessary write backs from L2 cache to main memory.

Our scheme consists of following parameters to be tuned. 1) Number of MRU Slots 2) Number of Buffer Slots 3) Possible retention times. 6 suggests that if we consider first eight MRU slots for buffering, we cover x percentage of blocks.

# 5. Experimental Evaluation

**Experimental Setup** We evaluate our design and schemes on modified ALPHA M5 Simulator [] . We operate M5 Simulator in Full System (FS) mode for PARSEC applications and in System Emulation (SE) Mode for SPEC 2006 Multiprogrammed mixes. We model a 4GHz processor with four out of order cores. We modified M5 simulator to model low retention time STT-RAM for L2 cache. The L2 cache

is banked with different read and write latencies, with all the banks connected via a shared memory bus. We assume a fixed 400 cycles main memory latency for all our simulations. Table [] details our experimental system configuration.

Collection of Results We report results of 12 mulithreaded PARSEC applications and 14 SPEC 2006 multiprogrammed mixes. Table ?? shows the characterization of PARSEC applications and list of multiprogrammed mixes. We use simsmall input for PARSEC benchmarks and report the results of only Region of Interest (ROI) after skipping the initilization and termination phases (except facesim, where we report results for only 2B instructions of ROI) We also warm up caches for 100M Instructions in ROI. For SPEC multiprogrammed mixes, we fast forward 1B Instructions, warm up caches for 500M instructions and then report results for 1B instructions.

**Performance Metrics** For mulltithreaded PARSEC applications, we assume 4 threads are mapped to our modeled processor with four cores. We report normalized speedup for these applications, which is defined as the decrease in execution time of the slowest thread. We randomly choose 14 multiprogram mixes, each mix with four different applications and assign each every application to a core. We report Instruction throughput and Weighted Speedup for SPEC multiprogrammed mixes, which are defined as:

# 6. Results

# 7. Prior Work

## 8. Conclusions

Sample bibliography [2]".

## References

[1] S. Chatterjee, M. Rasquinha, S. Yalamanchili, and S. Mukhopadhyay. A scalable design methodology for energy minimization of STTRAM: a circuit and architecture perspective. *IEEE Transactions on Very Large Scale Integration*, 2010. 8

- [2] Y. Chen, X. Wang, H. Li, L. H., and D. V. Dimitrov. Design Margin Exploration of Spin-Torque Transfer RAM (SPRAM). In *Quality Electronic Design*, 2008. ISQED 2008. 9th International Symposium on, pages 684–690, 2008. 13
- [3] Z. Diao, Z. Li, S. Wang, Y. Ding, A. Panchula, E. Chen, L.-C. Wang, and Y. Huai. Spin-transfer torque switching in magnetic tunnel junctions and spin-transfer torque random access memory. *Journal of Physics: Condensed Matter*, 19(16):165209, 2007.
- [4] X. Dong, N. P. Jouppi, and Y. Xie. PCRAMsim: System-level performance, energy, and area modeling for phase-change RAM. In *Proceedings of the International Conference on Computer-Aided Design*, pages 269–275, 2009. 9
- [5] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, et al. Circuit and Microarchitecture Evaluation of 3D Stacking Magnetic RAM (MRAM) as a Universal Memory Replacement. In *Proceedings of the Design Automation Conference*, pages 554–559, 2008. 10
- [6] F. Fishburn, B. Busch, J. Dale, D. Hwang, et al. A 78nm 6F<sup>2</sup> DRAM technology for multigigabit densities. In *Proceedings of the Symposium on VLSI Technology*, pages 28–29, 2004. 8
- [7] T. Kishi, H. Yoda, T. Kai, T. Nagase, E. Kitagawa, M. Yoshikawa, K. Nishiyama, T. Daibou, M. Nagamine, M. Amano, S. Takahashi, M. Nakayama, N. Shimomura, H. Aikawa, S. Ikegawa, S. Yuasa, K. Yakushiji, H. Kubota, A. Fukushima, M. Oogane, T. Miyazaki, and K. Ando. Lower-current and fast switching of a perpendicular TMR for high speed and high density spin-transfer-torque MRAM. In *Proceedings of International Electron Devices Meeting*, pages 1 –4, 2008.
- [8] C. Lin, S. Kang, Y. Wang, K. Lee, X. Zhu, W. Chen, X. Li, W. Hsu, Y. Kao, M. Liu, Y. Lin, M. Nowak, N. Yu, and L. Tran. 45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverseconnection 1T/1MTJ cell. In *Proceedings of International Electron Devices Meeting*, pages 57 –59, 2009.
- [9] A. Raychowdhury, D. Somasekhar, T. Karnik, and V. De. Design space and scalability exploration of 1T-1STT MTJ memory arrays in the presence of variability and disturbances. In *Proceedings of International Electron Devices Meeting*, pages 707–710, 2009.

- [10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan. Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In *Proceedings of the International Symposium on High Performance Computer Architecture*, pages 50–61, 2011. 3, 9
- [11] W. Xu, H. Sun, X. Wang, Y. Chen, and T. Zhang. Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). *IEEE Transactions on Very Large Scale Integration*, 19(3):483 –493, 2011. 8
- [12] W. Zhao and Y. Cao. New generation of predictive technology model for sub-45 nm early design exploration. *IEEE Transactions on Electron Devices*, 53(11):2816 –2823, nov. 2006. 8