# Lucid Infrared Thermography of Thermally-Constrained Processors

Hussam Amrouch and Jörg Henkel Karlsruhe Institute of Technology (KIT), Chair for Embedded Systems (CES), Karlsruhe, Germany {amrouch, henkel}@kit.edu

Abstract—Thermal analysis is a prerequisite for developing reliability increasing techniques for thermally-constrained processors, i.e. processors with a high power density. For that purpose, infrared (IR) camera measurement setups have been deployed with the purpose to provide direct feedback of the impact that thermal mitigation techniques have. To obtain lucid IR images<sup>1</sup>, the IR-opaque cooling must be removed and hence, an alternative IR-transparent cooling needs to be provided to protect the chip. To this end, the majority of state-of-the-art employs an IR coolant liquid to prevent the chip from overheating. The problem is that several aspects like thermal convection may interfere with the measured IR radiations resulting in equivocal IR images. Thus, they decrease the accuracy in a way that leads to incorrectly estimating reliability. Solving this prominent problem, we introduce an IR-transparent cooling that cools the chip from its rear side allowing the camera to perspicuously capture the IR emissions as no additional layer in between impedes the radiation. It maintains the on-chip temperatures within a safe range equivalent to the original heat sink-based cooling. We demonstrate how state-of-the-art inaccurate thermal analysis results in incorrectly estimating reliability. Our setup is the most accurate, least intrusive one that has been both proposed and actually applied to state-of-the-art multi-cores (Intel 45 nm dual-core and 22 nm octa-core).

## I. INTRODUCTION

With technology scaling, power densities have steadily increased making current and upcoming chips thermally constrained due to the deleterious impact of elevated temperatures on reliability. For instance, spatial thermal gradients across the chip can seriously jeopardize its reliability due to electromigration and logic failures [6]. Importantly, high on-chip temperatures contribute to faster device aging [1]. This trend demands an accurate thermal analysis as it is a prerequisite for evaluating the efficiency of any applied reliability-increasing technique.

Thermal Analysis: To investigate the thermal behavior of a chip, thermal simulations have traditionally been employed. Unfortunately, the estimated thermal maps may suffer from inaccuracy due to the lack of accurate information about the runtime power consumption [2]. Instead, direct thermal measurements can be obtained using the on-chip thermal diode sensors but the total number of sensors is restricted for power and cost reasons. This limitation may lead to a failure in capturing the peak temperature – particularly when the runtime hotspot is located too far from the sensor's location, which is fixed at design time, as Fig 1 demonstrates. In [5], authors proposed to attach thermal sensors onto the center of processor packaging to measure its temperature. However, similar limitations still exist. Modern chips (e.g., the Intel SCC [9]) distribute many Ring Oscillators (ROs) across the die, that

may be utilized as thermal sensors after careful calibrations. However, their susceptibility to voltage noise causes instability as it shown in Fig 2, where jumps in the RO frequency correlate to voltage levels.

To tackle the shortcomings of the aforementioned techniques, IR thermography setups that employ an IR camera have been deployed to capture high resolution IR images of the die of FPGA-based systems (e.g.,[2]) and ASIC processors (e.g., [3][4]) providing designers with an accurate baseline to compare against. Additionally, it enables them to accurately analyze the potential spatial/temporal thermal gradients that may occur during operation.

Challenge behind IR Thermography: It is noteworthy that building such an IR thermography-based measurement setup requires removing the cooling and packaging of the chip to allow the IR emissions to reach the camera's lens. Unlike FPGA-based processors that can still properly operate after exposing its bare silicon [2], due to their relatively low operating frequency, measuring the temperature of ASIC-based processors presents a more challenging problem as it necessitates building an alternative IR-transparent cooling to allow the IR radiation emitted from the chip to reach the



Fig. 1: The thermal image shows that a single fixed thermal diode sensor may not capture a thermal hotspot



Fig. 2: Voltage dependent RO-based sensor frequency with correspond voltage changes. The measurements show a sensitivity to voltage fluctuations of  $1.6\,\mathrm{MHz/mV}$ 

<sup>&</sup>lt;sup>1</sup>Images captured by a thermal camera under minimal interference with the emitted IR radiation from the measured object.



Fig. 3: Liquid-based thermal measurement setup applying an IR coolant oil on top of the measured chip to prevent overheating

thermal camera and concurrently counteract the very rapid increase in temperature due to the excessive power densities.

An intuitive solution may be to reduce the ambient temperature, as the chip's temperature is almost linearly proportional to it, through putting the entire setup inside a thermal chamber. However, the chip, after removing its cooling, generates heat significantly faster than what the surrounding cold air can dissipate as the latter has a very low thermal conductivity. We examined based on the thermal HotSpot simulator, the impact of removing the cooling on the Alpha CPU temperature and figured out that the peak temperature increases from  $90^{\circ}C$  (when the cooling is applied) to  $>200^{\circ}C$  (when both packaging and heat sink are removed). Thus, the ambient temperature needs to be tremendously reduced to prevent the chip from overheating which may, anyway, be unsustainable for the board/chip.

To tackle this challenge, state-of-the-art techniques [3][4] apply a layer of IR-transparent liquid on top of the chip. Through a continuous stream of coolant liquid, the setup prevents the chip from damage. Besides the complexity that comes with building such a setup, having additional layers on top of the chip is disadvantageous because of interference with the IR radiation.

#### II. STATE-OF-THE-ART IN INFRARED THERMOGRAPHY

Building an IR-transparent heat sink using a coolant liquid flowing directly on top of the bare silicon of the chip necessitates choosing a liquid that must be IR transparent to enable the IR emissions to obtrusively reach the camera. An example of the used liquid for this purpose is a mineral oil [3][4] specifically designed for IR spectroscopy. By continuously circulating the coolant oil, the generated heat from the processor chip can be dissipated and hence protect it from overheating. An external pump is then needed for the purpose of controlling the flow speed. To constraint the flow, a window on top of it is also required [4]. Such a window must be an IR window to allow energy at a specific electromagnetic wavelength to pass. In Fig 3, we illustrate the liquid-based thermal setup showing how additional layers may interfere with the emitted IR radiation.

# A. Obstacles behind Capturing Lucid Images

The key drawback of the aforementioned setup is that several aspects interfere with the measured IR radiation which may result in equivocal (i.e. non lucid) IR images.



Fig. 4: Thermal measurement experiments demonstrating the impact

of IR mineral oil applied to a hot object during measurements

Compatibility: The deployed materials to build the setup parts have diverse IR spectroscopy properties that complicates compatibility. As a matter of fact, each IR camera operates at a specific wavelength that covers a corresponding spectrum of electromagnetic radiation. For instance, short wavelength IR cameras cover a wavelength of 1-3  $\mu m$  and long wave IR cameras cover a wavelength of 8-14  $\mu m$  [10]. The transmission range indicates to energy in the region of the electromagnetic radiation spectrum. When new layers (e.g., coolant liquid and IR window) are added on top of the chip, their transmission ranges need to be precisely chosen to maintain compatibility. A sapphire IR window, for instance, has a transmission range of 0.17 to  $5.5 \,\mu\mathrm{m}$  [10] which makes it compatible only with short wavelength cameras whereas long wavelength cameras cannot collect IR data through it because a sapphire window does not transmit beyond 5.5 $\mu$ m [10]. Similarly, the transmission range of the selected oil needs to be compatible with the transmission ranges of the used IR window as well as camera.

Thermal Convection: The IR camera obtains the chip's thermal image by capturing the electromagnetic waves that radiate from the chip and directly transport IR radiation through air. When a coolant liquid is applied to the surface of the hot chip, the thermal convection phenomenon starts to appear due to the transfer of heat from one region to another by the movement of liquid. This diminishes the IR images lucidity of the measured object as Fig 4 clarifies, where similar IR oil to is used in current setups is examined.

Complexity: The thickness of the oil layer plays an essential role in determining the lucidity of the captured IR images. This can be seen in Fig 4 (a) where thicker oil layer made the image less lucid. Furthermore, the properties of the applied oil are additional sources that increase the setup complexity, e.g., its viscosity and flow speed, which both influence the thermal convection. It is also necessary to keep the liquid free from pollutants over time that compromise its transparency. It is noteworthy that calibrating the state-of-the-art equivocal IR images to close the inaccuracy gap necessitates, anyway, having the corresponding lucid IR images (i.e. captured images without any layer on top of the chip) as an accurate baseline to calibrate against.

**In Summary:** State-of-the-art in IR thermography is challenging to build as it is subject to multiple aspects that interfere with the emitted IR radiation. That, in turn, negatively influences the accuracy of the measurements.



Fig. 5: Our proposed RAMA setup for lucid IR thermography of processors

## Our main contributions within this paper are:

- (1) We introduce a setup that cools the measured chip from its rear side to keep its operating temperature within a safe range equivalent to the original cooling.
- (2) Our proposed Rear-side Thermoelectric-based IR Thermography (RAMA) setup does not require the addition of any extra layer on top of the measured chip allowing the IR camera to perspicuously capture the IR emissions and thereby providing lucid thermal images. We also demonstrate that if the thermal setup would not be as accurate as the one we built, reliability would be incorrectly estimated.

#### III. OUR RAMA SETUP FOR LUCID IR IMAGES

To keep the measured chip in operational conditions after removing its cooling, we continuously cool it from the rear side, i.e. through the PCB to which the chip is attached as shown in Fig 6. Thermoelectric device is employed as it provides a controlled source of cooling.

The basic concept behind it is the *Peltier effect*, which is a phenomenon that occurs when an electrical current flows through two dissimilar semiconductor materials (i.e. N-type and P-type) [11] resulting in a thermal variance. It heats up one of the thermoelectric sides of the device and cools the other due to conduction of heat from one side where it is absorbed to the other side where the heat is released. The generated heat must be rapidly dissipated from the thermoelectric device hot side to avoid overheating and to allow for a continued proper operation. To this end, we use a water-cooling unit. The temperature difference between both thermoelectric device sides can be controlled by regulating the applied voltage.



Fig. 6: The employment of a thermoelectric in our RAMA setup

Our built thermal measurement setup is depicted in Fig 5. We investigate chips with flip-chip technology, as they allow mechanical removal of the packaging as opposed to e.g., a wire-bond chips. This means that the silicon wafer of the processor will be exposed after removing the chip packaging. On-chip temperatures are then directly measured using a DIAS pyroview 380L compact IR camera capable of capturing temperatures with an accuracy of  $\pm 1^{\circ}\mathrm{C}$  and a spatial resolution of  $50\,\mu\mathrm{m}$  per pixel [12]. Once the readings have been obtained, the camera sends them at a frame rate of  $50\,\mathrm{Hz}$  to a host PC that analyzes them in order to build the thermal images of the measured chip.

It is noteworthy that our setup is not restricted to a specific kind of IR cameras, unlike state-of-the-art that necessitates employing an IR camera operating at a specific wavelength which must be compatible with both IR oil and IR window.

**Emissivity:** A crucial property to consider when performing IR measurements is the *emissivity* of the material. It indicates the object's ability to emit IR energy and it ranges form 0.0 for a perfect reflector to 1.0 for a perfect emitter. Fig 7



Fig. 8: Measured IR images of runtime behavior of Intel *dual-core* processor under an intense workload stress as well as the corresponding temperatures of individual cores compared to fabricated/original heat sink-based cooling unit



~0.01). Most heat measured is reflection from surroundings

Fig. 7: Emissivity test of a measured chip

shows our emissivity test of an FPGA chip with its packaging. There, the left half of the chip appears of relatively low temperature due to the low emissivity of the metal. The actual temperature, however, is much closer to the right coated half of the chip. To compensate for this shortcoming, a masking tape (with an emissivity of 0.92 [8]) is applied to the material's surface increasing its emissivity in the IR spectrum of the thermal camera. The masking tape was chosen due to its relatively high emissivity and it can easily be removed. The measurements obtained with the masking tape are then used for determining the emissivity of the exposed silicon wafer through a comparison between the camera image and the chip's thermal diode readings. Since the emissivity is known, the camera software can internally compensate the missing temperature for a more accurate reading.

# IV. EVALUATION, COMPARISON AND ADVANTAGES

To evaluate our thermal setup, we use an Intel  $45\,\mathrm{nm}$  dual-core processor along with the CPUBurn benchmark, which is designed to load x86 CPUs as heavily as possible for the purposes of maximizing heat production from the CPU and putting an intensive stress on the chip itself as well as the cooling [13]. Towards evaluating our setup under the highest possible stress, we operate the two cores at the maximum frequency  $(1.8\,\mathrm{GHz})$  with hyper-threading enabled. As a result, four CPUBurn programs are simultaneously run which maximizes the workload on the entire chip. The steady-state temperature of the chip, under such an intense scenario, reaches around  $55^{\circ}C$  which is far from the critical temperature

 $(80^{\circ}C$  as Intel defines). This shows that our setup protects the chip from damage during IR measurements that are required for performing the desired thermal analysis. Some of the captured IR images are shown in Fig 8 which demonstrates the real-time chip thermal images at different points of time from the start of program execution.

Steady-State Temperature Equivalent: Since different cooling may have different thermal impact on the chip, it is necessary to examine in how far our proposed cooling is representative to the original setup, i.e. the IR-opaque fabricated cooling unit. Therefore, we directly compare. For a fair comparison, both processors start at the same initial temperature (i.e.  $30^{\circ}C$ ). Then, we concurrently run the same intense workload (i.e. four simultaneous CPUBurn programs) on both for the same interval of time - around 5 minutes which is sufficient to reach the steady-state temperature. During operation, the thermal diode sensor readings of each core are recorded to compare the thermal impact of our setup with the original one. As presented in Fig 8(a, b, c), the steady-state temperature of each core with our cooling in action is very similar to the steady-state temperature in case of the original cooling<sup>2</sup>. Under identical conditions with respect to initial temperature, steady-state temperature and workload, our built IRtransparent cooling setup for thermal measurements achieves a very good representation of the original cooling in terms of steady-state temperatures.

It noteworthy that analyzing the *long-term* effects of temperature due to aging on the reliability mainly depends on the steady-state temperature and not on the transient.

Unlike FPGA-based processors, where the thermal hotspot is located at the memory interface [2], we observe in ASIC-based processors that the thermal hotspot is caused by the processor cores themselves, as shown in Fig 8. This is mainly due to the significantly higher power densities of cores. Therefore, thermal managements need to focus on the cores' temperatures. For instance, task migration-based thermal

<sup>&</sup>lt;sup>2</sup>An online adaptation of the applied voltage to the thermoelectric could adjust the generated cooling in order to result also in a similar behavior in terms of the transient temperatures. However, this was not the focus of this paper and it is a part of our future work.



Fig. 9: Spatial thermal gradient calculation along with an example of a measured IR image with its calculated thermal gradient map



Fig. 10: Adding a layer of oil on top of the measured chip losses the captured IR images its *lucidity*, due to the thermal convection, which jeopardizes the accuracy of spatial thermal gradient analysis

management techniques (e.g., [7]) may be applied. Thus, we investigate, in the following, the thermal characteristics when tasks are migrated between cores.

For a more general evaluation, we employ a state-of-theart Intel 22 nm octa-core processor running at a maximum frequency of 2.4 GHz. The scenario of running the intense workload (i.e. eight *CPUBurn* programs simultaneously running) has been first examined and it showed that our setup is also able to protect chip from damage during measurement under such an intense stress (the peak temperature was around  $75^{\circ}C$ , similar to the steady-state temperature under the original cooling). Fig 9 presents an example of the measured IR images of the chip and the corresponding calculated spatial thermal gradient maps during the scenario of migrating the running tasks from two cores to others every 10 sec. The gradient maps have been calculated according to the formula within Fig 9 that provides an integration in order to prohibit wrong gradients caused by the pixel resolution of the thermal camera. This is necessary to avoid the noise in pixel values during the computation of pixelwise difference. The operating conditions of the employed thermoelectric device in these experiments  $(I_{in}=0.6\,\mathrm{A},\,V_{in}=1.5\,\mathrm{V})$  were carefully selected to make our setup mimic the impact of the original cooling. As it can be noticed from the specifications in Fig 5, the thermoelectric device is not maxing out in its input current and it still has more current (around  $8.3\mathrm{x}$ ) to use. This enables providing a wide range of cooling to select the most suitable one for the targeted chip, e.g. providing a higher cooling to examine more severe scenarios, that might be generated in other processors with higher power densities.

Jeopardy of Equivocal IR Images: Fig 10 clarifies the loss in the IR images lucidity when a thin IR oil layer is added to the chip. Due to the transfer of heat from one region to another by the movement of liquid, thermal convection interferes with the IR radiation and results in equivocal IR images. Whereas, the IR image captured by our setup is lucid, as the camera perspicuously captures the IR radiation. Importantly, an equivocal IR image negatively impacts the spatial thermal gradient analysis as the thermal contour maps<sup>3</sup> and histograms establish. As shown, the oil causes the spatial thermal gradient analysis to appear different from the case of doing the measurements without oil. This leads to ambiguous vision for designers regarding the potential reliability degradations that may occur during lifetime. In this particular scenario within Fig 10, we measure a 7°C variation in terms of the maximum gradient. However, this will be higher when stressing more cores as it can further heat up the oil.

Impact of Inaccurate Thermal Analysis on Reliability: Spatial thermal gradients result in a temperature variation within a small distance. This causes identical circuits within a component (e.g., SRAMs of CPU cache) have distinct characteristics, which degrades the entire on-chip reliability [6]. One of the key reasons behind deploying thermal setups is enabling designers to accurately estimate the reliability of on-chip systems. Therefore, we investigate for the first time under high accurate gradients maps extracted from lucid thermal images how such a variation of 7°C in estimating the maximum spatial thermal gradient – due to the applied oil – noticeably impacts the reliability analysis. To this end, the reliability of SRAMs, as a representative example, has been explored due to their total chip area that may reach more than 70% [14]. Static Noise Margin (SNM) that represents the resiliency against noise, Critical Charge  $(Q_{crit})$  that represents the susceptibility against radiation, and Read Access Time (RAT) that represents the ability to provide data in time are considered the key reliability aspects in SRAMs [1]. We conducted around 1000 simulations

<sup>&</sup>lt;sup>3</sup>Thermal contour is a means to represent the thermal gradients where the absolute temperature difference between two lines is always the same.







Fig. 11: Impact of a thermal measurement-error of  $7^{\circ}C$  (as the case of current state-of-the-art thermal setups, when a thin layer of oil (1 mm) is applied on top of the chip, on jeopardizing the estimation accuracy of the key reliability aspects of 22 nm SRAMs



Fig. 12: Impact of state-of-the-art inaccurate thermal analysis on incorrectly estimating aging for a lifetime of 10 years

based upon SPICE for the 22 nm Predictive Technology Model (PTM) for transistor sizing. As shown in Fig 11(a), the temperature rise shifts the SNM distribution to the left resulting in lower resiliency against noise. A similar behavior can also be observed in Fig 11(b) which shows how a temperature increase makes the SRAMs more susceptible to radiation. In Fig 11(c), the temperature rise shifts the RAT distribution to the right making SRAMs slower. Therefore, thermal analysis based on equivocal IR images results in incorrectly estimating the SRAMs reliability. When aging analysis comes into play, the aforementioned problem of incorrectly estimating reliability is exacerbated. In Fig 12, we illustrate the jeopardy of a 7°C measurement-error on overestimating the impact that aging effects<sup>4</sup> may have on the transistors reliability (i.e.  $\Delta V_{TH}$ ) and the SRAMs reliability. As shown, inaccurate thermal analysis during design-time may lead to overestimate the impact of aging on, for instance, RAT by 16%.

All in all, having lucid thermal images allow designers to accurately explore the thermal characteristics of on-chip systems and thus correctly estimating their reliability.

## V. CONCLUSION

We introduced in this work an IR-transparent cooling setup enabling thermal cameras to perspicuously capture the chip's thermal images. In addition to the *dual-core* processor scenario, we explored the *octa-core* processor which exhibits a more severe scenario due to the higher on-chip power densities. Our RAMA setup overcomes the drawbacks of the state-of-the-art liquid-based cooling setup. In addition to the right thermal setup, we show in this paper for the first time in how far critical circuit characteristics for reliability are correlated with temperature. If the thermal setup would not be as accurate as

the one we present, reliability would be incorrectly estimated. As consequence, we believe that we are presenting the most accurate simulations of circuit reliability so far, in the scope of temperature effects, with respect to noise susceptibility, radiation susceptibility and timing violations. In all aspects, capturing lucid thermal images has a significant impact on accurately estimating reliability.

#### ACKNOWLEDGMENT

Authors would like to thank Volker Wenzel, Victor van Santen and Matrin Buchty (CES, KIT) for their valuable help. Experiments in Figs (2, 7) have been conducted in cooperation with Thomas Ebi (CES, KIT).

#### REFERENCES

- [1] H. Amrouch, V. M. van Santen, T. Ebi, V. Wenzel, and J. Henkel, "Towards interdependencies of aging mechanisms," in *Proceedings of the IEEE/ACM International Conference on Computer-Aided Design*, ICCAD, 2014, pp. 478–485.
- [2] H. Amrouch, T. Ebi, J. Schneider, S. Parameswaran, and J. Henkel, "Analyzing the thermal hotspots in fpga-based embedded systems," Field Programmable Logic and Applications, 23rd International Conference on, 2013, pp. 1–4.
- [3] F. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau, "Measuring power and temperature from real processors," *Parallel and Distributed Processing, IEEE International Symposium on*, 2008, pp. 1–5
- [4] S. Reda, "Thermal and power characterization of real computing devices," Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, vol. 1, no. 2, pp. 76–87, 2011.
- [5] Q. Xie, J. Kim, Y. Wang, D. Shin, N. Chang, and M. Pedram, "Dynamic thermal management in mobile devices considering the thermal coupling between battery and application processor," in *Proceedings of the Inter*national Conference on Computer-Aided Design, 2013, pp. 242–247.
- [6] L. Zhijian, H. Wei, M.R. Stan, K. Skadron, J. Lach, "Interconnect Lifetime Prediction for Reliability-Aware Systems," in Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol.15, no.2, pp.159-172, 2007.
- [7] T. Ebi, H. Amrouch, and J. Henkel, "COOL: control-based optimization of load-balancing for thermal behavior," in *Proceedings of the 10th International conference on Hardware/software codesign and system synthesis*, 2012, pp. 255–264.
- [8] B. Griffith, D. Tuerler, and H. Goudey, "Infrared Thermography," Encyclopedia of Imaging Science and Technology, 2002.
- [9] "Intel SCC," www.intel.com/info/scc.
- [10] "Infrared Windows," http://iriss.com/.
- [11] F. J. DiSalvo, "Thermoelectric cooling and power generation," *Science*, pp. 703–706, 1999.
- [12] "DIAS Infrared Camera," http://www.dias-infrared.com/.
- [13] http://manpages.ubuntu.com.
- [14] "ITRS," http://www.itrs.net/.

<sup>&</sup>lt;sup>4</sup>Both Bias Temperature Instability and Hot carrier injection aging phenomena have been analyzed based on the presented models in [1]