# TENT: Efficient Quantization of Neural Networks on the tiny Edge with <u>Tapered FixEd PoiNT</u>

Hamed F. Langroudi\*, Vedant Karia\*, Tej Pandit, Dhireesha Kudithipudi Neuromorphic AI Lab, University of Texas at San Antonio, TX, USA

## **ABSTRACT**

In this research, we propose a new low-precision framework, TENT, to leverage the benefits of a tapered fixed-point numerical format in TinyML models. We introduce a tapered fixed-point quantization algorithm that matches the numerical format's dynamic range and distribution to that of the deep neural network model's parameter distribution at each layer. An accelerator architecture for the tapered fixed-point with TENT framework is proposed. Results show that the accuracy on classification tasks improves up to  $\approx\!31\%$  with an energy overhead of  $\approx\!17\text{-}30\%$  as compared to fixed-point, for ConvNet and ResNet-18 models.

#### **KEYWORDS**

Deep neural networks, low-precision arithmetic, tapered fixed-point

#### 1 INTRODUCTION

In the last decade, there has been a surge in deep neural network (DNN) development and deployment for a wide range of use-cases, from bio-medicine [34] to precision agriculture [27]. One of the reasons for this success can be attributed to the dramatic enhancement in the knowledge capacity of DNN models. For instance, the knowledge capacity of DNNs for language translation has boosted by 629x from the GNMT model (278 million parameters) [38] to the recent GPT-3 model (175 billion parameters) [4] within a four year timespan. Increasing the knowledge capacity of DNN models also increases the number of operations, mostly multiply-and-accumulate (MAC), at trillions of operations per ML inference [15]. However, these large networks, classified as CloudML models are deployed on cloud based datacenters where each node utilizes massive compute resources which require hundreds of watts of power (eg., RTX-3070 Nvidia GPU has 8 GB memory, with 20 TFLOPs throughput, and 220 watt power consumption).

Tangential to the trend of upscaling cloudML models to improve performance, a new class of models have emerged. MobileML [33, 35] and TinyML [2, 3, 12, 22, 23, 30] models address the rapidly growing demand to deploy DNNs on the *edge* and on the *tiny edge* (<1W) devices.

To deploy these models on such resource constrained *tiny edge* platforms (*eg.*, ARM M-7 MCU with 2 MB, 216 million cycles per second (MCPS) throughput, and 0.3 watt power consumption [2]), the TinyML models are either designed from scratch through neural architecture search (NAS) [2, 22, 23] or by compressing a version of the CloudML models. Often these compression mechanisms include approaches such as low-precision arithmetic [9, 21, 30], model pruning [12], knowledge distillation [5], and low-rank approximation

[36]. Of the aforementioned techniques for TinyML models, low-precision arithmetic has been gaining significant traction. With this technique, the performance of the models can be compromised based on the numerical format selected.

TinyML models with low-bit precision have been deployed on tiny edge devices [30]. Unfortunately, reducing the bit-precision in fixed-point numerical formats (binary/ternary in extreme cases) can jeopardize the model performance due to the limited and fixed dynamic range [26, 29]. Equispaced distribution of values expressed by these numerical formats exacerbates the issue to some degree [26]. Pre- and post-processing approaches such as quantization aware training (QAT) [7, 16, 18], retraining [24] and calibration [25] to boost the performance of TinyML models using fixed-point increase their computational complexity. To overcome the performance loss with minimal hardware overhead, a new numerical format with unequal-magnitude spacing (tapered accuracy) and flexible dynamic range is needed. The tapered fixed-point format [20] offers both of these characteristics lacking in the conventional fixed-point format.

In this introductory paper, we are motivated to evaluate the efficacy of the tapered fixed-point numerical format compared to the standard fixed-point numerical format on multiple benchmarks. To study the performance of this new numerical format, a new framework, TENT, has been developed that quantizes TinyML model parameters to tapered fixed-point. The dynamic range and distribution (tapered or uniform) of the numerical format is adapted proportional to the dynamic range and distribution of parameters of each layer in the model. Furthermore, a hardware architecture is designed to study the complexity of the approach compared to fixed-point for  $\leq$  8-bit TinyML model inference in terms of latency and energy (on the CIFAR-10 dataset).

The key contributions of this work are as follows:

- (1) a tapered fixed-point quantization algorithm that adapts the numerical format to best represent the layerwise dynamic range and distribution of parameters within a TinyML model.
- (2) a low-precision deep learning framework, TENT, that demonstrates better performance of tapered fixed-point over fixed-point formats for multiple classification tasks.
- (3) an implementation of the TENT framework as a custom hardware architecture to study the latency and energy consumption of tapered fixed-point vs standard fixed-point.

# 2 TAPERED FIXED-POINT FORMAT

The tapered fixed-point numerical format (TFX) which is also called as taper [20] can be illustrated as a combination of the posit [13] and fixed-point numerical formats. It combines the hardware-oriented characteristics of fixed-point and the high accuracy of posit with tapered precision; where the values are distributed in a non-uniform tent shape which closely resemble the shape of DNN statistics. Specifically, the binary encoding to represent the integer bits in

<sup>\*</sup> Equal Contribution.

fixed-point is replaced with the signed unary encoding that was previously used to represent the regime bits in posit [13]. This change adds tapered precision characteristics to the fixed-point numerical format. The fraction remains the same as the standard fixed-point where the fraction is added to the integer rather than scaling, which reduces the hardware complexity as compared to posit and floating-point. The tapered fixed-point is defined as TFX(n, IS, SC) where n refers to the total number of bits, IS (in a range of [1,n]) indicates the maximum number of unary encoded integer bits and  $SC \in [-\frac{\beta}{2}..\frac{\beta}{2}]$  is the power-of-2 scaling downward or upward where  $\beta$  can be selected based on application. The IS value controls both the dynamic range ( $D_{TFX}$ ) as in (1) and the tapered precision.

$$D_{TFX} = \frac{Max_{TFX}}{Min_{TFX}} = \begin{cases} \frac{1}{2^{n-1}}, & \text{if } IS = 1\\ \frac{IS}{2^{n-2}}, & \text{otherwise} \end{cases}$$
 (1)

The dynamic range and tapered precision are scaled proportional to IS as shown in Fig. 1. For instance, the TFX(n, IS, SC) numerical format with n=5, SC=0, IS=1 and IS=2 has the two smallest dynamic ranges ( $\frac{1}{0.0625}$ ) and ( $\frac{2}{0.125}$ ) respectively, and behaves similar to fixed-point numerical format with uniform precision. Also, n=IS=5 and SC=0 represents the maximum dynamic range ( $\frac{5}{0.125}$ ) and maximum tapered precision. The dynamic range and precision does not change with variation in SC parameter, however, the TFX number represented can be scaled by a multiplication factor of 2 raised to the power SC.

The value of a TFX number is represented by (2), where I is computed in (3) representing the integer value, f indicates the fraction value and fs the maximum number of bits allocated for the fraction.

$$X = (I + \frac{f}{2f^s}) \times 2^{SC} \tag{2}$$

The integer bit-field is encoded based on the *runlength* m of identical bits (i...i) which is terminated by either an *integer terminating* bit  $\bar{i}$  or the end of the n-bit value. Note that the sign bit (s) is flipped and is also considered as the first bit of the integer  $(\bar{s} = i)$ .

$$I = \begin{cases} -m, & \text{if } i = 0\\ m - 1, & \text{if } i = 1 \end{cases}$$
 (3)



Figure 1: An Illustration of distribution of values in tapered fixed-point format with n=5,  $IS=\{1,2,3,4,5\}$  and SC=-1

For instance, 3.875 in the TFX(8,8,0) numerical format is represented by 3 as an integer (I), 0.875 as a fraction ( $\frac{f}{2fs}$ ), as shown by Fig. 2. More details about the tapered fixed-point number format can be found in [20].



Figure 2: Representation of a number in the tapered fixed point TFX(8,8,0) format

#### 3 RELATED WORK

Studies considering low-precision arithmetic have experimentally shown that TinyML models using 8-bit fixed-point numbers can achieve inference accuracy comparable to that of 32-bit floating-point numbers [2, 9, 21–23]. However, it requires either pre-processing such as calibration [25], quantization aware training (QAT) [7, 16, 18], or post-processing such as retraining [24]. For instance, Banbary *et al.* demonstrate the efficacy of 8-bit fixed-point for TinyML models on the visual wake words (VWW) dataset [2]. The outcome of this study, which uses the QAT approach to quantize weights and activations, indicates that 8-bit fixed-point model parameters are sufficient to achieve inference performance comparable to that of MobileNetV2 on the VWW corpora [8] (within <1% variation) which uses 32-bit floats.

While it is possible to achieve similar inference performance of TinyML models (32-bit float conversion to 8-bit fixed-point) through pre- or post processing approaches, reducing the bit-precision to fewer than 8 bits degrades the performance significantly [26, 29]. To mitigate this problem, researchers have explored mixed-precision fixed-point numerical format [10, 11, 14, 17, 30, 31, 37]. However, utilizing mixed-precision fixed-point required a precision-assignment policies for TinyML model parameters across layers. For instance, Rusci *et al.* leveraged the combination of reinforcement learning and QAT to automatically select the appropriate precision of TinyML parameters across layers [30]. And thus demonstrates that it is possible to evaluate even the ImageNet corpora using mixed-precision fixed-point formats on MobileNetV2, with results within <1% variation as compared to inference with 32-bit floats.

This research proposes a tapered fixed-point quantization algorithm for TinyML models where the dynamic range of this numerical format is adapted to the dynamic range of the TinyML model parameters. A notable difference between this work and previous works is that this quantization approach does not required complex hardware or algorithmic approaches such as precision-assignment policies, QAT or calibration.

# 4 TENT FRAMEWORK

The goal of the TENT empirical framework is to emulate ML inference with quantization using low-precision tapered fixed-point



Figure 3: TENT: a low-precision framework for TinyML inference with tapered fixed-point parameters. The framework is applied to each layer individually, selecting specific IS and SC values to match the distribution and range of parameters within the layer. IS specifies the *maximum* number of integer bits, and SC specifies the degree of shift required (left-shift if positive, right-shift if negative). The MAC structure displays the multiply-and-accumulate unit explained in Fig. 4.

format. Formally, the TinyML model parameters are quantized to tapered fixed-point such that the dynamic range and distribution of TinyML model parameters matches the distribution of values represented by tapered fixed-point. Therefore, the TENT framework, as shown in Fig. 3, approximates the optimal tapered fixed-point numerical format for TinyML model parameters in each layer by selecting appropriate *IS* and *SC* values. These parameters are selected based on dynamic range of TinyML model parameters in each layer. After this step, the learned weights and activations (32-bit floats) are quantized to the low-precision tapered fixed-point format, upon which the dot product operations are then carried out.

In particular, the TENT empirical framework comprises of three key aspects: Tapered fixed-point parameters selection, quantization to tapered fixed-point, and the low-precision tapered fixed-point dot product.

# 4.1 Tapered Fixed-Point Parameter Selection

Algorithm 1 presents the IS and SC optimization procedure. To select these parameters, in the first step, the maximum absolute value of the DNN parameters at each layer ( $W_{amax}$ , and  $A_{amax}$ ) are computed (lines 2–3) and rounded up to generate the appropriate IS value. Algorithm 1, defines the process by which IS is selected to tailor the numerical format to the range and distribution of parameters in each individual layer. In the worst case scenario, when the dynamic range of DNN parameters is larger than dynamic range of tapered fixed-point, the maximum possible value n (total number of bits represented in tapered fixed-point format) is selected for IS (lines 4–13). Selecting IS based on the algorithm 1 reduces the overflow error in quantization. However, when the maximum absolute value of a DNN parameter is less than the maximum absolute value

representable by the tapered fixed-point format selected; many bit-patterns in the numerical format are unused. To mitigate this issue, the the maximum absolute value of tapered fixed-point format is scaled down by 2 raised power of SC that determines as a base-2 logarithm of  $w_{amax}$  (lines 14–19).

# 4.2 Quantization with Tapered Fixed-Point

In this paper, the quantization function  $Q(x_i, q)$  defined in (4) approximates each parameter  $x_i$  to  $x_i'$  (a q-bit tapered fixed-point). In the quantization procedure, the values that lie outside the dynamic range of a given tapered fixed point format configuration,  $Q(\cdot, \cdot, \cdot, \cdot)$ , clips to the format maximum (u) or minimum (l) appropriately. A value that is between consecutive tapered fixed-point numbers is rounded to the nearest even number (RNE( $x_i$ )).

$$x_i' = Q(x_i, l, u, q) = \begin{cases} l, & x \le l \\ \text{RNE}(x_i), & l < x < u \\ u, & x \ge u \end{cases}$$
 (4)

## 4.3 Tapered Fixed-Point Dot Product

The tapered fixed-point dot product is presented in Algorithm 2. In the first step, a set of quantized weights and activations are decoded to the tapered fixed-point (lines 2–3). To decode the tapered fixed point format, the sign bit, integer bits (through leading zero detection algorithm), and remaining fractional bits require to be extracted (lines 4–7). Then, the product of the tapered fixed-point weights and activations are calculated without truncation or rounding after multiplication operations (lines 8–9). The products are then stored in a wide register (quire [13]) for m multipliers with size of  $w_{quire}$  as in (5) (lines 10–12).

**Algorithm 1** Compute the maximum integer bit width (*IS*) and scaling factor (*SC*) of tapered fixed-point for DNN parameters **Input**: layers weights ( $W_l$ ), layers activations ( $A_l$ )

**Output**:  $IS_{w_l}$ ,  $IS_{A_l}$ ,  $SC_{w_l}$  tapered fixed-point parameters

```
1: procedure IS, SC SELECTION (W_1, A_1)
 2:
              w_{amax} \leftarrow max(|w_1|)
              A_{amax} \leftarrow \max(|A_1|)
 3:
              Compute the IS_{w_I}, IS_{A_I}
              if \lfloor W_{amax} \rfloor + 1 \leq n_{bit} then
 4:
                      \texttt{IS}_{\texttt{w}} \leftarrow \lfloor \texttt{W}_{\texttt{amax}} \rfloor + 1
 5.
 6:
 7:
                     IS_w \leftarrow n_{bit}
 8:
              end if
 9:
              if |A_{amax}| + 1 \le n_{bit} then
                     IS_A \leftarrow \lfloor A_{amax} \rfloor + 1
10:
11:
                     \mathtt{IS}_\mathtt{A} \leftarrow \mathtt{n}_\mathtt{bit}
12:
13:
              end if
              Compute the SC_{w_l}
14:
              SC_{w_1} \leftarrow 0
              if W_{\text{amax}} < 0.5 then
15:
                     \mathtt{SC}_{\mathtt{W}} \leftarrow \lfloor \log_2(\mathtt{W}_{\mathtt{amax}}) \rfloor + 1
16:
17:
              \textbf{return} \ \texttt{IS}_{\texttt{W}_1}, \texttt{IS}_{\texttt{A}_1}, \texttt{SC}_{\texttt{W}_1}
18:
19: end procedure
```

$$w_{quire} = \lceil \log_2(m) \rceil + 2 \times \lceil \log_2(\frac{Max_{TFX}}{Min_{TFX}}) \rceil + 2$$
 (5)

The stored products are then converted and accumulated with fixed-point arithmetic. At the end, the accumulated result is encoded back into the tapered fixed-point numerical format (lines 16-18).

#### 5 SYSTEM DESIGN AND ARCHITECTURE

The TENT framework gives insights into adopting a new numerical format for storing the weights and activations which reduce the quantization loss at low precision. This section describes the framework designed to simulate DNNs on hardware platforms in order to evaluate the performance of the tapered fixed-point representation in terms of latency and energy consumption. Fig. 4 shows the highlevel architecture of the designed framework guided by the design of Eyeriss v2 [6]. Primarily, it is composed of processing elements (PEs) arranged in a 2D systolic array architecture accompanied by a hierarchical memory organization. Systolic architectures have shown promising results in computing convolutions at low energy cost due to the data reuse characteristics and parallel processing features [19]. Most systolic architectures, commonly used to perform convolution operations, adopt input stationary, weight stationary, and output stationary dataflows. Of which, output stationay has shown to reduce execution time and energy consumption when PE computations are confined to a single pixel in the output feature map, and was thus selected for this architecture.

The PE in the systolic array has a tapered fixed-point based MAC unit with configurable bit-precision. It is controlled by an external control signal (*IS*) which defines the integer value and dynamic

**Algorithm 2** Tapered fixed-point dot product operations for *n*-bit inputs each with  $\lceil \log n \rceil$  bits for *RS*, 3 bits for *SC*.

**Input**: layers quantized weights  $(W_{l_q})$ , layers quantized activations  $(A_{l_q})$ 

**Output**: *R* as Dot Product result

```
1: \mathbf{procedure} Tapered fixed-point DP (W_{l_q}, A_{l_q})
           \mathtt{sign_{w}}, \mathtt{Int_{w}}, \mathtt{frac_{w}} \leftarrow \mathtt{Decode}(W_{l_{a}}, \overset{\neg}{lS}_{w})
           sign_a, Int_a, frac_a \leftarrow Decode(A_{l_a}, IS_a)
 3:
           Multiplication
           sign_{mult} \leftarrow sign_w \oplus sign_a
 4:
           Value_w \leftarrow (Int_w + frac_w) \ll frac_{bit_w} + |SC_w|
 5:
           Value_a \leftarrow (Int_a + frac_a) \ll frac_{bit_a}
 6:
           p_{\text{mult}} \leftarrow \{\text{sign}, \text{Value}_{w} \times \text{Value}_{a}\}
 7:
           Accumulation & Normalize
 8:
                                                                                    ▶ Accumulate
           \texttt{sum}_{\texttt{quire}} \leftarrow p_{\texttt{mul}} + \texttt{sum}_{\texttt{quire}}
           \texttt{sum}_{\texttt{nquire}} \leftarrow \texttt{sum}_{\texttt{quire}} \gg \texttt{frac}_{\texttt{a}} + \texttt{frac}_{\texttt{w}} + |\texttt{SC}_{\texttt{w}}|
 9:
           Rounding & Encode
           result ← ROUNDING & ENCODING(sum_nquire)
10:
           return result
11:
12: end procedure
```

range of the format. Each MAC unit first reads an n-bit activation and weight in tapered fixed-point format and decodes them to  $\lceil log_2(n) \rceil + n$  bits fixed-point value with  $\lceil log_2(n) \rceil + 1$  integer bits and n-2 fraction bits. The decoding process utilizes a count leading zeros (CLZ) unit, which counts the total number of leading zeros (from MSB), representing the integer value of the fixed-point as shown in Fig. 5. Furthermore, the 3-bit SC signal controls the scaling factor of the weight by shifting the fixed-point number based on the magnitude and the sign of the SC signal. The decoded activations and scaled weights are multiplied using a fixed-point multiplier which is further accumulated. Unlike other dataflows, the output stationary dataflow does not require the quantization of the accumulated value at every step which avoids generation of partial sums and thus eliminates the quantization error which normally occurs in intermediate stages. The accumulated fixed-point value uses a unary encoding mechanism, based on the magnitude and sign of the integer, while converting to the n-bit tapered fixed-point

The memory hierarchy consists of 128 MB off-chip main memory (DRAM) and 3×108 kB on-chip scratchpad memory (SRAM). The main memory is dedicated to storing input data, activations and weights/filters that are loaded by the host processor, whereas the scratchpad memory serves as a global buffer. In order to estimate latency, we bridge our framework with the SCALE-Sim tool [32]. SCALE-Sim, however, does not consider the cycles consumed in shuttling data back and forth between the global buffer and the DRAM. Therefore, the total latency is re-approximated by considering PE array execution time and DRAM access time (Micron MT41J256M4). For energy estimation analysis, calculation of execution time, and power utilization, we factor in 45nm CMOS technology node.



Figure 4: Deep Neural Network accelerator architecture for TENT, with custom tapered fixed-point processing elements. The architecture is evaluated in a full cycle-emulator to analyze the performance and energy constraints.



Figure 5: RTL design for decoder, converting tapered point precision to fixed-point with count leading zero governed by IS parameter and encoder, converting fixed-point to tapered fixed-point with rounding off fraction bits to nearest value.

# 6 EXPERIMENTAL SETUP, RESULTS & ANALYSIS

The TENT framework is implemented in C++ and extended to support the TensorFlow framework [1]. To demonstrate the efficacy of the TENT framework, the performance of low-precision tapered fixed-point is evaluated on three inference tasks and is compared to the standard low-precision fixed-point numerical format with rounding quantization. The specifics of the evaluation tasks and the inference performance achieved on them with 32-bit float DNNs are summarized in Table 1. The MNIST and CIFAR-10 datasets are chosen for evaluation, as they are ideal for tinyML applications [3]. To appropriately evaluate the benefits of the tapered fixed point format,

the Fashion-MNIST dataset is also considered which presents additional challenges to the classification task, while still falling under a similar category as the MNIST dataset. The base model selected to perform the classification task on these datasets is an extremely compact model containing less than 2 million parameters that can be stored on tiny edge devices (eg., ARM M-7 MCU) [2]. In the evaluation of each format,  $IS \in [1..n]$  and  $SC \in [0..3]$  are considered for tapered fixed-point, and  $I \in [1..n-1]$  and  $fs \in [1..n-1]$  are considered for standard fixed-point. The inference accuracy performance was realized on the tinyML model by quantizing the parameters to either low-precision tapared fixed-point or standard fixed-point through post-training quantization approch.

Table 1: Description of the TinyML models and benchmarks using 32-bit float parameters.

| Dataset       | TinyML Model         | W-Range <sup>1</sup> | A-Range <sup>1</sup> | # Parameters | # MACs <sup>2</sup> | Performance |
|---------------|----------------------|----------------------|----------------------|--------------|---------------------|-------------|
| MNIST         | ConvNet <sup>3</sup> | [-0.78, 0.62]        | [0, 3.61]            | 1.40 M       | 58.7 K              | 99.32%      |
| Fashion-MNIST | ConvNet <sup>4</sup> | [-0.74, 0.45]        | [0, 5.97]            | 1.88 M       | 69.8 K              | 92.54%      |
| CIFAR-10      | ResNet-18            | [-2.12, 1.17]        | [0, 10.21]           | 0.27 M       | 286.72 K            | 91.54%      |

<sup>&</sup>lt;sup>1</sup> W: Weights; A: Activations

Table 2: Performance of TinyML models during inference with the tapered fixed-point (TFX) and fixed-point.

| Bit Precision | MNIST  |             | Fashi          | Fashion-MNIST |                       | CIFAR-10    |  |
|---------------|--------|-------------|----------------|---------------|-----------------------|-------------|--|
|               | TFX    | Fixed-Point | TFX            | Fixed-Point   | TFX                   | Fixed-Point |  |
| 8-bit         | 99.33% | 99.18%      | 92.59%         | 89.59%        | 81.66%                | 54.20%      |  |
| 7-bit         | 99.32% | 97.14%      | 92.47%         | 88.63%        | <u>75.90</u> %        | 24.50%      |  |
| 6-bit         | 99.30% | 97.08%      | 92.14%         | 85.31%        | 46.79%                | 12.96%      |  |
| 5-bit         | 99.29% | 96.96%      | <u>89.35</u> % | 83.46%        | $\underline{23.44}\%$ | 11.11%      |  |

#### 6.1 Inference Performance with TENT

The efficacy of the TENT framework is evaluated on DNN inference with varied IS and SC, as shown in Table 2. The findings show that the low-precision tapered fixed-point outperforms the standard fixed-point on various benchmarks by up to ≈31%. For instance, the performance of an 8-bit low-precision tapered fixedpoint ResNet-18 network on the CIFAR-10 dataset is improved by 27.44% compared to the fixed-point based network. Furthermore, we observed that the tapered fixed-point shows greater benefits on TinyML models whose parameters have a large dynamic range, such as ResNet model shown in Table 1 and Table 2. These performance benefits can be intuitively explained by the auto-tuning capability of the TENT framework, which adapts the format to the dynamic range of the weights and activations, so as to reduce the quantization error. The best performance observed on all the benchmarks (when analyzed across the full [5..8]-bit range) is achieved with tapered fixed-point.

## 6.2 Hardware System Results

The execution time of the DNN model is mainly governed by the dataflow and the PE array architecture. The output stationary dataflow offers a 24% reduction in latency as compared to weight stationary dataflow while performing inference [32]. For a compute-bound DNN this is a significant improvement, considering that inference favors latency over throughput [28]. The homogeneous  $16\times16$  PE configuration with no sophisticated architectural features offers improvement in computing efficiency from 89.58% to 91.82%, with a significant reduction in energy consumption. Fig. 6(c) illustrates the energy-delay product (EDP) for the tapered fixed-point network ResNet-18 while performing inference on the CIFAR-10 dataset. It is worth noting that tapered fixed-point offers 31% improvement in classification accuracy with a negligible EDP overhead (17-30%) as compared to fixed-point. Reduced bit-precision economizes the

local memory storage size and the number of operational cycles in both tapered and standard fixed-point.

Overall, the use of the tapered fixed-point numerical format helps in paving a path towards minimizing the power consumption (inference), which aligns with the ultimate goal envisioned by the TinyML community ( $\leq 1~\text{mw}~[3]$ ). Our analyses have shown that this work's adaptation of the ResNet-18 architecture consumes power within the range of 189-270 mw to classify a single sample of the CIFAR-10 dataset, which befits the low-power requirements of devices that operate on the edge [2]. Additionally, the proposed TENT framework further enhances the performance by leveraging the flexibility of a tapered precision numerical format and its affinity to represent parameters with reduced quantization error (as compared to standard fixed point formats).

#### 7 CONCLUSIONS

This paper presents a low-precision framework, TENT, that offers quantization using tapered fixed point to perform inference on TinyML models. The maximum integer bit width and scaling factor parameters are dynamically selected to best fit the variability of the parameter and activation distributions within each layer of the model. We observe reduction in total quantization error that leads to an improvement in inference accuracy by  $\approx 31\%$  over fixed-point models. Furthermore, we show that tapered fixed-point achieves this with a moderate increase in energy consumption over fixed-point.

#### REFERENCES

- Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
- [2] Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas Navarro, Urmish Thakkar, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul N Whatmough. 2020. MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers. arXiv preprint arXiv:2010.11267 (2020)

<sup>&</sup>lt;sup>2</sup> The number of MACs are calculated for a DNN inference with a batch size of 1.

 $<sup>^{3}</sup>$  2 Convolutional layers, 2 fully-connected layers, and 1 Pooling layer

<sup>&</sup>lt;sup>4</sup> 3 Convolutional layers, 2 fully-connected layers, 2 Pooling layers, 1 Batch normalization layer



Figure 6: Plots depict (a) Maximum frequency, (b) Area of Tapered fixed point and standard fixed point MAC unit, (c) Energy consumption of computing one MAC operation, and (d) Energy-delay product vs error of ResNet-18 model benchmarked with CIFAR-10 for tapered fixed-point and standard fixed-point. The performance was evaluated for different bit widths to represent the weights and activations.

- [3] Colby R Banbury, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel, Jeremy Holleman, Xinyuan Huang, Robert Hurtado, David Kanter, Anton Lokhmotov, et al. 2020. Benchmarking TinyML Systems: Challenges and Direction. arXiv preprint arXiv:2003.04821 (2020).
- [4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
- [5] Gianmarco Cerutti, Rahul Prasad, Alessio Brutti, and Elisabetta Farella. 2019. Neural Network Distillation on IoT Platforms for Sound Event Detection.. In Interspeech. 3609–3613.
- [6] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292–308
- [7] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).
- [8] Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. 2019. Visual wake words dataset. arXiv preprint arXiv:1906.05721 (2019)
- [9] Miguel de Prado, Romain Donze, Alessandro Capotondi, Manuele Rusci, Serge Monnerat, Luca Benini, and Nuria Pazos. 2020. Robust navigation with tinyML for autonomous mini-vehicles. arXiv preprint arXiv:2007.00302 (2020).
- [10] Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in Neural Information Processing Systems 33 (2020).
- [11] A. T. Elthakeb, P. Pilligundla, F. Mireshghallah, A. Yazdanbakhsh, and H. Esmaeilzadeh. 2020. ReLeQ: A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks. IEEE Micro 40, 5 (2020), 37–45.
- [12] Igor Fedorov, Ryan P Adams, Matthew Mattina, and Paul Whatmough. 2019. Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. In Advances in Neural Information Processing Systems. 4977–4989.
- [13] John L Gustafson and Isaac T Yonemoto. 2017. Beating Floating Point at its Own Game: Posit Arithmetic. Supercomputing Frontiers and Innovations 4, 2 (2017),

- 71-86
- [14] Philipp Gysel, Jon Pimentel, Mohammad Motamedi, and Soheil Ghiasi. 2018. Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems (2018).
- [15] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. 2020. Scaling Laws for Autoregressive Generative Modeling. arXiv preprint arXiv:2010.14701 (2020).
- [16] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, et al. 2018. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- [17] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos. 2018. Proteus: Exploiting precision variability in deep neural networks. *Parallel Comput.* 73 (2018), 40–51.
- [18] Raghuraman Krishnamoorthi. 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342 (2018).
- [19] Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE computer 15, 1 (1982), 37–46.
- [20] John L. Gustafson. 2020. A Generalized Framework for Matching Arithmetic Format to Application Requirements. In https://posithub.org/.
- [21] Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2018. CMSIS-NN: efficient neural network kernels for arm cortex-M CPUS. CoRR abs/1801.06601 (2018). arXiv preprint arXiv:1801.06601 (2018).
- [22] Edgar Liberis, Łukasz Dudziak, and Nicholas D Lane. 2020. muNAS: Constrained Neural Architecture Search for Microcontrollers. arXiv preprint arXiv:2010.14246 (2020).
- [23] Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. 2020. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems 33 (2020).
- [24] Jeffrey L McKinstry, Steven K Esser, Rathinakumar Appuswamy, Deepika Bablani, John V Arthur, Izzet B Yildiz, and Dharmendra S Modha. 2018. Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191 (2018).
- [25] S Migacz. 2017. 8-bit inference with TensorRT. In GPU Technology Conference.

- [26] Asit Mishra and Debbie Marr. 2017. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. arXiv preprint arXiv:1711.05852 (2017).
- [27] Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, et al. 2019. Deep-Weeds: A Multiclass Weed Species Image Dataset for Deep Learning. Scientific Reports 9, 1 (2019), 2058.
- [28] David A Patterson. 2004. Latency lags bandwith. *Commun. ACM* 47, 10 (2004), 71–75
- [29] Hadi Pouransari, Zhucheng Tu, and Oncel Tuzel. 2020. Least squares binary quantization of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 698–699.
- [30] Manuele Rusci, Marco Fariselli, Alessandro Capotondi, and Luca Benini. 2020. Leveraging Automated Mixed-Low-Precision Quantization for tiny edge micro-controllers. arXiv preprint arXiv:2008.05124 (2020).
- [31] Charbel Sakr and Naresh Shanbhag. 2018. An analytical method to determine minimum per-layer precision of deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1090–1094.
- [32] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).

- [33] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
- [34] Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, et al. 2020. Improved protein structure prediction using potentials from deep learning. *Nature* (2020), 1–5.
- [35] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2820–2828.
- [36] Murad Tukan, Alaa Maalouf, Matan Weksler, and Dan Feldman. 2020. Compressed deep networks: Goodbye svd, hello robust low-rank approximation. arXiv preprint arXiv:2009.05647 (2020).
- [37] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8612–8620.
- [38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).