## EE451 Supervised Research Exposition

Topic : Analog/Mixed-Signal Circuits for Machine Learning Applications

Supervisor : Prof. Rajesh Zele

### Mihir Kavishwar (17D070004) Electrical Engineering, Dual Degree - Microelectronics Indian Institute of Technology, Bombay

#### December 2020

### Contents

| 1 | Intr              | oducti  | on                                                   | 2  |  |
|---|-------------------|---------|------------------------------------------------------|----|--|
| 2 | Literature Survey |         |                                                      |    |  |
|   | 2.1               | Winner  | r-Take-All Circuits                                  | 3  |  |
|   |                   | 2.1.1   | Introduction                                         | 3  |  |
|   |                   | 2.1.2   | Circuit Implementation and Analysis                  | 3  |  |
|   | 2.2               | Bump    | Circuits                                             | 5  |  |
|   |                   | 2.2.1   | Introduction                                         | 5  |  |
|   |                   | 2.2.2   | Circuit Implementation and Analysis                  | 6  |  |
|   | 2.3               | Analog  | g Programmable Multidimensional RBF based Classifier | 9  |  |
|   |                   | 2.3.1   | Introduction                                         | 9  |  |
|   |                   | 2.3.2   | System Overview                                      | 10 |  |
|   |                   | 2.3.3   | Circuit Implementation and Analysis                  | 11 |  |
|   | 2.4               | Switch  | ed-Capacitor Matrix Multiplier                       | 14 |  |
|   |                   | 2.4.1   | Introduction                                         | 14 |  |
|   |                   | 2.4.2   | Circuit Implementation and Analysis                  | 15 |  |
|   |                   | 2.4.3   | Applications and Results                             | 18 |  |
|   | 2.5               |         |                                                      |    |  |
|   |                   | 2.5.1   | Introduction                                         | 19 |  |
|   |                   | 2.5.2   | System Overview                                      | 19 |  |
|   |                   | 2.5.3   | Circuit Implementation and Analysis                  | 20 |  |
| 3 | Sim               | ulation | $\mathbf{a}\mathbf{s}$                               | 22 |  |
|   | 3.1               | Setup   |                                                      | 22 |  |
|   | 3.2               | Results | S                                                    | 23 |  |

#### 1 Introduction

Over the last couple of decades there have been many advancements in the field of Machine Learning (ML) and today ML algorithms are used in a variety of applications. Given the computation intensive nature of these algorithms, designing hardware which can meet the power and performance requirements is very important. While efforts are being made to improve upon the existing digital processing architectures, there is emerging research which advocates using analog and mixed signal circuits for certain applications (such as Edge Devices) which have strict power and latency constraints. Analog/Mixed-Signal designs trade numerical accuracy for very high energy efficiency and performance, and are therefore well suited for Edge Devices.

This report summarises the work that was done during the course of one semester under the guidance of Prof. Rajesh Zele and his PhD student, Siva Elangovan, in Advanced Integrated Circuits and Systems (aiCAS) Lab.

#### 2 Literature Survey

I reviewed the following 5 papers:

- J. Lazzaro, S. Ryckebusch, M. A. Mahowald and C. A. Mead, "Winner-take-all networks of O(N) complexity", Advances in Neural Information Processing Systems 1, Morgan Kaufmann Publishers, San Francisco, CA, 1989.
- 2. T. Delbruck, "'Bump' circuits for computing similarity and dissimilarity of analog voltages," IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 1991.
- S. Peng, P. E. Hasler and D. V. Anderson, "An Analog Programmable Multidimensional Radial Basis Function Based Classifier," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 10, pp. 2148-2158, Oct. 2007.
- 4. E. H. Lee and S. S. Wong, "Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 261-271, Jan. 2017.
- J. Zhang, Z. Wang and N. Verma, "In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array," in IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 915-924, April 2017.

The circuits described in the first 2 papers form the backbone of many complex ML systems being implemented today which use Analog/Mixed signal processing. Therefore it is very useful to look at those papers first despite the fact that

they were written almost 30 years ago. The next 3 papers are relatively recent, and all of them use different approaches to realise an ML system.

#### 2.1 Winner-Take-All Circuits

Reference: J. Lazzaro, S. Ryckebusch, M. A. Mahowald and C. A. Mead, "Winner-take-all networks of O(N) complexity", Advances in Neural Information Processing Systems 1, Morgan Kaufmann Publishers, San Francisco, CA, 1989.

#### 2.1.1 Introduction

Many classification algorithms in supervised learning use a winner-take-all (WTA) approach to produce the final outputs. The behaviour of a WTA block can be described as follows:

Given N input-output pairs, say

$$(input_k, output_k)$$
  $k \in \{1, 2, \dots, N\}$ 

Let

$$m = \underset{k \in \{1, 2, \dots, N\}}{\arg\max} input_k$$

Then

$$output_k = \begin{cases} 1, & if \ k = m \\ 0, & if \ k \neq m \end{cases}$$

That is, only the output corresponding to the largest input is high while all other outputs are 0.

#### 2.1.2 Circuit Implementation and Analysis



Figure 1. Schematic diagram of the winner-take-all circuit. Each neuron receives a unidirectional current input  $I_k$ ; the output voltages  $V_1 \dots V_n$  represent the result of the winner-take-all computation. If  $I_k = \max(I_1 \dots I_n)$ , then  $V_k$  is a logarithmic function of  $I_k$ ; if  $I_j \ll I_k$ , then  $V_j \approx 0$ .



Figure 2. Schematic diagram of a two-neuron winner-take-all circuit



Figure 3. Experimental data (circles) and theory (solid lines) for a two-neuron winner-take-all circuit.  $I_1$ , the input current of the first neuron, is swept about the value of  $I_2$ , the input current of the second neuron; neuron voltage outputs  $V_1$  and  $V_2$  are plotted versus normalized input current.

To understand how the above circuit works, first let's look at a simpler version with just 2 inputs as shown in Figure 2. All 4 transistors are biased in **subthreshold region**. The equation governing their behaviour is

$$I_{ds} = I_0 e^{\frac{\kappa V_{gs}}{V_T}} (1 - e^{-\frac{V_{ds}}{V_T}}) \tag{1}$$

where  $V_T = \frac{KT}{q} \approx 26~mV$  at room temperature. The factor  $\kappa \approx 0.7$  accounts for back-gate or body effect. If we assume  $V_{ds} \geq 3V_T$  then the effect of  $V_{ds}$  becomes negligible and transistor is said to operate in **sub-threshold saturation** 

$$I_{ds} \approx I_0 e^{\frac{\kappa V_{gs}}{V_T}} \tag{2}$$

Consider the case when both inputs currents are equal  $I_1=I_2\equiv I_m$ . Transistors  $T_{1_1}$  and  $T_{1_2}$  have identical potentials at gate and source, and are both sinking  $I_m$ ; thus, the drain potentials  $V_1$  and  $V_2$  must be equal, say  $V_m$ . Transistors  $T_{2_1}$  and  $T_{2_2}$  have identical source, drain, and gate potentials, and therefore must sink the identical current  $I_{c_1}=I_{c_2}=\frac{I_c}{2}$ . Assuming both  $T_1$  and  $T_2$  are in saturation and substituting in equation (2),

$$I_{m} \approx I_{0}e^{\frac{\kappa V_{c}}{V_{T}}} \implies \frac{\kappa V_{c}}{V_{T}} \approx \ln\left(\frac{I_{m}}{I_{0}}\right)$$

$$\frac{I_{c}}{2} \approx I_{0}e^{\frac{\kappa (V_{m}-V_{c})}{V_{T}}} \implies \frac{\kappa (V_{m}-V_{c})}{V_{T}} \approx \ln\left(\frac{I_{c}}{2I_{0}}\right)$$

$$V_{m} \approx \frac{V_{T}}{\kappa}\ln\left(\frac{I_{m}}{I_{0}}\right) + \frac{V_{T}}{\kappa}\ln\left(\frac{I_{c}}{2I_{0}}\right)$$
(3)

Thus, for equal input currents, the circuit produces equal output voltages; this behavior is desirable for a winner-take-all circuit. In addition, the output voltage  $V_m$  logarithmically encodes the magnitude of the input current  $I_m$ .

Now consider the case when  $I_1 = I_m + \delta_i$  and  $I_2 = I_m$ . Transistor  $T_{1_1}$  must sink  $\delta_i$  more current than in the previous example; as a result, the gate voltage of  $T_{1_1}$  rises. Transistors  $T_{1_1}$  and  $T_{1_2}$  share a common gate, however; thus,  $T_{1_2}$  must also sink  $I_m + \delta_i$ . But only  $I_m$  is present at the drain of  $T_{1_2}$ . To compensate,

the drain voltage of  $T_{1_2}$ ,  $V_2$ , must decrease. For small  $\delta_i$ , the Early effect serves to decrease the current through  $T_{1_2}$ , decreasing  $V_2$  linearly with  $\delta_i$ . For large  $\delta_i$ ,  $T_{1_2}$  must leave saturation, driving  $V_2$  to approximately 0 volts. As desired, the output associated with the smaller input diminishes. For large  $\delta_i$ ,  $I_{c_2} \approx 0$ , and  $I_{c_1} = I_c$ . Therefore,

$$I_{m} + \delta_{i} \approx I_{0}e^{\frac{\kappa V_{c}}{V_{T}}} \implies \frac{\kappa V_{c}}{V_{T}} \approx \ln\left(\frac{I_{m} + \delta_{i}}{I_{0}}\right)$$

$$I_{c_{1}} \approx I_{c} \approx I_{0}e^{\frac{\kappa(V_{1} - V_{c})}{V_{T}}} \implies \frac{\kappa(V_{1} - V_{c})}{V_{T}} \approx \ln\left(\frac{I_{c}}{I_{0}}\right)$$

$$V_{1} \approx \frac{V_{T}}{\kappa}\ln\left(\frac{I_{m} + \delta_{i}}{I_{0}}\right) + \frac{V_{T}}{\kappa}\ln\left(\frac{I_{c}}{I_{0}}\right) \quad ; \quad V_{2} \approx 0$$

$$(4)$$

The same logic can be extrapolated to a circuit with N inputs. The maximum current input sets the common gate voltage  $V_c$ , which forces other transistors to leave saturation as they are sinking smaller currents. The computational complexity is O(N) because the designs scales linearly with the number of inputs. The paper also briefly discusses the time response of this circuit and condition for stability. A local winner-take-all circuit is also discussed where the inhibitory action is limited to only adjacent neurons. I have not included the analysis here as the principle remains the same.



Figure 8. Schematic diagram of a section of the local winner-take-all circuit. Each neuron i receives a unidirectional current input  $I_i$ ; the output voltages  $V_i$  represent the result of the local winner-take-all computation.



Figure 9. Experimental data showing the spatial impulse response of the local winner-take-all circuit, for values of  $t_{\rm L}/t_{\rm c}$  ranging over a factor of 12.7. Wider inhibitory responses correspond to larger ratios. For clarity, the plots are vertically displaced in 0.25 volt increments.

#### 2.2 Bump Circuits

Reference: T. Delbruck, "'Bump' circuits for computing similarity and dissimilarity of analog voltages," IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 1991.

#### 2.2.1 Introduction

Similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. These are in some sense inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. The notion of similarity is commonly used in many ML tasks:

- Clustering algorithms such as K-means clustering use Euclidean distance to compute similarity between two data points
- Radial Basis Function (RBF) Kernel which is commonly used in Support Vector Machines is actually a similarity function

Bump circuits implement a gaussian-like similarity function which "bumps" if the input voltages are close to each other and dies down if they are far apart. Anti-bump circuits implement a dis-similarity function which is like an inverse gaussian - it gives high output if input voltages are far apart and low output if they are close by.

#### 2.2.2 Circuit Implementation and Analysis



The similarity output from the circuit, given as current, becomes large when the input voltages are close to each other. Bias current  $I_b$  is set by bias voltage  $V_b$ . Transistors  $Q_{1-6}$  are operating in subthreshold and  $Q_4$  and  $Q_5$  have their W:L ratios w times those of other transistors. Further,  $Q_1$ ,  $Q_2$  and  $Q_5$  are assumed to be in subthreshold saturation, meaning that we can neglect the contribution of  $V_{ds}$  in drain current equation. First let's analyse the differential pair. Using the same subthreshold current equations (1 and 2) from the previous section with the assumption that all voltages are in units of  $\frac{\mathbf{KT}}{\mathbf{q}}$ ,

$$I_{1} = I_{0}e^{\kappa(V_{1} - V_{c})} \quad ; \quad I_{2} = I_{0}e^{\kappa(V_{2} - V_{c})}$$

$$I_{1} + I_{2} = I_{b}$$

$$\implies I_{1} = I_{b}\frac{e^{\kappa V_{1}}}{e^{\kappa V_{1}} + e^{\kappa V_{2}}}$$

$$and \quad I_{2} = I_{b}\frac{e^{\kappa V_{2}}}{e^{\kappa V_{1}} + e^{\kappa V_{2}}}$$

$$(5)$$

From equations (5) and (6) we can conclude that if  $|\Delta V| = |V_1 - V_2|$  is larger than a few  $\frac{KT}{q}$  then current in one of the two legs will shut off.

Now let's look at the upper part of the circuit which is called a current correlator. The paper cites a reference which uses a slightly modified subthreshold current equation:  $(\kappa V_{gs})$  is replaced by  $(\kappa V_g - V_s)$  when body is connected to appropriate supply rails. For  $\kappa$  close to 1, both give the same result and therefore I'll use the later as it simplifies the analysis.



$$I_{out} = wI_0 e^{V_{dd} - \kappa V_x} (1 - e^{-(V_{dd} - V_z)}) = wI_0 e^{V_z - \kappa V_y}$$

$$\implies e^{V_{dd} - \kappa V_x} - e^{V_z - \kappa V_x} = e^{V_z - \kappa V_y}$$

$$\implies e^{V_z} = \frac{e^{V_{dd} - \kappa V_x}}{e^{-\kappa V_x} + e^{-\kappa V_y}}$$

Substituting  $e^{V_z}$ ,

$$I_{out} = wI_0 \frac{e^{(V_{dd} - \kappa V_x)} e^{(V_{dd} - \kappa V_y)}}{e^{(V_{dd} - \kappa V_x)} + e^{(V_{dd} - \kappa V_y)}} \approx w \frac{I_1 I_2}{I_1 + I_2}$$
(7)

Intuitively, we can think about this circuit as an analog AND gate - for output to be high, both inputs need to be high. Substituting (5) and (6) in (7) gives,

$$I_{out} \approx w \frac{I_b}{2} sech^2 \left( \frac{\kappa (V_1 - V_2)}{2} \right)$$
 (8)

where  $sech(x) = \frac{2}{e^x + e^{-x}}$ .

The following computes similarity as well as dissimilarity outputs as currents.



Bias current  $I_b$  is set by  $V_b$ .  $Q_{1-4}$  are in subthreshold.  $Q_3$  and  $Q_4$  have their aspect ratio scaled by  $\mathbf{w}$ .  $Q_1$ ,  $Q_2$  and  $Q_4$  are assumed to be in subthreshold saturation.

$$I_1 = I_0 e^{\kappa V_1 - V_c}$$
 ;  $I_2 = I_0 e^{\kappa V_2 - V_c}$    
  $I_1 + I_2 + I_{out} = I_b$ 

To solve for  $I_1$  or  $I_2$ , first we will have to calculate  $I_{out}$  in terms of  $I_1$  and  $I_2$ .

$$\begin{split} I_{out} &= w I_0 e^{\kappa V_1 - V_c} (1 - e^{-(V_z - V_c)}) = w I_0 e^{\kappa V_2 - V_z} \\ &\implies e^{\kappa V_1 - V_c} - e^{\kappa V_1 - V_z} = e^{\kappa V_2 - V_z} \\ &\implies e^{-V_z} = \frac{e^{\kappa V_1 - V_c}}{e^{\kappa V_1} + e^{\kappa V_2}} \end{split}$$

Substituting  $e^{-V_z}$ ,

$$I_{out} = w I_0 \frac{e^{(\kappa V_1 - V_c)} e^{(\kappa V_2 - V_c)}}{e^{(\kappa V_1 - V_c)} + e^{(\kappa V_2 - V_c)}} \approx w \frac{I_1 I_2}{I_1 + I_2}$$
(9)

Substituting this back into KCL equation we get,

$$I_{out} = \frac{I_b}{1 + \frac{4}{w} \cosh^2\left(\frac{\kappa(V_1 - V_2)}{2}\right)} \tag{10}$$

where  $cosh(x) = \frac{e^x + e^{-x}}{2}$  We get the anti-bump current by adding  $I_1$  and  $I_2$ ,

$$I_1 + I_2 = I_b - I_{out}$$

The following is an extension of bump-anti-bump circuit. It was designed to produce a transconductance element that would ignore small voltage offsets on the input voltage,  $\Delta V$ , while retaining a monotonic yet saturating output characteristic for larger inputs.



## 2.3 Analog Programmable Multidimensional RBF based Classifier

Reference: S. Peng, P. E. Hasler and D. V. Anderson, "An Analog Programmable Multidimensional Radial Basis Function Based Classifier," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 10, pp. 2148-2158, Oct. 2007.

#### 2.3.1 Introduction

A real-valued function  $\varphi$  whose value depends only on the distance between the input and some fixed point,  $\mathbf{c}$ , called a center, so that  $\varphi(\mathbf{x}) = \varphi(\|\mathbf{x} - \mathbf{c}\|)$  is called Radial Basis Function (RBF). A classifier which uses RBFs to determine decision boundaries is called an RBF classifier. A vector quantizer (multi-class classifier) compares distances or similarities between input vectors and the stored templates and classifies the input data to the most representative template. This paper discusses the development of an analog Gaussian response function (which is an RBF) having a diagonal covariance matrix and demonstrates its application to vector quantization.

#### 2.3.2 System Overview





Fig. 12. Architecture of an analog vector quantizer. The core is the bump cel array followed by a WTA circuit. The main complexity from programming ar at the peripheries and the system can be scaled up easily.

The classification algorithm uses multidimensional Gaussian function with diagonal covariance matrix  $\Sigma$ , mean vector  $\mu$ , and scaling factor K as the RBF function

$$f_{\mu,\Sigma,K}(\mathbf{x}) = K.exp\left(-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1}(\mathbf{x} - \mu)\right)$$

The following block diagram explains the classification algorithm:



 $f_i(.)$  are all multivariate gaussian functions with different values of  $\mu$ ,  $\Sigma$  and K. Therefore,

class into which input is classified = 
$$\underset{i}{argmax} f_{\mu_i, \Sigma_i, K_i}(\mathbf{x})$$
 (11)

#### 2.3.3 Circuit Implementation and Analysis

The function implemented by bump circuit can be approximated as a gaussian but we need to modify it so that we can control center, width and height of the curve ( $\mu$ ,  $\Sigma$  and K respectively). We also have to make it multi-dimensional.



Fig. 2. Schematics of the new floating-gate bump circuit. All floating-gate transistors in the schematics have two inputs with equal weights and the floating-gate voltage can be expressed as  $V_{\rm fg}=1/2(V_{\rm con1}+V_{\rm con2})+V_Q$ , where  $V_Q=Q_{1g}/C_T$ ,  $Q_{\rm fg}$  is the charge on the floating gate and  $C_T$  is the total capacitance from the floating gate. The bump circuit is composed of an inverse generation block, a fully differential VOA, and a conventional bump circuit. The width and the center of the bell-shaped transfer function are set by the common-mode and differential charges on  $M_{21}$  and  $M_{22}$ . The height is controlled by the tail current  $I_{\rm b}$ . All of them are independently programmable.

The paper implements a **Floating Gate Bump Circuit** as shown above. The circuit comprises of 2-input floating-gate transistors which can be programmed using Fowler–Nordheim tunneling and channel hot electron (CHE) injection mechanisms.

$$V_{fg} = \frac{1}{2}(V_{con1} + V_{con2}) + V_Q \quad ; \quad V_Q = \frac{Q_{fg}}{C_T}$$
 (12)

The first step for realising desired function is additive inverse generation. Using the above equation for this circuit we can see,

$$V_{in1} + V_{1c} = V_{in2} + V_{2c} = V_{const}$$

$$\implies V_{1c} = V_{const} - V_{in1} \quad ; \quad V_{2c} = V_{const} - V_{in2}$$





Fig. 3. Inverse generation transfer characteristics. A floating-gate summing amplifier generates a complementary input voltage. This outputs are fed to floating-gate transistors in the VGA so that the outputs of the VGA are independent of the input common-mode signals. With an appropriate bias voltage, the operating range is one V<sub>DSnat</sub> away from supply rails.

To proceed, we first define 2 quantities,

$$V_{Q,cm} = \frac{V_{Q,21} + V_{Q,22} + V_{const}}{2}$$
 ;  $V_{Q,dm} = \frac{V_{Q,21} - V_{Q,22}}{2}$ 

Now,

$$V_{fg,21} = \frac{V_{in1} + V_{2c}}{2} + V_{Q,21} = \frac{V_{in1} + V_{const} - Vin2}{2} + V_{Q,21} = \frac{\Delta V_{in}}{2} + \frac{V_{const}}{2} + V_{Q,21}$$

$$V_{fg,21} = \frac{1}{2}\Delta V_{in} + \frac{1}{2}V_{Q,dm} + V_{Q,cm}$$
 (13)

$$V_{fg,22} = \frac{V_{in2} + V_{1c}}{2} + V_{Q,22} = \frac{V_{in2} + V_{const} - Vin1}{2} + V_{Q,22} = -\frac{\Delta V_{in}}{2} + \frac{V_{const}}{2} + V_{Q,22}$$

$$V_{fg,22} = -\frac{1}{2}\Delta V_{in} - \frac{1}{2}V_{Q,dm} + V_{Q,cm}$$
 (14)

$$V_1 - V_2 = \eta \cdot (V_{fq,21} - V_{fq,22}) = \eta \cdot (\Delta V_{in} + V_{Q,dm})$$
(15)

Where  $\eta$  is the gain of VGA and depends on  $V_{Q,cm}$ . Substituting (15) in (8) we get,

$$I_{out} \approx K.exp\left(-\eta'(\Delta V_{in} + V_{Q,dm})^2\right)$$
(16)

Here we have assumed  $sech^2(x) \approx e^{-x^2}$ . K is a function of bias current  $I_b$  and  $\eta'$  is a function of  $V_{Q,cm}$ . Therefore by programming  $I_b$ ,  $V_{Q,cm}$  and  $V_{Q,dm}$  we can vary height, width and center of the curve respectively.





Fig. 5. Gaussian fits of the transfer curves and the width dependance. (a) Comparison of the measured 1-D bumps (circles) and the corresponding Gaussian fits (dashed lines), One of the bump input voltages is fixed at  $V_{\rm ID}/2$ , where  $V_{\rm ID}$  is 3.3 V through the measurement. The extracted standard deviation varies SR times and the mean only shifts 4.29%. The minimum achievable extracted standard deviation is 0.199 V. (b) Width and common-mode charge relation in semi-logarithmic scale. The width is characterized by the extracted  $\sigma$ . The shift of the programmed common-mode floating gate voltage,  $\Delta V_{\rm Com}$ , represents the common-mode charge level. The dashed line is the exponential curve fit.

Now, to extend this to multiple dimensions we simple cascade many bump cells

$$I_{out2} \approx w_2 \frac{I_{b2}}{2} exp \left( -\eta_2' (\Delta V_{in2} + V_{Q,dm2})^2 \right)$$

$$\approx w_1 w_2 \frac{I_{b1}}{4} exp \left( -\eta_1' (\Delta V_{in1} + V_{Q,dm1})^2 \right) exp \left( -\eta_2' (\Delta V_{in2} + V_{Q,dm2})^2 \right)$$

$$I_{out2} \approx K' . exp \left( -\eta_1' (\Delta V_{in1} + V_{Q,dm1})^2 - \eta_2' (\Delta V_{in2} + V_{Q,dm2})^2 \right)$$
(17)







#### 2.4 Switched-Capacitor Matrix Multiplier

Reference: E. H. Lee and S. S. Wong, "Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 261-271, Jan. 2017.

#### 2.4.1 Introduction

This paper presents a switched-capacitor matrix multiplier for approximate computing and machine learning applications. Let x be the  $n \times 1$  input vector, A be an  $m \times n$  matrix whose elements are stored in memory and y = Ax be the  $m \times 1$  output vector; then

$$y[j] = \sum_{i=1}^{n} A[j, i].x[i]$$

x can be in analog or digital domain. If it's in digital we need a DAC to convert it to analog since the computation happens in analog domain. The output is again converted to digital using an ADC.



#### 2.4.2Circuit Implementation and Analysis



Fig. 2. (a) Active analog MAC and (b) passive analog MAC implementations.





(b) Passive SC-MAC

First let's analyse Active SC-MAC.

$$Q_{1,\phi_{1}}[i] = C_{1}[i].V_{in}[i] \quad ; \quad Q_{2,\phi_{1}}[i] = Q_{2,\phi_{2}}[i-1]$$
 
$$Q_{1,\phi_{2}}[i] = Q_{1,\phi_{1}}[i] - \Delta Q = C_{1}[i].V_{in}[i] - \Delta Q = -C_{1}[i].V_{x}[i] \quad (18)$$
 
$$Q_{2,\phi_{2}}[i] = Q_{2,\phi_{1}}[i] + \Delta Q = Q_{2,\phi_{2}}[i-1] + \Delta Q = C_{2}.(V_{y}[i] - V_{x}[i]) = -C_{2}.V_{x}[i].(1+A_{0}) \quad (19)$$

Eliminating  $\Delta Q$  from above 2 equations,

$$C_1[i].V_{in}[i] + Q_{2,\phi_2}[i-1] = -(C_1[i] + C_2.(1+A_0)).V_x[i]$$

$$\implies V_x[i] = -\frac{C_1[i].V_{in}[i]}{C_1[i] + C_2.(1 + A_0)} - \frac{Q_{2,\phi_2}[i-1]}{C_1[i] + C_2.(1 + A_0)}$$

Substituting back in equation (19),

$$V_{C_2}[i] = \frac{C_1[i].(1+A_0)}{C_1[i]+C_2.(1+A_0)}V_{in}[i] + \frac{C_2.(1+A_0)}{C_1[i]+C_2.(1+A_0)}V_{C_2}[i-1]$$
 (20)

Let  $k[i] = \frac{C_2(1+A_0)}{C_1[i]+C_2(1+A_0)}$  and  $\mu[i] = \frac{C_1[i]}{C_2}$  then,

$$V_{C_2}[i] = \mu[i].k[i].V_{in}[i] + k[i].V_{C_2}[i-1]$$

We only care about the final result after n cycles, therefore expanding the recursive equation gives us

$$V_{C_2}[n] = \sum_{i=1}^n \left( \mu[i].V_{in}[i]. \prod_{j=i}^n k[j] \right)$$
 (21)

We can write  $V_{C_2}[n]$  as dot product of 2 vectors,

$$V_{C_2}[n] = \begin{bmatrix} \mu[1] \prod_{j=1}^n k[j] & \mu[2] \prod_{j=2}^n k[j] & \dots & \mu[n]k[n] \end{bmatrix} \begin{bmatrix} V_{in}[1] \\ V_{in}[2] \\ \vdots \\ V_{in}[n] \end{bmatrix}$$

Performing this operation on m rows in the matrix A results in a nonideal matrix operation  $y = \tilde{A}x$ , where

$$\tilde{A} = \begin{bmatrix} \mu[1,1] \prod_{j=1}^{n} k[1,j] & \mu[1,2] \prod_{j=2}^{n} k[1,j] & \dots & \mu[1,n]k[1,n] \\ \mu[2,1] \prod_{j=1}^{n} k[2,j] & \mu[2,2] \prod_{j=2}^{n} k[2,j] & \dots & \mu[2,n]k[2,n] \\ \vdots & \vdots & \ddots & \vdots \\ \mu[m,1] \prod_{j=1}^{n} k[m,j] & \mu[m,2] \prod_{j=2}^{n} k[m,j] & \dots & \mu[m,n]k[m,n] \end{bmatrix}$$
(22)

After doing a similar analysis for passive SC-MAC case, we get

$$V_{C_2}[i] = \frac{C_1[i]}{C_1[i] + C_2} V_{in}[i] + \frac{C_2}{C_1[i] + C_2} V_{C_2}[i-1]$$
(23)

This can also be written as,

$$V_{C_2}[i] = \mu[i].k[i].V_{in}[i] + k[i].V_{C_2}[i-1]$$

where  $k[i] = \frac{C_2}{C_1[i] + C_2}$  and  $\mu[i] = \frac{C_1[i]}{C_2}$ . Therefore  $\tilde{A}$  describes the matrix for passive case too. Note that ideally we would like k[i] = 1 so that  $\tilde{A} = A$ .







To correct for the non-ideality given in equation (22), the authors proposed multiplying the output by a normalisation constant and then applying another matrix B in digital domain. Formally, we solve for B in

$$\min_{B \in \Omega_B} ||A - B\tilde{A}||_F$$

where F is the Frobenius norm, A is the intended ideal matrix,  $\tilde{A}$  is the actual matrix due to incomplete charge accumulation, B is the correction matrix, and  $\Omega_B$  is the set of possible values that B can take on.

#### 2.4.3 Applications and Results

The paper discusses results for two applications where SC-MM was used

- 1. Energy-efficient feature extraction layer performing both compression and classification in a neural network for an analog front end and
- 2. Analog acceleration for solving optimization problems that are traditionally performed in the digital domain



|                                                    | VENTIONAL    |                  |
|----------------------------------------------------|--------------|------------------|
|                                                    | Conventional | This work        |
| Top-3 Accuracy (%)                                 | 86           | 85               |
| Layer's Energy/Op (fJ) at<br>1GHz                  | 145*         | 13               |
| # of A/Ds per image at 6b                          | 3072         | 144              |
| Resolution                                         | 6b/4b/6b     | Analog/3b<br>/6b |
| NMSE of feature outputs<br>(avg. over all batches) | 0.0033       | 0.0054           |



|                        | 1                 |                      |                       |
|------------------------|-------------------|----------------------|-----------------------|
|                        | ISSCC 2015 [28]   | This Work            |                       |
| Technology             | 130nm             | 40nm                 |                       |
| Application            | Sensor Classifier | Sensor<br>Classifier | Analog<br>Accelerator |
| Analog Multiply Rate   | 20kHz*            | 1GHz                 | 2.5GHz                |
| Analog Accumulate Rate | N/A               | 1GHz                 | 2.5GHz                |
| A/D Rate               | 20kS/s            | 15MS/s               | 39MS/s                |
| Resolution             | Analog / 4b* / 8b | Analog / 3b / 6b     | 6b / 3b / 6b          |
| Supply                 | 1.2V              | 1V                   | 1.1V                  |
| Total Power            | 663nW             | 228uW**              | 647uW**               |
| Efficiency (TOPS/W)    | 0.0603*           | 8.77**               | 7.72**                |

# 2.5 In-Memory Computation of an ML Classifier in 6T-SRAM array

Reference: J. Zhang, Z. Wang and N. Verma, "In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array," in IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 915-924, April 2017.

#### 2.5.1 Introduction

An underlying limitation emerges in current architectures for digital accelerators, which separate data storage from computation. Storing the data fundamentally associated with a computation requires area, and thus, its communication to the location of computation incurs energy and throughput cost, which can dominate. This has motivated thinking about architectures that integrate some forms of memory and compute This paper talks about a machine learning classifier where data storage and computation (MAC operations) are combined in a standard 6T SRAM array.

#### 2.5.2 System Overview





In the SRAM mode, the operation is typical read/write of digital data. This is how machine-learning models derived from training are stored in bit cells.

In the Classify Mode, all wordlines (WLs) are driven at once to analog voltages. Thus, parallel operation of all bit cells is involved (by comparison, and in the SRAM mode, only one WL is driven at a time). Each analog WL voltage corresponds to a feature in a feature vector we wish to classify. The features are provided as digital data, loaded through the feature-vector buffer, and analog voltages are generated using a WLDAC in each row.

Every column based classifier computes a decision d given by

$$d = sgn\left(\sum_{i=1}^{N} w_i \times x_i\right) \tag{24}$$

where  $x_i$  corresponds to elements from a feature vector  $\overrightarrow{x}$ , and  $w_i$  corresponds to weights in a weight vector  $\overrightarrow{w}$ , derived from training. The circuit implementation makes this a "weak classifier" (cannot be trained to fit arbitrary data distributions) because not only are the bit-cell currents non-ideal, the weights are also restricted to +1 and -1. A "strong classifier" can be constructed by combining many such columns, as shown in figure 17, using a technique called Error-Adaptive Classifier Boosting.

#### 2.5.3Circuit Implementation and Analysis



architecture.

In SRAM mode, the read and write operations work exactly like those in standard 6T-SRAM. All the transistors have to be scaled appropriately to prevent read upset and enable write operation.

In classify mode, the wordline  $(WL_i)$  carries input analog voltage. Consider the circuit where the weight is -1 as shown in the figure. Both BL and BLB are precharged, so current flows only through the access transistor connecting BL and 0. Since  $V_{WLi} < V_{DD}$ , it operates in saturation

$$I_{BC,i} \approx \frac{K}{2}.(V_{WLi} - V_{tn})^2$$

The flow of current causes discharge of capacitor on BL.

$$\Delta V_{BL,i} = \frac{I_{BC,i} \times \Delta t}{C_{BL}}$$

Similarly, if the weight stored is +1, current flows through access transistor connected to BLB and causes discharge of capacitor on BLB. Parallel currents add up and result in some total discharge on both BL and BLB. A sense amplifier senses which line has greater discharge and gives 1 bit output.

$$d = sgn\left(\Delta V_{BLB,tot} - \Delta V_{BL,tot}\right) \propto sgn\left(\sum_{i=1}^{n} I_{BC,i} \times w_i\right)$$
 (25)

This is what we want to implement as per equation (24), if  $I_{BC,i}$  represents input.



Fig. 7. WLDAC circuit, consisting of binary-weighted current sources and a bit-cell replica, to generate analog voltage on WL corresponding to inputted digital feature values.

The WLDAC is implemented using binary weighted current sources connected in parallel. If class\_en is high,

$$I_{DAC} = I_{bias} \times x_{offset} + \sum_{i=0}^{Nbits} I_0 \times 2^i \times x[i]$$
 (26)

Transistor  $M_{A,R}$  sets the WL voltage. Access transistor  $M_A$  mirrors  $M_{A,R}$  with scaling such that  $I_{BC} = \frac{1}{R}I_{DAC}$ . Therefore,  $I_{BC,i}$  does represent the input and equation (25) matches (24).

#### 3 Simulations

The simulations were a joint effort by me and Anubhav Agarwal. We simulated a column based weak classifier as described in the In-Memory computation paper discussed above.

#### 3.1 Setup

Simulations were carried out with Cadence AMS. UMC65 nm technology was used. The circuit was designed to work for a 100 element sized feature vector input. Every element is represented in 8 bits. The circuit comprises of the following blocks:

- 1. **FV Buffer**: This block was implemented in Verilog. It allows us to store an array of input values present a text file, into a register as bit values.
- 2. WLDAC: This block was implemented in VerilogAMS. It converts the digital input to analog voltage between 0 and 1.2 V whenever CLASS\_EN goes high, and presents high output impedance when CLASS\_EN is low.
- ADDR Decoder and WLDriver: This block was implemented in Verilog. It decodes the input address and drives the wordlines whenever SRAM\_EN is high, and presents high output impedance when SRAM\_EN is low.
- 4. Write Driver: This block was implemented as transistor level schematic. In SRAM mode, it drives the bit data in appropriate bit cell when WRITE\_EN is high.
- 5. **6T SRAM Cell**: This block was implemented as transistor level schematic by appropriate scaling of transistors. It stores the weights as bits.

Same blocks were instantiated multiple times and bus connections were used wherever required.



Figure 12: Full Testbench Schematic



Figure 13: Column of SRAM cells



Figure 14: SRAM Cell

#### 3.2 Results

We did a transient analysis. In the first 100 clock cycles, weights were written into bit cells and in the next 3 clock cycles feature vector inputs were classified. For the sake of simplicity, clock frequency for SRAM mode and CLASSIFY mode was kept same (10 MHz).

During sequential writing, only the WL corresponding to cell being written goes high while the rest go to 0. When inputs are being classified, different

analog voltages corresponding to inputs appear on all WLs. This process can be seen in the figure below were  $WL_{0-3}$  and  $WL_{97-99}$  signals are plotted,



 $V_{out}$  in classify mode represents the decision as per equation (25). In the plot below we see that the first feature vector was classified as 1 while the next two were classified as 0. This matched the expected result computed using python.

