<a href="https://colab.research.google.com/github/mrigakshipandey/seminar-mikroelektronik/blob/main/Task_1_Related_Literature.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WHAT MATTERS IN TRANSFORMERS? NOT ALL ATTENTION IS NEEDED


---


##Assessing Redundancy
Redundant modules produce outputs that are similar to their inputs, implying minimal transformation. The similarity between the input $\mathbf{X}$ and output $\mathbf{Y}$ of a module is quantified using cosine similarity.

Therefor the importance score $\mathbf{S}$ of the module is computed as:

\begin{align}
\mathbf{S} = 1 - CosineSim(\mathbf{X},\mathbf{Y})
\end{align}

\begin{align}
\mathbf{S} = 1 - \frac{ \mathbf{X} ⋅ \mathbf{Y}}{\left \| \mathbf{X} \right \| \left \| \mathbf{Y} \right \|}
\end{align}


---


##Assessing the effects of dropping a module
To quantify the trade-off between performance degradation and speedup, we introduce a new metric called Speedup Degradation Ratio or SDR $λ$, defined as:

\begin{align}
\mathbf{S} = \frac{Δ Avg.}{Δ Speedup}
\end{align}

Where $Δ Avg.$ represents the percentage change in average performance across the evaluated tasks and $Δ Speedup$ denotes the corresponding percentage of speedup achieved by each method.

##Joint Layer Drop
The Joint Layer Drop method is simple Imented by  calculating the importance scores for both attention layers and MLP layers individually. We concatenate the scores and from this combined set of importance scores,we drop the layers with the lowest values.

---
##Observations
- Attention layers are highly redundant, and their removal has minimal impact on model accuracy, making Attention Drop a highly efficient pruning strategy.

- Deeper layers (excluding the last ones) often exhibit excessively low importance across Block, MLP, and Attention modules.

- Attention layers demonstrate consistently lower importance scores than MLP and Block at all training stages.

- Joint Layer Drop  consistently achieves better performance than either Attention Drop or MLP Drop alone.

- Given the simplicitynand efficiency,  One-ShotDropping emerges as the superior choice.

- Attention Drop is Orthogonal to Quantization.
---
---

#CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices
An ideal model compression technique should:
 -  maintain regular network structure;
 - reduce the complexity for both inference and training, and, most importantly,
 - retain a rigorous mathematical fundation on compression ratio and accuracy.

CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and
 training)  and the storage complexity.

In a square circulant matrix, each row (or column) vector is the circulant reformat of the other row (column) vectors. A non-squared matrix could be represented by a set of square circulant submatrices (blocks).

A fully-connected layer of DNN, can be represented as **y** = ψ(**Wx**+**θ**), where vectors
x and y represent the outputs of all neurons in the previous layer and the current layer, respectively; W is the m-by-n weight matrix; and ψ(·) is activation function.

When **W** is a block-circulant matrix, the Fast Fourier Transform (FFT)-based fast multiplication method can be utilized, and the computational complexity is reduced. Block circulant matrices can be block-diagonalized using a block FFT. This turns convolution (or multiplication) into element-wise multiplication in the frequency domain, which is much faster.

CirCNN directly trains the network assuming block-circulant structure. This leads to two advantages.
- CirCNN provides the adjustable but fixed reduction ratioofreductio in model size;
- with the same FFT-based fast multiplication, the computational complexity of training is also reduced.

To achieve better compression ratio, larger block size should be used, however, it may lead to more accuracy degradation. The smaller block sizes provide better accuracy, but less compression.

For CONV Layers, Software tools such as Caffe provide an efficient methodology of transforming tensor-based operations in the CONV layer to matrix
based operations.

---

##Overall Architecture
- The **basic computing block** is responsible for the major FFT and IFFT computations.

- The **peripheral computing block** is responsible for performing component-wise multiplication, ReLU activation, pooling etc.

- The implementations of ReLU activation and pooling are through comparators and have no inherent difference compared with prior work.

- The **control subsystem** orchestrates the actual FFT/IFFT calculations on the
 basic computing block and peripheral computing block. The different setting of FFT/IFFT calculations is configured by the control subsystem.

- The **memory subsystem** is composed of ROM, which is utilized to store the coefficients in FFT/IFFT calculations; and RAM, which is used to store weights.

- We use 16-bit fixed point numbers for input and weight representations.

---

##Pipelining and Parallelism
- In **inter-level pipelining**, each pipeline stage corresponds to one level in the basic computing block.

- In **intra-level pipelining**, additional pipeline stage(s)will  be added with in each butterfly computation unit.

- The proper selection of pipelining scheme highly depends on the target operating frequency and memory subsystem organization.

- Derive upper bound of $p$ (parallelization degree) based on memory bandwidth-limit & hardware resource limit.

- The overall metric $M$, which is a function of performance: $Perf(p,d)$ and  power consumption $Power(p,d)$; where $d$ is the parallelization depth.

- We estimate $M$ assuming $d = 1$.

- Optimize depth $d$ using the ternary search method, based on the
 derived $p$ value.

---

##Platform-Specific Optimizations
We focus on weight storage and memory management, in order to simplify the design and achieve higher energy efficiency and performance.

###FPGA Platform
- Weight storage requirements can be met by the on-chip block memory in state-of-the-art FPGAs.

- 16-bit fixed point numbers are used to represent the weights.

- Applying Block Circulent Matrix to both FC and CONV layer in AlexNet, the storage requirement can be further reduced to 2MB or even less

###ASIC platform
- If we target at a clock frequency around 200MHz, then the memory hierarchy is not necessary because a single-level memory system can support such operating frequency.

- Memory/cache reconfiguration techniques can be employed when executing different types and sizes of applications for performance enhancement and static power reduction.

- If we target at a higher clock frequency, say 800MHz, an effective memory hierarchy with at least two levels (L1 cache and main memory) becomes necessary because a single-level memory cannot accommodate such high operating frequency.

- The effectiveness of prefetching is due to the regularity in the proposed block-circulant matrix-based neural networks, showing another advantage over prior compression schemes.

- Besides the memory hierarchy structure, the memory bandwidth is determined by the parallelization degree $p$ in the basic computing block.

---
---