# Rapid Implementation of Hardware Neural Networks Powered by Verython


## Project Introduction
For our ECE 5760 final project, we implemented a convolutional neural network (CNN) on the FPGA to classify handwritten digits.  CNNs are among the most popular neural network architectures for image classification, as they have been shown to outperform humans in clinical imaging [(Source)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6586983/), and have a variety of recognition applications!  Interestingly, CNNs behave like a black box: they take in input and generate an output without revealing their intrinsic logic.  The goal of this project was to demystify CNNs by breaking their layers down into constituent components, and implementing them in hardware.  





## High Level Design


The term “convolutional” is derived from the mathematical operation of convolution - which conventionally involves multiplying elements together then summing them together.  The output of a convolution is therefore a description of how the shape of one function influences another function.  This unique property of our convolution is what our model captures: the spatial dependencies of an image.  For our digit classification task, a conventional neural network would be overfit to the specific location of the digit on the screen whereas the convolutional layer would learn the edges and gradients of the image. 

The architecture of our network is as follows: a series of convolutional layers with max pooling (twice), followed by a flattened, fully connected layer.  Below is the tensorflow (python library for building neural networks) summary of our model, highlighting all the parameters that we trained.  

![image info](./images/tf.png)



In the following examples, we describe the high-level operations of each layer.

Our convolutional layer uses a kernel size of 2x2.  In this example, we output a 3x3 feature map from a 4x4 image.  Our 2x2 filter slides across the image from left to right, accumulating the element-wise product summation.  We can extrapolate this result to NxN images with 2x2 kernels → the output feature map will always be (N-1) x (N-1) given that the kernel is sized 2x2.  


![image info](./images/conv.png)


As a finishing touch to the output of our convolutional layer, we pass our values through a special function to “activate” or realize our predictions.  Each element of the feature map gets passed through an activation function before moving onto the next layer.  We chose the sigmoid activation function initially to keep our values between [0,1].  Extremely large positive values get normalized closer to 1 while large negative values get normalized closer to 0.  We soon realized we needed a large number of decimal bits in our fixed point representation.  We ended up resorting to ReLU activation, as the computational cost was cheaper and did not require significant decimal precision to activate values. In this activation function, negative values become zero while positive values retain their original value.

![image info](./images/sigmoid.png)


Immediately following the convolutional layer is our max-pooling layer.  Let’s take a 4x4 feature map, as an example.  Here, we use a pool size of 2x2 with a stride of 2.  At each stride, we take the maximum element of the 2x2 window.  This results in an output image of size 2x2. When the pool size is equal to the stride size, we effectively reduce the dimensions of the feature map by a factor of 2.  


![image info](./images/mp.png)


Lastly, our fully-connected layer is the flattened output of the feature map.  We connect it to our predicted digits of 0-9 and train our model to learn the weights of this layer.  The weights can be thought of as the strength of the connection: how much does the output depend on that specific connection?  We take the dot product of the flattened feature map with each of their respective weights and pass them through another activation function to generate our predictions.  We take the index of the element with the maximum probability in our 10 element vector array as our predictions for a given input.

![image info](./images/fullyconnect.png)


## Introduction to Notation
For some array $A$, let $A:[d1,\dots,d_i,\dots,d_n]$ indicate that that $A$ is $n$-dimensional and each dimension $1\leq i\leq n$ has a length of $d_i$. For example,
$$A:[2, 3]=\begin{bmatrix} a_{1, 1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \end{bmatrix}.$$



## Program/Hardware Design


### Model Implementation
In this section, we will dive deep into the math behind each of the layers in the model and how we rederived the transformation functions of each layer to fit on the FPGA. 

As a reminder, our model ingests some image of a hand-drawn digit and outputs a number between 0 and 9 which corresponds to the prediction of the model. The image is a 28 by 28 2D array of values between 0 and 255, or rather, some input image $img$ can be represented as
$$img=\begin{bmatrix} 
  p_{1,1} & \dots & p_{1,j} & \dots & p_{1,28} \\ 
  \vdots & \ddots & & & \vdots \\
  p_{i, 1} & \dots & p_{i, j} & \dots & p_{i, 28} \\
  \vdots & & & \ddots & \vdots \\
  p_{28, 1} & \dots & p_{28, j} & \dots & p_{28, 28}
\end{bmatrix}$$
such that $0\leq p_{i,j}\leq255$. 

Moreover, we can think of our model $M$ as a function that maps some input $img$ to some output prediction $pred$, or rather, the inference of our model can be represented by 
$$M(img)=pred$$
for some output prediction $0\leq pred\leq9$. 

Furtheremore, we can also express each of the layers in our model as a a function. Let $L=\{L_1,\dots,L_i,\dots,L_4\} be the layers in our model. Then we can represent each layer $L_i$ as some function 
$$L_i(V_i^{in})=V_i^{out},$$
where $V_{in}$ and $V_{out}$ are matrices whose shapes are defined by $L_i$. 

With this new notation, we can redfine $M$ in terms of its layers:
$$\begin{align*}
  M(img) & =L_4(L_3(L_2(L_1(img)))) \\
  & =pred.
\end{align*}$$
Thus, a model's prediction is simply the output of its cascaded layer functions.

Let's now take a look at each of the layers and how we can rederive their functions to be built on the FPGA.


#### Layer 1: Conv2D
The 2D convolutional layer has weights $W^1$ and biases $B^1$ that are defined by 
$$\begin{align*}
    & W^1:[x,y,z]=
    \begin{bmatrix}
        \begin{bmatrix}
            W^1_{1,1} = \begin{bmatrix} o_1 & \dots & o_z \end{bmatrix} \\
            \vdots \\
            W^1_{1,y} = \begin{bmatrix} p_1 & \dots & p_z \end{bmatrix}
        \end{bmatrix}
        & \dots &
        \begin{bmatrix}
            W^1_{x, 1} = \begin{bmatrix} q_1 & \dots & q_z \end{bmatrix} \\
            \vdots \\
            W^1_{x, y} = \begin{bmatrix} r_1 & \dots & r_z \end{bmatrix}
        \end{bmatrix}
    \end{bmatrix}, \\
    & B^1:[z]=\begin{bmatrix} b_1 & \dots & b_z \end{bmatrix}.
\end{align*}$$

## Results of the Design
These are the results of the design.


## Conclusions
These are the conclusions.