# Deploy DFG IR on FPGA

This tutorial continues, with how to, finally deploy the algorithm, in this case our face detaction SSD model onto an FPGA. We have talked about the network itself, Tensorflow as a *DSL* and we are going to finish off by using the platform that we have introduced in previous tutorials to take this network and actually deploy it onto the FPGA.

This tutorial is organised as follows:

1. [Freezing a model](#Freezing-a-model)
2. [Creating a DFG](#Creating-a-DFG)
3. [Optimizing DFG](#Optimizing-DFG)
4. [Hardware Deployment](#Hardware-Deployment)
5. [Importing DFG into raintime](#Importing-DFG-into-raintime)
6. [Compilation Observation](#Compilation-Observation)


### Intended Outcomes:

1. Learn and understand how the compilation process could be split up for your research interests
2. Learn how to use our tool flow to deploy a model onto FPGA

Given that you have installed __Plumber__ in the Tutorial 1, we are now going to use it to generate a *D*ata-*F*low *G*raph (DFG) that can be passed down the platform to reach the final execution on FPGA and CPU. After this lab you should understand the concept of multi-stage compilation down to the FPGA level or be inspired how to translate this process into your personal projects. 

Just for a quick summary and recapitulation, the platform itself consists of multiple parts: *Plumber* is a web-based application capable of taking a templated description of a machine learning algorithm, optimize it and create a DFG that is then passed into *raintime*. *raintime* then instantiates computation nodes, either processed in a CPU or offloaded to a FPGA accelerator. *rainman* then takes the FPGA templates and synthesises them on the device itself, while interconnecting with the nodes instantiated on the CPU. All can be visualised in a simple diagram:

![Flowchart.png](../data/figs/platform_flowchart.png)

So as to go deeper into the compilation process it can be separated into two parts. First is a software call via *Plumber*. The first layer of the platform gets a description of CNN which is then passed into DFG intermediate representation (IR). It then optimizes it in software and later in hardware which is going to generate a `*.json` construct which is going to be a guide for actual hardware design. That description is then used to create a bit-stream by using Vivado environment. All the information and capabilities of our platform can be also found on this [link](https://corerain.github.io/plumber-docs/topics/commands.html).

Then follows the software call which has multiple steps which we are going to discuss below. 

So to get started with the SSD example, make sure that you have the checkpoint files of your model and a `plumber_cli` installed in your virtual environment. The checkpoint files are already downloaded for you.

Then we can simply use the `plumber_cli` to step-by-step to create a DFG that can be loaded on FPGA. You can create a separate terminal window inside the root directory of the tutorials or simply run the commands from the comfort of jupyter notebook. 

## Freezing a model
Make sure that your checkpoint files generated after training/retraining session contain these files: 

- `checkpoint`: a file that contains meta information, data files and index file about the checkpoint directory
- `*.meta`: the meta information about your model
- `*.data`: weights data
- `*.index`: the index file

These files are now going to be used to be imported into `Plumber` and consequently converted into a representation that the platform *understands* and can optimise. 

In [21]:
! plumber_cli freeze ../model/ssd_ckpt -d ../model/ssd -o ssd_KYnet_v2/block9_box/Reshape,ssd_KYnet_v2/block8_box/Reshape,ssd_KYnet_v2/block4_box/Reshape,ssd_KYnet_v2/softmax_5/Reshape_1,ssd_KYnet_v2/softmax_1/Reshape_1,ssd_KYnet_v2/softmax_2/Reshape_1,ssd_KYnet_v2/softmax_3/Reshape_1,ssd_KYnet_v2/block7_box/Reshape,ssd_KYnet_v2/block10_box/Reshape,ssd_KYnet_v2/softmax/Reshape_1,ssd_KYnet_v2/block3_box/Reshape,ssd_KYnet_v2/softmax_4/Reshape_1

TensorFlow Version: 1.5.0
Freezing ../model/ssd_ckpt to ../model/ssd/model.pb ...
Output node names: ['ssd_KYnet_v2/block9_box/Reshape', 'ssd_KYnet_v2/block8_box/Reshape', 'ssd_KYnet_v2/block4_box/Reshape', 'ssd_KYnet_v2/softmax_5/Reshape_1', 'ssd_KYnet_v2/softmax_1/Reshape_1', 'ssd_KYnet_v2/softmax_2/Reshape_1', 'ssd_KYnet_v2/softmax_3/Reshape_1', 'ssd_KYnet_v2/block7_box/Reshape', 'ssd_KYnet_v2/block10_box/Reshape', 'ssd_KYnet_v2/softmax/Reshape_1', 'ssd_KYnet_v2/block3_box/Reshape', 'ssd_KYnet_v2/softmax_4/Reshape_1']
2018-07-24 17:35:28 INFO Initialising with model directory ...
2018-07-24 17:35:28.696265: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
Converted 52 variables to const ops.
Successfully writen the frozen graph to ../model/ssd/model.pb


- `../model/ssd_ckpt/`: is the checkpoint directory
- `../model/ssd/`: is the output directory

## Creating a DFG

Out of these files that you have created you can create a raw Data-Flow graph, again by using `plumber_cli`. This command actually parses the content from the frozen TensorFlow model and generates a corresponding DFG. If you would like to see the raw `.pbtxt` please look at `model/ssd/ssd_dfg.pbtxt` after you run the command. 

In [22]:
!plumber_cli dfg \
    --model-file=../model/ssd/model.pb \
    --dfg-bin-file=../model/ssd/model_dfg.pb \
    --dfg-text-file=../model/ssd/ssd_dfg.pbtxt \
    --dfg-data-file=../model/ssd/ssd_dfg.h5 \
    --input-image-shape=1,256,256,3

TensorFlow Version: 1.5.0
[32mGenerating DFG from ../model/ssd/model.pb to ../model/ssd/model_dfg.pb[0m
2018-07-24 17:35:36 INFO loading TensorFlow model from ../model/ssd/model.pb ...
2018-07-24 17:35:36 INFO Initialising with model frozen file ...
2018-07-24 17:35:36 INFO Successfully loaded the model!
2018-07-24 17:35:36 INFO Rewriting the input TensorFlow Graph for convenient DFG generation ...
2018-07-24 17:35:36 INFO Rewriting the graph by "ReshapeRewriter" ...
2018-07-24 17:35:36 INFO Rewriting the graph by "PlaceholderRewriter" ...
2018-07-24 17:35:36 INFO Rewriting the graph by "DropoutRewriter" ...
2018-07-24 17:35:36 INFO Building a TFGraph object ...
2018-07-24 17:35:36 INFO Running model inference, it may take a while ...
2018-07-24 17:35:36.888671: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2
2018-07-24 17:35:36 INFO Running inference ...
2018-07-24 17:35:3

  device: CPU
  type: T_FLOAT
  conv2d_op_param {
    depth: 8
    kernel_size: 3
    pad: 1
    stride: 1
    activation_fn: ""
    use_maxpool_2x2: false
    use_batch_norm: false
    use_bias: true
    use_relu: false
  }
}
node {
  name: "ssd_KYnet_v2/softmax_3/reshape_1"
  input: "ssd_KYnet_v2/softmax_3/softmax"
  op: "Reshape"
  device: CPU
  type: T_FLOAT
  reshape_op_param {
    shape {
      dim: 384
      dim: 2
    }
  }
}
node {
  name: "ssd_KYnet_v2/softmax_4/reshape"
  input: "ssd_KYnet_v2/block9_box/reshape_1"
  op: "Reshape"
  device: CPU
  type: T_FLOAT
  reshape_op_param {
    shape {
      dim: 1
      dim: 4
      dim: 4
      dim: 6
      dim: 2
    }
  }
}
node {
  name: "ssd_KYnet_v2/softmax_4/softmax"
  input: "ssd_KYnet_v2/softmax_4/reshape"
  op: "Softmax"
  device: CPU
  type: T_FLOAT
  softmax_op_param {
    shape {
      dim: 96
      dim: 2
    }
  }
}
node {
  name: "ssd_KYnet_v2/block10_box/reshape

- `../model/ssd/model.pb`: is the Plumber file describing the network as a Plumber binary file
- `../model/ssd/model_dfg.pb`: is the Plumber template for a DFG
- `../model/ssd/ssd_dfg.pbtxt`: is the description of the DFG in a text format
- `../model/ssd/ssd_dfg.h5`: is a data-file describing input/output sizes, important for random data generation or weights extraction
- `1,256,256,3`: is an input image shape, in our case 256x256 images with three channels with one image per batch, n.b.: the format is Batch Size, Height, Width, Number of Channels.

## Optimizing DFG

DFG should be optimised before hardware deployment. DFG optimisation is performed in two steps: platform-independet and platform-dependent optimisation. Platform-independet optimisation improves the efficiency of the model without knowing the platform specification, while platform-dependent optimisation takes that information into account. Both steps are callable from `plumber_cli`.

### Platform-Independent Optimisation

`dfg_opt` optimises an input DFG in a platform-independent style. It mainly quantises model coefficients.

In [23]:
!plumber_cli dfg_opt \
    --dfg-file=../model/ssd/model_dfg.pb \
    --dfg-data-file=../model/ssd/ssd_dfg.h5 \
    --opt-dfg-file=../model/ssd/ssd_opt_dfg.pbtxt \
    --logdir=../model/ssd/logs

TensorFlow Version: 1.5.0
Loading DFGDef from file "../model/ssd/model_dfg.pb" ...
Loading DFG from definition and data ...
2018-07-24 17:35:46 INFO Loading data from file ../model/ssd/ssd_dfg.h5 ...
Initialise DFG optimizer ...
Running optimization ...
2018-07-24 17:35:46 INFO Running optimization pass "Data Representation Optimization" ...
Explored data representation table:
                            node_name         key       min         max rep_fixed32 rep_ufixed32 rep_fixed16 rep_ufixed16 rep_fixed8 rep_ufixed8
0                           img_input       ifmap  0.001562  255.994907       22, 9        23, 9        6, 9         7, 9       None        None
1                           img_input       ofmap  0.001562  255.994907       22, 9        23, 9        6, 9         7, 9       None        None
2         ssd_KYnet_v2/conv1_1/conv2d       ifmap  0.001562  255.994907       22, 9        23, 9        6, 9         7, 9       None        None
3         ssd_KYnet_v2/conv1_1/conv2d   

All reports are written to: ../model/ssd/logs/reports/dfg_opt


This will now take the original DFG described in `tmp/ssd_dfg.pb` and optimize it to maximaise the gain from our platform: 

- `../model/ssd/model_dfg.pb`: is the previously generated DFG file
- `../model/ssd/ssd_dfg.h5`: is the data file that we have created in the previous step
- `../model/ssd/ssd_opt_dfg.pbtxt`: this is the new, optimised, pbtxt
- `../model/ssd/logs`: this is the logging directory

now let's move to the next step, which actually results in an execution on embedded system with FPGA. 

### Platform-Dependent Optimisation

`hdl_opt` performs platform-dependent optimisation. It takes a board specification file and a platform-independently optimised DFG as input, and produces a further optimised DFG and a configuration file that specifies hardware design parameters.

In [25]:
! plumber_cli hdl_opt \
    ../data/boards/rainman_board_v2.pbtxt \
    --dfg_file=../model/ssd/ssd_opt_dfg.pbtxt \
    --dfg_data_file=../model/ssd/ssd_dfg.h5 \
    --opt-dfg-file=../model/ssd/ssd_hdl_dfg.pbtxt \
    --logdir=../model/ssd/logs

TensorFlow Version: 1.5.0
Optimise DFG for hardware-accelerated execution ...
LOAD BoardDef from file: "../data/boards/rainman_board_v2.pbtxt"
2018-07-24 17:40:02 INFO Loading data from file ../model/ssd/ssd_dfg.h5 ...
LOAD DFG from DFGDef file: "../model/ssd/ssd_opt_dfg.pbtxt" and data file: "../model/ssd/ssd_dfg.h5"
RUN  HDL optimization ...
2018-07-24 17:40:02 INFO Running optimization pass "Device Optimization" ...
2018-07-24 17:40:02 INFO Marked node "img_input" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv1_1/conv2d" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv2_1/conv2d" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv2/conv2_1/conv2d" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv2/conv2_2/conv2d" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv3/conv3_1/conv2d" with "device: FPGA"
2018-07-24 17:40:02 INFO Marked node "ssd_KYnet_v2/conv

In the example above, `data/boards/rainman_board_v2.pbtxt` is a board specification file that specifies our Rainman V2 board. It just specifies the number of each type of resource on our targeting board.

```protobuf
name: "RAINMAN_BOARD_V2"
num_lut: 218600
num_ff: 437200
num_bram: 545
num_dsp: 900
```

The optimisation procedure will try to fuse operators, allocate the execution device for each operation, and explore the design space. You can check the progress through its verbose output.

`../model/ssd/ssd_hdl_dfg.pbtxt` is the optimised DFG, and in `../model/ssd/logs` you can locate a file called `hdl_params.json` that contains hardware design parameters. This JSON file will be further utilised to generated bitstream.

## Hardware Deployment

To generate the bitstream, we should provide the `hdl_params.json` to the `gen` command. `gen` will call the FPGA synthesis tool-chain to produce a bitstream. Since we don't include any FPGA tool in the virtual machine or JupyterHub, we simply present the command and its outcome:

## Importing DFG into raintime

Just as a quick recap: `raintime` is a software runtime library for processing CNNs on embedded FPGA systems. Computation nodes in a CNN can either be processed in CPU or offloaded to the FPGA accelerator design built by `rainman`. It also has several parts, it can be summarised in a diagram without going into too much detail: 

![raintime.png](../data/figs/raintime.png)

The nodes themselves are already implemented in raintime, but we need to streamline the execution and say which data we want to extract etc. In later versions this step is going to be completely automatic, at the moment we have to write a short demo.

If we were about to write the demo in `raintime`, to execute the demo on FPGA it would have several important steps:

```C++
  int batch_size = 1;
  int n_channels = 3;
  int img_size = 265

  // Load image
  cv::Mat image;
    image = cv::imread(argv[1], cv::CV_LOAD_IMAGE_COLOR);
    
  // Reorder the pixel values from the default ordering of opencv
  std::vector<uint8_t> converted;
  auto image_pointer = image.ptr();
  for (size_t i = 0; i < n_channels; i++) {
    for (size_t j = 0; j < img_size * img_size; j++) {
      converted[j + img_size * img_size * i] =  image[n_channels * j + i];
    }
  }
  
  // Load DFGDef
  auto dfg_def = LoadDFGDefFromFile(dfg_file_name);

  // Use the integrated builder to build the graph and make abstractions to connect the CPU and FPGA
  *dfg = DFGBuilder(dfg_def).Build();

  // Load constant data map, including weights and biases
  *data_map = new DFGDataMap;
  (*data_map)->LoadFromDir(data_dir);
  
  
  std::vector<int> dims;
  dims.push_back(n_channels);
  dims.push_back(img_size);
  dims.push_back(img_size);
  dims.push_back(batch_size);
  
  
  // Load input data-map into the DFG, without any particular pre-processing optimisations
  DFGDataMap *input_data_map = new DFGDataMap;
    input_data_map->LoadImage(converted, image_size, n_channels,
                              "input_tensor", dims, "no");
                              
  // Extract the output data map from the runner, in this case the 
  auto output_data_map = runner->Run(dfg, data_map, input_data_map, true);
  auto output_data = output_data_map->get("Predictions").second;


  std::cout<<"The number of detected faces: "<<output_data_map.size()<<std::endl;
  
  //House-keeping
  delete input_data_map;
  delete output_data_map;
```

Once you have finished writing the demo, then you would have to compile your design, on the board.

```bash
$ git clone https://github.com/corerain/raintime.git
$ cd raintime
$ mkdir build && cd build
# Create the compiling structure through CMake and specify the number of fraction bits (FB) for a 32 bit representation
$ cmake .. -DCMAKE_BUILD_TYPE=Release -DDEF_FIXED_NUM_FB_32=20 -DBUILD_TESTS=ON
$ make
$ ./your_demo
```

### Compilation Observation

This is fairly easy, `raintime` has several settings how to compile a project, but we will try to avoid details. Once connected to the board with preinstalled OS and a correct `BOOT.bin`, you would clone the raintime project with your demo and compile it. 

... and viola! You have described an algorithm in Python/Tensorflow and now you are executing it on FPGA, great isn't it?

#### Resources

Remember the fixed point representation from the previous tutorial? It obviously also affects the resources used in addition to accuracy and speed of execution. Let's also look at how many resources were used in each representation (32-bits and 16-bits):

| SSD Demo          | 32-bits | 16-bits |
|-------------------|---------|---------|
| Registers         | 280871  | 92780   |
| Block Memory bits | 12Mb    | 6349Kb  |
| DSP Blocks        | 168     | 150     |

you can see that the amount of registers decreases approximately three fold and block memory usage decreases approximately two fold. 

The process up-to raintime is also available as a web-application on: [Link](http://corerain1.corerain.com:5005/), where you can not only view the SSD demo but also others. Below is the Corerain's team. 

![team.jpeg](../data/figs/team.jpeg)

Or you can see a video of the demonstration: 

In [1]:
%%HTML
<video width="640" height="480" controls>
  <source src="../data/figs/face_detecion.mp4" type="video/mp4">
</video>

### Contact

If you would like to discover more please do not hesitate to contact us at:

-   Professor Wayne Luk (w.luk@imperial.ac.uk)

-   Martin Ferianc (martin.ferianc@corerain.com)

-   Ruizhe (Vincent) Zhao (vincent.zhao@corerain.com)