<a href="https://colab.research.google.com/github/tcal-x/CFU-Playground/blob/first-prof-tflm/CFUPlay_prof_tflm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Profiling in CFU Playground

```
Copyright 2022 Google LLC.
SPDX-License-Identifier: Apache-2.0
```

To run locally on your laptop, either in Renode or on a real FPGA board, see instructions below under "To do this example on your machine directly".   Otherwise, continue here to use the Colab!

## Clone CFU-Playground

In [1]:
!git clone https://github.com/google/CFU-Playground.git
%cd CFU-Playground
!git checkout first-prof-tflm
!git branch
!git status

Cloning into 'CFU-Playground'...
remote: Enumerating objects: 14128, done.[K
remote: Counting objects: 100% (845/845), done.[K
remote: Compressing objects: 100% (416/416), done.[K
remote: Total 14128 (delta 422), reused 674 (delta 347), pack-reused 13283[K
Receiving objects: 100% (14128/14128), 26.49 MiB | 16.13 MiB/s, done.
Resolving deltas: 100% (8868/8868), done.
/content/CFU-Playground
Branch 'first-prof-tflm' set up to track remote branch 'first-prof-tflm' from 'origin'.
Switched to a new branch 'first-prof-tflm'
* [32mfirst-prof-tflm[m
  main[m
On branch first-prof-tflm
Your branch is up to date with 'origin/first-prof-tflm'.

nothing to commit, working tree clean


## Setup

Now do the usual setup including getting the submodules.


In [None]:
!./scripts/setup -ci

## Get Tools

Install a Conda environment containing additional necessary tools.

In [None]:
%cd /content/CFU-Playground
!rm -rf env/
!make env

## Interactive Renode run

Now try running the following cell.  It compiles and simulates the `prof_tflm_tut` project using Renode in headless mode.  When you see `(monitor)`, click below it get a box for text entry and type \<space\>\<enter\>. You need to hit "enter" after each key (unlike running Renode directly on your own computer).  Type the sequence to run one KWS test: 1(enter), 1(enter), 1(enter).  You should see it report that it consumed 47M cycles.


* Also while running interactively, check out the "Benchmarking" menu option ("6" on the top menu).  There are some microbenchmarks for memory accesses - accessing different regions, and with different access patterns.  In Renode, all memory accesses are ideal, so you won't see any difference between the different access patterns.  You can easily add your own microbenchmarks, for example if you have a multi-cycle CFU and wanted to make sure the performance is what you expected.  

* **This is tricky: To stop the interactive session**, hit the "stop" button (black circle with white square inside), then clear the output, and *then* hit the "stop" button again (and again if necessary).   You **must** stop this cell in order to run any other cell.


In [None]:
!(source env/conda/bin/activate cfu-common && cd proj/prof_tflm_tut && make TARGET=kosagi_fomu renode-headless)

## Automated (non-interactive) Renode run

This cell runs Renode using a Robot script -- this is how it would run in CI, or while doing an automated design space exploration.   This assumes we've already run `make renode` or `make renode-headless` or `make renode-scripts` with the target to generate the build/renode/*.resc file.  

The second cell below will output the per-TensorFlow-op-type cycle count as well as the total cycle count.  The total is only counting cycles spent inside TF kernels, so it will be a bit less than 47M.

In [None]:
!pip install -r /content/CFU-Playground/third_party/renode/tests/requirements.txt
!echo "sysbus.uart CreateFileBackend @/tmp/uart.txt True" >> proj/prof_tflm_tut/build/renode/kosagi_fomu.resc
!uniq proj/prof_tflm_tut/build/renode/kosagi_fomu.resc > u
!mv u proj/prof_tflm_tut/build/renode/kosagi_fomu.resc
!rm -f /tmp/uart.txt
!(cd proj/prof_tflm_tut && ../../third_party/renode/renode-test cfu.robot --variable USER_INPUT:"1 1 1" --variable SCRIPT:$PWD/build/renode/kosagi_fomu.resc)

In [None]:
!python3 ./scripts/scanprof.py < /tmp/uart.txt

## Profiling Discussion

The main use of the cycle counts is to direct our focus of where to optimize, specialized, and eventually add hardware acceleration.  Here, we see that runtime is dominated by CONV_2D, so that is where we should start optimizing.

Here in this Colab we are simulating in Renode.  Usually with CFU Playground you are running directly on the FPGA board.  

Renode does not model the memory hierarchy, so in that sense it is not accurate -- cache misses and hits both execute with zero latency.   However, we have found this to be a benefit.   Any small code change can change code layout and thus affect which parts of the code start interfering with each other in the cache -- a trivial change can cause a +/-20% change in the cycle count on a real board.   But it is just random noise.   In Renode, the cycle count reflects the number of active compute instructions executed, without the random noise.


You might think that running on the board will be much faster than simulation, but that is not always true.   It is likely that your laptop at 2-3GHz can simulate the test case here faster than it would actually run on Fomu at 12MHz.

In this example, we got the following cycle counts:

Renode: 47M original, 41M specialized (-13%)

On Fomu: 766M original, 614M specialized (-20%)


## TFLM Software Specialization

We see that more than 75% of the execution time is spent in the conv2d op type, from 5 different layers.   Further, experients show that many of the parameters are the same for 4 of those layers.    We can create a specialized version of the conv2d kernel in which these values are treated as constants:

```
// SPECIALIZED FOR KWS:
//   filter_height == 1
//   filter_width == 1
//   input_offset == 128
//   output_offset == -128
//   pad_height == 0
//   pad_width == 0
//   stride_height == 1
//   stride_width == 1
//   batches == 1
//   groups == 1
```

*   Navigate to `/content/CFU-Playground/proj/prof_tflm_tut/src/tensorflow/lite/kernels/internal/reference/integer_ops`.   These are files particular to this project that will overlay the original TFLM files.

* When the parameters match the values above, the code in `conv.h` invokes the specialized version in `kws_conv.h`; otherwise it executes the original general version.

* We have modified conv.h to dispatch to the *specialized* version in `kws_conv.h`, but we have not yet performed the specialization.  Open `kws_conv.h` by double-clicking it, and using the known values of many parameters, do some code simplifications yourself.  Or, use the version that we have already optimized: remove `kws_conv.h` and replace it by renaming `kws_conv.h-opt` to take its place. 


* After any modifications, just rerun the "Interactive Renode run" cell above.   The recompilation should go very fast.  If you execute the same 1-1-1 sequence in the menu, hopefully you'll see a reduced cycle count.

## Optional

Build a bitstream for Fomu to make sure everything works.   Uncomment the lines below and run the cell.

In [None]:
# !rm -rf third_party/usr
# !(source env/conda/bin/activate cfu-common && cd proj/prof_tflm_tut && make TARGET=kosagi_fomu clean bitstream)

# To do this example on your machine directly


### In all cases:

* Type the commands in the first 3 cells, without the leading `!` or `%`.  The last command will be "`make env`".
* `make enter` --- to enter the Conda environment.   It starts a new shell.  To exit the environment, just exit the shell with "`exit`".
* `cd proj/prof_tflm_tut` --- (usually when using CFU Playground, you are in a project directory like this).

### To run in Renode:
* `make TARGET=kosagi_fomu renode` --- Unlike how it works in the Colab, this runs Renode with the GUI.  Two new windows will pop up -- one for the "monitor", and one for the UART.   Type commands into the UART (no \<enter\> required), and type "`quit`" in the monitor window to exit.

### To run on Fomu directly:
* Power cycle Fomu so that it is in "boot" mode.  The LED should be cyan-colored, fading bright to dark.
* `make TARGET=kosagi_fomu clean`
* `make TARGET=kosagi_fomu bitstream`
* `make TARGET=kosagi_fomu TTY=/dev/ttyACM0 load`
* Those commands do the following: clean previous build artifacts; build the FPGA bitstream; then compile the software, flash bitstream+software to Fomu, and connect to it via a terminal program.
* To disconnect from Fomu, hit ctrl-C two or three times
* To incrementally compile and reload after modifying any software under ./src, just type the last command (`make TARGET=kosagi_fomu TTY=/dev/ttyACM0 load`) again.

# Run the KWS (keyword spotting) example CFU




Now let's run the KWS example with all of the optimizations (moving hot data and code to the fast SRAM, etc)....EXCEPT for the CFU acceleration.   In fact, we will build the bitstream with the CFU, but we won't use it from software, so it won't help us...although it won't hurt us either, at least in terms of cycle count.

In [None]:
%cd /content/CFU-Playground/
!rm -rf third_party/usr
!(source env/conda/bin/activate cfu-common && cd proj/kws_micro_accel && make TARGET=kosagi_fomu renode-headless)

How many cycles for one inference?

Ok, let's now enable the CFU.   Open `/content/CFU-Playground/proj/kws_micro_accel/Makefile` for editing by clicking the link.  Uncomment the two lines:
```
#DEFINES += OPT_ACCEL_CONV
#DEFINES += OPT_ACCEL_DEPTHWISE_CONV
```
Then run the cell below.

In [None]:
!(source env/conda/bin/activate cfu-common && cd proj/kws_micro_accel && make TARGET=kosagi_fomu renode-headless)