<a href="https://colab.research.google.com/github/putoale/DSDMChall/blob/main/Challenge_Bambu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Initial setup**

Install Bambu and required packages:

In [1]:
!echo "deb http://dk.archive.ubuntu.com/ubuntu/ xenial main universe" >> /etc/apt/sources.list
!echo "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-11 main" >> /etc/apt/sources.list
!apt-get -o Acquire::AllowInsecureRepositories=true update
!apt-get -o Acquire::AllowInsecureRepositories=true --allow-unauthenticated install -y --no-install-recommends ca-certificates git libbdd-dev iverilog verilator gcc-4.9 gcc-4.9-plugin-dev gcc-4.9-multilib g++-4.9 g++-multilib gcc-multilib gcc-7-plugin-dev g++-multilib clang-6.0 libclang-6.0-dev clang-11 libclang-11-dev
!git clone https://github.com/SerenaC94/bambu-tutorial.git
!tar xf bambu-tutorial/panda-dist.tar.xz -C /
%env PATH=/opt/panda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/opt/bin

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Hit:10 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:12 http://dk.archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 https://apt.llvm.org/bionic llvm-toolchain-bionic-11 InRelease [5,527 B]
Hit:13 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease


# **Introduction**

Have a look at the C code for **Exercise 1**: /content/bambu-tutorial/01-introduction/Exercise1/module.c

Launch bambu:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1 --simulator=VERILATOR --simulate --generate-tb=test_icrc1.xml -v2 --print-dot --pretty-print=a.c 2>&1 | tee icrc1.log

Inspect the generated files in the explorer tab on the left:

*   icrc1.v
*   test_icrc1.xml
*   simulate_icrc1.sh
*   synthesize_Synthesis_icrc1.sh



Visualize the FSM:

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/icrc1/HLS_STGraph.dot')

Navigate through the explorer to see the code for other exercises, **edit** this box to execute them:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2
!./bambu.sh

# **Target selection and tool integration**

**Exercise 1**: synthesize a module that returns the minimum and maximum value in an array.

Start by modifying the code below:

In [None]:
%cd /content/bambu-tutorial/02-target_customization/Exercise1/

In [None]:
%%writefile minmax.c
void max(int input[10], int * max)
{
   int local_max = input[0];
   int i = 0;
   for(i = 0; i < 10; i++)
   {
      if(input[i] > local_max)
      {
         local_max = input[i];
      }
   }
   *max = local_max;
}

Synthesize with bambu:

In [None]:
!bambu minmax.c

**Exercise 2**: write a testbench to test arrays with different elemets and different sizes.

Start by modifying the code below **(change parameter names so that they correspond to function arguments in your code)**:

In [None]:
%%writefile test.xml
<?xml version="1.0"?>
<function>
   <testbench input="{0,1,2,3,4}" num_elements="5"/>
</function>

In [None]:
!bambu minmax.c --generate-tb=test.xml --simulate

**Exercise 3**: compare simulations across different target platforms and frequencies.

Start from the given command and modify the options appropriately to test the following combinations:


*   xc4vlx100-10ff1513 – 66MHz
*   5SGXEA7N2F45C1 – 200MHz
*   xc7vx690t-3ffg1930-VVD – 100MHz
*   xc7vx690t-3ffg1930-VVD – 333MHz
*   xc7vx690t-3ffg1930-VVD – 400MHz



In [None]:
!bambu minmax.c --device-name=xc4vlx100-10ff1513 --clock-period=15 --simulate

# **Optimizations**

## **Exercise 1**: 

Modify Bambu options to evaluate the effect of:


*   different levels of optimization (-O0, -O1, -O2, -O3, -Os)
*   vectorization (-ftree-vectorize)
*   inlining (-finline-limit=100000)
*   different frontend compilers (--compiler={I386_GCC49|I386_GCC7|I386_CLANG6|I386_CLANG11})


##### **ADPCM from CHStone benchmark suite**
Adaptive Diferential Pulse-Code Modulation is an algorithm used to perform audio compression (mainly in telephony). It is part of the CHStone benchmark suite for C-based HLS tools.

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 2**: 

Use the command that yielded the best result in Exercise 1 and verify if SDC scheduling can introduce further improvements.

* -s or --speculative-sdc-scheduling

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 3**:

Modify Bambu options to evaluate the effect of different integer division implementations.

--hls-div=<method\>
* none  - use a HDL based pipelined restoring division
* nr1   - use a C-based non-restoring division with unrolling factor equal to 1 (default)
* nr2   - use a C-based non-restoring division with unrolling factor equal to 2
* NR    - use a C-based Newton-Raphson division
* as    - use a C-based align divisor shift dividend method


##### **FPDiv from CHStone**
Soft floating-point division implementation from the CHStone benchmark suite for C-based HLS.

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise3/
!bambu dfdiv.c --simulate --clock-period=15 --hls-div=none

## **Exercise 4**: 

Write C implementation that compute the following function:

# $awesome\_math(a,b,c) = acos(\frac{a^2+b^2-c^2}{2ab})$

Experiment with single and double precision data types, different softfloat and libm implementations offered by bambu.

Start by editing this code and then try different bambu options:
* Different floating-point arithmetic implementations (--softfloat, --soft-fp, --flopoco)
* Different libm implementations (--libm-std-rounding)

In [None]:
%%writefile /content/bambu-tutorial/03-optimizations/Exercise4/module.c
#include <math.h>
float awesome_math(float a, float b, float c)
{
   return acosf((powf(a,2) + powf(b,2) - powf(c,2))/(2*a*b));
}

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise4/
!bambu module.c -O3 -lm --simulate --top-fname=awesome_math --generate-tb="a=3.0,b=4.0,c=5.0" --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float

# **Challenge: LLSQ implementation**

Implementation and verification of LLSQ algorithm.

## 1. Workspace creation:

In [2]:
%mkdir /content/challenge_llsq/
%cd /content/challenge_llsq/

/content/challenge_llsq


## 2. Creation of c files

### Single precision floating point implementation

In [3]:
%%writefile llsq_sp.c
void llsq ( int n, float x[], float y[], float *a, float *b )

/******************************************************************************/
/*
  Purpose:

    LLSQ solves a linear least squares problem matching y=a*x+b  to data.

  Discussion:

    A formula for a line of the form Y = A * X + B is sought, which
    will minimize the root-mean-square error to N data points ( X[I], Y[I] );

  Licensing:

    This code is distributed under the GNU LGPL license.

  Modified:

    07 March 2012

  Author:

    John Burkardt

  Parameters:

    Input, int N, the number of data values.

    Input, *float* X[N], Y[N], the coordinates of the data points.

    Output, *float* *A, *B, the slope and Y-intercept of the least-squares
    approximant to the data.
*/
{
  float bot;
  int i;
  float top;
  float xbar;
  float ybar;
/*
  Special case.
*/
  if ( n == 1 )
  {
    *a = 0.0;
    *b = y[0];
    return;
  }
/*
  Average X and Y.
*/
  xbar = 0.0;
  ybar = 0.0;
  for ( i = 0; i < n; i++ )
  {
    xbar = xbar + x[i];
    ybar = ybar + y[i];
  }
  xbar = xbar / ( float ) n;
  ybar = ybar / ( float ) n;
/*
  Compute Beta.
*/
  top = 0.0;
  bot = 0.0;
  for ( i = 0; i < n; i++ )
  {
    top = top + ( x[i] - xbar ) * ( y[i] - ybar );
    bot = bot + ( x[i] - xbar ) * ( x[i] - xbar );
  }
  *a = top / bot;

  *b = ybar - *a * xbar;

  return;
}

Writing llsq_sp.c


### Double precision floating point implementation

In [4]:
%%writefile llsq_dp.c
void llsq ( int n, double x[], double y[], double *a, double *b )

/******************************************************************************/
/*
  Purpose:

    LLSQ solves a linear least squares problem matching y=a*x+b  to data.

  Discussion:

    A formula for a line of the form Y = A * X + B is sought, which
    will minimize the root-mean-square error to N data points ( X[I], Y[I] );

  Licensing:

    This code is distributed under the GNU LGPL license.

  Modified:

    07 March 2012

  Author:

    John Burkardt

  Parameters:

    Input, int N, the number of data values.

    Input, double X[N], Y[N], the coordinates of the data points.

    Output, double *A, *B, the slope and Y-intercept of the least-squares
    approximant to the data.
*/
{
  double bot;
  int i;
  double top;
  double xbar;
  double ybar;
/*
  Special case.
*/
  if ( n == 1 )
  {
    *a = 0.0;
    *b = y[0];
    return;
  }
/*
  Average X and Y.
*/
  xbar = 0.0;
  ybar = 0.0;
  for ( i = 0; i < n; i++ )
  {
    xbar = xbar + x[i];
    ybar = ybar + y[i];
  }
  xbar = xbar / ( double ) n;
  ybar = ybar / ( double ) n;
/*
  Compute Beta.
*/
  top = 0.0;
  bot = 0.0;
  for ( i = 0; i < n; i++ )
  {
    top = top + ( x[i] - xbar ) * ( y[i] - ybar );
    bot = bot + ( x[i] - xbar ) * ( x[i] - xbar );
  }
  *a = top / bot;

  *b = ybar - *a * xbar;

  return;
}

Writing llsq_dp.c


## 3. Testbench creation

In [5]:
%%writefile test_llsq.xml
<?xml version="1.0"?>
<function>
   <testbench n="2" x="{1,2}" y="{3,5}"/>
   <testbench n="4" x="{1.0988,6.3221,8.1276,2.4544}" y="{3,5,12.2334,7.4554}"/>
</function>

Writing test_llsq.xml


## 4. Bambu simulation:

### Single precision:

Bambu simple simulation

In [6]:
%cd /content/challenge_llsq/
!bambu llsq_sp.c --simulate --top-fname=llsq --generate-tb=test_llsq.xml -v 4 2>&1 |tee simple_log_sp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
  Starting execution of Frontend::VarAnalysis::__float32_divSRT4if
  Ended execution of Frontend::VarAnalysis::__float32_divSRT4if(74)[42] in 0.00 seconds - Virtual Memory: 166MB
  Starting execution of Frontend::ScalarSsaDataFlowAnalysis::__float32_divSRT4if
  Ended execution of Frontend::ScalarSsaDataFlowAnalysis::__float32_divSRT4if(74)[42] in 0.00 seconds - Virtual Memory: 166MB
  Starting execution of Frontend::VirtualAggregateDataFlowAnalysis::__float32_divSRT4if
  Ended execution of Frontend::VirtualAggregateDataFlowAnalysis::__float32_divSRT4if(74)[42] in 0.00 seconds - Virtual Memory: 166MB
  Starting execution of Frontend::AggregateDataFlowAnalysis::__float32_divSRT4if
  Ended execution of Frontend::AggregateDataFlowAnalysis::__float32_divSRT4if(74)[42] in 0.00 seconds - Virtual Memory: 166MB
  Starting execution of Frontend::BuildVirtualPhi::__float32_mulif
  Ended execution of Frontend::BuildVirtualPhi::__float

In [7]:
!head -1 simple_log_sp.txt
!echo -------------------
!cat simple_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat simple_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu --simulate --top-fname=llsq --generate-tb=test_llsq.xml -v 4 llsq_sp.c 
-------------------
    Total number of flip-flops in function llsq: 642
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 113 cycles;
2. Simulation completed with SUCCESS; Execution time 161 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.61 seconds - Virtual Memory: 195MB
  Starting execution of HLS::Evaluation
    Total cycles             : 274 cycles
    Number of executions     : 2
    Average execution        : 137 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 195MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 195MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 seconds - Virtual Memory: 195MB


 Bambu simulation w/ optimizations and ---hls-div=none



In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float -v 4 2>&1 |tee opt1_log_sp.txt

In [None]:
!head -1 opt1_log_sp.txt
!echo -------------------
!cat opt1_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt1_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float -v 4 llsq_sp.c 
-------------------
    Total number of flip-flops in function llsq: 610
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.56 seconds - Virtual Memory: 237MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 237MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 237MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 seco

 Bambu simulation w/ optimizations and ---hls-div=nr1



In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr1 --soft-float -v 4 2>&1 |tee opt2_log_sp.txt

In [None]:
!head -1 opt2_log_sp.txt
!echo -------------------
!cat opt2_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt2_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr1 --soft-float -v 4 llsq_sp.c 
-------------------
    Total number of flip-flops in function llsq: 610
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.59 seconds - Virtual Memory: 245MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 245MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 245MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 secon

 Bambu simulation w/ optimizations and ---hls-div=nr2



In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr2 --soft-float -v 4 2>&1 |tee opt2_log_sp.txt

In [None]:
!head -1 opt2_log_sp.txt
!echo -------------------
!cat opt2_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt2_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr2 --soft-float -v 4 llsq_sp.c 
    Total number of flip-flops in function llsq: 610
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.58 seconds - Virtual Memory: 228MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 228MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 228MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 seconds - Virtual Memory: 228MB


 Bambu simulation w/ optimizations and ---hls-div=nr3



In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr3 --soft-float -v 4 2>&1 |tee opt3_log_sp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
      Source: ui_bit_ior_expr_FU_32_32_32_147_i1 Target: ui_bit_ior_expr_FU_32_0_32_146_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i0 Target: ui_bit_ior_expr_FU_8_8_8_148_i1(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_and_expr_FU_1_1_1_130_i1(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_ior_expr_FU_1_1_1_143_i0(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_rshift_expr_FU_8_0_8_182_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i2 Target: ui_bit_ior_expr_FU_8_8_8_148_i3(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_0_1_128_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_1_1

In [None]:
!head -1 opt3_log_sp.txt
!echo -------------------
!cat opt3_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt3_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=nr3 --soft-float -v 4 llsq_sp.c 
-------------------
    Total number of flip-flops in function llsq: 610
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.58 seconds - Virtual Memory: 256MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 256MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 256MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 secon

 Bambu simulation w/ optimizations and ---hls-div=NR



In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=NR --soft-float -v 4 2>&1 |tee opt4_log_sp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
      Source: ui_bit_ior_expr_FU_32_32_32_147_i1 Target: ui_bit_ior_expr_FU_32_0_32_146_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i0 Target: ui_bit_ior_expr_FU_8_8_8_148_i1(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_and_expr_FU_1_1_1_130_i1(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_ior_expr_FU_1_1_1_143_i0(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_rshift_expr_FU_8_0_8_182_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i2 Target: ui_bit_ior_expr_FU_8_8_8_148_i3(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_0_1_128_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_1_1

In [None]:
!head -1 opt4_log_sp.txt
!cat opt4_log_sp.txt | grep "Total number of flip-flops in function llsq"
!cat opt4_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=NR --soft-float -v 4 llsq_sp.c 
    Total number of flip-flops in function llsq: 610
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.59 seconds - Virtual Memory: 252MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 252MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 252MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 seconds - Virtual Memory: 252MB


 Bambu simulation w/ optimizations and ---hls-div=as





In [None]:
!bambu llsq_sp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=as --soft-float -v 4 2>&1 |tee opt5_log_sp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
      Source: ui_bit_ior_expr_FU_32_32_32_147_i1 Target: ui_bit_ior_expr_FU_32_0_32_146_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i0 Target: ui_bit_ior_expr_FU_8_8_8_148_i1(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_and_expr_FU_1_1_1_130_i1(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_bit_ior_expr_FU_1_1_1_143_i0(1)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i1 Target: ui_rshift_expr_FU_8_0_8_182_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i2 Target: ui_bit_ior_expr_FU_8_8_8_148_i3(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_0_1_128_i0(0)  connected by: DIRECT_CONNECTION
      Source: ui_bit_ior_expr_FU_8_8_8_148_i3 Target: ui_bit_and_expr_FU_1_1_1

In [None]:
!head -1 opt5_log_sp.txt
!echo -------------------
!cat opt5_log_sp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt5_log_sp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=as --soft-float -v 4 llsq_sp.c 
-------------------
    Total number of flip-flops in function llsq: 610
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 112 cycles;
2. Simulation completed with SUCCESS; Execution time 160 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.59 seconds - Virtual Memory: 241MB
  Starting execution of HLS::Evaluation
    Total cycles             : 272 cycles
    Number of executions     : 2
    Average execution        : 136 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 241MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 241MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 second

### Double precision:

Simulation w/o optimizations:

In [8]:
%cd /content/challenge_llsq/
!bambu llsq_dp.c --simulate --top-fname=llsq --generate-tb=test_llsq.xml -v 4 2>&1 |tee simple_log_dp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
      Operation __float64_addif_12095_12250(bit_and_expr) scheduled at control step (2-2) on functional unit ui_bit_and_expr_FU_64_64_64_192
      Operation __float64_addif_12095_12251(bit_ior_expr) scheduled at control step (2-2) on functional unit ui_bit_ior_expr_FU_64_64_64_203
      Operation __float64_addif_12095_12252(rshift_expr) scheduled at control step (2-2) on functional unit ui_rshift_expr_FU_64_0_64_255
      Operation __float64_addif_12095_12253(bit_and_expr) scheduled at control step (2-2) on functional unit ui_bit_and_expr_FU_64_64_64_192
      Operation __float64_addif_12095_12254(bit_and_expr) scheduled at control step (2-2) on functional unit ui_bit_and_expr_FU_64_64_64_192
      Operation __float64_addif_12095_12255(bit_ior_expr) scheduled at control step (2-2) on functional unit ui_bit_ior_expr_FU_64_64_64_203
      Operation __float64_addif_12095_12263(lshift_expr) scheduled at control step (2-2) on f

In [9]:
!head -1 simple_log_dp.txt
!echo -------------------
!cat simple_log_dp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat simple_log_dp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu --simulate --top-fname=llsq --generate-tb=test_llsq.xml -v 4 llsq_dp.c 
-------------------
    Total number of flip-flops in function llsq: 1154
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 188 cycles;
2. Simulation completed with SUCCESS; Execution time 260 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.70 seconds - Virtual Memory: 214MB
  Starting execution of HLS::Evaluation
    Total cycles             : 448 cycles
    Number of executions     : 2
    Average execution        : 224 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 214MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 214MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 seconds - Virtual Memory: 214MB


Simulation w optimizations:

In [10]:
!bambu llsq_dp.c -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float -v 4 2>&1 |tee opt1_log_dp.txt

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
  Starting execution of Frontend::SymbolicApplicationFrontendFlowStep(SplitReturn)
  Ended execution of Frontend::SymbolicApplicationFrontendFlowStep(SplitReturn) in 0.00 seconds - Virtual Memory: 211MB
  Starting execution of Frontend::CSE::llsq(361)[174]
  Ended execution of Frontend::CSE::llsq(366)[175] in 0.00 seconds - Virtual Memory: 211MB
  Starting execution of Frontend::SymbolicApplicationFrontendFlowStep(CSE)
  Ended execution of Frontend::SymbolicApplicationFrontendFlowStep(CSE) in 0.00 seconds - Virtual Memory: 211MB
  Starting execution of Frontend::UpdateSchedule::llsq(362)[175]
  Ended execution of Frontend::UpdateSchedule::llsq(366)[175] in 0.01 seconds - Virtual Memory: 211MB
  Starting execution of Frontend::MultiWayIf::llsq(365)[175]
  Ended execution of Frontend::MultiWayIf::llsq(366)[175] in 0.00 seconds - Virtual Memory: 211MB
  Starting execution of Frontend::SymbolicApplicationFrontendFlowStep(Multi

In [11]:
!head -1 opt1_log_dp.txt
!echo -------------------
!cat opt1_log_dp.txt | grep "Total number of flip-flops in function llsq"
!echo -------------------
!cat opt1_log_dp.txt | grep "results.txt" -A 12

 ==  Bambu executed with: bambu -O5 -lm --simulate --top-fname=llsq --generate-tb=test_llsq.xml --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float -v 4 llsq_dp.c 
-------------------
    Total number of flip-flops in function llsq: 1122
-------------------
File "/content/challenge_llsq/results.txt" opened
1. Simulation completed with SUCCESS; Execution time 185 cycles;
2. Simulation completed with SUCCESS; Execution time 257 cycles;
  Ended execution of HLS::SimulationEvaluation(2) in 0.74 seconds - Virtual Memory: 226MB
  Starting execution of HLS::Evaluation
    Total cycles             : 442 cycles
    Number of executions     : 2
    Average execution        : 221 cycles
  Ended execution of HLS::Evaluation in 0.00 seconds - Virtual Memory: 226MB
  Starting execution of HLS::ClassicalHLSSynthesisFlow
  Ended execution of HLS::ClassicalHLSSynthesisFlow in 0.00 seconds - Virtual Memory: 226MB
  Starting execution of Exit
  Ended execution of Exit in 0.00 sec

## 5. Results summary 

From the simulations above, it's possible to highlight that the double precision implementation is way more expensive in terms of used flip-flops (about 2x, how it's expected since we need a double number of registers per floating point number). Moreover, launching the simulation with further optimizations doesn't improve the performance in a significant way.