# Your name: Joanna Kondylis

In [1]:
import pandas as pd
import numpy as np
from loaders import *

## Part 1: Single PE Modeling
We will start with a simple design consisting of a single PE as shown in the figure below. The PE contains a MAC unit to do multiplication and accumulation, and a scratchpad to store data locally for reuse. We also provide you with the loop nest for this single PE design in the figure below.

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/singlePE/figures/PE_arch.png" alt="PE Architecture" style="margin:100px 0px 30px 70px; width:35%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/singlePE/figures/PE_loopnest.png" alt="PE Loopnest" style="width:40%">
  </div>
</div>

## Question 1 Introduction

### Question 1.1
Assuming you cannot reorder the provided loop nest, if you can only store one datatype (datatypes inlcude *filter weights, input activations, output activations*) inside the PE scratchpad to maximize data reuse inside the PE, which datatye will you choose? In 1 or 2 sentences, explain why.


**Answer**
In order to maximize data reuse inside the PE, store the filter weight. This would make the system weight stationary.




### Question 1.2 
#### 1.2.1
Take a look at the `SINGLE_PE_ARCH` config. This config describes the hardware structure of the architecture. Please fill in the chart below:

*Hint: the operand registers of the mac unit belong to the same memory level*

In [None]:
show_config(ConfigRegistry.SINGLE_PE_ARCH)

In [2]:
# the Question 1.2.1 chart
d = {'# of memory levels (including DRAM and registers)': [3],   # fill in your answer here
     '  # of bits used to represent a data': [16],                # fill in your answer here
     '  size of local scrachpad (bytes)': [36],                   # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 # of memory levels (including DRAM and registers)    # of bits used to represent a data    size of local scrachpad (bytes)
                        3                                            16                                  36                


Take a look at the compound component descriptions at `designs/singlePE/arch/components`. These files describe the hardware details of each component in the design.

In [None]:
show_config(
    ConfigRegistry.SINGLE_PE_COMPONENTS_DIR
)

#### 1.2.2
Are these compound components composed of single subcomponent or multiple subcomponents?
   

**Answer** These compound components are composed of a single subcomponent.

   
   
#### 1.2.3
According to description of the `mac_compute` compound component, is our architecture capable of performing floating point computations? In 1 or 2 sentences, explain why.


**Answer** Type is intmac so our architecture is not capable of performing floating point computations.



### Question 1.3
The command below performs static hardware charaterizations using **Accelergy**. You do not need to worry about the warning messages.

Examine the file `designs/singlePE/output/ERT.yaml`. Please fill in the chart below (**note that the implicit energy unit for the ERT is pJ**)

In [None]:
result = run_accelergy(ConfigRegistry.SINGLE_PE_ARCH, ConfigRegistry.SINGLE_PE_COMPONENTS_DIR)
# The energy reference table (ERT) is the one used to compute energy.
print(result.ert)

# The verbose energy reference table shows more information. You don't need it here but later in Q1.6
# print(result.ert_verbose)

In [3]:
# the Question 1.3 chart
d = {'DRAM read': [512],           # fill in your answer here
     ' DRAM write': [512],         # fill in your answer here
     ' scrachpad read': [0.2256],     # fill in your answer here
     ' scrachpad write': [0.2256],    # fill in your answer here
     ' mac compute': [3.275],        # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 DRAM read   DRAM write   scrachpad read   scrachpad write   mac compute
   512         512           0.2256            0.2256          3.275    


### Question 1.4 

Take a look at the `SINGLE_PE_MAP` config. This config describes a mapping for a certain workload. By examining the mapping, can you tell what are the values of `M0`, `N0`, `C0`, `R`, `S`, `P`, `Q` in the loop nest above? For each of them, if you can, specifiy the value in the following chart; if you can't, state why in this cell. 


**Answer**

The M0 value is specified in the chart. For the remaining values, we have 'nan'. We cannot tell when the values are of for the remaining loop nest (P, Q, R, S). 

Under "mapping for the local scratchpad inside the PE" we have that R=0 S=0 P=0 Q=0 and each value traverses [0, x). So we cannot determine the exact value.

**A note on Timeloop mapping conventions**

Permutation is the order of the loops from inner to outer. For example, permutation QPS and factors `Q=5`, `P=2`, and `S=4` means the following loop nest.
```
for s in [0, 4):
 for p in [0, 2):
  for q in [0, 5):
   ...
```

A buffer level can also have bypass specification. For example, an output buffer with `keep=[Output]` and `bypass=[Weights, Input]` will store only the `Output` tensor.

In [None]:
show_config(ConfigRegistry.SINGLE_PE_MAP)

In [4]:
# the Question 1.4 chart, put down nan if you cannot tell what the value is 
d = {'M0': [2],   # fill in your answer here
     'N0': ['nan'],   # fill in your answer here
     'C0': ['nan'],   # fill in your answer here
     'S':  ['nan'],   # fill in your answer here
     'R':  ['nan'],   # fill in your answer here
     'P':  ['nan'],   # fill in your answer here
     'Q':  ['nan']    # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 M0  N0  C0  S   R   P   Q 
 2  nan nan nan nan nan nan


### Question 1.5
The command below performs **Timeloop** runtime simulation of your design, and **Accelergy** is queried as the backend to provide energy estimations for each simulated component (that's why you will see the Accelergy related outputs as well (*e.g.,* `timeloop-model.ERT.yaml`))

In [None]:
small_layer_stats, small_layer_mapping = run_timeloop_model(
    ConfigRegistry.SINGLE_PE_ARCH, ConfigRegistry.SINGLE_PE_COMPONENTS_DIR,
    ConfigRegistry.SINGLE_PE_MAP,
    ConfigRegistry.SMALL_LAYER_PROB
)

In [None]:
print(small_layer_mapping)

In [None]:
print(small_layer_stats)

#### 1.5.1
Take a look at `small_layer_mapping`, can you now tell the dimensions of the layer shape by looking at the produced mapping? In 1 or 2 sentences, explain why. Take a look at the `small_layer_stats`, and fill in the chart in the code cell below.


**Answer**

Yes, you can now tell the dimensions of the layer shape by looking at the produced mapping. Because it explicitly shows the loop values for R, S, P, Q.

#### 1.5.2
Run simulation on the medium layer shape below.

Fill in the second row in the chart below. Does the `pJ/MACC` value change? In 1 or 2 sentences, explain why. 


**Answer**

The pJ/MACC value does change. It is a little bit lower for the medium layer because the system is more capable of data reuse. Sometimes having larger activations can enable more reuse.

In [None]:
medium_layer_stats, medium_layer_mapping = run_timeloop_model(
    ConfigRegistry.SINGLE_PE_ARCH, ConfigRegistry.SINGLE_PE_COMPONENTS_DIR,
    ConfigRegistry.SINGLE_PE_MAP,
    ConfigRegistry.MED_LAYER_PROB
)
print(medium_layer_stats)

3. What's the benefit of allowing a factor of 0, e.g., R=0, in mapping specification (*hint: we used the same `SINGLE_PE_MAP` for 2 different layer shapes*)?


**Answer**

By allowing for a factor of zero in the mapping speicfications we can accommodate different layer shapes. We make sure that there are no imcomplete partial sums.
   

In [5]:
# the Question 1.5.1 and 1.5.2 chart
d = {'layer shape': ['small_layer', 'medium_layer'],
     '  number of cycles': [921600, 8294400],                # fill in your answer here
     '  mac energy total (pJ)': [3018240, 27164160],           # fill in your answer here
     '  scratchpad total energy (pJ)': [16633.04, 16633.04],    # fill in your answer here  
     '  DRAM total energy (pJ)':  [358203392, 3186081792],         # fill in your answer here  # hint: all datatypes
     '  pJ/MACC':  [397.85, 393.252]                         # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

layer shape     number of cycles    mac energy total (pJ)    scratchpad total energy (pJ)    DRAM total energy (pJ)    pJ/MACC
 small_layer        921600                3018240                     16633.04                     358203392         397.850  
medium_layer       8294400               27164160                     16633.04                    3186081792         393.252  


**Since now you have an understanding of the input and output files of the tools, we now would like you to write your own input files and feed it to the evaluation system.**


### Question 1.6

Many modern accelerator designs integrate address generators into their storages. The address generator is responsible for generating a sequence of read and write addresses for the memory, *i.e.,* for each read and write, the address is generated locally by the address generator. Typically, the address generator can be represented as an adder.

In this question, we would like you to update the compound component definition for the scratchpad to reflect the existence of such an additional address generator. To be specific:

    1. name of the address generator: address_generator
    2. class of the address generator: intadder
    3. attributes associated with the address generator: datawidth (hint: log2 function can be used), technology, latency
    4. you also need to specify the role your address generator plays when the storage is read and written

Inspect the `SINGLE_PE_AG_COMPONENT_SMART_STORAGE` configuration and use the widget below to apply your updates...

**Note**: running the cell with the widget *resets* the widget. Changes to the widget are applied automatically, so just run the next cell.

#### 1.6.1
After you have updated your architecture description, naviagte to the desgins root folder and run Accelergy (the command cell below). Examine the outputs and fill in the chart below. 

#### 1.6.2
Without rerunning Timeloop simulation for the `SMALL_LAYER_PROB` workload, can you infer from the ERT how much more energy will the local scrachpad consume? In 1 or 2 sentences, explain why.


**Answer**
Yes, you can infer from the ERT by seeing how much energy the local scratchpad consumes for each action and multiply that by the number of times each action is undertaken.
In other words, after you run the timeloop simulation, you know the read/write counts so you can mathematically integrate the new parts (i.e. address generator).
We need to know the # of cycles.

   

#### 1.6.3
If we have a huge workload and running simulations of it takes hours, how would using compound components help us when we perform design space exploration (*hint: can you avoid rerunning simulations when you change the details of a compound component*)?


**Answer**
Using compound components help us when we perform design space exploration. Basically, from 1.6.2 we know that you can infer from the ERT how much more energy the local scrachpad will consume. So if we have organized compound components, we can can simulate the energy for a subset of these individual components and from there move on to infer the total energy consumption of the entire system.

In [None]:
show_config(ConfigRegistry.SINGLE_PE_AG_COMPONENT_SMART_STORAGE)

In [None]:
smart_storage_widget = \
    load_widget_config(ConfigRegistry.SINGLE_PE_AG_COMPONENT_SMART_STORAGE_WIDGET, title='Smart Storage')

In [None]:
print(show_config(smart_storage_widget.dump()))

In [None]:
single_pe_ag_accelergy_result = run_accelergy(ConfigRegistry.SINGLE_PE_AG_ARCH,
                                              ConfigRegistry.SINGLE_PE_AG_COMPONENT_MAC_COMPUTE,
                                              ConfigRegistry.SINGLE_PE_AG_COMPONENT_REG_STORAGE,
                                              smart_storage_widget.dump())
print(single_pe_ag_accelergy_result.ert_verbose)

In [6]:
# Question 1.6 chart
d = {'read energy of the scratchpad (pJ)': [0.25297],  # fill in your answer here
     'write energy of the scratchpad (pJ)': [0.25297], # fill in your answer here
     'address generation energy (pJ)': [0.5474]       # add read and write (2*0.02737)
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 read energy of the scratchpad (pJ)  write energy of the scratchpad (pJ)  address generation energy (pJ)
              0.25297                             0.25297                             0.5474            


### Question 1.7
So far, we have been focusing on studying the dataflow described in the provided loop next above. In this question, we would like you to update the mapping to represent a new loop nest shown below. 

Please set the bounds in the `SINGLE_PE_OS_MAP` mapping according to the layer shape described in `SMALL_LAYER_PROB`  (**note that some of the inner bounds are set for you**) and **only keep outputs inside the scratchpad**.

After you have updated the mapping, run `timeloop-model` (run the command cell below). Please fill in the chart below:

<div class="row">
  <div class="column">
    <img align="center" src="designs/singlePE_os/figures/PE_loopnest.png" alt="PE Architecture" style="margin:0px 0px 70px 70px; width:50%">
  </div>
</div>

In [None]:
show_config(ConfigRegistry.SMALL_LAYER_PROB)

**Using the widget to update the mapping**

First, run the widget and the following cell. Inspect the generated Timeloop mapping.

In the widget below, "temp." means temporal and the numbers of each dimension is the loop bound in the for loop for that dimension. Permutation is the order of the loops from inner to outer. Bypass and keep are as described in Q1.4. You can also see the mapping for the registers (e.g., the `output_activation_reg`) as an example.

After filling out the widget below, run the next cell to see its effect on the Timeloop mapping.

In [None]:
os_map_widget = \
    load_widget_config(ConfigRegistry.SINGLE_PE_OS_MAP_WIDGET)

In [None]:
# Nothing to change here! This just loads the configuration from the widget above.
os_map = configure_mapping(os_map_widget.dump(),
                           ConfigRegistry.SINGLE_PE_OS_MAP_TEMPLATE,
                           {'dram_t_c': 0,
                            'dram_t_m': 0,
                            'dram_t_n': 0,
                            'dram_t_r': 0,
                            'dram_t_s': 0,
                            'dram_t_p': 0,
                            'dram_t_q': 0,
                            'dram_permutation': None,
                            'spad_t_c': 0,
                            'spad_t_m': 0,
                            'spad_t_n': 0,
                            'spad_t_r': 0,
                            'spad_t_s': 0,
                            'spad_t_p': 0,
                            'spad_t_q': 0,
                            'spad_permutation': None,
                            'spad_bypass': None,
                            'spad_keep': None})
print(os_map)

In [None]:
single_pe_os_accelergy_result = run_accelergy(ConfigRegistry.SINGLE_PE_OS_ARCH,
                                              ConfigRegistry.SINGLE_PE_OS_COMPONENTS_DIR)
single_pe_os_stats, single_pe_os_map = run_timeloop_model(
    ConfigRegistry.SINGLE_PE_OS_ARCH, ConfigRegistry.SINGLE_PE_OS_COMPONENTS_DIR,
    single_pe_os_accelergy_result.art,
    single_pe_os_accelergy_result.ert,
    os_map,
    ConfigRegistry.SMALL_LAYER_PROB
)
print(single_pe_os_stats)

In [7]:
# the Question 1.7 chart
d = {'layer shape': ['small_layer'],    
     'number of cycles': [921600],          # fill in your answer here
     'mac Energy':  [3018240],               # fill in your answer here
     'scratchpad Energy (pJ)': [222395.6],    # fill in your answer here
     'DRAM Energy (pJ)': [261734400],          # fill in your answer here
     'pJ/MAC':[293.47]                      # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

layer shape  number of cycles  mac Energy  scratchpad Energy (pJ)  DRAM Energy (pJ)  pJ/MAC
small_layer       921600        3018240           222395.6            261734400      293.47
