# Your name: Joanna Kondylis

In [1]:
import pandas as pd
import numpy as np
from loaders import *

## Part 2: Full System (PE Array, Global Buffer, NoC) Modeling
Now that you are familiar with the simple PE setup, let’s look at a full system as shown in the figure below. This design is composed of two levels of on-chip storage --- the global buffer and the local scratchpads in each PE as described in part 1. Each datatype is sent via a network from the global buffer to the PE array, and there are inter-PE networks that are capable of sending various data types within the array. We provide you with the loop nest of this design in the figure below. 

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/system_manual/figures/system_arch.png" alt="Full System  Architecture Diagram" style="margin:50px 0px 0px 50px; width:40%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/system_manual/figures/system_loopnest.png" alt="System Loopnest" style="width:50%">
  </div>
</div>

## Question 2  Manual Exploration of the Mapspace

### Question 2.1
You are provided with a PE array that has 16 PEs. Assume you can design different architectures and associated mappings for every layer shape (i.e. both architecture yaml and mapping yaml can change across layer shapes). 

In specific, you can select the height and width of the PE array as long as the total number of PEs equal to 16, while keeping other architectural attributes the same.

#### 2.1.1
Please examine the provided architecture descriptions for system_1x16, and set the parameters for 2x8 and 4x4 to create an architecture description that has the same buffer sizes and a PE array of physical dimension 2x8 and 4x4. Which hardware attributes do you need to change? (**Hint**: see the widgets below and inspect what effects they have on the architecture specification).


**Answer**

In [None]:
show_config(ConfigRegistry.SYSTEM_1x16_ARCH)

In [None]:
sys_2x8_widget = load_widget_config(ConfigRegistry.SYSTEM_2x8_ARCH_WIDGET, title='System 2x8')

In [None]:
print(sys_2x8_widget.dump())

In [None]:
sys_4x4_widget = load_widget_config(ConfigRegistry.SYSTEM_4x4_ARCH_WIDGET, title='System 4x4')

In [None]:
print(sys_4x4_widget.dump())

#### 2.1.2
In 1 or 2 sentences, explain why running the same workload on architectures with different physical PE array dimensions might result in different performance (*e.g.,* energy, throughput)?


**Answer**

### Question 2.2
In this question, we would like you to find the best architecture (among the three architectures in question 2.1) and the associated mapping that has the highest throughput (minimizes the number of cycles) for `ConfigRegistry.TINY_LAYER_PROB`. If two architectures result in the same throughput, choose the one that's less energy consuming.
  
<font color=blue> <b>Your mapping has to agree with the loop nest provided above. To simplify your search, please further assume that: </b>
    
   - input channels can only be spatially mapped to the rows of the PE array and output channels can only be spatially mapped to the columns of the PE array.
    
   - PE scrachpads only store filter weights 
    
</font>

A sample mapping for `system_arch_1x16` is provided in `ConfigRegistry.SYSTEM_MANUAL_MAP`. You can change the mapping by tweaking the widgets below (see Q1.4 and Q1.7 for a reminder on Timeloop mapping conventions).


Please fill in the table below to provide your answer.

In [None]:
show_config(ConfigRegistry.TINY_LAYER_PROB)

In [None]:
show_config(ConfigRegistry.SYSTEM_MANUAL_MAP)

In [None]:
sys_1x16_map_widget = load_widget_config(ConfigRegistry.SYSTEM_MANUAL_MAP_WIDGET, title='Mapping Options')

In [None]:
# Nothing to change here!
sys_1x16_mapping = configure_mapping(sys_1x16_map_widget.dump(),
                                     ConfigRegistry.SYSTEM_MANUAL_MAP_TEMPLATE,
                                     {'dram_t_c': None,
                                      'dram_t_m': None,
                                      'dram_t_n': None,
                                      'gbuf_t_c': None,
                                      'gbuf_t_m': None,
                                      'gbuf_t_n': None,
                                      'gbuf_s_m': None,
                                      'gbuf_s_c': None,
                                      'spad_t_n': None})
print(sys_1x16_mapping)

In [None]:
sys_1x16_stats, sys_1x16_loops = run_timeloop_model(
    ConfigRegistry.SYSTEM_1x16_ARCH, ConfigRegistry.SYSTEM_COMPONENTS_DIR,
    sys_1x16_mapping,
    ConfigRegistry.TINY_LAYER_PROB
)
print(sys_1x16_stats)

In [None]:
sys_2x8_map_widget = load_widget_config(ConfigRegistry.SYSTEM_MANUAL_MAP_WIDGET, title='Mapping Options')

In [None]:
# Nothing to change here!
sys_2x8_mapping = configure_mapping(sys_2x8_map_widget.dump(),
                                    ConfigRegistry.SYSTEM_MANUAL_MAP_TEMPLATE,
                                    {'dram_t_c': None,
                                     'dram_t_m': None,
                                     'dram_t_n': None,
                                     'gbuf_t_c': None,
                                     'gbuf_t_m': None,
                                     'gbuf_t_n': None,
                                     'gbuf_s_m': None,
                                     'gbuf_s_c': None,
                                     'spad_t_n': None})
sys_2x8_stats, sys_2x8_loops = run_timeloop_model(
    sys_2x8_widget.dump(), ConfigRegistry.SYSTEM_COMPONENTS_DIR,
    sys_2x8_mapping,
    ConfigRegistry.TINY_LAYER_PROB
)
print(sys_2x8_stats)

In [None]:
sys_4x4_map_widget = load_widget_config(ConfigRegistry.SYSTEM_MANUAL_MAP_WIDGET, title='Mapping Options')

In [None]:
# Nothing to change here!
sys_4x4_mapping = configure_mapping(sys_4x4_map_widget.dump(),
                                    ConfigRegistry.SYSTEM_MANUAL_MAP_TEMPLATE,
                                    {'dram_t_c': None,
                                     'dram_t_m': None,
                                     'dram_t_n': None,
                                     'gbuf_t_c': None,
                                     'gbuf_t_m': None,
                                     'gbuf_t_n': None,
                                     'gbuf_s_m': None,
                                     'gbuf_s_c': None,
                                     'spad_t_n': None})
sys_4x4_stats, sys_4x4_loops = run_timeloop_model(
    sys_4x4_widget.dump(), ConfigRegistry.SYSTEM_COMPONENTS_DIR,
    sys_4x4_mapping,
    ConfigRegistry.TINY_LAYER_PROB
)
print(sys_4x4_stats)

|Shape 	|Mapping 	|Cycles 	|Total Energy (uJ)|
|-------|-----------|-----------|----|
|1x16 	|Example 	| 	129600|4.36|
|2x8 	|Example 	| 	129600|4.36|
|4x4 	|Example 	| 	129600|4.36|
|1x16 	|Optimized 	| 	    16200|2.83|
|2x8 	|Optimized 	| 	    8100|2.64|
|4x4 	|Optimized 	| 	    10800|3.05|


In [2]:
# the Question 2.2 chart
d = {'problem': ['tiny_layer'],  # fill in your answer here
     'architecture name': ['2x8'], # fill in your answer here
     'number of cycles': [8100],   # fill in your answer here
     'total energy (uJ)': [2.64],  # fill in your answer here
     'M3': [2],
     'N3': [2],
     'C3': [9],
     'M2': [1],
     'N2': [1],
     'C2': [1],
     'M1': [8],
     'C1': [2],
     'N0': [1]
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 problem   architecture name  number of cycles  total energy (uJ)  M3  N3  C3  M2  N2  C2  M1  C1  N0
tiny_layer        2x8               8100              2.64         2   2   9   1   1   1   8   2   1 


### Question 3 Mapspace Exploration with Timeloop

Mananully generating the best mapping for each architecture and layer shape is rather time-consuming, even if the search is performed under a tightly constrained map sapace, *e.g.,* the one in question 2.2. Therefore, timeloop does provide the automatic mapping space search functinality when appropriate map space constriants are given.

To perform an automatic mapping space search, you need to provide a mapspace constraint as an input. A mapspace constraint specifies the limitations imposed by your dataflow or hardware structures. An example mapping space constraint is shown below (`EXAMPLE_CONSTRAINTS`). To automatically search the mapspace with the constraints file, you should run the `run_timeloop_mapper` command.

*The search should take less than 5 minutes to finish. If you are running this command from the shell instead of running the below cell, you can also temrinate it whenever you want by pressing Ctrl+C (you will need to wait for timeloop to finish the remainig computations after you send the signal; the terminated threads will have a dash next to its id).*

In [None]:
show_config(ConfigRegistry.EXAMPLE_CONSTRAINTS)

In [None]:
sys_1x16_mapper_stats, sys_1x16_mapper_loops = run_timeloop_mapper(
    ConfigRegistry.SYSTEM_1x16_ARCH, ConfigRegistry.SYSTEM_COMPONENTS_DIR,
    ConfigRegistry.TINY_LAYER_PROB,
    ConfigRegistry.EXAMPLE_CONSTRAINTS, ConfigRegistry.MAPPER
)

In [None]:
# Stats of the best mappping found by the mapper.
print(sys_1x16_mapper_stats)

In [None]:
# Loop nest of the best mapping found by the mapper.
print(sys_1x16_mapper_loops)

### Question 3.1

In this question, we have provided you with a much more relaxed constraint `RELAXED_CONSTRAINT`. 

#### 3.1.1
    
Please examine the constraints, and list two additional relaxations on the mapspace constraints in `RELAXED_CONSTRAINTS` comparing to `EXAMPLE_CONSTRAINTS` (*Note: there are more than two relaxations, but you only need to list two*)
 

**Answer**


In [None]:
show_config(ConfigRegistry.RELAXED_CONSTRAINTS)

#### 3.1.2
Below, we run the mapper on all three architectures (1x16, 2x8, 4x4) on all three workloads (tiny, depth-wise, point-wise). For each workload, find the architecture that has the highest throughput by inspecting `all_stats`. If two architectures result in the same throughput, choose the one that's less energy consuming. Please fill in the chart below. 

In [None]:
# Nothing to change here!
architectures = [(ConfigRegistry.SYSTEM_1x16_ARCH, '1x16'),
                 (sys_2x8_widget.dump(), '2x8'),
                 (sys_4x4_widget.dump(), '4x4')]
workloads = [(ConfigRegistry.TINY_LAYER_PROB, 'tiny'),
             (ConfigRegistry.DEPTHWISE_LAYER_PROB, 'depthwise'),
             (ConfigRegistry.POINTWISE_LAYER_PROB, 'pointwise')]

all_stats = {'tiny': {}, 'depthwise': {}, 'pointwise': {}}
all_loops = {'tiny': {}, 'depthwise': {}, 'pointwise': {}}

for arch, arch_name in architectures:
    for workload, workload_name in workloads:
        stats, loops = run_timeloop_mapper(
            arch, ConfigRegistry.SYSTEM_COMPONENTS_DIR,
            ConfigRegistry.RELAXED_CONSTRAINTS, ConfigRegistry.MAPPER,
            workload
        )
        all_stats[workload_name][arch_name] = stats
        all_loops[workload_name][arch_name] = loops

In [None]:
# Check your results here. Rerunning the last cell will take a while
print(all_stats['tiny']['1x16'])

In [3]:
# the Question 3.1.2 chart
d = {'problem': ['tiny_layer', 'depth_wise', 'point_wise'],  
     'architecture name': [ '2x8', '1x16', '4x4'], # fill in your answer here
     'number of cycles': [8100 , 2700, 750],    # fill in your answer here
     'total energy (uJ)': [2.64 , 1.02, 0.39],   # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 problem   architecture name  number of cycles  total energy (uJ)
tiny_layer         2x8              8100              2.64       
depth_wise        1x16              2700              1.02       
point_wise         4x4               750              0.39       


### Question 3.2
Your circuit designer has told you that it is too expensive to have a separate architecture for each layer shape. You must now have a fixed architecture (i.e. fixed height and width of the PE array). Based on this specific architecture, you can change the mapping according to different layer shapes. 

What is the best architecture that achieves the **highest average throughput** of those three layer shapes among all the architectures explored in question 3.1? Please fill in the chart below.



In [4]:
# the Question 3.2 chart
d = {'problem': ['tiny_layer', 'depth_wise', 'point_wise'],  
     'architecture name': [ '2x8', '2x8', '2x8'], # fill in your answer here
     'number of cycles': [ 8100, 4050, 1200],    # fill in your answer here
     'total energy (uJ)': [ 2.64, 1.04, 0.4],   # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 problem   architecture name  number of cycles  total energy (uJ)
tiny_layer        2x8               8100              2.64       
depth_wise        2x8               4050              1.04       
point_wise        2x8               1200              0.40       


## Question 4 Architectures with New Technologies

So far, we have been looking at conventional architectures based on digital VLSI designs. There are also many DNN accelerator designs that are based on emerging technologies, such as optical DNN accelerators and processing-in-memory (PIM) DNN accelerators. In this question, we are going to evaluate a PIM DNN accelerator design. The PIM design can be found at `ConfigRegistry.PIM_ARCH` and `ConfigRegistry.PIM_COMPONENTS_DIR`.

### Question 4.1 
Please take a look at the architecture description and the compound components descriptions at `ConfigRegistry.PIM_ARCH`. You will realize the compound components are much more complicated than the ones we presented before. Examine the `ConfigRegistry.PIM_COMPONENTS_DIR` class YAML definition and the hierachical tree description below. What are the missing subcomponent names? We have provided one subcomponent name for you, please follow the convention and provide you anwser in the cell below.

*Hint: to find the definition of a sub-compound-component, you need to find its class definition in another file stored in the component folder*

In [None]:
show_config(ConfigRegistry.PIM_ARCH)

In [None]:
show_config(ConfigRegistry.PIM_COMPONENTS_DIR)

<img align="left" src="designs/PIM/figures/simplemulticast_tree.png" alt="Full System  Architecture Diagram" style="margin:0px 0px 0px 0px; width:70%">

### Question 4.2
#### 4.2.1

Run `run_accelergy`. Recall that this command generates the energy and area characterizations of the architecture. Examine the output files, and fill in the table below

*Hint: mac compute energy should not be a large number, e.g., >100. If so, you probably restarted/recreated the docker container and therefore erased the PIM plug-in path added by the 
accelergyTables command in the readme*. Please rerun:

```
accelergyTables -r /home/workspace/lab4/PIM_estimation_tables
```
  

In [None]:
# If the following code doesn't run, uncomment and run this bash command
# !accelergyTables -r /home/workspace/lab4/PIM_estimation_tables

pim_accelergy_result = run_accelergy(
    ConfigRegistry.PIM_ARCH,
    ConfigRegistry.PIM_COMPONENTS_DIR
)
print(pim_accelergy_result.ert_verbose)

In [5]:
# the Question 4.2.1 chart
d = {'scratchpad access energy (pJ)': [],   # fill in your answer here
     'mac compute energy (pJ)': [0],         # fill in your answer here  
     'D2A_NoC average energy (pJ)': [0],     # fill in your answer here
     'A2D_NoC average energy (pJ)': [],     # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

 scratchpad access energy (pJ)  mac compute energy (pJ)  D2A_NoC average energy (pJ)  A2D_NoC average energy (pJ)
              0                        0.23424                     0.4224                     170.72768          


#### 4.2.2

Our PIM accelerator program the weights in the memory cells (i.e.,Each PE is resposible for storing 1 16-bit weight value in its scratchpad) and does not reload weights during the run of a layer (reflected in the constraints). Calculate the number of PEs needed to store all the weights for `TINY_LAYER_PROB`. 


**Answer**:

Numper of PEs needed to store all the weights: 2592.

In [None]:
show_config(ConfigRegistry.TINY_LAYER_PROB)

#### 4.2.3
Run `run_timeloop_mapper`. Is timeloop able to find any mappings? If not, in 1 or 2 sentences, explain why. If yes, provide the number of cycles and total energy consumption for running the workload.


**Answer**


In [None]:
pim_results = run_timeloop_mapper(
    ConfigRegistry.PIM_ARCH,
    ConfigRegistry.PIM_COMPONENTS_DIR,
    pim_accelergy_result.art,
    pim_accelergy_result.ert,
    ConfigRegistry.PIM_CONSTRAINTS,
    ConfigRegistry.PIM_MAPPER,
    ConfigRegistry.TINY_LAYER_PROB
)

### Question 4.3

Navigate to `designs/PIM_large`. 

In this folder, we provide you with an architecture with a larger PE array of size 144*18. 

#### 4.3.1
Run `run_timeloop_mapper`. Is timeloop able to find any mappings? If not, in 1 or 2 sentences, explain why. If yes, provide the number of cycles and total energy consumption for running the workload.


**Answer** Yes timeloop is now able to find mappings because there are enough PE's to load the weights with the larger array size of 2592 (144*18).

Number of cycles: 50

Total energy consumption: 0.57

  

In [None]:
pim_accelergy_result = run_accelergy(
    ConfigRegistry.PIM_LARGE_ARCH,
    ConfigRegistry.PIM_LARGE_COMPONENTS_DIR
)


In [None]:
pim_large_stats, pim_large_loops = run_timeloop_mapper(
    ConfigRegistry.PIM_LARGE_ARCH,
    ConfigRegistry.PIM_LARGE_COMPONENTS_DIR,
    pim_accelergy_result.art,
    pim_accelergy_result.ert,
    ConfigRegistry.PIM_LARGE_CONSTRAINTS,
    ConfigRegistry.PIM_LARGE_MAPPER,
    ConfigRegistry.TINY_LAYER_PROB
)

#### 4.3.2
Your circuit designer has invented a very low-power 8-bit ADC, which only consumes half of the energy per conversion. We call this type of ADC as `low_power_SAR` ADC. You decided to model a design with thie new `low_power_SAR` ADC unit integrated. Please perform the following updates and fill in the table below.


 - Update the `designs/PIM_large/arch/components/A2D_conversion_system.yaml` approriately to replace the old `SAR` ADC with the new `low_power_SAR` ADC.
 
 - Update the energy tables at `PIM_estimation_tables/32nm_data/data/ADC.csv` for the 8-bit `low_power_SAR` ADC used in this design.
 
 - Rerun `run_accelergy`.

*Hint: mac compute energy should not be a large number, e.g., >100. If so, you probably restarted/recreated the docker container and therefore erased the PIM plug-in path added by the 
accelergyTables command in the readme*. Please rerun:

```
accelergyTables -r /home/workspace/lab4/PIM_estimation_tables
```

In [None]:
# If the following code doesn't run, uncomment and run this bash command
# !accelergyTables -r /home/workspace/lab4/PIM_estimation_tables

print(pim_accelergy_result.ert_verbose)

In [6]:
# the Question 4.3.2 chart
print('\n== Static Hardware Properties ==')
d = {'scratchpad access energy': [' pJ'],  # fill in your answer here
     '  mac compute energy': [' pJ'],      # fill in your answer here
     '  D2A_NoC average energy': [' pJ'],  # fill in your answer here
     '  A2D_NoC average energy': [' pJ'],  # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))

print('\n== Runtime Stats ==')
d = {'total cycles running tiny_layer':[],        # fill in your answer here
     '  total energy running tiny_layer':[' uJ'] # fill in your answer here
    }
df = pd.DataFrame(data=d)
print(df.to_string(index=False, justify='center'))


== Static Hardware Properties ==
scratchpad access energy   mac compute energy   D2A_NoC average energy   A2D_NoC average energy
          0 pJ                0.23424 pJ             0.4224 pJ               138.72768 pJ      

== Runtime Stats ==
 total cycles running tiny_layer   total energy running tiny_layer
               50                             0.48 uJ             
