# Lab 3

In this lab, we will explore how to use different dataflows in our deep neural
network accelerators.

In [None]:
import pandas as pd
from loaders import *

answer(
    question='1.0',
    subquestion=f'What is your name?',
    answer= 'FILL ME',
    required_type=str,
)
answer(
    question='1.0',
    subquestion=f'What is your email address?',
    answer= 'FILL ME',
    required_type=str,
)
answer(
    question='1.0',
    subquestion=f'What is your kerberos?',
    answer= 'FILL ME',
    required_type=str,
)

## Part 1: Single PE Modeling

We will start with a simple design consisting of a single PE as shown in the figure below. The PE contains a MAC unit to multiply-accumulate, and a scratchpad to store data locally for reuse. We also provide you with the loop nest for this single PE design in the figure below.

**Note: Loop nest for the single PE includes both the scratchpad and the registers.**

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/singlePE/figures/arch.png" alt="PE Architecture" style="margin:100px 0px 30px 70px; width:35%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/singlePE/figures/loopnest.png" alt="PE Loopnest" style="width:40%">
  </div>
</div>

### Question 1
Assuming you cannot reorder the provided loop nest, if you can only store one datatype (datatypes inlcude *filter weights, input activations, output activations*) inside the PE scratchpad to maximize data reuse inside the PE, which datatype will you choose? In a sentence in the assumptions field, explain why.

In [None]:
answer(
    question='1.1',
    subquestion='Which datatype maximizes data reuse?',
    answer= 'FILL ME',
    required_type=('filter weights', 'input activations', 'output activations'), 
    assumptions=[] # Put a list of strings as assumptions, if any.
)

### Question 2
Take a look at the `designs/singlePE/arch.yaml` config file printed by the code box below. This file describes the hardware structure of the architecture. Please fill in the statements below. When filling in the statements, please treat the registers as one memory level.

In [None]:
show_config('designs/singlePE/arch.yaml')

In [None]:
answer(
    question='1.2.1',
    subquestion='What is the number of memory levels (including DRAM and registers)',
    answer= 'FILL ME',
    required_type=int, 
)
answer(
    question='1.2.1',
    subquestion='How many bits used to represent a value?',
    answer= 'FILL ME',
    required_type=int, 
)
answer(
    question='1.2.1',
    subquestion='How many bytes is the local scratchpad?',
    answer= 'FILL ME',
    required_type=int, 
)

Now take a look at the compound component descriptions at `designs/components`. These consist of multiple primitive elements to make a more complex element for analysis. These files describe the hardware details of each component in the design.

In [None]:
show_config('designs/components/mac_compute.yaml')
show_config('designs/components/reg_storage.yaml')
show_config('designs/components/smart_storage.yaml')

In [None]:
answer(
    question='1.2.2',
    subquestion='True/False: These components are made of multiple subcomponents. (False if they are made of a single subcomponent)',
    answer= 'FILL ME',
    required_type=bool, 
)
answer(
    question='1.2.3',
    subquestion='True/False: According to description of the `mac_compute` compound component, is our architecture capable of performing floating point computations?',
    answer= 'FILL ME',
    required_type=bool, 
)

### Question 3
The command below performs static hardware charaterizations using **Accelergy**. You do not need to worry about the warning messages.

Now examine the file `designs/singlePE/output/ERT.yaml`. Please fill in the statements below in pJ. (**note that the implicit energy unit for the ERT is pJ**)

In [None]:
result = run_accelergy(
    architecture='designs/singlePE/arch.yaml',
)
# The energy reference table (ERT) is the one used to compute energy.
print(result.ert)

# The verbose energy reference table shows more information. You don't need it here but later in Q1.6
# print(result.ert_verbose)

In [None]:
answer(
    question='1.3',
    subquestion='What is the mac compute energy?',
    answer= 'FILL ME',
    required_type=Number, 
)
answer(
    question='1.3',
    subquestion='What is the DRAM write energy?',
    answer= 'FILL ME',
    required_type=Number, 
)
answer(
    question='1.3',
    subquestion='What is the scratchpad leak energy?',
    answer= 'FILL ME',
    required_type=Number, 
)

### Question 4 

Take a look at the `designs/singlePE/map.yaml` config file below. This config describes a mapping for a certain workload. Can you tell what are the values of `M0`, `N0`, `C0`, `R`, `S`, `P`, `Q` in the loop nest above by examining the mapping? For each of variable, if you can, specify the value in the following chart.

**A note on Timeloop mapping conventions**

1. ```permutation``` is the order of the loops from inner to outer. ```factors``` is the number of tiles on that level. For example, permutation QPS and factors `Q=5`, `P=2`, and `S=4` means the following loop nest.
	```
	for s in [0, 4):
	 for p in [0, 2):
	  for q in [0, 5):
	...
	```

2. A buffer level (e.g., scratchpad, registers) can also have a bypass specification. For example, an output buffer with `keep=[Output]` and `bypass=[Weights, Input]` will store only the `Output` tensor.

In [None]:
show_config('designs/singlePE/map.yaml')

In [None]:
answer(
    question='1.4',
    subquestion='What are the [M0, N0, C0] factors?',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # For each of the factors, put down the value if it is possible to tell what the value is. Otherwise, put down 'nan'.
    required_type=[(int, "nan"), (int, "nan"), (int, "nan")], 
)
answer(
    question='1.4',
    subquestion='What are the [S, R, P, Q] factors?',
    answer= ['FILL ME', 'FILL ME', 'FILL ME', 'FILL ME'], # For each of the factors, put down the value if it is possible to tell what the value is. Otherwise, put down 'nan'.
    required_type=[(int, "nan"), (int, "nan"), (int, "nan"), (int, "nan")],
)

### Question 5
The command below performs a **Timeloop** runtime simulation of your design, and **Accelergy** is queried as the backend to provide energy estimations for each simulated component. That's why you will see the Accelergy related outputs as well (*e.g.,* `timeloop-model.ERT.yaml`).

In [None]:
conv2_results = run_timeloop_model(
    architecture='designs/singlePE/arch.yaml',
    mapping='designs/singlePE/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
conv2_stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
conv2_mapping = conv2_results.mapping

**Understanding Timeloop Output**

From `conv2_mapping`, can you now tell the dimensions of the layer shape by looking at the produced mapping? Take a look at the `conv2_stats`, and fill in the statements below.

In [None]:
print(conv2_mapping)

In [None]:
answer(
    question='1.5',
    subquestion='What is the number of input channels?',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the number of output channels?',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the batch size?',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What are the output P and Q? P and Q are the same, so just give one value.',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What are the weight R and S? R and S are the same, so just give one value.',
    answer= 'FILL ME',
    required_type=int,
)

View `conv2_stats` and fill in the statements below.

In [None]:
print(conv2_stats)

In [None]:
answer(
    question='1.5',
    subquestion='What is the number of cycles?',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the total MAC energy (pJ overall, NOT PER MAC)?',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the total scratchpad energy (pJ overall, NOT PER MAC)?',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the total DRAM energy (pJ overall, NOT PER MAC)?',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the pJ/compute',
    answer= 'FILL ME',
    required_type=Number,
)

### Question 6

Since you now have an understanding of the input and output files of the tools, we would like you to write your own input files and feed it to the evaluation system.

Many modern accelerator designs integrate address generators into their storages. The address generator is responsible for generating a sequence of read and write addresses for the memory, *i.e.,* for each read and write, the address is generated locally by the address generator. Typically, the address generator can be represented as an adder.

In this question, we would like you to update the compound component definition for the scratchpad to reflect the existence of such an additional address generator. To be specific:

1. name of the address generator: address_generator
2. class of the address generator: intadder
3. attributes associated with the address generator: 
    - datawidth (hint: arithmetic expressions including the ceil, floor, and
      log2 functions can be used). The datawidth of the address generator should
      be set to the minimum value such that each row of the memory has a unique
      address (*i.e.,* number of unique values must be >= memory depth).
    - technology
    - latency (hint: the global_cycle_seconds global variable is visible and can be used)
4. you also need to specify the role your address generator plays when the storage is read and written, and when the storage leaks over time (hint: the intadder has `add` and `leak` actions)

Inspect the `designs/components/smart_storage_addr_gen.yaml` configuration below. We will be setting variables enclosed in double curly braces {{ }} with a function later.

In [None]:
show_config('designs/components/smart_storage_addr_gen.yaml')

Fill the following dictionary to set double-curly-brace-enclosed variables in
the above description. Afterwards, run Accelergy (the command cell below).
Examine the outputs and fill in the chart below. 

In [None]:
address_generator_config = dict(
    use_smart_storage_addr_gen = True, # DO NOT CHANGE THIS LINE
    address_generator_name = 'FILL ME', 
    address_generator_class  = 'FILL ME',
    address_generator_technology_attribute = '"45nm"',
    address_generator_number_of_address_bits = 'FILL ME',
    address_generator_action_for_write = 'FILL ME',
    address_generator_action_for_read = 'FILL ME',
    address_generator_action_for_leak = 'FILL ME',
########################
#### YOUR CODE HERE ####
########################
)

single_pe_ag_accelergy_result = run_accelergy(
    address_generator_config,
    architecture='designs/singlePE/arch.yaml',
)
print(single_pe_ag_accelergy_result.ert_verbose)

In [None]:
# <HINT> You should see the address generator's contributions to the energy in
# the ERT above. If you don't see it, you may want to check your answer.
answer(
    question='1.6',
    subquestion='What is the read energy of the scratchpad (pJ)?',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.6',
    subquestion='What is the write energy of the scratchpad (pJ)?',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.6',
    subquestion='What is the leak energy of the scratchpad (pJ)?',
    answer= 'FILL ME',
    required_type=Number,
)

answer( # No need for you to change this one.
    question='1.6',
    subquestion='What parameters did you put for (name, class, technology, datawidth, write_action, read_action, leak_action)?',
    answer=list(address_generator_config.values())[1:],
    required_type=[str] * (len(address_generator_config) - 1),
)

### Question 7
So far, we have been focusing on studying the dataflow described in the provided loop nest above. In this question, we would like you to update the mapping to represent a new loop nest shown below. 

Please set the bounds in the `designs/singlePE/map_os.yaml` mapping according to the layer shape described in `layer_shapes/conv2.yaml`. You will again be doing this in the code cell below. **Note that some of the inner bounds are set for you** and **only keep outputs inside the scratchpad**.

After you have updated the mapping, run `timeloop-model` in the command cell below. Please also fill in the chart below:

<div class="row">
  <div class="column">
    <img align="center" src="designs/singlePE/figures/loopnest_os.png" alt="PE Architecture" style="margin:0px 0px 70px 70px; width:100%">
  </div>
</div>

In [None]:
show_config('layer_shapes/conv2.yaml')

**Update the mapping**

First, view the map file in `designs/singlePE/map_os.yaml`. Inspect the double-curly-brace-enclosed statements-- these we'll be filling in. When we update the dictionaries below, double-curly-brace-enclosed variables will be updated as the YAML file is loaded in.

Fill out the variables in the dictionaries below to match the output-stationary loop nest for the problem above.

In [None]:
show_config('designs/singlePE/map_os.yaml')

In [None]:
os_map_config = dict(
    DRAM_factor_R='FILL ME', # Set to an integer
    DRAM_factor_S='FILL ME', # Set to an integer
    DRAM_factor_P='FILL ME', # Set to an integer
    DRAM_factor_Q='FILL ME', # Set to an integer
    DRAM_factor_N='FILL ME', # Set to an integer
    DRAM_factor_M='FILL ME', # Set to an integer
    DRAM_factor_C='FILL ME', # Set to an integer
    DRAM_permutation=['FILL', 'ME'], # Set to a list

    scratchpad_factor_R='FILL ME', # Set to an integer
    scratchpad_factor_S='FILL ME', # Set to an integer
    scratchpad_factor_P='FILL ME', # Set to an integer
    scratchpad_factor_Q='FILL ME', # Set to an integer
    scratchpad_factor_N='FILL ME', # Set to an integer
    scratchpad_factor_M='FILL ME', # Set to an integer
    scratchpad_factor_C='FILL ME', # Set to an integer
    scratchpad_permutation=['FILL', 'ME'], # Set to a list
    scratchpad_keep_list=['FILL', 'ME'], # Set to a list
    scratchpad_bypass_list=['FILL', 'ME'], # Set to a list
########################
#### YOUR CODE HERE ####
########################
)

# No need for you to change the following lines.
for key, value in os_map_config.items():
    required_type = int
    if 'permutation' in key or 'list' in key:
        required_type = [str] * len(value)
    answer(
        question='1.7',
        subquestion=f'Setting for {key} in the os_map_config',
        answer=value,
        required_type=required_type,
    )

In [None]:
single_pe_os_results = run_timeloop_model(
    os_map_config,
    architecture='designs/singlePE/arch.yaml',
    mapping='designs/singlePE/map_os.yaml',
    problem='layer_shapes/conv2.yaml'
)
single_pe_os_stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
single_pe_os_mapping = single_pe_os_results.mapping  # You can print to check your answer
print(single_pe_os_stats)

In [None]:
answer(
    question='1.7.2',
    subquestion=f'Number of cycles',
    answer= 'FILL ME',
    required_type=int,
)
answer(
    question='1.7.2',
    subquestion=f'MAC energy (pJ overall, NOT PER MAC)',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'Scratchpad energy (pJ overall, NOT PER MAC)',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'DRAM energy (pJ overall, NOT PER MAC)',
    answer= 'FILL ME',
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'Total pJ/compute',
    answer= 'FILL ME',
    required_type=Number,
)

### Question 8
This question asks about the influence of moving loops on the overall data movement between different storages. In answering the following, please make the following assumptions:
- All storages are large enough to fit all data. 
- If a temporal loop is above a given storage element, then that storage element is flushed (*i.e.*, all data is removed and re-fetched) for each iteration of the loop. This happens even if the data does not change across loop iterations (*i.e.,* if an upper-level buffer iterates over M, we would flush inputs from a lower-level buffer.)

We will be processing a convolutional layer with the following Einsum (note stride equals 1): 

$$
O_{m,n,p,q} = I_{n,c,p+r,q+s} \times F_{m,c,r,s}
$$

We currently have the following storage hierarchy and loop nest:

```
DRAM [Weights Inputs Outputs]
-----------------------------
| for P in [0..2):        Loop (P)
|  for Q in [0..2):       Loop (Q)
|   for C in [0..2):      Loop (C)
|    for R in [0..2):     Loop (R)
|     for S in [0..2):    Loop (S)
|      for N in [0..2):   Loop (N)
|       for M in [0..2):  Loop (M)

Scratchpad [Weights Inputs Outputs]
--------------------------------------
< No loops here >

Registers [Weights Inputs Outputs]
----------------------------------
|               <MAC>
```

Note that the above loop syntax uses Timeloop notation, where loops are written with a capital dimension name rather than iteration variables. For example, the following loop nest is Timeloop notation:
```
for P in [0..2):
  for P in [0..2):
```

And is equivalent to the following standard notation loop nest:
```
for p0 in [0..2):
  for p1 in [0..2):
```

Please answer the following.

In [None]:
answer(
    question='1.8',
    subquestion='How many inputs, outputs, and weights (including duplicates) are transferred from DRAM to the scratchpad? Answer in the order of inputs, outputs, and weights.',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='How many inputs, outputs, and weights (including duplicates) are transferred from the scratchpad to the registers? Answer in the order of inputs, outputs, and weights.',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)



We now have the option of moving loop (P), (Q), (C), (R), (S), or (M) from beneath the DRAM to beneath the scratchpad (*e.g.,* if we chose (P), the (P) loop would be removed from the DRAM loop nest and instead be placed in the scratchpad loop nest).

In [None]:
answer(
    question='1.8',
    subquestion='If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='If we choose to move the (N) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (N) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='Looking at the einsum, what has to be true about a loop for it to be helpful in reducing DRAM-scratchpad transfers for a particular datatype by moving it from the DRAM to the scratchpad loop nest? Please answer using the rank of the loop (*i.e.* P,Q,R,S,C,N,M) and its role in the Einsum.',
    answer= 'FILL ME',
    required_type=str,
    assumptions=[], # Put a list of strings as assumptions, if any.
)
answer(
    question='1.8',
    subquestion='Were we able to affect data movement between the scratchpad and registers? Why or why not?',
    answer= 'FILL ME',
    required_type=str,
    assumptions=[], # Put a list of strings as assumptions, if any.
)

Now fill out the following lines with (P), (Q), (R), (S), (C), (N), and (M). If we can't reduce data movement by moving any of the loops, say "None". *Hint: The ordering of scratchpad-level loops will affect your answer. Could moving certain loops lead to reuse through convolutional sliding windows?*

In [None]:
ALLOWED_ANSWERS = ('(P)', '(Q)', '(R)', '(S)', '(C)', '(N)', '(M)')
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** input movement, we could move loops:',
    answer= ['FILL ME', 'FILL ME', 'FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 5,
)
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** output movement, we could move loops:',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 3,
)
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** weight movement, we could move loops:',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 3,
)