# Lab 3

In this lab, we will explore how to use different dataflows in our deep neural
network accelerators.

In [4]:
import pandas as pd
from loaders import *

answer(
    question='1.0',
    subquestion=f'What is your name?',
    answer= 'Saniya Karwa',
    required_type=str,
)
answer(
    question='1.0',
    subquestion=f'What is your email address?',
    answer= 'saniya@mit.edu',
    required_type=str,
)
answer(
    question='1.0',
    subquestion=f'What is your kerberos?',
    answer= 'saniya',
    required_type=str,
)

1.0: What is your name?
	Saniya Karwa
1.0: What is your email address?
	saniya@mit.edu
1.0: What is your kerberos?
	saniya


## Part 1: Single PE Modeling

We will start with a simple design consisting of a single PE as shown in the figure below. The PE contains a MAC unit to multiply-accumulate, and a scratchpad to store data locally for reuse. We also provide you with the loop nest for this single PE design in the figure below.

**Note: Loop nest for the single PE includes both the scratchpad and the registers.**

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/singlePE/figures/arch.png" alt="PE Architecture" style="margin:100px 0px 30px 70px; width:35%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/singlePE/figures/loopnest.png" alt="PE Loopnest" style="width:40%">
  </div>
</div>

### Question 1
Assuming you cannot reorder the provided loop nest, if you can only store one datatype (datatypes inlcude *filter weights, input activations, output activations*) inside the PE scratchpad to maximize data reuse inside the PE, which datatype will you choose? In a sentence in the assumptions field, explain why.

In [5]:
answer(
    question='1.1',
    subquestion='Which datatype maximizes data reuse?',
    answer= 'output activations',
    required_type=('filter weights', 'input activations', 'output activations'), 
    assumptions=['Outermost loops are N and M which iterate over the output'] # Put a list of strings as assumptions, if any.
)

1.1: Which datatype maximizes data reuse?
	output activations
	Outermost loops are N and M which iterate over the output


### Question 2
Take a look at the `designs/singlePE/arch.yaml` config file printed by the code box below. This file describes the hardware structure of the architecture. Please fill in the statements below. When filling in the statements, please treat the registers as one memory level.

In [6]:
show_config('designs/singlePE/arch.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
architecture:
  version: 0.4
  nodes:
  - !Container
    name: system_arch
    attributes:
      # Top-level attributes inherited by all components unless overridden
      technology: "45nm"
      global_cycle_seconds: 1e-9
      datawidth: 16

  - !Component
    name: DRAM                 # offchip DRAM is the source of all datatypes
    class: DRAM                # assume DRAM is large enough to store all the data, so no depth specification needed
    attributes:
      datawidth: datawidth
      width: 64                # width in bits

  - !Container
    name: chip

  - !Container
    name: PE
    spatial: {meshX: 1, meshY: 1}

  - !Component
    name: scratchpad
    class: smart_storage  # definitions of the compound classes can be found under "components" folder
    attributes: {depth: 128, datawidth: datawidth, width: datawidth}

  

In [7]:
answer(
    question='1.2.1',
    subquestion='What is the number of memory levels (including DRAM and registers)',
    answer=3,
    required_type=int, 
)
answer(
    question='1.2.1',
    subquestion='How many bits used to represent a value?',
    answer=16,
    required_type=int, 
)
answer(
    question='1.2.1',
    subquestion='How many bytes is the local scratchpad?',
    answer=256,
    required_type=int, 
)

1.2.1: What is the number of memory levels (including DRAM and registers)
	3
1.2.1: How many bits used to represent a value?
	16
1.2.1: How many bytes is the local scratchpad?
	256


Now take a look at the compound component descriptions at `designs/components`. These consist of multiple primitive elements to make a more complex element for analysis. These files describe the hardware details of each component in the design.

In [8]:
show_config('designs/components/mac_compute.yaml')
show_config('designs/components/reg_storage.yaml')
show_config('designs/components/smart_storage.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
compound_components:
  version: 0.4
  classes:
    - name: mac_compute
      attributes:        # default attribute values (can be overridden by architecture specifications)
        technology: "45nm"
        datawidth: datawidth    # datawidth in bits
        num_pipeline_stages: 2
      subcomponents:     # a list of all components that this compound component is composed of (one in this example)
        - name: compute_unit
          class: intmac  # primitive class defined in primitive class library
          attributes:    # lower-level attributes that are mapped from upper level
            technology: technology
            latency: global_cycle_seconds
            datawidth: datawidth # datawidth in bits
            width: datawidth
            num_pipeline_stages: 2
      actions:           # definitions of the compound actions i

In [9]:
answer(
    question='1.2.2',
    subquestion='True/False: These components are made of multiple subcomponents. (False if they are made of a single subcomponent)',
    answer= False,
    required_type=bool, 
)
answer(
    question='1.2.3',
    subquestion='True/False: According to description of the `mac_compute` compound component, is our architecture capable of performing floating point computations?',
    answer= False,
    required_type=bool, 
)

1.2.2: True/False: These components are made of multiple subcomponents. (False if they are made of a single subcomponent)
	False
1.2.3: True/False: According to description of the `mac_compute` compound component, is our architecture capable of performing floating point computations?
	False


### Question 3
The command below performs static hardware charaterizations using **Accelergy**. You do not need to worry about the warning messages.

Now examine the file `designs/singlePE/output/ERT.yaml`. Please fill in the statements below in pJ. (**note that the implicit energy unit for the ERT is pJ**)

In [10]:
result = run_accelergy(
    architecture='designs/singlePE/arch.yaml',
)
# The energy reference table (ERT) is the one used to compute energy.
print(result.ert)

# The verbose energy reference table shows more information. You don't need it here but later in Q1.6
# print(result.ert_verbose)

[INFO] 2025-03-05 23:50:40,536 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


ERT:
    version: '0.4'
    tables:
      - name: system_arch_top_level.DRAM[1..1]
        actions:
          - name: read
            arguments:
                global_cycle_seconds: 1e-09
                action_latency_cycles: 1
            energy: 512.0
          - name: write
            arguments:
                global_cycle_seconds: 1e-09
                action_latency_cycles: 1
            energy: 512.0
          - name: update
            arguments:
                global_cycle_seconds: 1e-09
                action_latency_cycles: 1
            energy: 512.0
          - name: leak
            arguments:
                global_cycle_seconds: 1e-09
                action_latency_cycles: 1
            energy: 0.0
      - name: system_arch_top_level.scratchpad[1..1]
        actions:
          - name: read
            arguments: {}
            energy: 0.83416
          - name: write
            arguments: {}
            energy: 0.83416
          - name: update
            arguments

In [11]:
answer(
    question='1.3',
    subquestion='What is the mac compute energy?',
    answer= 3.275,
    required_type=Number, 
)
answer(
    question='1.3',
    subquestion='What is the DRAM write energy?',
    answer= 512,
    required_type=Number, 
)
answer(
    question='1.3',
    subquestion='What is the scratchpad leak energy?',
    answer= 0.0007728,
    required_type=Number, 
)

1.3: What is the mac compute energy?
	3.275
1.3: What is the DRAM write energy?
	512
1.3: What is the scratchpad leak energy?
	0.0007728


### Question 4 

Take a look at the `designs/singlePE/map.yaml` config file below. This config describes a mapping for a certain workload. Can you tell what are the values of `M0`, `N0`, `C0`, `R`, `S`, `P`, `Q` in the loop nest above by examining the mapping? For each of variable, if you can, specify the value in the following chart.

**A note on Timeloop mapping conventions**

1. ```permutation``` is the order of the loops from inner to outer. ```factors``` is the number of tiles on that level. For example, permutation QPS and factors `Q=5`, `P=2`, and `S=4` means the following loop nest.
	```
	for s in [0, 4):
	 for p in [0, 2):
	  for q in [0, 5):
	...
	```

2. A buffer level (e.g., scratchpad, registers) can also have a bypass specification. For example, an output buffer with `keep=[Output]` and `bypass=[Weights, Input]` will store only the `Output` tensor.

In [12]:
show_config('designs/singlePE/map.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
mapping:
# mapping for the DRAM
- target: DRAM
  type: temporal
  factors: [R=1, S=1, P=1, Q=1, N=50, M=4, C=4]
  permutation: [R, S, P, Q, C, M, N]
# mapping for the local scratchpad inside the PE
- target: scratchpad
  type: temporal
  factors: [R=0, S=0, P=0, Q=0, N=1, M=2, C=1] # factor of 0 => full dimension
  permutation: [Q, P, N, C, M, S, R]
- target: scratchpad
  type: dataspace
  keep: [Weights]
  bypass: [Inputs, Outputs]
# mapping for the input and output registers of the mac unit
- target: weight_reg
  type: temporal
  factors: [R=1, S=1, P=1, Q=1, M=1, C=1, N=1]
  permutation: [P, Q, C, M, R, S, N]
- target: weight_reg
  type: dataspace
  keep: [Weights]
  bypass: [Inputs, Outputs]
- target: input_activation_reg
  type: temporal
  factors: [R=1, S=1, P=1, Q=1, M=1, C=1, N=1]
  permutation: [P, Q, C, M, R, S, N]
- target: inp

In [13]:
# TODO: Not sure
answer(
    question='1.4',
    subquestion='What are the [M0, N0, C0] factors?',
    answer= [2, 50, 4], # For each of the factors, put down the value if it is possible to tell what the value is. Otherwise, put down 'nan'.
    required_type=[(int, "nan"), (int, "nan"), (int, "nan")], 
)
answer(
    question='1.4',
    subquestion='What are the [S, R, P, Q] factors?',
    answer= ["nan", "nan", "nan", "nan"], # For each of the factors, put down the value if it is possible to tell what the value is. Otherwise, put down 'nan'.
    required_type=[(int, "nan"), (int, "nan"), (int, "nan"), (int, "nan")],
)

1.4: What are the [M0, N0, C0] factors?
	[2, 50, 4]
1.4: What are the [S, R, P, Q] factors?
	['nan', 'nan', 'nan', 'nan']


### Question 5
The command below performs a **Timeloop** runtime simulation of your design, and **Accelergy** is queried as the backend to provide energy estimations for each simulated component. That's why you will see the Accelergy related outputs as well (*e.g.,* `timeloop-model.ERT.yaml`).

In [14]:
conv2_results = run_timeloop_model(
    architecture='designs/singlePE/arch.yaml',
    mapping='designs/singlePE/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
conv2_stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
conv2_mapping = conv2_results.mapping

[INFO] 2025-03-05 23:50:41,182 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


**Understanding Timeloop Output**

From `conv2_mapping`, can you now tell the dimensions of the layer shape by looking at the produced mapping? Take a look at the `conv2_stats`, and fill in the statements below.

In [15]:
print(conv2_mapping)

DRAM [ Weights:800 (800) Inputs:204800 (204800) Outputs:313600 (313600) ] 
-------------------------------------------------------------------------
| for N in [0:50)
|   for M in [0:4)
|     for C in [0:4)

scratchpad [ Weights:50 (50) ] 
------------------------------
|       for R in [0:5)
|         for S in [0:5)
|           for M in [0:2)
|             for P in [0:28)
|               for Q in [0:28)

weight_reg [ Weights:1 (1) ] 
input_activation_reg [ Inputs:1 (1) ] 
output_activation_reg [ Outputs:1 (1) ] 
---------------------------------------
|                 << Compute >>



In [16]:
answer(
    question='1.5',
    subquestion='What is the number of input channels?',
    answer= 4,
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the number of output channels?',
    answer= 8,
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the batch size?',
    answer= 50,
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What are the output P and Q? P and Q are the same, so just give one value.',
    answer= 28,
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What are the weight R and S? R and S are the same, so just give one value.',
    answer= 5,
    required_type=int,
)

1.5: What is the number of input channels?
	4
1.5: What is the number of output channels?
	8
1.5: What is the batch size?
	50
1.5: What are the output P and Q? P and Q are the same, so just give one value.
	28
1.5: What are the weight R and S? R and S are the same, so just give one value.
	5


View `conv2_stats` and fill in the statements below.

In [17]:
print(conv2_stats)

Buffer and Arithmetic Levels
----------------------------
Level 0
-------
=== mac ===

    SPECS
    -----
    Word bits             : 16
    Instances             : 1 (1*1)
    Compute energy        : 3.27 pJ

    STATS
    -----
    Utilized instances      : 1
    Computes (total)        : 31360000
    Cycles                  : 31360000
    Energy (total)          : 102704000.00 pJ
    Area (total)            : 1726.50 um^2

Level 1
-------
=== output_activation_reg ===

    SPECS
    -----
        Technology                      : SRAM
        Size                            : 1
        Word bits                       : 16
        Block size                      : 1
        Cluster size                    : 1
        Instances                       : 1 (1*1)
        Shared bandwidth                : -
        Read bandwidth                  : -
        Write bandwidth                 : -
        Multiple buffering              : 1.00
        Effective size                  : 1
     

In [18]:
answer(
    question='1.5',
    subquestion='What is the number of cycles?',
    answer= 31360000,
    required_type=int,
)
answer(
    question='1.5',
    subquestion='What is the total MAC energy (pJ overall, NOT PER MAC)?',
    answer= 102704000.00,
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the total scratchpad energy (pJ overall, NOT PER MAC)?',
    answer= 66732.80,
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the total DRAM energy (pJ overall, NOT PER MAC)?',
    answer= 5120000.00,
    required_type=Number,
)
answer(
    question='1.5',
    subquestion='What is the pJ/compute',
    answer= 387.0258,
    required_type=Number,
)

1.5: What is the number of cycles?
	31360000
1.5: What is the total MAC energy (pJ overall, NOT PER MAC)?
	102704000.0
1.5: What is the total scratchpad energy (pJ overall, NOT PER MAC)?
	66732.8
1.5: What is the total DRAM energy (pJ overall, NOT PER MAC)?
	5120000.0
1.5: What is the pJ/compute
	387.0258


### Question 6

Since you now have an understanding of the input and output files of the tools, we would like you to write your own input files and feed it to the evaluation system.

Many modern accelerator designs integrate address generators into their storages. The address generator is responsible for generating a sequence of read and write addresses for the memory, *i.e.,* for each read and write, the address is generated locally by the address generator. Typically, the address generator can be represented as an adder.

In this question, we would like you to update the compound component definition for the scratchpad to reflect the existence of such an additional address generator. To be specific:

1. name of the address generator: address_generator
2. class of the address generator: intadder
3. attributes associated with the address generator: 
    - datawidth (hint: arithmetic expressions including the ceil, floor, and
      log2 functions can be used). The datawidth of the address generator should
      be set to the minimum value such that each row of the memory has a unique
      address (*i.e.,* number of unique values must be >= memory depth).
    - technology
    - latency (hint: the global_cycle_seconds global variable is visible and can be used)
4. you also need to specify the role your address generator plays when the storage is read and written, and when the storage leaks over time (hint: the intadder has `add` and `leak` actions)

Inspect the `designs/components/smart_storage_addr_gen.yaml` configuration below. We will be setting variables enclosed in double curly braces {{ }} with a function later.

In [19]:
show_config('designs/components/smart_storage_addr_gen.yaml')

# Please update this file to reflect the addition of address generator
compound_components:
  version: 0.4
  classes:
  - name: smart_storage
    attributes:        # default attribute values (can be overridden by architecture specifications)
      technology: "45nm"
      depth: 24
      width: 16
    subcomponents:     # a list of all components that this compound component is composed of (one in this example)
    - name: storage
      class: regfile # primitive class defined in primitive class library
      attributes:    # lower-level attributes that are mapped from upper level
        technology: technology
        latency: global_cycle_seconds
        depth : depth
        width: width

    # Add your hardware description for the address generator here
    - name: {{address_generator_name}}
      class: {{address_generator_class}}
      attributes:
        technology: {{address_generator_technology_attribute}}
        datawidth:  {{address_generator_number_of_address_bits}}
     

Fill the following dictionary to set double-curly-brace-enclosed variables in
the above description. Afterwards, run Accelergy (the command cell below).
Examine the outputs and fill in the chart below. 

In [20]:
address_generator_config = dict(
    use_smart_storage_addr_gen = True, # DO NOT CHANGE THIS LINE
    address_generator_name = 'address_generator', 
    address_generator_class  = 'intadder',
    address_generator_technology_attribute = '"45nm"',
    address_generator_number_of_address_bits = 'ceil(log2(24))',
    address_generator_action_for_write = 'add',
    address_generator_action_for_read = 'add',
    address_generator_action_for_leak = 'leak',
########################
#### YOUR CODE HERE ####
########################
)

single_pe_ag_accelergy_result = run_accelergy(
    address_generator_config,
    architecture='designs/singlePE/arch.yaml',
)
print(single_pe_ag_accelergy_result.ert_verbose)

[INFO] 2025-03-05 23:50:42,350 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


ERT_summary:
    version: '0.4'
    table_summary:
      - name: system_arch_top_level.DRAM[1..1]
        actions:
          - name: read
            energy: 512.0
          - name: write
            energy: 512.0
          - name: update
            energy: 512.0
          - name: leak
            energy: 0.0
        primitive_estimation(s):
          - system_arch_top_level.DRAM[1..1]:
                estimator: CactiDRAM
      - name: system_arch_top_level.scratchpad[1..1]
        actions:
          - name: read
            energy: 0.866972
          - name: write
            energy: 1.66832
          - name: update
            energy: 1.66832
          - name: leak
            energy: 0.00113217
        primitive_estimation(s):
          - action_name: read
            arguments: {}
            energy: 0.866972
            subaction_estimations:
              - subcomponent_name: storage
                subaction_name: read
                arguments:
                    address_del

In [21]:
# <HINT> You should see the address generator's contributions to the energy in
# the ERT above. If you don't see it, you may want to check your answer.
answer(
    question='1.6',
    subquestion='What is the read energy of the scratchpad (pJ)?',
    answer=  0.866972,
    required_type=Number,
)
answer(
    question='1.6',
    subquestion='What is the write energy of the scratchpad (pJ)?',
    answer= 1.66832,
    required_type=Number,
)
answer(
    question='1.6',
    subquestion='What is the leak energy of the scratchpad (pJ)?',
    answer= 0.00113217,
    required_type=Number,
)

answer( # No need for you to change this one.
    question='1.6',
    subquestion='What parameters did you put for (name, class, technology, datawidth, write_action, read_action, leak_action)?',
    answer=list(address_generator_config.values())[1:],
    required_type=[str] * (len(address_generator_config) - 1),
)

1.6: What is the read energy of the scratchpad (pJ)?
	0.866972
1.6: What is the write energy of the scratchpad (pJ)?
	1.66832
1.6: What is the leak energy of the scratchpad (pJ)?
	0.00113217
1.6: What parameters did you put for (name, class, technology, datawidth, write_action, read_action, leak_action)?
	['address_generator', 'intadder', '"45nm"', 'ceil(log2(24))', 'add', 'add', 'leak']


### Question 7
So far, we have been focusing on studying the dataflow described in the provided loop nest above. In this question, we would like you to update the mapping to represent a new loop nest shown below. 

Please set the bounds in the `designs/singlePE/map_os.yaml` mapping according to the layer shape described in `layer_shapes/conv2.yaml`. You will again be doing this in the code cell below. **Note that some of the inner bounds are set for you** and **only keep outputs inside the scratchpad**.

After you have updated the mapping, run `timeloop-model` in the command cell below. Please also fill in the chart below:

<div class="row">
  <div class="column">
    <img align="center" src="designs/singlePE/figures/loopnest_os.png" alt="PE Architecture" style="margin:0px 0px 70px 70px; width:100%">
  </div>
</div>

In [22]:
show_config('layer_shapes/conv2.yaml')

problem:
  version: 0.4
  shape:
    name: "CNN_Layer"
    dimensions: [ C, M, R, S, N, P, Q ]
    coefficients:
    - name: Wstride
      default: 1
    - name: Hstride
      default: 1
    - name: Wdilation
      default: 1
    - name: Hdilation
      default: 1

    data_spaces:
    - name: Weights
      projection:
      - [ [C] ]
      - [ [M] ]
      - [ [R] ]
      - [ [S] ]
    - name: Inputs
      projection:
      - [ [N] ]
      - [ [C] ]
      - [ [R, Wdilation], [P, Wstride] ] # SOP form: R*Wdilation + P*Wstride
      - [ [S, Hdilation], [Q, Hstride] ] # SOP form: S*Hdilation + Q*Hstride
    - name: Outputs
      projection:
      - [ [N] ]
      - [ [M] ]
      - [ [Q] ]
      - [ [P] ]
      read_write: True

  instance:
    C: 4  # inchn
    M: 8  # outchn
    R: 5   # filter height
    S: 5   # filter width
    P: 28  # ofmap height
    Q: 28  # ofmap width
    N: 50   # batch size



**Update the mapping**

First, view the map file in `designs/singlePE/map_os.yaml`. Inspect the double-curly-brace-enclosed statements-- these we'll be filling in. When we update the dictionaries below, double-curly-brace-enclosed variables will be updated as the YAML file is loaded in.

Fill out the variables in the dictionaries below to match the output-stationary loop nest for the problem above.

In [23]:
show_config('designs/singlePE/map_os.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
mapping:
# Mapping for the DRAM and scratchpads.
- target: DRAM
  type: temporal
  factors: 
  - R={{DRAM_factor_R}}
  - S={{DRAM_factor_S}}
  - P={{DRAM_factor_P}}
  - Q={{DRAM_factor_Q}}
  - N={{DRAM_factor_N}}
  - M={{DRAM_factor_M}}
  - C={{DRAM_factor_C}}
  permutation: {{DRAM_permutation}}

- target: scratchpad
  type: temporal
  factors: 
  - R={{scratchpad_factor_R}}
  - S={{scratchpad_factor_S}}
  - P={{scratchpad_factor_P}}
  - Q={{scratchpad_factor_Q}}
  - N={{scratchpad_factor_N}}
  - M={{scratchpad_factor_M}}
  - C={{scratchpad_factor_C}}
  permutation: {{scratchpad_permutation}}

- target: scratchpad
  type: dataspace
  keep: {{scratchpad_keep_list}}
  bypass: {{scratchpad_bypass_list}}

# Mapping for the registers. We will not change these.
- target: weight_reg
  type: temporal
  factors: [R=1, S=1, P=1, Q=1, M=1, C=1, N=1]

In [24]:
os_map_config = dict(
    DRAM_factor_R=1, 
    DRAM_factor_S=1,
    DRAM_factor_P=28,
    DRAM_factor_Q=1, 
    DRAM_factor_N=50,
    DRAM_factor_M=4,
    DRAM_factor_C=4,
    DRAM_permutation=['S', 'R', 'Q', 'P', 'C', 'M', 'N'], 

    scratchpad_factor_R=5,
    scratchpad_factor_S=5, 
    scratchpad_factor_P=1,
    scratchpad_factor_Q=28,
    scratchpad_factor_N=1, 
    scratchpad_factor_M=2,
    scratchpad_factor_C=1,
    scratchpad_permutation=['R', 'S', 'C', 'N', 'M', 'Q', 'P'],
    scratchpad_keep_list=['Outputs'],
    scratchpad_bypass_list=['Weights', 'Inputs'] 
)

# No need for you to change the following lines.
for key, value in os_map_config.items():
    required_type = int
    if 'permutation' in key or 'list' in key:
        required_type = [str] * len(value)
    answer(
        question='1.7',
        subquestion=f'Setting for {key} in the os_map_config',
        answer=value,
        required_type=required_type,
    )

1.7: Setting for DRAM_factor_R in the os_map_config
	1
1.7: Setting for DRAM_factor_S in the os_map_config
	1
1.7: Setting for DRAM_factor_P in the os_map_config
	28
1.7: Setting for DRAM_factor_Q in the os_map_config
	1
1.7: Setting for DRAM_factor_N in the os_map_config
	50
1.7: Setting for DRAM_factor_M in the os_map_config
	4
1.7: Setting for DRAM_factor_C in the os_map_config
	4
1.7: Setting for DRAM_permutation in the os_map_config
	['S', 'R', 'Q', 'P', 'C', 'M', 'N']
1.7: Setting for scratchpad_factor_R in the os_map_config
	5
1.7: Setting for scratchpad_factor_S in the os_map_config
	5
1.7: Setting for scratchpad_factor_P in the os_map_config
	1
1.7: Setting for scratchpad_factor_Q in the os_map_config
	28
1.7: Setting for scratchpad_factor_N in the os_map_config
	1
1.7: Setting for scratchpad_factor_M in the os_map_config
	2
1.7: Setting for scratchpad_factor_C in the os_map_config
	1
1.7: Setting for scratchpad_permutation in the os_map_config
	['R', 'S', 'C', 'N', 'M', 'Q', 

In [25]:
single_pe_os_results = run_timeloop_model(
    os_map_config,
    architecture='designs/singlePE/arch.yaml',
    mapping='designs/singlePE/map_os.yaml',
    problem='layer_shapes/conv2.yaml'
)
single_pe_os_stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
single_pe_os_mapping = single_pe_os_results.mapping  # You can print to check your answer
print(single_pe_os_stats)

[INFO] 2025-03-05 23:50:43,856 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


Buffer and Arithmetic Levels
----------------------------
Level 0
-------
=== mac ===

    SPECS
    -----
    Word bits             : 16
    Instances             : 1 (1*1)
    Compute energy        : 3.27 pJ

    STATS
    -----
    Utilized instances      : 1
    Computes (total)        : 31360000
    Cycles                  : 31360000
    Energy (total)          : 102704000.00 pJ
    Area (total)            : 1726.50 um^2

Level 1
-------
=== output_activation_reg ===

    SPECS
    -----
        Technology                      : SRAM
        Size                            : 1
        Word bits                       : 16
        Block size                      : 1
        Cluster size                    : 1
        Instances                       : 1 (1*1)
        Shared bandwidth                : -
        Read bandwidth                  : -
        Write bandwidth                 : -
        Multiple buffering              : 1.00
        Effective size                  : 1
     

In [26]:
answer(
    question='1.7.2',
    subquestion=f'Number of cycles',
    answer= 31360000,
    required_type=int,
)
answer(
    question='1.7.2',
    subquestion=f'MAC energy (pJ overall, NOT PER MAC)',
    answer= 102704000.00,
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'Scratchpad energy (pJ overall, NOT PER MAC)',
    answer= 2615925.76,
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'DRAM energy (pJ overall, NOT PER MAC)',
    answer= 280985600.00 + 4014080000.00 + 4014080000.00,
    required_type=Number,
)
answer(
    question='1.7.2',
    subquestion=f'Total pJ/compute',
    answer= 269.18943,
    required_type=Number,
)

1.7.2: Number of cycles
	31360000
1.7.2: MAC energy (pJ overall, NOT PER MAC)
	102704000.0
1.7.2: Scratchpad energy (pJ overall, NOT PER MAC)
	2615925.76
1.7.2: DRAM energy (pJ overall, NOT PER MAC)
	8309145600.0
1.7.2: Total pJ/compute
	269.18943


### Question 8
This question asks about the influence of moving loops on the overall data movement between different storages. In answering the following, please make the following assumptions:
- All storages are large enough to fit all data. 
- If a temporal loop is above a given storage element, then that storage element is flushed (*i.e.*, all data is removed and re-fetched) for each iteration of the loop. This happens even if the data does not change across loop iterations (*i.e.,* if an upper-level buffer iterates over M, we would flush inputs from a lower-level buffer.)

We will be processing a convolutional layer with the following Einsum (note stride equals 1): 

$$
O_{m,n,p,q} = I_{n,c,p+r,q+s} \times F_{m,c,r,s}
$$

We currently have the following storage hierarchy and loop nest:

```
DRAM [Weights Inputs Outputs]
-----------------------------
| for P in [0..2):        Loop (P)
|  for Q in [0..2):       Loop (Q)
|   for C in [0..2):      Loop (C)
|    for R in [0..2):     Loop (R)
|     for S in [0..2):    Loop (S)
|      for N in [0..2):   Loop (N)
|       for M in [0..2):  Loop (M)

Scratchpad [Weights Inputs Outputs]
--------------------------------------
< No loops here >

Registers [Weights Inputs Outputs]
----------------------------------
|               <MAC>
```

Note that the above loop syntax uses Timeloop notation, where loops are written with a capital dimension name rather than iteration variables. For example, the following loop nest is Timeloop notation:
```
for P in [0..2):
  for P in [0..2):
```

And is equivalent to the following standard notation loop nest:
```
for p0 in [0..2):
  for p1 in [0..2):
```

Please answer the following.

In [27]:
answer(
    question='1.8',
    subquestion='How many inputs, outputs, and weights (including duplicates) are transferred from DRAM to the scratchpad? Answer in the order of inputs, outputs, and weights.',
    answer= [128, 128, 128], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='How many inputs, outputs, and weights (including duplicates) are transferred from the scratchpad to the registers? Answer in the order of inputs, outputs, and weights.',
    answer= [128, 128, 128], # Answer here
    required_type=[int, int, int],
)



1.8: How many inputs, outputs, and weights (including duplicates) are transferred from DRAM to the scratchpad? Answer in the order of inputs, outputs, and weights.
	[128, 128, 128]
1.8: How many inputs, outputs, and weights (including duplicates) are transferred from the scratchpad to the registers? Answer in the order of inputs, outputs, and weights.
	[128, 128, 128]


We now have the option of moving loop (P), (Q), (C), (R), (S), or (M) from beneath the DRAM to beneath the scratchpad (*e.g.,* if we chose (P), the (P) loop would be removed from the DRAM loop nest and instead be placed in the scratchpad loop nest).

In [28]:
answer(
    question='1.8',
    subquestion='If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [64, 128, 128], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [64, 128, 128], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [128, 64, 128], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [128, 64, 128], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='If we choose to move the (N) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [128, 128, 64], # Answer here
    required_type=[int, int, int],
)
answer(
    question='1.8',
    subquestion='If we choose to move the (N) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).',
    answer= [128, 128, 64], # Answer here
    required_type=[int, int, int],
)

answer(
    question='1.8',
    subquestion='Looking at the einsum, what has to be true about a loop for it to be helpful in reducing DRAM-scratchpad transfers for a particular datatype by moving it from the DRAM to the scratchpad loop nest? Please answer using the rank of the loop (*i.e.* P,Q,R,S,C,N,M) and its role in the Einsum.',
    answer= 'Loop\'s rank should not be part of the type.',
    required_type=str,
    assumptions=[], # Put a list of strings as assumptions, if any.
)
answer(
    question='1.8',
    subquestion='Were we able to affect data movement between the scratchpad and registers? Why or why not?',
    answer= 'No, because none of the loops are in the registers so data moves to the registers at every iteration.',
    required_type=str,
    assumptions=[], # Put a list of strings as assumptions, if any.
)

1.8: If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).
	[64, 128, 128]
1.8: If we choose to move the (M) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).
	[64, 128, 128]
1.8: If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from DRAM to the scratchpad? Count repeated transfers as unique transfers (i.e., read a particular input twice --> two inputs transferred).
	[128, 64, 128]
1.8: If we choose to move the (C) loop to beneath the scratchpad, how many inputs, outputs, and weights are transferred from the scratchpad to the registers? Count

Now fill out the following lines with (P), (Q), (R), (S), (C), (N), and (M). If we can't reduce data movement by moving any of the loops, say "None". *Hint: The ordering of scratchpad-level loops will affect your answer. Could moving certain loops lead to reuse through convolutional sliding windows?*

In [29]:
ALLOWED_ANSWERS = ('(P)', '(Q)', '(R)', '(S)', '(C)', '(N)', '(M)')
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** input movement, we could move loops:',
    answer= ['(M)', '(R)', '(S)', '(P)', '(Q)'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 5,
)
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** output movement, we could move loops:',
    answer= ['(R)', '(S)', '(C)'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 3,
)
answer(
    question='1.8',
    subquestion='To reduce **DRAM <==> Scratchpad** weight movement, we could move loops:',
    answer= ['(N)', '(P)', '(Q)'], # Answer here
    required_type=[ALLOWED_ANSWERS] * 3,
)

1.8: To reduce **DRAM <==> Scratchpad** input movement, we could move loops:
	['(M)', '(R)', '(S)', '(P)', '(Q)']
1.8: To reduce **DRAM <==> Scratchpad** output movement, we could move loops:
	['(R)', '(S)', '(C)']
1.8: To reduce **DRAM <==> Scratchpad** weight movement, we could move loops:
	['(N)', '(P)', '(Q)']
