## Part 3: Mapspace Exploration with Timeloop

We start with the same architecture and mapping as in Part 2.

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/system/figures/arch.png" alt="Full System  Architecture Diagram" style="margin:50px 0px 0px 50px; width:40%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/system/figures/loopnest.png" alt="System Loopnest" style="width:50%">
  </div>
</div>

### Question 1
In this question, we would like you to find the best architecture and associated mapping that has the highest throughput (minimizes the number of cycles) for `layer_shapes/conv2.yaml`. If two architectures result in the same throughput, choose the one that's less energy consuming.

First, inspect the mapping file below and see what double-curly-brace-enclosed parameters are available to set.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from loaders import *
show_config('designs/system/map.yaml')

Now, optimize these mappings for a 1x16-PE, 2x8-PE, and 4x4-PE architecture by setting the variables below. To simplify the mapspace, most variables
are set and not able to change. The following constraints are required (and can not be changed by the below variables):

- The mapping agrees with the loop nest provided in the image above.
- Input channels can only be spatially mapped to the rows of the PE array and output channels can only be spatially mapped to the columns of the PE array.
- PE scrachpads only store filter weights.

After optimizing these mappings, please fill out the table below with the energy and delay of each mapping. An example setting is shown below.

**For full credit, all of your optimized mappings must be below 400 uJ and 4,000,000 cycles. Furthermore, at least one mapping must be below 275 uJ.**

In [None]:
config_example = dict( # Do not change this configuration!
    DRAM_factor_N=50,
    DRAM_factor_M=8,
    DRAM_factor_C=4,
    global_buffer_factor_N=1,
    global_buffer_factor_M=1,
    global_buffer_factor_C=1,
    PE_spatial_factor_M=1,
    PE_spatial_factor_C=1,
    scratchpad_factor_N=1,
)

config_optimized_1x16 = dict( # Replace these with your optimized values
	 DRAM_factor_N=50,
     DRAM_factor_M=8,
     DRAM_factor_C=4,
     global_buffer_factor_N=1,
     global_buffer_factor_M=1,
     global_buffer_factor_C=1,
     PE_spatial_factor_M=1,
     PE_spatial_factor_C=1,
     scratchpad_factor_N=1,
########################
#### YOUR CODE HERE ####
########################
)

config_optimized_2x8 = dict( # Replace these with your optimized values
	 DRAM_factor_N=50,
     DRAM_factor_M=8,
     DRAM_factor_C=4,
     global_buffer_factor_N=1,
     global_buffer_factor_M=1,
     global_buffer_factor_C=1,
     PE_spatial_factor_M=1,
     PE_spatial_factor_C=1,
     scratchpad_factor_N=1,
########################
#### YOUR CODE HERE ####
########################
)

config_optimized_4x4 = dict( # Replace these with your optimized values
	 DRAM_factor_N=50,
     DRAM_factor_M=8,
     DRAM_factor_C=4,
     global_buffer_factor_N=1,
     global_buffer_factor_M=1,
     global_buffer_factor_C=1,
     PE_spatial_factor_M=1,
     PE_spatial_factor_C=1,
     scratchpad_factor_N=1,
########################
#### YOUR CODE HERE ####
########################
)

In [None]:
configs = {
    'Example 1x16': {**config_example, "pe_meshX": 1, "pe_meshY": 16},
    'Example 2x8': {**config_example, "pe_meshX": 2, "pe_meshY": 8},
    'Example 4x4': {**config_example, "pe_meshX": 4, "pe_meshY": 4},
    'Optimized 1x16': {**config_optimized_1x16, "pe_meshX": 1, "pe_meshY": 16},
    'Optimized 2x8': {**config_optimized_2x8, "pe_meshX": 2, "pe_meshY": 8},
    'Optimized 4x4': {**config_optimized_4x4, "pe_meshX": 4, "pe_meshY": 4},
}

# Set to None to run all configurations. FOR YOUR FINAL SUBMISSION, MAKE SURE YOU RUN ALL CONFIGURATIONS.
CONFIG_TO_RUN = None # 'Optimized 4x4'

to_run = list(configs.keys()) if CONFIG_TO_RUN is None else [CONFIG_TO_RUN]

# DO NOT CHANGE
THRES = [(float('inf'), float('inf')),
         (float('inf'), float('inf')),
         (float('inf'), float('inf')),
         (4_000_000, 400),
         (4_000_000, 400),
         (4_000_000, 400)]

min_energy = float('inf')
for i, k in enumerate(to_run):
    cycle_thres, energy_thres = THRES[i]
    result = run_timeloop_model(
        configs[k],
        architecture='designs/system/arch.yaml',
        mapping='designs/system/map.yaml',
        problem='layer_shapes/conv2.yaml'
    )
    stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
    mapping = result.mapping
    if len(to_run) == 1:
        print(stats)

    lines = stats.split('\n')
    energy = float([l for l in lines if 'Energy:' in l][0].split(' ', 2)[1])
    cycles = int([l for l in lines if 'Cycles:' in l][0].split(' ', 1)[1])
    min_energy = min(min_energy, energy)

    if i < 3:
        print(f'{k} --- Latency: {cycles} cycles; Energy: {energy} uJ.')
    else:
        answer( # Don't change this
            question='3.1',
            subquestion=f'{k} --- Latency: {cycles} cycles; Energy: {energy} uJ. Passing?',
            answer=cycles < cycle_thres and energy < energy_thres,
            required_type=bool
        )
    print('')

answer( # Don't change this
    question='3.1',
    subquestion=f'Minimum energy achieved: {min_energy} uJ. Passing?',
    answer=min_energy < 275,
    required_type=bool
)

### Question 2
Manually generating the best mapping for each architecture and layer shape is rather time-consuming, even if the search is performed under a tightly constrained map sapace, *e.g.,* the one in question 2.2. Therefore, Timeloop provides automatic mapspace search functinality when appropriate map space constriants are given.

To perform an automatic mapspace search, you need to provide mapspace constraints as an input. Mapspace constraints specify the limitations imposed by your dataflow or hardware structures. They include the same directives as the mapping, except they are an *incomplete* description of the mapping, and they allow the mapper to optimize over all unspecified parts. An example mapspace constraint is shown below (`designs/system/constraints.yaml`). To automatically search the mapspace with the constraints file, you should run the `run_timeloop_mapper` command.

*The search should take less than 5 minutes to finish. If you are running the command from the shell instead of running the below cell, you can also terminate it whenever you want by pressing Ctrl+C (you will need to wait for Timeloop to finish the remainig computations after you send the signal; the terminated threads will have a dash next to its id).*

In [None]:
show_config('designs/system/constraints.yaml')

In [None]:
sys_1x16_result = run_timeloop_mapper(
    {'pe_meshX': 1, 'pe_meshY': 16},
    architecture='designs/system/arch.yaml',
    problem='layer_shapes/conv2.yaml',
    constraints='designs/system/constraints.yaml',
    mapper='designs/_include/mapper.yaml'
)

sys_1x16_stats = open('./output_dir/timeloop-mapper.stats.txt', 'r').read()
sys_1x16_mapping = sys_1x16_result.mapping

# Stats for the best mapping found by the mapper.
print(sys_1x16_result.cycles)

In [None]:
lines = sys_1x16_stats.split('\n')
energy = float([l for l in lines if 'Energy:' in l][0].split(' ', 2)[1])
cycles = int([l for l in lines if 'Cycles:' in l][0].split(' ', 1)[1])
print(energy, cycles)

In [None]:
# Loop nest of the best mapping found by the mapper.
print(sys_1x16_mapping)

In this question, we have provided you with a much more relaxed constraint `designs/system/constraints_relaxed.yaml`. 
    
Please examine the constraints, and list two additional relaxations on the mapspace constraints in `designs/system/constraints_relaxed.yaml` comparing to `designs/system/constraints.yaml` (*Note: there are more than two relaxations, but you only need to list two*)

In [None]:
show_config('designs/system/constraints_relaxed.yaml')

In [None]:
answer(
    question='3.2',
    subquestion='List two hardware levels for which constraints have been relaxed. Answer as a Python list of strings (e.g., ["PE", "DRAM"]).',
    answer= ['FILL ME', 'FILL ME'], # Answer here
    required_type=[str, str]
)

Below, we run the mapper on all three architectures (1x16, 2x8, 4x4) on all three workloads (conv1, conv2, and fc1). For each workload, find the architecture that has the highest throughput by inspecting `all_stats`. If two architectures result in the same throughput, choose the one that's less energy consuming. Please fill in the chart below. 

In [None]:
workloads = {
    'conv1': 'layer_shapes/conv1.yaml',
    'conv2': 'layer_shapes/conv2.yaml',
    'fc1': 'layer_shapes/fc1.yaml'
}

pe_array_shapes = [
    (1, 16),
    (2, 8),
    (4, 4)
]

all_stats = {'conv1': {}, 'conv2': {}, 'fc1': {}}
all_maps = {'conv1': {}, 'conv2': {}, 'fc1': {}}

for name, workload in workloads.items():
    print('')
    for meshX, meshY in [(1, 16), (2, 8), (4, 4)]:
        result = run_timeloop_mapper(
            {'pe_meshX': meshX, 'pe_meshY': meshY},
            architecture='designs/system/arch.yaml',
            problem=workload,
            constraints='designs/system/constraints_relaxed.yaml',
            mapper='designs/_include/mapper.yaml',
            pe_spatial_c_constraint=True, # An extra constraint to speed up the mapper for this example.
        )
        stats = open('./output_dir/timeloop-mapper.stats.txt', 'r').read()
        mapping = result.mapping
        all_stats[name][(meshX, meshY)] = stats
        all_maps[name][(meshX, meshY)] = mapping
        lines = stats.split('\n')
        energy = float([l for l in lines if 'Energy:' in l][0].split(' ', 2)[1])
        cycles = int([l for l in lines if 'Cycles:' in l][0].split(' ', 1)[1])
        
        answer(
            question='3.2',
            subquestion=f'{meshX}x{meshY} cycles and energy (uJ) for {name}',
            answer=[cycles, energy],
            required_type=[int, Number]
        )

Additionally, you will need to consider the workload shapes, which are printed below for your convenience:

In [None]:
for workload, file in workloads.items():
	print(f"### {workload} ###")
	print(open(file, 'r').read())

In [None]:
answer(
    question='3.2',
    subquestion='(True/False) There is a PE array shape that achieves fewer cycles and energy compared to other array shapes for conv1.',
    answer= 'FILL ME',
    required_type=bool
)
answer(
    question='3.2',
    subquestion='Given the array shapes (1x16, 2x8, and 4x4) and the workload shape of conv1, what is the maximum PE array utilization for each array shape? Answer as a list, e.g., [0.25, 0.5, 1].',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[Number, Number, Number]
)
answer(
    question='3.2',
    subquestion='Best [architecture name, cycles, total energy (uJ)] for conv2. Please answer as a list, e.g., ["2x8", 3920000, 68.54].',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[('2x8', '1x16', '4x4'), int, Number]
)
answer(
    question='3.2',
    subquestion='Best [architecture name, cycles, total energy (uJ)] for fc1. Please answer as a list with 3 elements. Please answer as a list, e.g., ["2x8", 3920000, 68.54].',
    answer= ['FILL ME', 'FILL ME', 'FILL ME'], # Answer here
    required_type=[('2x8', '1x16', '4x4'), int, Number]
)

### Question 3
Your circuit designer has told you that it is too expensive to have a separate architecture for each layer shape. You must now have a fixed architecture (i.e. fixed height and width of the PE array). Based on this specific architecture, you can change the mapping according to different layer shapes. 

What is the best architecture that achieves the **highest average throughput (1/cycles)** of those three layer shapes among all the architectures explored in question 1? Calculate throughput for each layer independently, then average (DO NOT calculate 1 / (sum cycles)). Please fill in the chart below using that architecture.

In [None]:
answer(
    question='3.3',
    subquestion='What is the best overall architecture for the three workloads?',
    answer= 'FILL ME',
    required_type=('1x16', '2x8', '4x4')
)
answer(
    question='3.3',
    subquestion='What are the [cycles, total energy (uJ)] for this architecture and conv1?',
    answer= ['FILL ME', 'FILL ME'], # Answer here
    required_type=[int, Number]
)
answer(
    question='3.3',
    subquestion='What are the [cycles, total energy (uJ)] for this architecture and conv2?',
    answer= ['FILL ME', 'FILL ME'], # Answer here
    required_type=[int, Number]
)
answer(
    question='3.3',
    subquestion='What are the [cycles, total energy (uJ)] for this architecture and fc1?',
    answer= ['FILL ME', 'FILL ME'], # Answer here
    required_type=[int, Number]
)