## Design Space Exploration

This notebook will show you how to use AccelForge to perform a design space exploration.
We'll look at the Eyeriss architecture and analyze how the global buffer size affects
the energy of a workload.

First import AccelForge and initialize a Spec. We'll use a 4096x4096x4096 matrix
multiply as our workload.

In [None]:
import accelforge as af
from pathlib import Path

examples_dir = Path("../../examples")

spec = af.Spec.from_yaml(
    examples_dir / "arches" / "eyeriss.yaml",
    examples_dir / "workloads" / "matmuls.yaml",
    jinja_parse_data={"N_EINSUMS": 1, "M": 4096, "KN": 4096},
)
spec.mapper.metrics = af.Metrics.ENERGY

Now we'll define a range of global buffer sizes to explore. 

In [None]:
size_start = 32 * 8            # 32 bytes
size_end = 4 * 1024 * 1024 * 8 # 4MB
n_sizes = 32                   # Number of sizes to explore
sizes = [
    size_start * (size_end / size_start) ** (i / (n_sizes - 1)) for i in range(n_sizes)
]

Now we'll define a function that will map the workload to the architecture, setting the
global buffer size to the given size.

In [None]:
def get_mapper_result(spec: af.Spec, global_buffer_size: int):
    spec.arch.find("GlobalBuffer").size = global_buffer_size
    return spec.map_workload_to_arch(
        print_progress=False,
        print_number_of_pmappings=False
    )

We could use this function directly. However, there's a faster way, so we've commented
out the direct use below.

In [None]:
# < SLOWER >
# Directly use the function
# results = {size: get_mapper_result(spec, size) for size in sizes}

AccelForge by default will parallelize the mapping process. However, if we're running
many different architectures, it's inefficient to be spawning processes for each
individual mapping run. Instead, we'll use the `parallel` function to parallelize
over the different architectures.

In [None]:
from accelforge.util import parallel, delayed

spec.mapper.metrics = af.Metrics.ENERGY
jobs = {size: delayed(get_mapper_result)(spec, size) for size in sizes}
results = parallel(jobs, pbar="Generating Mappings")


Now we'll plot the results. We can see a clear U curve, with the energy per compute
minimized around global buffer sizes of 2e5 bits, or 24kB.

In [None]:
from typing import Any


import matplotlib.pyplot as plt

x = list(results.keys())
y = [v.per_compute().energy() * 1e12 for v in results.values()]
plt.plot(x, y)
plt.xlabel("Global Buffer Size (bits)")
plt.ylabel("Energy per Compute (pJ)")
plt.xscale("log")
plt.show()

We can also analyze the energy breakdown to see why the energy per compute scales as it
does.

Notice several things from the plot:

- For small global buffer sizes, main memory energy dominates. In this region,
  increasing global buffer sizes can reduce main memory energy by getting more reuse and
  reducing main memory accesses.
- For large global buffer sizes, global buffer energy dominates as the energy per access
  to the global buffer increases. Main memory energy continues to decrease due to fewer
  accesses, but growing global buffer energy overwhelms the benefits.
- For extremely large global buffer sizes, we see a dramatic shift in energy as global
  buffer energy decreases and main memory energy increases. This is because the global
  buffer has become so expensive that the mapper decides to bypass it entirely with some
  tensors, accessing them directly from main memory to reduce the global buffer energy.

In [None]:
x = list(results.keys())
per_component_results = {}
for r in results.values():
    for component, energy in r.per_compute().energy(per_component=True).items():
        per_component_results.setdefault(component, []).append(energy)

for component, energy in per_component_results.items():
    plt.plot(x, energy, label=component)
plt.legend()
plt.xlabel("Global Buffer Size (bits)")
plt.ylabel("Energy (pJ)")
plt.xscale("log")
plt.show()

We can inspect area as well. The calculate_component_area_energy_latency_leak function
will calculate the area, energy, latency, and leakage power for each component, and we
can access it with the `arch.total_area` attribute.

We can see that, for very small global buffer sizes, area is not signifcantly impacted
because other components dominate area. For large global buffer sizes, area is
proportional to the global buffer size as global buffer dominates overall area.

In [None]:
def get_area(spec: af.Spec, size: int):
    spec.arch.find("GlobalBuffer").size = size
    return spec.calculate_component_area_energy_latency_leak()

areas = {k: delayed(get_area)(spec, k) for k in sizes}
areas = parallel(areas, pbar="Calculating Area")

# Plot the total area
x = list(areas.keys())
y = [cur_spec.arch.total_area * 1e6 for cur_spec in list(areas.values())]
plt.plot(x, y)
plt.xlabel("Global Buffer Size (bits)")
plt.ylabel("Area (mm^2)")
plt.xscale("log")
plt.yscale("log")
plt.show()

# Plot per-component area
x = list(areas.keys())
y = {}
for cur_spec in list(areas.values()):
    for component, area in cur_spec.arch.per_component_total_area.items():
        y.setdefault(component, []).append(area * 1e6)

for component, area in y.items():
    plt.plot(x, area, label=component)
plt.legend()
plt.xlabel("Global Buffer Size (bits)")
plt.ylabel("Area (mm^2)")
plt.xscale("log")
plt.yscale("log")
plt.show()