## Advanced Usage of the Mapper

This notebook shows advanced usage for the 

We first initialize the spec. The spec is initialized with a `af.Spec` object
using YAML files (though you may also initialize them with Python objects).

When loading specifications, Jinja2 templating can be used, and the `jinja_parse_data`
parameter can be used to pass in data to the templating engine.

In [None]:
from pathlib import Path

examples_dir = Path("../../examples")

# < DOC_INCLUDE_MARKER > make_spec
import accelforge as af

# Set the number of parallel threads that the mapper can use. If you are running out of
# memory, you may decrease this number. By default the number of threads is set to the
# number of cores on your machine.
import os
af.set_n_parallel_jobs(os.cpu_count(), print_message=True)

# Initialize the spec and show the workload.
BATCH_SIZE = 1
N_TOKENS = 8192
FUSE = False
spec = af.Spec.from_yaml(
    examples_dir / "arches" / "tpu_v4i.yaml",
    examples_dir / "workloads" / "gpt3_6.7B.yaml",
)
# Fusion happens when tensors bypass the outermost Memory object, so, to disable fusion,
# force all tensors to be in the outermost memory.
if not FUSE:
    for node in spec.arch.nodes:
        if isinstance(node, af.arch.Memory):
            print(f'Keeping all tensors in {node.name}')
            node.tensors.keep = "All"
            break

We'll first visualize the architecture. The architecture is a tree, starting at the top
with the outermost memory level. Each leaf of the tree is a `Compute` component.

In [None]:
spec.arch

Now we'll visualize the workload. The workload is a cascade of Einsums, with boxes
showing Einsums (computation steps), ovals showing tensors, and arrows showing
dependencies.

In [None]:
spec.workload

Next, we'll set optimization metrics for the mapper. Note that having fewer metrics is
faster, because it makes it easier to prune suboptimal mappings. A mapping is suboptimal
if and only if another mapping is better in all metrics.

In [None]:
# Set optimization metrics
spec.mapper.ffm.metrics = af.mapper.FFM.Metrics.ENERGY
# spec.mapper.ffm.metrics = af.mapper.FFM.Metrics.LATENCY
# spec.mapper.ffm.metrics = af.mapper.FFM.Metrics.LATENCY | af.mapper.FFM.Metrics.ENERGY

Workloads can be mapped onto the architecture in one step using the `spec.map_workload_to_arch` function.

In [None]:
# < DOC_INCLUDE_MARKER > map_workload_to_arch

# Commenting this will be slower, but may generate better mappings. Limits the number of
# fused loops that can exist in a single pmapping.
spec.mapper.ffm.max_fused_loops = 1
mapping =spec.map_workload_to_arch()

# Render the mapping with mapping.render(), or in the last line of a notebook:
mapping

We can inspect the energy and latency of the resulting mapping using the `energy` and
`latency` attributes of the mapping object.

In [None]:
# < DOC_INCLUDE_MARKER > mapping_stats

print(f'Energy: {mapping.energy()}J, {mapping.per_compute().energy()}J/compute')
for k, v in mapping.per_compute().energy(per_component=True).items():
    print(f'\t{k}: {v}J/compute')

print(f'Latency: {mapping.latency()}s, {mapping.per_compute().latency()}s/compute')
for k, v in mapping.per_compute().latency(per_component=True).items():
    print(f'\t{k}: {v}s/compute')