## **RAELLA Model**
## **Overview**
This is the Accelergy/Timeloop model of the RAELLA Processing-In-Memory (PIM) Deep Neural Network (DNN) accelerator architecture from "RAELLA: Reforming the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!" by Tanner Andrulis, Vivienne Sze, and Joel Emer (https://dl.acm.org/doi/10.1145/3579371.3589062).

The models includes several parameterized RAELLA architectures, plus a model of the ISAAC architecture from "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars" by Ali Shafiee et. al. (https://dl.acm.org/doi/10.1145/3007787.3001139).

This model provides estimates of area, energy, and throughput for the architectures while they run DNN layers. All DNNs used in the RAELLA paper are included.

## **Table of Contents**
1. [Getting Started](#Getting-Started)
2. [Speeding Up Mapping](#speeding-up-mapping)
3. [Collecting Input Files And Putting Together The Specification](#collecting-input-files-and-putting-together-the-specification)
3. [Mapping a DNN Layer to the Architecture](#mapping-a-dnn-layer-to-the-architecture)
4. [Plotting Energy, Area, and Throughput Results](#plotting-energy-area-and-throughput-results)
5. [One-Layer Ablation Experiment](#one-layer-ablation-experiment)
6. [Multi-Layer Ablation Experiment](#multi-layer-ablation-experiment)
7. [Full-DNN Ablation Experiment](#full-dnn-ablation-experiment)
8. [Full-DNN Mapping with Energy, Area, And Throughput Results](#full-dnn-mapping-with-energy-area-and-throughput-results)
9. [Tips for Extending the Model](#tips-for-extending-the-model)
10. [Contact](#contact)
11. [Glossary](#glossary)
12. [References](#references)


## **Getting Started**
This Jupyter notebook guides you through running the model on any of the provided architectures or DNNs. It starts off with a single-DNN-layer, single-architecture example, and then shows how to run multiple DNN layers and multiple architectures, building up to a full ablation study over a single DNN.

Once you're familiar with the model, you can use/edit any of the provided scripts or YAML files to build your own experiments.

We'll start off by importing the necessary Python packages and setting defining paths.

In [None]:
import os
import timeloopfe as tl
from timeloopfe.v4spec import Specification
from scripts.processors import ArtifactProcessor, GreedyMapperProcessor
from timeloopfe.processors.v4_standard_suite import STANDARD_SUITE
from subprocess import PIPE
from scripts.helper_functions import *
import tqdm
from typing import Optional


thisdir = os.getcwd()

# Stores recorded values from PyTorch kernels to be used in the simulator.
# Values include input/weight value statistics, speculation success rates, and
# choices for RAELLA's Adaptive Weight Slicing.
STATS_PATH = f"{thisdir}/recorded_statistics"

# Architecture models of RAELLA and ISAAC.
ARCHITECTURE_PATH = f"{thisdir}/models"

# Path to the mapper configuration file. 
MAPPER_PATH = f"{thisdir}/mapper.yaml"

# Environment variables to be used in Timeloop. We want scientific notation
# output.
env = {
    "TIMELOOP_OUTPUT_STAT_SCIENTIFIC": 1, # Outputs scientific notation
    "TIMELOOP_OUTPUT_STAT_DEFAULT_FLOAT": 0, # Outputs default floating-point
}

# Where Timeloop/Accelergy/TimeloopFE will do work.
RUN_DIRECTORY = f"{thisdir}/rundir"
OUTPUT_STATS_PATH = f"{RUN_DIRECTORY}/timeloop_mapper.stats.txt"
OUTPUT_ART_PATH = f"{RUN_DIRECTORY}/timeloop_mapper.ART.yaml"


## **Speeding Up Mapping** <a name="speeding-up-mapping"></a>
To speed up mapping, we use a timeloopfe processor to generate constraints on
our mapspace. Constraints are dependent on the architecture and the DNN layer,
so the processor is run for each combination.

The processor reduces the size of the mapspace by many orders of magnitude,
allowing us to run experiments in a reasonable amount of time. Of course, any
mappings we find here are a subset of what Timeloop searches, but they are
reasonable constraints to make. The following are the constraints we use and
why they are reasonable:

- Pack the weights as densely as possible into the crossbars. This is
  reasonable because better utilization of the crossbars is always better for
  ISAAC/RAELLA, and all buffers are provisioned to handle maximally-utilized
  crossbars, so we won't get buffer overflows from doing this.
- Replicate weights over N/Q (output batches / output columns) across tiles
  after all weights are mapped. This is reasonable because, after confirming we
  have extra space, working with more tiles in parallel is helpful. We have
  different tiles work over the dimensions with the LEAST reuse so that each
  individual tile has more reuse. This may not return the
  data-movement-energy-optimal result, but it's close enough that the
  differences don't matter much.
- Restrict temporal permutations such that P (horizontal traversal of outputs),
  then Q (vertical traversal of outputs), then N (different samples) is the
  order for all buffers. This is reasonable because RAELLA tiles consume input
  rows one at a time and produce output rows one at a time. Generally, moving
  within a row gets more reuse than moving between rows, which gets more reuse
  than moving between samples. This may not be true always, but data movement
  energy is not the dominant factor in RAELLA/ISAAC, so we don't worry about
  it too much.

In [None]:
# NOTE: A description of all the dimensions (CRSLMPGNQ) and architecture levels
# (tile, macro, crossbar, column, cim_unit) can be found in the end of this
# notebook. 
greedy_mapper_processors = [
    # 1. PACK WEIGHTS AS DENSELY AS POSSIBLE.    
    # First map input channels over cim_units. Spill if needed.
    GreedyMapperProcessor("CRS", ("cim_unit"), no_other_factors=True),
    GreedyMapperProcessor("CRS", ("crossbar", "macro", "tile")),
    # Allocate weight slices to columns.
    GreedyMapperProcessor("L", ("column")),
    # Allocate output channels to columns. Spill if necessary.
    GreedyMapperProcessor("M", ("column", "crossbar", "macro", "tile")),
    # Map G at the highest levels possible. There is no reuse across the G
    # dimension, so keeping it local isn't helpful.
    GreedyMapperProcessor("G", ("tile", "macro", "crossbar")),
    
    # 2. REPLICATE WEIGHTS WITH ANY EXTRA SPACE.
    # We've mapped all the weights already, so the remaining space at the tile
    # level can go to N/Q
    GreedyMapperProcessor("NQ", ("tile")),
    
    # 3. APPLY INTELLIGENT RESTRICTIONS TO TEMPORAL REUSE.
    # Reuse across P (horizontal traversal of outputs, within-row) is
    # most helpful, so do that innermost. Reuse across Q (vertical traversal of
    # outputs, within-column) is less helpful, so do that next. Reuse across N
    # (different samples within batch) is least helpful, so do that outermost.
    GreedyMapperProcessor(temporal_permutations="NQP"),
]

## **Collecting Input Files And Putting Together The Specification**
The following function creates a timeloopfe specification for a given
parameterization of RAELLA/ISAAC running a DNN layer. Comments explain each
piece.

In [None]:
def get_spec(
        architecture: str, 
        dnn: str, 
        layer: str, 
        use_typical_statistics: bool,
        n_tiles: Optional[int] = None,
        batch_size: Optional[int] = None,
    ) -> Specification:
    architecture = architecture.lower()
    dnn = dnn.lower()
    layer = layer.lower()

    # The recorded statistics are stored in two places: typical_values and
    # torch_values. Torch values are recorded from PyTorch kernels for specific
    # layers. Typical values are general-case, recommended for use with unseen
    # layers.
    if use_typical_statistics:
        stats_path = f"{STATS_PATH}/typical_values/{architecture}"
    else:
        stats_path = f"{STATS_PATH}/torch_values/{architecture}/{dnn}/"
    
    # The ArtifactProcessor does a few things:
    # 1. Reads & populates the recorded statistics.
    # 2. For ISAAC designs, reads some additional values from the layer YAML.
    #    The values read are the per-bit probability of a weight/input having a
    #    value of 1. These values are already handled in the RAELLA recorded
    #    statistics.
    # 3. For QDQBERT, RAELLA must process positive/negative inputs in two
    #    separate passes. This processor adds the appropriate adjustments. 
    artifact_processor = ArtifactProcessor(
        isaac="isaac" in architecture,
        raella_bert="raella" in architecture and "qdqbert" in dnn.lower(),
        vars_from=stats_path,
        use_typical_statistics=use_typical_statistics,
    )
    
    # The other RAELLA designs are just parameterized versions of the RAELLA
    # architecture, and use the same architecture specification.
    yamldir = f'{ARCHITECTURE_PATH}/{architecture}/'
    if 'raella' in architecture:
        yamldir = f'{ARCHITECTURE_PATH}/raella/'
    
    # Top-level architecture YAML file.
    arch_path = f'{yamldir}/arch.yaml'
    # Variables with which to populate the architecture.
    variables_path = f'{yamldir}/variables.yaml'
    # Components that are common to all architectures.
    common_components_path = f'{ARCHITECTURE_PATH}/common_components/*.yaml'
    # Layer YAML file.
    layer_path = f'{thisdir}/dnn_layers/{dnn}/{layer}.yaml'
    # Mapper YAML file.
    mapper_path = f'{thisdir}/mapper.yaml'
    # Put togther processors. The STANDARD_SUITE is a suite of processors
    # that are generally recommended. The order is important, as:
    # 1. The artifact processor populates variables
    # 2. The math processor in the standard suite uses those variables
    # 3. The constraint attacher processor in the standard suite organizes
    #    constraints in the YAML files.
    # 4. The greedy mapper processors use those constraints.
    procs = [artifact_processor] + STANDARD_SUITE + greedy_mapper_processors
    
    spec = Specification.from_yaml_files(
        arch_path,
        variables_path,
        common_components_path,
        mapper_path,
        layer_path,
        processors=procs
    )
    
    # Set variables
    if batch_size is not None:
        spec.variables['BATCH_SIZE'] = batch_size
    if n_tiles is not None:
        spec.variables['TILES_PER_ARCH'] = n_tiles
    
    spec.process() # Run the processors.
    return spec

## **Mapping a DNN Layer to the Architecture**
We can now run the mapper. There are several DNNs to choose from, and several
parameterizations of RAELLA/ISAAC. Set the DNN_CHOICE, ARCHITECTURE_CHOICE,
LAYER_CHOICE, USE_TYPICAL_STATISTICS, N_TILES, and BATCH_size parameters below
to choose which setup to run.

The tl.mapper command will run Timeloop mapper to find the best mapping for the
given architecture and DNN layer. It should take from 6-60 seconds to run
depending on the settings and machine.

The four architectures included are the architectures used in the ablation
study of RAELLA. They are:
- isaac: An 8b ISAAC. Crossbars are 128 by 128 1T1R devices running unsigned
  arithmetic. Eight bit weights are sliced into four slices of two bits each.
  Eight bit inputs are sliced into eight slices of one bit each. ADCs are 8b.

- raella_no_speculation_2b_cells: This is RAELLA running Center+Offset, its
  first contribution. Crossbars are 512 by 512 2T2R devices running Center+Offset
  arithmetic. Eight bit weights are sliced into four slices of two bits each.
  Eight bit inputs are sliced into eight slices of one bit each. ADCs are 7b.

- raella_no_speculation: This is RAELLA running Center+Offset and Adaptive
  Weight Slicing, its first two contributions. The same setup is used as
  raella_no_speculation_2b_cells, but the weight slicings are chosen per-layer.
  Most layers use three slices in a 4b-2b-2b pattern.

- raella: This is RAELLA running Center+Offset, Adaptive Weight Slicing, and
  Speculation; all three contributions. The same setup is used as
  raella_no_speculation, but speculation is enabled. RAELLA runs a 2-4 bit
  speculation input slice followed by 2-4 one-bit recovery input slices. In
  recovery cycles, ADCs do not convert columns where speculation succeeded.
  Overall, this reduces ADC usage by approximately 60%, at the cost of lower
  throughput and increased energy for other components.

You can expect slightly different energy results from the ablation study in the
RAELLA paper. This is because:
- We're finding different mappings here. The original paper didn't use the
  greedy mapper-- instead, Timeloop was run for multiple days for each DNN.
- Models of networks in Timeloop have been improved. 
- There was a mistake in the output register model in the original paper,
  resulting in doubled output register size for the RAELLA models. This has
  been fixed, which reduces output register area / energy for RAELLA
  parameterizations.
- There was a mistake in ISAAC's memory cell energy in the original paper,
  where input bits were counted with a different encoding. This has been fixed,
  which reduces memory cell (crossbar) energy for ISAAC.

In [None]:

DNNS = [
    "InceptionV3",   # Layers 000-094 [3]
    "GoogLeNet",     # Layers 000-057 [4]
    "ResNet18",      # Layers 000-020 [5]
    "ResNet50",      # Layers 000-053 [5]
    "MobileNetV2",   # Layers 000-052 [6]
    "shufflenet1_0", # Layers 000-056 [7]
    "qdqbert",       # Layers 000-071 [8][9]
]

N_LAYERS = {k: len(os.listdir(f"{thisdir}/dnn_layers/{k.lower()}")) for k in DNNS}

ARCHITECTURES = [
    "isaac_like",                      # 1st ablation in paper
    "raella_no_speculation_2b_cells",  # 2nd ablation in paper
    "raella_no_speculation",           # 3rd ablation in paper
    "raella",                          # RAELLA, 4th ablation in paper
]

# SET THE VALUES BELOW. SET THE VALUES BELOW. SET THE VALUES BELOW.
# SET THE VALUES BELOW. SET THE VALUES BELOW. SET THE VALUES BELOW.
# SET THE VALUES BELOW. SET THE VALUES BELOW. SET THE VALUES BELOW.
DNN_CHOICE = 'ResNet18' # Which DNN to use.
ARCHITECTURE_CHOICE = 'raella' # Which architecture to use.
LAYER_CHOICE = 10 # Which layer to use.
USE_TYPICAL_STATISTICS = False # Use typical statistics for the layer
                               # instead of recorded statistics.
N_TILES = 16
BATCH_SIZE = 32 # Set this too low and utilization may drop for the number of
                # tiles available.
# SET THE VALUES ABOVE. SET THE VALUES ABOVE. SET THE VALUES ABOVE.
# SET THE VALUES ABOVE. SET THE VALUES ABOVE. SET THE VALUES ABOVE.
# SET THE VALUES ABOVE. SET THE VALUES ABOVE. SET THE VALUES ABOVE.

assert DNN_CHOICE in DNNS
assert LAYER_CHOICE < N_LAYERS[DNN_CHOICE]
assert ARCHITECTURE_CHOICE in ARCHITECTURES

layer = f"{LAYER_CHOICE:03d}"


spec = get_spec(ARCHITECTURE_CHOICE, DNN_CHOICE, layer, 
                USE_TYPICAL_STATISTICS, N_TILES, BATCH_SIZE)

os.system(f"rm -rf {OUTPUT_STATS_PATH}")
proc = tl.mapper(
    spec, 
    RUN_DIRECTORY, 
    env, 
    dump_intermediate_to=f'{RUN_DIRECTORY}/input.yaml', 
    return_proc=True, 
    log_to=PIPE
)
while proc.poll() is None:
    print(proc.stdout.readline().decode('utf-8'), end='')

## **Plotting Energy, Area, and Throughput Results**
Now that Timeloop has run, outputs will appear in RUN_DIRECTORY. Of interest
are timeloop_mapper.map.txt and timeloop_mapper.stats.txt. The former contains
the mapping, and the latter contains the performance and energy results.

The get_energy_cycles and get_area_from_art functions parse these files and
return the relevant results. Let's plot the energy and area of each component:

In [None]:
stats, cycles = get_energy_cycles(OUTPUT_STATS_PATH)
for_plot = prep_numbers_for_plot(stats)
for_plot = {k: v for k, v in for_plot.items() if v != 0}
area_for_plot = prep_numbers_for_plot(get_area_from_art(OUTPUT_ART_PATH))
area_for_plot = {k: v for k, v in area_for_plot.items() if v != 0}
n_tiles = spec.variables["TILES_PER_ARCH"]

plot_results(
    {f'{ARCHITECTURE_CHOICE} {DNN_CHOICE} {layer}': for_plot},
    {f'{ARCHITECTURE_CHOICE} {DNN_CHOICE} {layer}': area_for_plot},
    {f'{ARCHITECTURE_CHOICE} {DNN_CHOICE} {layer}': {'Cycles': cycles}},
)

print(f'Total cycles: {cycles}. Total energy: {sum(for_plot.values()) / 1e6:.2f} uJ.')

## **One-Layer Ablation Experiment**
We can also run ablations to tell how each of RAELLA's contributions impacts
the overall energy of the system. We'll run a one-layer ablation experiment
using our four architectures and show the results on one plot.

We can see that adding Center+Offset, the second bar, reduces the energy of nearly all components relative to the ISAAC-like design. It enables using larger crossbars/crossbars, which complete more analog operations in parallel. This greater parallelism better amortizes data movement and ADC energy.

Adding in Adaptive Weight Slicing, the third bar, is almost always beneficial to energy. It usually uses fewer slices, reducing the number of ADC converts required. However, it is not always beneficial. If it uses many slices, (e.g., the last layer of each DNN), there are energy penalties.

Adding in Speculation substantially reduces ADC energy, trading off increased energy in other components as well as reduced throughput. The increased energy and decreased throughput comes from running speculation and recovery cycles, while the reduced ADC energy comes from not converting columns where speculation succeeded.

In [None]:
def run(arch, dnn, layer, desc_text=''):
    if isinstance(layer, list):
        total = {}
        per_layer_cycles = []
        for l in tqdm.tqdm(layer, desc=desc_text):
            e, a, c = run(arch, dnn, l)
            per_layer_cycles.append(c)
            for k, v in e.items():
                total[k] = total.get(k, 0) + v
        return total, a, per_layer_cycles
    if isinstance(layer, int):
        layer = f"{layer:03d}"
    
    spec = get_spec(arch, dnn, layer, USE_TYPICAL_STATISTICS,  N_TILES, BATCH_SIZE)
    os.system(f'rm {OUTPUT_STATS_PATH} {OUTPUT_ART_PATH}')
    proc = tl.mapper(
        spec,
        RUN_DIRECTORY,
        env,
        dump_intermediate_to=f'{RUN_DIRECTORY}/input.yaml',
        return_proc=True,
        log_to='/dev/null',
    )
    while proc.poll() is None:
        pass
    stats, cycles = get_energy_cycles(OUTPUT_STATS_PATH)
    energy_for_plot = prep_numbers_for_plot(stats)
    energy_for_plot = {k: v for k, v in energy_for_plot.items() if v != 0}
    art = get_area_from_art(OUTPUT_ART_PATH)
    area_for_plot = prep_numbers_for_plot(art)
    area_for_plot = {k: v for k, v in area_for_plot.items() if v != 0}
    cycles_for_plot = {'Cycles': cycles}
    return energy_for_plot, area_for_plot, cycles_for_plot

def ablation_triple_plot(dnn, layer):
    arch2energy = {}
    arch2area = {}
    arch2cycles = {}
    for arch in tqdm.tqdm(ARCHITECTURES, desc=f'Layer {layer} of {dnn}'):
        e, a, c = run(arch, dnn, layer)
        arch2energy[arch] = e
        arch2area[arch] = a
        arch2cycles[arch] = c
    plot_results(arch2energy, arch2area, arch2cycles)


ablation_triple_plot(DNN_CHOICE, LAYER_CHOICE)


## **Multi-Layer Ablation Experiment**
The performance of each architecture is impacted by the particular DNN layer.
The shape (e.g. number of input/output channels, batch size, etc.) of the layer
affects the utilization of the crossbar crossbars, the reuse opportunities in the
input and output buffers, and the number of cycles required to process the
layer. 

To get a better sense of the performance of each architecture, we can run a
similar ablation experiment over more layers. We'll run the first, middle, and
last layers of our DNN and plot the results.

We can see that the trends seen from the one-layer ablation experiment hold for
most layers, but there are exceptions. Note the third set of plots, which runs
the last layer of each DNN. In this case, Adaptive Weight Slicing is not
beneficial for energy. This is because the last layer is generally much less
energy intensive than earlier layers, but it is also very important for high
accuracy. Adaptive Weight Slicing uses as many slices as possible in this
layer, as this layer has an outsized effect on DNN accuracy relative to its
energy cost. This is a worthwhile tradeoff; the last layer is generally much
less energy intensive than earlier layers, but it is also very important for
high accuracy. *Note: You will not see this effect if USE_TYPICAL_STATISTICS is
set to True. The last layer is atypical, which makes it extra interesting.*

In [None]:
ablation_triple_plot(DNN_CHOICE, 0)
ablation_triple_plot(DNN_CHOICE, N_LAYERS[DNN_CHOICE] // 2)
ablation_triple_plot(DNN_CHOICE, N_LAYERS[DNN_CHOICE] - 1)

## **Full-DNN Ablation Experiment**
Now we're ready to see the results for a full DNN. We'll run the same four
setups, this time summing the energy across all layers of the DNN.

Go grab a coffee. This one will take a while. It should take anywhere from 20
minutes to a few hours depending on the machine and the chosen DNN. ResNet18
is the fastest.

We first run all layers and collect results for the energy, area, and
throughput of each layer. Using the throughput information, we will perform
replication of layers to maximize throughput. We will then plot the energy,
area, and throughput of each architecture for the full DNN.

In [None]:
def run_full_dnn(arch, dnn, desc_text=''):
    layers = list(range(N_LAYERS[dnn]))
    results = run(arch, dnn, layers,  desc_text=desc_text)
    return results

def run_full_dnn_all_architectures(dnn):
    e_results = {}
    a_results = {}
    c_results = {}
    for i, arch in enumerate(ARCHITECTURES):
        t = f'Architecture {i+1}/{len(ARCHITECTURES)}'
        e, a, c = run_full_dnn(arch, dnn, desc_text=t)
        e_results[arch] = e
        a_results[arch] = a
        c_results[arch] = c
    return e_results, a_results, c_results

e_results, a_results, c_results = run_full_dnn_all_architectures(DNN_CHOICE)

## **Full-DNN Mapping with Energy, Area, And Throughput Results**
We can now plot the results. We'll plot the energy, area, and throughput of
each architecture for the full DNN. All tiles will run in a pipelined fashion,
sending data from one tile to the next as soon as it is ready.

We can increase throughput by replicating layers. If a layer has low
throughput, it is replicated, or its weights are copied, to multiple tiles.
These tiles run in parallel, increasing throughput.

Below, we first adjust the number of tiles of each architecture to normalize
area and provide a fair comparison. Then, we perform replication of layer
weights to maximize throughput for each architecture. Replication follows a
greedy scheme; while there are tiles left, the tiles allocated to the
lowest-throughput layer are doubled.

Throughput values may differ from the RAELLA paper. The reasons are:
- Throughput is dependent of number of tiles, which is different here than in
  the paper.
- All mappings in the paper were manually checked to ensure good throughput for
  each layer. Here, we use a greedy mapper and quick mapping, so this may or
  may not be the case.
- In the paper, the mapper run for a very long time for each DNN/architecture
  combination. Here, each DNN/architecture combination is run for a few
  thread-hours at most. This is enough to get a good mapping, but not
  necessarily the best mapping. For the paper, we ran for several thread-weeks
  per DNN/architecture combination.
- The paper uses a different replication scheme. Here, the lowest-throughput
  layer tiles are doubled. In the paper, the tiles are incremented. This is
  slower to compute, but resulted in better throughput.

In [None]:
# The granularity of our throughput estimation is limited by N_TILES, as no
# layer can receive fewer than N_TILES tiles. However, we can't make N_TILES
# too small or some layers may not be able to be mapped. In the paper, we solve
# this by calculating the minimal tiles for each layer individually and use
# that as a start point. Here, we can do a few things:
# 1. Use a big N_TILES to ensure that all layers can be mapped.
# 2. Use a big batch size such that when layers don't fill up the larger
#    N_TILES, they can be automatically replicated to use the extra tiles.
# 3. Use a really big AGGREGATE_TILES so that effects of the large N_TILES and
#    coarse-grained pipeline does not affect throughput too much.
# 4. Pay attention to the normalized throughput, not absolute, because the
#    absolute throughput will be very high with this large number of tiles.
AGGREGATE_TILES = 1024 * 64

def get_throughput_adj_area(e_results, c_results, a_results, AGGREGATE_TILES):
    throughput = {}
    tiles = {}

    # Normalize the number of tiles
    first_arch = list(e_results.keys())[0]
    area_base = sum(a_results[first_arch].values())

    # Replicate layers to maximize throughput
    for arch in e_results:
        cycles = c_results[arch]
        area = a_results[arch]
        # Normalize area
        n_tiles_total = AGGREGATE_TILES // (sum(area.values()) / area_base)
        tiles[arch] = n_tiles_total

        # Replicate layers to maximize throughput
        layers = list([N_TILES, 1 / c['Cycles'] * 1e9] for c in cycles)
        while sum(l[0] for l in layers) < n_tiles_total:
            used_tiles = sum(l[0] for l in layers)
            slowest_layer = min(layers, key=lambda l: l[1])
            if used_tiles + slowest_layer[0] > n_tiles_total:
                break
            slowest_layer[0] *= 2
            slowest_layer[1] *= 2
        
        slowest_layer = min(layers, key=lambda l: l[1])
        throughput[arch] = {'Throughput': slowest_layer[1]}
        print(
            f'\t{arch} allocated {sum(l[0] for l in layers)} '
            f'tiles with {slowest_layer[1]} batches/s'
        )
        # for i, l in enumerate(layers):
        #     print(f'\t\tLayer {i}: {l[0]} tiles, {l[1]:.2f} batches/s')
        
    area = {a: 
        {k: v * tiles[a] / N_TILES for k, v in a_results[a].items()} 
        for a in e_results
    }
        
    return throughput, area

throughput, area = get_throughput_adj_area(
    e_results, c_results, a_results, AGGREGATE_TILES)
print(f'\nDNN: {DNN_CHOICE}')
plot_results(e_results, area, throughput=throughput, energy_mj=True)


## **Tips for Extending the Model**
This concludes the tutorial. The following two sections give some tips and
information for running your own experiments with these models.

### Interpreting the Stats File
When running your own models, you'll likely need to read the Timeloop output
stats file (timeloop_mapper.stats.txt). This file contains the energy and
latency results for the model.

RAELLA and ISAAC both run with 1ns cycles, so the reported number of cycles is
equal to the time per layer in nanoseconds. Note that this model does NOT
report the latency, only the throughput. Latency would include additional
factors such as initial buffer fills and network latency, which is not modeled. 

The aggregate energy of each component can be found by multiplying its energy
per op with the number of computes.

Note that the Timeloop reported number of computes is NOT the number of
multiply-accumulates. This is because these models record each input, weight,
and output bit combination as a different "op". Actual multiply-accumulates are
equal to: (number of computes) / (number of input bits) / (number of weight
bits) / (number of output bits).

In addition to the relation between computes and number of bits of each
operand, special considerations need to be made for RAELLA. In the simulation
for RAELLA's adaptive weight slicing, we change the number of weight bits that
the model sees on a per-layer basis (NOTE: Actual number of weight bits is
unchanged. This is just how we implemented the model for Timeloop. All layers
of all provided models use 8b weights.). For each layer, we need to look at the
number of weight bits provided to the model and divide by that value to get the
number of multiply-accumulates.

### Tips for Running Your Own Experiments
Use the following tips to run your own experiments:
- To speed experiments, you can parallelize Timeloop calls across DNNs and
  layers. If you do this, ensure that each experiment has its own
  RUN_DIRECTORY. Additionally, some Accelergy plug-ins may crash when run in
  parallel. To avoid this, spawn processes with a few seconds of delay between
  them. Also have a re-try that tries a few times if a process crashes.
- Use the GreedyMapper wherever possible; in this notebook, it reduces the
  mapping search space by eight orders of magnitude or more.
- Read & understand the interactions between the architecture, variables, and
  macro YAML files.
- If Timeloop cannot find a mapping, or it finds mappings that are weird /
  obviously suboptimal, then the problem is likely one of the following:
  - The mapper timed out before finding a good mapping: To fix this, it is
    most effective to constrain the mapping or use the GreedyMapper. Either of
    these options can reduce the search space by orders of magnitude.
    Additionally, you may increase the timeout and victory condition in the
    mapper YAML file to run longer.
  - There may be insufficient tiles to map the DNN layer: To fix this,
    increase the number of tiles in the architecture YAML file.
  - Something is wrong with the architecture: To fix this, enable diagnostics
    in the mapping YAML file to see why mapping is failing. I often find myself
    following the following procedure:
    - Make all the buffers gigantic (depth = big number). Make all the fanouts
      gigantic (meshX = big number). The architecture should now work.
    - One at a time, return a fanout / buffer to its original value. If a
      certain buffer or fanout breaks the architecture, then that buffer or
      fanout is likely the problem.
  - If layers can't fit in your number of tiles, you may increase N_TILES
    to increase the width of the pipeline. Ensure that the batch size is large
    enough to fill the pipeline for each layer, or else one layer can artificially
    limit the throughput of the entire DNN.
- If the inputs to a DNN layer are signed, ISAAC's input activation energy will
  go up significantly (ISAAC will cast the inputs to unsigned numbers, leading
  to a large number of 1 bits in inputs), while RAELLA will incur double the
  cycles and ADC conversions (from splitting positive/negative inputs into
  separate slices). Ensure your models take this into account.

## **Glossary**
Some vocabulary here differs from the RAELLA paper to better align with the
current vocabulary used by the research community. The following is a glossary
of terms used in this model:
- PIM/CiM: Processing-In-Memory / Compute-In-Memory. Used interchangeably here.
- Tile: An architectural tile comprising multiple PIM macros plus buffers,
  networks, and quantization units.
- Macro: Minimal architecture required to perform operations. In the RAELLA and
  ISAAC papers, this is an IMA.
- Crossbar: A grid of horizontally/vertically connected memory cells that
  perform in-memory operations.
- CiM unit: The smallest architectural unit available for mapping. In RAELLA,
  this is a 2T2R device. In ISAAC, this is a ReRAM device.
- Memory Cell: The smallest unit of memory. In RAELLA, this is a ReRAM device.
  In ISAAC, this is a ReRAM (1R) device. Note that a CiM unit may be comprised
  of multiple memory cells.

The following is an explanation of each dimension that may appear in the
problem, mapping, architecture, and stats files. The problem dimensions refer
to dimensions of the input, output, and weight tensors that are used in each
DNN layer:
- P: The width of the output tensor.
- Q: The height of the output tensor.
- R: The width of the weight filter. 1 for fully-connected layers.
- S: The height of the weight filter. 1 for fully-connected layers.
- C: The number of input channels.
- M: The number of output channels.
- G: The number of groups. 1 for non-grouped layers.
- N: The batch size.
- I: The number of input bits (resolution of inputs).
- L: The number of weight bits (resolution of weights).
- T: The number of output bits (resolution of outputs. Note that psum
  resolution is up to 2*T before quantization).

## **Contact**
If you have any questions, comments, or concerns, please contact Tanner
Andrulis at andrulis@mit.edu. I'm happy to help with anything related to
RAELLA, Accelergy, or PIM modeling in Timeloop.

## **References**
1. Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reforming the
Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No
Retraining Required! In Proceedings of the 50th Annual International Symposium
on Computer Architecture (ISCA '23). Association for Computing Machinery, New
York, NY, USA, Article 27, 1–16. https://doi.org/10.1145/3579371.3589062
2. Ali Shafiee, Anirban Nag, Naveen Muralimanohar, 
Rajeev Balasubramonian
 John Paul Strachan,
Miao Hu, R. Stanley Williams, and Vivek Srikumar, "ISAA
: A Convolutional Neural Network Accelerator wi h
In-Situ Analog Arithmetic 
n Crossbars," 2016 ACM/IEEE 43rd Annu l
International Symposium on Comput
r Architecture (ISCA), Seoul, Korea (South ,
2016, pp. 14-26, doi: 10.1109/ISCA.2016.124
3. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2818–2826.
https://doi.org/10.1109/CVPR.2016.305
4. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). 1–9.
https://doi.org/10.1109/CVPR.2015.7298594
5. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016), 770–778.
6. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh
Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 4510–4520.
https://doi.org/10.1109/CVPR.2018.00474
7. Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet
V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings
of the European Conference on Computer Vision (ECCV).
8. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you
Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
(Eds.), Vol. 30. Curran Associates, Inc.
https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
9. Wu, H., Judd, P., Zhang, X., Isaev, M. and Micikevicius, P., 2020. Integer
quantization for deep learning inference: Principles and empirical evaluation.
arXiv preprint arXiv:2004.09602. https://arxiv.org/abs/2004.09602
