## Tradeoffs in Memory Size vs. Accesses to Parent Memories

This notebook will show you how to analyze tradeoffs in memory size versus accesses to
parent memories. The generated curve will show you, for a given memory size, the lower
bound of the number of accesses to parent memories.

This analysis is called Orojenesis, and it is introduced in "Mind the Gap: Attainable
Data Movement and Operational Intensity Bounds for Tensor Algorithms" by Qijing Huang,
Po-An Tsai, Joel S. Emer, Angshuman Parashar.

Our plan to do this analysis is the following:
- Set up a simple architecture with a main memory and a global buffer
- Tell the mapper to optimize for both main memory accesses and global buffer usage
- Observe the resulting Pareto frontier between global buffer usage (size) and the
  minimum number of accesses to main memory.

To this end, we have an "orojenesis" architecture in the examples/arches directory.
Let's take a look at it:


In [None]:
from IPython.display import Markdown, display
import accelforge as af


display(Markdown(f"""
``` yaml
{open(af.examples.arches.oroejenesis).read()}
```
"""))


First import AccelForge and initialize a Spec. We'll use a 4096x4096x4096 matrix
multiply as our workload.

In [None]:
# import accelforge as af
# from pathlib import Path

# examples_dir = Path("../../examples")

# spec = af.Spec.from_yaml(
#     af.examples.arches.oroejenesis,
#     af.examples.workloads.matmuls,
#     jinja_parse_data={"N_EINSUMS": 1, "M": 4096, "KN": 4096},
# )
# spec.mapper.metrics = af.Metrics.ENERGY | af.Metrics.RESOURCE_USAGE

In [None]:
# # af.set_n_parallel_jobs(1)
# results = spec.map_workload_to_arch()

Let's plot the results.

In [None]:
# import matplotlib.pyplot as plt

# results.data.sort_values("Total<SEP>energy", ascending=True, inplace=True)
# plt.plot(
#     [x * spec.arch.find("GlobalBuffer").size for x in results.resource_usage()["GlobalBuffer"]],
#     results.energy()
# )
# plt.xlabel("Global Buffer Size (bits)")
# plt.ylabel("Lowest-Attainable DRAM Accesses (bits)")
# plt.xscale("log")
# plt.yscale("log")
# plt.show()
# # Plotting runoff to the right
# # arxiv and let timeloop team know
# # restricted imperfect factorization for spatial fanouts

Let's also do a comparison of how the fusion affects this curve. We'll use the same
architecture, but this time with a larger workload.

In [None]:
import accelforge as af
from pathlib import Path

examples_dir = Path("../../examples")

spec = af.Spec.from_yaml(
    af.examples.arches.oroejenesis,
    af.examples.workloads.gpt3_6_7B,
    # af.examples.workloads.three_matmuls_annotated,
    jinja_parse_data={"N_TOKENS": 8192}
)
spec.mapper.metrics = af.Metrics.ENERGY | af.Metrics.RESOURCE_USAGE
# spec.mapper.max_pmapping_templates_per_einsum = 8


# FUSED
af.set_n_parallel_jobs(1)
spec.arch.find("MainMemory").tensors.keep = "~Intermediates"
spec.arch.find("MainMemory").tensors.may_keep = "All"
results_fused = spec.map_workload_to_arch(einsum_names=["Q"])


# # UNFUSED
spec.arch.find("MainMemory").tensors.keep = "All"
results_unfused = spec.map_workload_to_arch(einsum_names=["Q"])#einsum_names=["K", "QK"])

# BUG C: Initial stride appearing in the model output when stride == initial

In [None]:
results_unfused.resource_usage()
results_unfused.columns
import matplotlib.pyplot as plt

results_fused.data.sort_values("Total<SEP>energy", ascending=True, inplace=True)
plt.plot(
    results_fused.resource_usage()["GlobalBuffer"],
    results_fused.energy(),
    label="Fused",
)

results_unfused.data.sort_values("Total<SEP>energy", ascending=True, inplace=True)
plt.plot(
    results_unfused.resource_usage()["GlobalBuffer"],
    results_unfused.energy(),
    label="Unfused",
)
plt.xlabel("Global Buffer Size (bits)")
plt.ylabel("Lowest-Attainable DRAM Accesses (bits)")
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
# results_fused.resource_usage("GlobalBuffer")
