### MHA2SHA

This notebook shows how to use MHA2SHA from a high level. MHA2SHA at its core is simply converting Multi Head Attetntion (MHA) ops to Single Head Attention (SHA), as well as, propogate AIMET encodings to the newly converted SHA model. MHA2SHA can be used pythonically and via the command line. This notebook will cover both of these for each block of information.

#### Overall Notebook Flow
This notebook covers the following:

1. MHA2SHA setup
2. Flag/Arg information

#### Assumptions
We will use a LLaMAv2 model that is an artifacts of the AI research notebooks. This comes with changes to the model that is not standard with a LLaMAv2 model straight from Hugging Face.

---

#### 1. MHA2SHA Setup

To setup MHA2SHA, there is a environment setup script called `env_setup.sh`. Running this will set paths accordingly. Below is how to run from the top level of MHA2SHA.

```bash
source bin/env_setup.sh
```

In [None]:
# If you are in the root of MHA2SHA you can run this cell
!source bin/env_setup.sh

---

#### 2. Flag/Arg Information

At a minimum, MHA2SHA needs the following flags/args: `--model-name`, `--exported-model-path`(`model_or_path` in python). `--exported-model-encoding-path` is not mandatory as some models may not have encodings.

Apart from the minimum, MHA2SHA has a lot of flags to cover many iterations of LLM/LVM models. A full list of these can be seen in the README. To make this easier, we provide a higher level flag `--base-llm`. This flag will map a LLM/LVM's base architecture to the necessary flags that are needed for MHA2SHA. For example, `--base-llm llama2` will implicitly turn on flags such as `--handle-rope-ops`, `--llm-model`, etc. If a user provides a flag that contradicts a default flag for the LLM/LVM base architecture, we warn the user but use the explicitly provided flag. For a list of `--base-llm` architectures supported, see the README.

Below will show how to run a LLaMAv2 model through the command line and pythonically.

##### Command Line

The command line variation of running MHA2SHA easy to use. Below highlights how to run.

```bash
mha2sha-onnx-converter \
--model-name example \
--sha-export-path [PATH-TO-EXPORT] \
--exported-model-path [PATH-TO-MODEL] \
--exported-model-encoding-path [PATH-TO-ENCODINGS] \
--base-llm llama2
--mha-conv  # These are flags necessary if the model is an artifact from AI Research notebooks
--nchw-aligned # These are flags necessary if the model is an artifact from AI Research notebooks
```

##### Python

The pythonic way of running MHA2SHA is similar to the command line. The difference here is that the `__init__` of the MHA2SHAConverter takes positional arguments for the minimum arguments listed above, followed by keyword arguments for all other additional flags. It may be easier to make a dictionary of flags needed and unpacking it in the `__init__`.

The Python API will convert the model and give back the converted model and a verification status. This verification status is `True` if the original MHA models logits matched the SHA converted models logits otherwise it's `False`. Please see `--no-verification` in the README for more information.

In [None]:
from mha2sha.converter import MHA2SHAConverter

flags = {
    "exported_model_encoding_path": "path/to/encodings",
    "base_llm": "llama2",
    "mha_conv": True, # These are flags necessary if the model is an artifact from AI Research notebooks
    "nchw_aligned": True # These are flags necessary if the model is an artifact from AI Research notebooks
}

converter = MHA2SHAConverter(
    model_name="example",
    sha_export_path="path/to/export",
    model_or_path="path/to/model",
    **flags
)
sha_model, verification_status = converter.convert()

---

#### Summary

Hopefully this notebook was useful in understanding how to integrate MHA2SHA into your pipeline.

For additional resources:
- View the README
- Contact the contributors listed in the README