# QNN Model Prepare on Linux

The Qualcomm AI Engine Direct SDK allows clients to run ML models on HTP hardware. The following steps describe how to prepare the LLaMa2 models on Linux platforms for Windows.

Before continuing, ensure all steps from [README](README.md) are completed.

This document uses the term Qualcomm Neural Network (QNN) and Qualcomm AI Engine Direct SDK interchangeably.

# Prerequisites

1. Qualcomm AI Engine Direct SDK (with Ubuntu Linux support)
2. Ubuntu 22.04 installation with required packages for QNN Tools
3. This notebook could be executed with Anaconda (with the supplied environment.yaml) or a virtual environment (venv)
4. LLaMA `.onnx` files and their corresponding AIMET encodings and safetensors (generated via AIMET workflow in example1)

This work flow assumes that you have generated the model artifacts following the AIMET LLaMa  workflow (example1):

-   LLaMa2 model and its AIMET encodings
-   `.pkl` file per network - a numpy object array saved as a Python pickle that contains data that is required as part of the model conversion step.

![dir_struct](../jupyter_notebook_assets/nb1_output_dir_contents.png "Overall directory Structure from notebook 1") ![onnx_dir_struct](../jupyter_notebook_assets/onnx_dir_structure.png "Snapshot of file contents of onnx folder from notebook 1")

# Workflow

All the models and encodings are processed independently via different executable QNN utilities available in the Qualcomm AI Engine Direct SDK.

To prepare LLaMa2 models for inference, the QNN executable utilities require an Ubuntu 22.04 environment

1. Split the onnx model into several small onnx models.
2. Apply MHA2SHA transformation to convert all attention block MHAs to SHAs. 
3. Convert the `.onnx` files to their equivalent QNN representation.
4. Generate the QNN model quantized libraries.
5. Import adapters and safetensors for the DLC format
6. Generate the QNN context binaries for the QNN HTP backend.

After preparing the LLaMa2 models for inference, the next step is to execute the QNN context binaries for inference on a Snapdragon Windows device.

![QNN Work flow](../jupyter_notebook_assets/qnn-lora-v3-workflow.png)


## Set up the Qualcomm AI Engine Direct SDK

The following steps configure the Qualcomm AI Engine Direct SDK, which enables running LLaMA on the device.
Execute the following on an Ubuntu 22.04 terminal.

**NOTE:** These steps require sudo or root privileges.

1. After setting up Python and pip in Ubuntu, check QNN tool dependencies.
2. Set the `QNN_SDK_ROOT` environment variable to the location of the Qualcomm AI Engine Directory. For **Linux**, `export QNN_SDK_ROOT="./assets/qnn"`
3. Check and install Linux dependencies.

    ```
    source $QNN_SDK_ROOT/bin/check-linux-dependency.sh
    sudo apt-get install -y libtinfo5
    ```


### Install the required python packages


In [None]:
if hasattr(__builtins__,'__IPYTHON__'):
    !sudo -H pip install --quiet --upgrade --root-user-action=ignore -r requirements.txt

## Set up models and Qualcomm AI Engine Direct SDK variables


In [None]:
import os
from pathlib import Path


def is_true(val: str):
    return val.lower() in ("true", "1", "t")


# setup whether using multithread or single thread to compile
PARALLEL = is_true(os.getenv("PARALLEL", "True"))

# setup Target platform and its generation
TARGET_PLATFORM = os.getenv("TARGET_PLATFORM", "Windows").capitalize()
PLATFORM_GEN = int(os.getenv("PLATFORM_GEN", 2)) 

# Get source folder
SRC_DIR = str(Path(".").resolve())

# Set Work Folder, where You want to save outputs of NB2 (this NB)
WORK_DIR = Path(os.getenv("OUTPUT_DIR", os.getcwd()))
# All the outputs will be saved in assets dir
ASSETS_DIR = WORK_DIR / "assets"

# Set up environment variable to reference Input network and SDK
MODELS_DIR = Path(os.getenv("INPUT_MODEL_DIR", ASSETS_DIR / "models" ))
QNN_SDK_ROOT = os.getenv("QNN_SDK_ROOT", ASSETS_DIR / "qnn")

# Check path to LLAMA_MODELS and QNN_SDK_ROOT
assert QNN_SDK_ROOT.exists(), f"{QNN_SDK_ROOT=} path does not exist"
assert MODELS_DIR.exists(), f"{MODELS_DIR=} path does not exist"

# These are the AR, CL and onnx name of the model exported from NB1
EXPORT_AR = int(os.getenv("EXPORT_AR", 2073))
EXPORT_CONTEXT_LENGTH = int(os.getenv("EXPORT_CONTEXT_LENGTH", 3073))
ONNX_NAME = os.getenv("ONNX_NAME", "llamav2_base")

# These are the AR, CLs and num_splits we want to be output of model preparation from NB2 (this NB)
CL_LIST = sorted(list(map(int, os.getenv("CL_LIST", "512,1024,2048,3072,4096").split(",")))) #1024,2048,3072,4096
ARNS = sorted(list(map(int, os.getenv("ARNS", "32,128").split(","))), reverse=True)

NUM_SPLITS = int(os.getenv("NUM_SPLITS", 4))

# Setting up varibale to let split_onnx know if embedding and lm_head require its own seperate split
SPLIT_EMBEDDING = is_true(os.getenv("SPLIT_EMBEDDING", "False"))
SPLIT_LMHEAD = is_true(os.getenv("SPLIT_LMHEAD", "False"))

# Set SSD params
SSD_PARAMS_FILE = os.getenv("SSD_PARAMS_FILE", None)
assert (
    not SSD_PARAMS_FILE or os.path.exists(SSD_PARAMS_FILE) == True
), "SSD_PARAMS_FILE path does not exist"

# Set Context Bin Generation handler
EMBEDDING_ON_CPU = is_true(os.getenv("EMBEDDING_ON_CPU", "False"))
if EMBEDDING_ON_CPU and not SPLIT_EMBEDDING:
    SPLIT_EMBEDDING = True
    print(f"WARNING! {EMBEDDING_ON_CPU=} requires {SPLIT_EMBEDDING=}. Setting {SPLIT_EMBEDDING=}.")


ENABLE_NATIVE_KV = is_true(os.getenv("ENABLE_NATIVE_KV", "True"))

os.environ["QNN_SDK_ROOT"] = str(QNN_SDK_ROOT)

In [None]:
import sys
import itertools
import subprocess
import concurrent.futures
import yaml
import json

sys.path.extend(
    [
        SRC_DIR + "/../G2G",
        SRC_DIR + "/../G2G/split_onnx_utils",
        SRC_DIR + "/..",
        SRC_DIR + "/../..",
    ]
)

from utilities.nsptargets import NspTargets
from utilities.profiler import event_marker
import utils

# Set up nsp target specification
# Android GEN4 and GEN5 is supported for this notebook
nsp_target = eval(f"NspTargets.{TARGET_PLATFORM}.GEN{PLATFORM_GEN}")
soc_id = nsp_target.soc_id
dsp_arch = nsp_target.dsp_arch

if ENABLE_NATIVE_KV:
    supported_ARs = (32, 128)
    unsupported_ARs = [arn for arn in ARNS if arn not in supported_ARs]
    if len(unsupported_ARs) > 0:
        raise ValueError(
            f"ERROR: Native KV Optimization only supported for AR32 and AR128, unsupported AR found: {unsupported_ARs}"
        )

SPLITS = range(1, NUM_SPLITS + 1)
ARN_CL_LIST = list(itertools.product(ARNS, CL_LIST))
FULLTASKLIST = list(itertools.product(ARNS, CL_LIST, SPLITS))

print("All task list:", FULLTASKLIST)

### Set up environment variables for the Qualcomm AI Direct SDK tools


In [None]:
qnn_env = os.environ.copy()
qnn_env["QNN_SDK_ROOT"] = str(QNN_SDK_ROOT)
qnn_env["PYTHONPATH"] = f"{QNN_SDK_ROOT}/benchmarks/QNN/:{QNN_SDK_ROOT}/lib/python"
qnn_env["PATH"] = f"{QNN_SDK_ROOT}/bin/x86_64-linux-clang:{qnn_env['PATH']}"
qnn_env["LD_LIBRARY_PATH"] = f"{QNN_SDK_ROOT}/lib/x86_64-linux-clang"
qnn_env["HEXAGON_TOOLS_DIR"] = f"{QNN_SDK_ROOT}/bin/x86_64-linux-clang"
os.environ = qnn_env

# Prepare LLaMa2 models for Inference

The following section uses the Qualcomm AI Engine Direct SDK to prepare LLaMa2 models for on-target inference.


### Export to desired ARn

Expected execution time: ~10 minutes


In [None]:
import change_hardcoding


def gen_ar(ARN_CL_LIST, scoring_net=False):
    arn, CL = ARN_CL_LIST

    fix_list = [
        f" {EXPORT_AR},{arn}",
        f" -{EXPORT_AR},-1",
        f" {EXPORT_CONTEXT_LENGTH},{CL}",
        f" {EXPORT_CONTEXT_LENGTH-EXPORT_AR},{CL-arn}",
    ]

    model_dir = MODELS_DIR
    output_path = ASSETS_DIR / f"ar{arn}_cl{CL}" / "src"

    output_path.mkdir(parents=True, exist_ok=True)

    change_hardcoding.execute(str(model_dir), str(output_path), fix_list)


with event_marker(f"prepare-export(ARn -> ARx) {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(ARN_CL_LIST) if PARALLEL else 1
    ) as executor:
        results = executor.map(gen_ar, ARN_CL_LIST)
        for result in results:
            if result:
                print(result)
print(f"Prepare AR export done.")

## Preprocess ONNX

Prior to utilizing the QNN tool chain to compile and generate the context binary for LLaMA we need to split the model and generate the following artifacts

-   ONNX file for each split of the model
-   input vectors for each split
-   golden output vectors for each split

We need to specify the following parameters to proceed with execution of the notebook and generate all necessary artifacts

-   number of splits of the model
-   path to LLaMA onnx file
-   path to LLaMA encodings file
-   path to \*.pkl files

![Split](../jupyter_notebook_assets/ModelSplit.png)


### Split Onnx export

This step splits a model into multiple parts based on the number of splits specified.

Expected execution time: ~10 minutes


In [None]:
def thread_split(task):
    arn, cl = task
    target_model_name = f"ar{arn}_cl{cl}"
    model_dir = ASSETS_DIR / target_model_name


    model_name = ONNX_NAME
    encodings_fname = f"{ONNX_NAME}.encodings"
    lora_importer_config = None

    print(f"Starting {model_name}.onnx for {arn=} {cl=}")
    utils.split_onnx(
        onnxfile=str(model_dir / "src" / "onnx" / f"{model_name}.onnx"),
        encoding_file=str(model_dir / "src" / "onnx" / encodings_fname),
        pickle_filedir=str(model_dir / "src" / "test_vectors"),
        modelname=target_model_name,
        num_splits=NUM_SPLITS,
        split_embedding=SPLIT_EMBEDDING,
        split_lmhead=SPLIT_LMHEAD,
        embed_ssd_params_file=SSD_PARAMS_FILE,
        using_qairt_workflow=True,
        output_dir=model_dir,
        lora_importer_config=lora_importer_config,
    )
    print(f"Completed {model_name}.onnx for {arn=} {cl=}")


with event_marker(f"split-onnx {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(ARN_CL_LIST) if PARALLEL else 1
    ) as executor:
        results = executor.map(thread_split, ARN_CL_LIST)
        for result in results:
            if result:
                print(result)

print(f"All onnx model splitted.")

### Convert attention layers from MHA to SHA

The `mha2sha-onnx-converter` tool converts a model from MHA representation to its equivalent SHA representation. The encoding files generated from the AIMET workflow are provided as an input to this step via the `--exported-model-encoding-path` option.

This step generates a new `.onnx` file that represents the model in SHA format.

Expected execution time: ~10 minutes


In [None]:
MHA2SHA_ROOT = f"{SRC_DIR}/../G2G/MHA2SHA"
g2g_env = os.environ.copy()
g2g_env["PYTHONPATH"] = os.pathsep.join(
    [g2g_env.get("PYTHONPATH", ""), os.path.join(MHA2SHA_ROOT, "src/python")]
)
g2g_env["PATH"] = os.pathsep.join([g2g_env.get("PATH", ""), os.path.join(MHA2SHA_ROOT, "bin")])
print(f"MHA2SHA tool root set to: {MHA2SHA_ROOT}")


def thread_mha2sha(task):
    arn, CL, split = task

    if SPLIT_EMBEDDING and split == 1:
        print("As first split only include embedding layer, so let's skip first split")
        return
    elif SPLIT_LMHEAD and split == NUM_SPLITS:
        print("As last split only include lm_head, so let's skip last split")
        return

    artifacts_dir = ASSETS_DIR / f"ar{arn}_cl{CL}"
    split_work_dir = artifacts_dir / f"{split}_of_{NUM_SPLITS}"
    out_dir = split_work_dir / "sha_output"
    out_dir.mkdir(parents=True, exist_ok=True)

    name = f"ar{arn}_cl{CL}_{split}_of_{NUM_SPLITS}"


    print(f"mha2sha-onnx-converter {name} running...")
    args = [
        "mha2sha-onnx-converter",
        *["--sha-export-path", str(out_dir)],
        *["--model-name", name],
        *[
            "--exported-model-encoding-path",
            str(
                artifacts_dir
                / "src"
                / "onnx"
                /  (f"{ONNX_NAME}.encodings")
            ),
        ],
        *[
            "--exported-model-path",
            str(artifacts_dir / "split_onnx" / f"{name}.onnx"),
        ],
        *["--base-llm", "llama2"],
        "--mha-conv",
        "--nchw-aligned",
    ]

    proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=g2g_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f"mha2sha-onnx-converter {name} done.")


with event_marker(f"mha2sha {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(FULLTASKLIST) if PARALLEL else 1
    ) as executor:
        results = executor.map(thread_mha2sha, FULLTASKLIST)
        for result in results:
            if result:
                print(result)
print(f"All mha2sha convert done.")

## Convert the model from ONNX representation to QNN DLC representation

The Qualcomm AI Engine Direct SDK `qairt-converter` tool converts a model from ONNX representation to its equivalent QNN DLC representation. The encoding files generated from the AIMET workflow are provided as an input to this step via the `–quantization_overrides model.encodings` option.

This step generates a `.dlc` file that represents the model as a series of QNN API calls.

Expected execution time: ~10 minutes


In [None]:
def thread_convert(task):

    arn, cl, split = task

    # When EMBEDDING_ON_CPU is True, we do not require to generate binaries for this split,
    # hence, no conversion is required.
    if EMBEDDING_ON_CPU and split == 1:
        return

    artifacts_dir = ASSETS_DIR / f"ar{arn}_cl{cl}"
    split_work_dir = artifacts_dir / f"{split}_of_{NUM_SPLITS}"
    out_dir = split_work_dir / "converted_model"
    out_dir.mkdir(parents=True, exist_ok=True)

    name = f"ar{arn}_cl{cl}_{split}_of_{NUM_SPLITS}"

    if (SPLIT_EMBEDDING and split == 1) or (SPLIT_LMHEAD and split == NUM_SPLITS):
        # mha2sha not applied to fisrt split (if SPLIT_EMBEDDING is set) or to last (if SPLIT_LMHEAD is set)
        input_onnx = artifacts_dir / "split_onnx" / f"{name}.onnx"
        quantization_overrides = (
            artifacts_dir
            / "src"
            / "onnx"
            / (f"{ONNX_NAME}.encodings")
        )
    else:
        input_onnx = split_work_dir / "sha_output" / f"{name}.onnx"
        quantization_overrides = split_work_dir / "sha_output" / f"{name}.encodings"

    args = [
        f"{QNN_SDK_ROOT}/bin/x86_64-linux-clang/qairt-converter",
        *["--input_network", str(input_onnx)],
        *["--quantization_overrides", str(quantization_overrides)],
        *["--output_path", str(out_dir / f"{name}.dlc")],
    ]
    for opt in utils.get_input_layout(str(input_onnx), using_qairt_workflow=True):
        args += opt

    proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=qnn_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())

    print()
    print(f"qairt-converter {name} done!")


with event_marker(f"convert-onnx {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(FULLTASKLIST) if PARALLEL else 1
    ) as executor:
        results = executor.map(thread_convert, FULLTASKLIST)
        for result in results:
            if result:
                print(result)

print(f"All qairt-converter done.")

## Quantized QNN DLC model

The Qualcomm AI Engine Direct SDK `qairt-quantizer` compiles the model `.dlc` and input`.raw` files into a `model.quantized.dlc` file.

The inputs to this stage are the input raw files & `model.dlc` generated in the previous step.

Expected execution time: ~< 25 minutes


In [None]:
def thread_genlib(task):

    arn, cl, split = task

    # When EMBEDDING_ON_CPU is True, we do not require to generate binaries for this split,
    # hence, quantization is not required.
    if EMBEDDING_ON_CPU and split == 1:
        return

    artifacts_dir = ASSETS_DIR / f"ar{arn}_cl{cl}"
    split_work_dir = artifacts_dir / f"{split}_of_{NUM_SPLITS}"
    out_dir = split_work_dir / "compiled_model"
    out_dir.mkdir(parents=True, exist_ok=True)

    name = f"ar{arn}_cl{cl}_{split}_of_{NUM_SPLITS}"

    proc = subprocess.Popen(
        [
            f"{QNN_SDK_ROOT}/bin/x86_64-linux-clang/qairt-quantizer",
            *["--act_bitwidth", "16"],
            *["--bias_bitwidth", "32"],
            *["--input_dlc", str(split_work_dir / "converted_model" / f"{name}.dlc")],
            *["--input_list", str(artifacts_dir / f"input_list_{name}.txt")],
            *["--output_dlc", str(out_dir / f"{name}.dlc")],
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        env=qnn_env,
    )

    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f"qairt-quantizer {name} done!")


with event_marker(f"qairt-quantizer {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(FULLTASKLIST) if PARALLEL else 1
    ) as executor:
        results = executor.map(thread_genlib, FULLTASKLIST)
        for result in results:
            if result:
                print(result)

print(f"All qairt-quantizer done.")

## QNN HTP weight sharing context binary

The Qualcomm AI Engine Direct SDK `qnn-context-binary-generator` tool creates a QNN context binary applicable to the QNN HTP backend. This binary can be deployed to run on a Snapdragon 8 Gen4 device that runs Android. This step requires the ar128 and ar1 quantized DLCs from the previous step and the `libQnnHtp.so` library, available in the Qualcomm AI Engine Direct SDK.

Provide additional options that pertain to the QNN HTP backend by passing the `libQnnHtpBackendExtensions.so` library that implements extensions for the QNN HTP backend. The library is available in the Qualcomm AI Engine Direct SDK.


### Update config with absolute paths to SHA files


In [None]:
def modify_importer_config(task):
    arn, cl, split = task

    artifacts_dir = ASSETS_DIR / f"ar{arn}_cl{cl}"
    split_work_dir = artifacts_dir / f"{split}_of_{NUM_SPLITS}"
    name = f"ar{arn}_cl{cl}_{split}_of_{NUM_SPLITS}"

    lora_importer_config = artifacts_dir / "src" / "onnx" / "lora_importer_config.yaml"
    with lora_importer_config.open("r") as f:
        lora_importer_config_data = yaml.safe_load(f)

    for use_case in lora_importer_config_data["use_case"]:
        adaptor_name = use_case["name"]
        use_case["model_name"] = str(split_work_dir / "sha_output" / f"{name}.onnx")
        use_case["lora_weights"] = str(
            split_work_dir / "sha_output" / f"{adaptor_name}_sha_weights.safetensor"
        )
        use_case["quant_overrides"] = str(
            split_work_dir / "sha_output" / f"{adaptor_name}_{name}.encodings"
        )
        use_case["output_path"] = str(split_work_dir / "importer_output")

        for key in ("model_name", "lora_weights", "quant_overrides"):
            assert Path(
                use_case[key]
            ).exists(), f"{use_case[key]} does not exist for adaptor {adaptor_name}."

    lora_importer_sha_config = split_work_dir / "lora_importer_config.yaml"
    with lora_importer_sha_config.open("w") as f:
        yaml.dump(lora_importer_config_data, f, default_flow_style=False, sort_keys=False)

### Define Htp Perf Setting


In [None]:
def make_config_file(split, out_dir: Path, src_graphs, soc_id=69, dsp_arch="v79"):
    htp_config_path = out_dir / f"HtpConfigFile_API_{split}.json"
    perf_config_path = out_dir / f"PerfSetting_API_{split}.conf"

    htp_config = {
        "backend_extensions": {
            "shared_library_path": "libQnnHtpNetRunExtensions.so",
            "config_file_path": str(perf_config_path),
        }
    }

    perf_config = {
        "graphs": [
            {
                "O": 3.0,
                "vtcm_mb": 8,
                "graph_names": src_graphs,
                "fp16_relaxed_precision": 0,
                "hvx_threads": 6,
            }
        ],
        "devices": [
            {
                "soc_id": int(soc_id),
                "dsp_arch": dsp_arch,
                "cores": [{"perf_profile": "burst", "rpc_control_latency": 100}],
                "pd_session": "unsigned",
            }
        ],
        "context": {"weight_sharing_enabled": len(src_graphs) > 1},
        "groupContext": {"share_resources": True},
        "memory": {"mem_type": "shared_buffer"},
    }

    with htp_config_path.open("w") as f:
        json.dump(htp_config, f, indent=4)

    with perf_config_path.open("w") as f:
        json.dump(perf_config, f, indent=4)

### Native Format KV Optimization config generator


In [None]:
def gen_kv_format_config(split, folder: Path, data_format_config_name, scoring_net=False):
    graphs_dict = {}
    graphs_dict["graphs"] = []
    cl_list = CL_LIST
    ARNs = ARNS
    if scoring_net:
        cl_list = SCORING_NET_CL_LIST
        ARNs = [1]
    for arn in ARNs:
        for CL in cl_list:
            model_artifact = ASSETS_DIR / f"ar{arn}_cl{CL}" / "split_onnx"
            onnx_name = f"ar{arn}_cl{CL}_{split}_of_{NUM_SPLITS}"
            onnxfile = model_artifact / f"{onnx_name}.onnx"

            if scoring_net:
                model_artifact = ASSETS_DIR / f"{SCORING_NETWORK_ONNX_NAME}_cl{CL}" / "onnx"
                onnx_name = SCORING_NETWORK_ONNX_NAME
                onnxfile = model_artifact / f"{onnx_name}.onnx"

            input_names, output_names = utils.get_onnx_input_output_names(
                str(onnxfile), deco_digit=False, using_qairt_workflow=True
            )
            tensors = [
                {
                    "tensor_name": name,
                    "dataFormat": "QNN_TENSOR_DATA_FORMAT_HMX_WEIGHT_LAYOUT",
                }
                for name in input_names + output_names
                if "key" in name or "value" in name
            ]
            if len(tensors) > 0:
                graphs_dict["graphs"].append(
                    {
                        "graph_name": onnx_name,
                        "tensors": tensors,
                    }
                )

    with (folder / data_format_config_name).open("w") as f:
        json.dump(graphs_dict, f, indent=4)

### Compile context binary

Expected execution time: ~< 3 hours


In [None]:
def merge_adapter_configs(adapter_config_paths, merged_adapter_config_path):
    merged_use_cases = []
    for adapter_config_path in adapter_config_paths:
        with adapter_config_path.open("r") as f:
            adapter_config_data = yaml.safe_load(f)
        merged_use_cases.extend(adapter_config_data["use_case"])

    merged_adapter_config_data = {
        "use_case": merged_use_cases,
        "share_adapters_between_graphs": "Yes",
    }
    with merged_adapter_config_path.open("w") as f:
        yaml.dump(merged_adapter_config_data, f, default_flow_style=False, sort_keys=False)

In [None]:
def thread_gen_ws_cb(split):

    # When EMBEDDING_ON_CPU is True, we do not require to generate binaries for this split.
    if EMBEDDING_ON_CPU and split == 1:
        return

    graph_list = []
    dlc_list = []
    adapter_config_paths = []
    for ar, cl in ARN_CL_LIST:
        split_dir = ASSETS_DIR / f"ar{ar}_cl{cl}" / f"{split}_of_{NUM_SPLITS}"

        graph_name = f"ar{ar}_cl{cl}_{split}_of_{NUM_SPLITS}"
        src_q_dlc = split_dir / "compiled_model" / f"{graph_name}.dlc"

        graph_list.append(graph_name)
        dlc_list.append(str(src_q_dlc))

    ar_str = "_".join(f"ar{x}" for x in ARNS)
    cl_str = "_".join(f"cl{x}" for x in CL_LIST)

    out_dir = ASSETS_DIR / f"{ar_str}_{cl_str}"
    out_dir.mkdir(parents=True, exist_ok=True)

    conf_dir = ASSETS_DIR / f"{ar_str}_{cl_str}_conf_files"
    conf_dir.mkdir(parents=True, exist_ok=True)

    make_config_file(split, conf_dir, graph_list, soc_id, dsp_arch)

    binary_file_name = f"weight_sharing_model_{ar_str}_{cl_str}_{split}_of_{NUM_SPLITS}"
    if ENABLE_NATIVE_KV:
        binary_file_name += "_natKV"

    cmd = [
        qnn_env["HEXAGON_TOOLS_DIR"] + "/qnn-context-binary-generator",
        "--log_level=error",
        *["--backend", "libQnnHtp.so"],
        *["--model", "libQnnModelDlc.so"],
        *["--input_output_tensor_mem_type", "memhandle"],
        *["--config_file", str(conf_dir / f"HtpConfigFile_API_{split}.json")],
        *["--dlc_path", ",".join(dlc_list)],
        *["--output_dir", str(out_dir)],
        *["--binary_file", f"{binary_file_name}.serialized"],
    ]


    if ENABLE_NATIVE_KV:
        data_format_config_name = f"data_format_config_{split}_of_{NUM_SPLITS}.json"
        gen_kv_format_config(split, conf_dir, data_format_config_name)
        cmd += ["--data_format_config", str(conf_dir / data_format_config_name)]

    print(" ".join(cmd))
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=qnn_env)
    output, error = proc.communicate()
    print(output.decode(), error.decode())
    print(f"#{split} weight sharing model generated.")


with event_marker(f"context-binary {ONNX_NAME}"):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=len(SPLITS) if PARALLEL else 1
    ) as executor:
        results = executor.map(thread_gen_ws_cb, SPLITS)
        for result in results:
            if result:
                print(result)

print(f"All weight shared qnn-context-binary generated.")

### Save profiling stats


In [None]:
from utilities.profiler import EventProfiler

EventProfiler().report()
EventProfiler().json_dump(str(ASSETS_DIR / "profiling_stats.json"))

Upon completion of these steps to prepare models for inference, QNN context binaries are available in `./assets/artifacts`.
The next step is to execute the prepared models (now represented as serialized context binaries)on a Snapdragon 8 Gen4 Android device using executable utilities available in the Qualcomm AI Engine Direct SDK.

Copyright (c) 2024 Qualcomm Technologies, Inc. and/or its subsidiaries.
