# Optimize Llama2 on multiple EPs(CPU/CUDA)

Olive does not only support the optimization of the model on a single execution provider but also on multiple execution providers. In this notebook, we will show you how to optimize the ResNet model on multiple execution providers (CPU and CUDA) using Olive.

When enable multiple execution providers, Olive will try to optimize the input model on each execution provider one by one. As different execution providers have different environments and requirements, Olive will manage the environment for each execution provider separately. The user can specify the device and execution provider by themselves or let Olive choose the best execution provider for them.

In Olive, we manage the environment by installing the required packages from the requirements_file or Dockerfile in the environment. Only Python environment system, Docker system, AzureML system support managed system. The device and execution_providers for managed system is mandatory. Otherwise, Olive will raise an error.

In this case we will use the Llama2 model and optimize it on multiple execution providers.

### Prerequisite

Before running this script, you need to install the required python packages. You can install them using the following command:

```bash
pip install -r ../../requirements.txt
```

Also, please ensure you already installed olive-ai. Please refer to the [installation guide](https://github.com/microsoft/Olive?tab=readme-ov-file#installation) for more information.

### Multiple Execution Providers

Now, let's see how to optimize the model on multiple execution providers in Olive. Instead of creating two separate environments for CPU and CUDA execution providers, we can use Olive to optimize the model on multiple execution providers.


In Olive, we use python `venv` to manage the environment. The user can specify the requirements file for the environment. Olive will install the required packages from the requirements file in the environment. The user can also specify the device and execution providers for the environment. Olive will manage the environment for each execution provider separately.

""Note that"": Olive will leverage the system's site packages to reduce the environment size and creation time. That may require the user to try the multi-ep case on a clean environment(*virtual env like conda/venv may fail*)* to avoid any conflicts.

```json
 "systems": {
    "python_system": {
        "type": "PythonEnvironment",
        "config": {
            "accelerators": [
                {
                    "device": "GPU",
                    "execution_providers": [
                        "CPUExecutionProvider",
                        "CUDAExecutionProvider"
                    ]
                }
            ],
            "olive_managed_env": true,    // <---------- let olive to install dependencies automatically
            "requirements_file": "multiple_ep_requirements.txt" // <---------- requirements file for the environment
        }
    }
}
```

In [None]:
! python3 -m olive.workflows.run --config config_multi_ep.json

[2024-04-17 15:04:11,408] [INFO] [run.py:261:run] Loading Olive module configuration from: test/lib/python3.8/site-packages/olive/olive_config.json
[2024-04-17 15:04:11,409] [INFO] [run.py:267:run] Loading run configuration from: config_multi_ep.json
[2024-04-17 15:04:11,472] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-cpu,gpu-cuda
[2024-04-17 15:04:11,472] [INFO] [engine.py:106:initialize] Using cache directory: cache_folder//cache
[2024-04-17 15:04:11,472] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-cpu
[2024-04-17 15:04:11,473] [INFO] [engine.py:1074:_create_system] Creating target system ...
[2024-04-17 15:04:13,464] [INFO] [misc.py:68:create_managed_system] Virtual environment 'tmp/olive_python_env_653h69jt' created.
[2024-04-17 15:04:21,545] [INFO] [engine.py:1077:_create_system] Target system created in 10.071915 seconds
[2024-04-17 15:04:21,545] [INFO] [engine.py:1086:_create_system] Creating host system ...
[2024-

With the support of multiple execution providers, Olive will optimize the model on each execution provider one by one and package the optimized model into a single file under the `output_dir` set by users.

You can check the `models_rank.json` under the zipped file to see the optimized model details:

Here is the optimal model output for the Llama2 model on multiple execution providers:
```json
{
    "rank": 1,
    "model_config": {
        "type": "ONNXModel",
        "config": {
            "model_path": "CandidateModels/gpu-cuda/BestCandidateModel_1",
            "onnx_file_name": "model.onnx",
            "inference_settings": null,
            "use_ort_extensions": false,
            "external_initializers_file_name": null,
            "constant_inputs_file_name": null,
            "model_attributes": {
                "vocab_size": 32000,
                "max_position_embeddings": 4096,
                "hidden_size": 4096,
                "intermediate_size": 11008,
                "num_hidden_layers": 32,
                "num_attention_heads": 32,
                "num_key_value_heads": 32,
                "hidden_act": "silu",
                "initializer_range": 0.02,
                "rms_norm_eps": 1e-05,
                "pretraining_tp": 1,
                "use_cache": true,
                "rope_theta": 10000.0,
                "rope_scaling": null,
                "attention_bias": false,
                "attention_dropout": 0.0,
                "return_dict": true,
                "output_hidden_states": false,
                "output_attentions": false,
                "torchscript": false,
                "torch_dtype": "float16",
                "use_bfloat16": false,
                "tf_legacy_loss": false,
                "pruned_heads": {},
                "tie_word_embeddings": false,
                "chunk_size_feed_forward": 0,
                "is_encoder_decoder": false,
                "is_decoder": false,
                "cross_attention_hidden_size": null,
                "add_cross_attention": false,
                "tie_encoder_decoder": false,
                "max_length": 20,
                "min_length": 0,
                "do_sample": false,
                "early_stopping": false,
                "num_beams": 1,
                "num_beam_groups": 1,
                "diversity_penalty": 0.0,
                "temperature": 1.0,
                "top_k": 50,
                "top_p": 1.0,
                "typical_p": 1.0,
                "repetition_penalty": 1.0,
                "length_penalty": 1.0,
                "no_repeat_ngram_size": 0,
                "encoder_no_repeat_ngram_size": 0,
                "bad_words_ids": null,
                "num_return_sequences": 1,
                "output_scores": false,
                "return_dict_in_generate": false,
                "forced_bos_token_id": null,
                "forced_eos_token_id": null,
                "remove_invalid_values": false,
                "exponential_decay_length_penalty": null,
                "suppress_tokens": null,
                "begin_suppress_tokens": null,
                "architectures": [
                    "LlamaForCausalLM"
                ],
                "finetuning_task": null,
                "id2label": {
                    "0": "LABEL_0",
                    "1": "LABEL_1"
                },
                "label2id": {
                    "LABEL_0": 0,
                    "LABEL_1": 1
                },
                "tokenizer_class": null,
                "prefix": null,
                "bos_token_id": 1,
                "pad_token_id": null,
                "eos_token_id": 2,
                "sep_token_id": null,
                "decoder_start_token_id": null,
                "task_specific_params": null,
                "problem_type": null,
                "_name_or_path": "meta-llama/Llama-2-7b-hf",
                "transformers_version": "4.39.3",
                "model_type": "llama"
            }
        }
    },
    "metrics": {
        "latency_prompt_processing-avg": {
            "value": 25.82543,
            "priority": 1,
            "higher_is_better": false
        },
        "latency_token_generation-avg": {
            "value": 25.14717,
            "priority": -1,
            "higher_is_better": false
        }
    }
}
```


### Legacy way to optimize the model on multiple execution providers

Before we start, let's see how to optimize the model on multiple execution providers in the legacy way.
Since CPU execution provider requires `onnxruntime`, but CUDA execution provider requires `onnxruntime-gpu`, and they cannot be installed in the same environment, we need to create two separate environments for CPU and CUDA execution providers then run Olive to optimize.

Let us try this kind of optimization in both CPU and GPU firstly to see the difference.

Optimization stack: 
1. OnnxConversion: Convert the model to ONNX format.
2. OrtTransformersOptimization: Optimize the model with transformers optimization.
3. OnnxMatMul4Quantizer: Quantize the model with blockwise quantization.

#### CPU System config

For CPU system, we need to install `onnxruntime` in the environment firstly,
then we can add the following system config to the Olive config file.
```json
"systems": {
    "local_system": {
        "type": "LocalSystem",
        "config": {
            "accelerators": [
                {
                    "device": "CPU",
                    "execution_providers": [
                        "CPUExecutionProvider"
                    ]
                }
            ]
        }
    }
}
```

In [1]:
! python3 -m pip uninstall onnxruntime onnxruntime-gpu -y
! python3 -m pip install onnxruntime -U

Found existing installation: onnxruntime 1.17.3
Uninstalling onnxruntime-1.17.3:
  Successfully uninstalled onnxruntime-1.17.3
[0mDefaulting to user installation because normal site-packages is not writeable
Collecting onnxruntime
  Using cached onnxruntime-1.17.3-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Using cached onnxruntime-1.17.3-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
Installing collected packages: onnxruntime
Successfully installed onnxruntime-1.17.3


In [2]:
# Then run the workflow:
! python3 -m olive.workflows.run --config config_cpu.json

[2024-04-17 14:57:23,246] [INFO] [run.py:261:run] Loading Olive module configuration from: test/lib/python3.8/site-packages/olive/olive_config.json
[2024-04-17 14:57:23,247] [INFO] [run.py:267:run] Loading run configuration from: config_cpu.json
[2024-04-17 14:57:23,305] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2024-04-17 14:57:23,306] [INFO] [engine.py:106:initialize] Using cache directory: cache_folder//cache
[2024-04-17 14:57:23,306] [INFO] [engine.py:262:run] Running Olive on accelerator: cpu-cpu
[2024-04-17 14:57:23,306] [INFO] [engine.py:1074:_create_system] Creating target system ...
[2024-04-17 14:57:23,307] [INFO] [engine.py:1077:_create_system] Target system created in 0.000136 seconds
[2024-04-17 14:57:23,307] [INFO] [engine.py:1086:_create_system] Creating host system ...
[2024-04-17 14:57:23,307] [INFO] [engine.py:1089:_create_system] Host system created in 0.000122 seconds
data loader kwargs: {'batch_size': 2, 'model_

Then we can get the:

- `latency_prompt_processing`: The time taken to process the prompt where the KV_cache is not used.
- `latency_token_generation`: The time taken to generate the tokens where the KV_cache is used.


| Model Type | latency_prompt_processing(ms) | latency_token_generation(ms) |
|------------|---------------------------|-------------------------|
| Torch Model | 590.32982 | 501.17834 |
| Olive Optimized Model | 764.51641 | 192.5402 |


#### GPU System config

For GPU system, we need to install `onnxruntime-gpu` in the environment firstly,
then we can add the following system config to the Olive config file.
```json
"systems": {
    "local_system": {
        "type": "LocalSystem",
        "config": {
            "accelerators": [
                {
                    "device": "gpu",
                    "execution_providers": [
                        "GPUExecutionProvider"
                    ]
                }
            ]
        }
    }
}
```

In [3]:
! python3 -m pip uninstall onnxruntime onnxruntime-gpu -y
! python3 -m pip install onnxruntime-gpu -U

Found existing installation: onnxruntime 1.17.3
Uninstalling onnxruntime-1.17.3:
  Successfully uninstalled onnxruntime-1.17.3
[0mDefaulting to user installation because normal site-packages is not writeable
Collecting onnxruntime-gpu
  Using cached onnxruntime_gpu-1.17.1-cp38-cp38-manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Using cached onnxruntime_gpu-1.17.1-cp38-cp38-manylinux_2_28_x86_64.whl (192.1 MB)
Installing collected packages: onnxruntime-gpu
Successfully installed onnxruntime-gpu-1.17.1


In [4]:
! python3 -m olive.workflows.run --config config_gpu.json

[2024-04-17 15:02:03,396] [INFO] [run.py:261:run] Loading Olive module configuration from: test/lib/python3.8/site-packages/olive/olive_config.json
[2024-04-17 15:02:03,397] [INFO] [run.py:267:run] Loading run configuration from: config_gpu.json
[2024-04-17 15:02:03,464] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-04-17 15:02:03,464] [INFO] [engine.py:106:initialize] Using cache directory: cache_folder//cache
[2024-04-17 15:02:03,465] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-cuda
[2024-04-17 15:02:03,465] [INFO] [engine.py:1074:_create_system] Creating target system ...
[2024-04-17 15:02:03,466] [INFO] [engine.py:1077:_create_system] Target system created in 0.000136 seconds
[2024-04-17 15:02:03,466] [INFO] [engine.py:1086:_create_system] Creating host system ...
[2024-04-17 15:02:03,466] [INFO] [engine.py:1089:_create_system] Host system created in 0.000114 seconds
data loader kwargs: {'batch_size': 2, 'mode

Then we can get the:

- `latency_prompt_processing`: The time taken to process the prompt where the KV_cache is not used.
- `latency_token_generation`: The time taken to generate the tokens where the KV_cache is used.


| Model Type | latency_prompt_processing(ms) | latency_token_generation(ms) |
|------------|---------------------------|-------------------------|
| Torch Model | 53.67267 | 47.0224 |
| Olive Optimized Model | 25.82543 | 25.14717 |
