# Optimize Llama2 Model on Azure Machine Learning

In this tutorial, we will optimize Llama2 model, and leverage AML (Azure Machine Learning) compute resources to run Olive Pass, while using your local device as the target to evaluate both the input model and output model.

we will apply [QLoRA](https://github.com/artidoro/qlora), [ONNX conversion](https://huggingface.co/docs/transformers/serialization) and [ONNXRuntime transformers optimization](https://onnxruntime.ai/docs/performance/transformers-optimization.html) to optimize the original LLaMA 2 model, and evaluate the model performance by latency metric.

## Prerequisites
### installing Olive
To get started with Olive, install it using the following command:
```bash
pip install git+https://github.com/microsoft/Olive#egg=olive-ai[azureml]
```

### Preparing AML compute
Ensure your AML compute is set up and ready. If you haven't donw this yet, please refer to the [instructions](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-compute-instance?view=azureml-api-2&tabs=azure-studio) for creating an AML compute instance.

### Attaching your local machine to the AML workspace
In this tutorial, we will use the local machine as target to run the model evaluation. Please follow [this guide](https://microsoft.github.io/Olive/tutorials/azure_arc.html) to attach your local machine to AML workspace.

### Logging in to Azure with Azure CLI
Authenticate and log in through your browser with the `az login` command.

### Setting up Huggingface token in keyvault.
To access Huggingface, please follow [this guide](https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#azureml-system) to setup your Huggingface token.

## Olive workflow
### Json configuration file

#### Azure ML client
Since Olive will run Pass on your AML compute, add `azureml_client` section to the config with your workspace info:
```json
"azureml_client": {
    "subscription_id": "<subscription_id>",
    "resource_group": "<resource_group>",
    "workspace_name": "<workspace_name>",
    "keyvault_name": "<my_keyvault_name>"
}
```

#### Input model
In this tutorial, we will use Azure Machine Learning Llama2 curated model. The input model will be automatically downloaded from the [Azure Model catalog](https://ml.azure.com/models/Llama-2-7b/version/13/catalog/registry/azureml-meta):
```json
"input_model":{
    "type": "PyTorchModel",
    "config": {
        "model_path": {
            "type": "azureml_registry_model",
            "config": {
                "name": "Llama-2-7b",
                "registry_name": "azureml-meta",
                "version": "13"
            }
        },
        "model_file_format": "PyTorch.MLflow",
        "hf_config": {
            "model_name": "meta-llama/Llama-2-7b-hf",
            "task": "text-generation"
        }
    }
}
```

#### System
There are 2 systems defined here. The first is the compute resource designated for running the Pass. The second is the cluster on your loacal machine that previously had been attached to your AML workspace:
```json
"systems": {
    "aml": {
        "type": "AzureML",
        "config": {
            "accelerators": ["gpu"],
            "hf_token": true,
            "aml_compute": "<my_aml_compute>",
            "aml_docker_config": {
                "base_image": "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.6-cudnn8-ubuntu20.04",
                "conda_file_path": "conda.yaml"
            }
        }
    },
    "azure_arc": {
        "type": "AzureML",
        "config": {
            "accelerators": ["gpu"],
            "hf_token": true,
            "aml_compute": "<my_arc_compute>",
            "aml_docker_config": {
                "base_image": "mcr.microsoft.com/azureml/openmpi4.0-cuda11.6-cudnn8-ubuntu20.04",
                "conda_file_path": "conda.yaml"
            }
        }
    }
}
```

#### Data config
Add a data section that will be used in QLoRA:
```json
"data_configs": [
    {
        "name": "tiny_codes_train",
        "type": "HuggingfaceContainer",
        "user_script": "user_script.py",
        "components": {
            "load_dataset": {
                "type": "load_tiny_code_dataset"
            }
        },
        "params_config": {
            "data_name": "nampdn-ai/tiny-codes",
            "split": "train",                
            "component_kwargs": {
                "load_dataset": {
                    "language": "Python"
                },
                "pre_process_data": {
                    "dataset_type": "corpus",
                    "corpus_strategy": "join",
                    "text_template": "### Question: {prompt} \n### Answer: {response}",
                    "source_max_len": 1024
                }
            }
        }
    }
]
```

You can find more details about how to configure data configs [here](https://microsoft.github.io/Olive/tutorials/configure_data.html). `user_script.py` is needed for this data config. Please check [user_script.py](./user_script.py) for details.


#### Evaluators
Add the latency metric to evaluator. We will use this metric to evaluate the output model:
```json
"evaluators": {
    "common_evaluator": {
        "metrics": [
            {
                "name": "latency",
                "type": "latency",
                "sub_types": [{"name": "avg", "goal": {"type": "percent-min-improvement", "value": 10}}],
                "user_config": {
                    "user_script": "user_script.py",
                    "dataloader_func": "dataloader_func_for_merged",
                    "func_kwargs": {
                        "dataloader_func": {
                            "model_id": "meta-llama/Llama-2-7b-hf",
                            "past_seq_length": 0,
                            "seq_length": 8,
                            "max_seq_length": 2048
                        }
                    },
                    "batch_size": 2,
                    "io_bind": true
                }
            }
        ]
    }
}
```

`user_script.py` is needed for this evaluator. Please check [user_script.py](./user_script.py) for details.


#### Passes
Now we can add passes to the config file. We firstly add the QLoRA pass with training arguments:
```json
"passes": {
    "qlora": {
        "type": "QLoRA",
        "config": {
            "lora_dropout": 0.1,
            "train_data_config": "tiny-codes-train",
            "eval_dataset_size": 1024,
            "training_args": {
                "per_device_train_batch_size": 2,
                "per_device_eval_batch_size": 2,
                "gradient_accumulation_steps": 1,
                "max_steps": 100,
                "logging_steps": 50,
                "save_steps": 50,
                "evaluation_strategy": "steps",
                "adam_beta2": 0.999,
                "max_grad_norm": 0.3,
                "load_best_model_at_end": true
            }
        }
    },
}
```
More details about each parameters can be found at [QLoRA doc](https://github.com/artidoro/qlora)

We will then convert the pytorch model to ONNX model by adding conversion pass:
```json
"convert": {
    "type": "OptimumConversion",
    "config": {
        "target_opset": 17,
        "save_as_external_data": true,
        "all_tensors_to_one_file": true,
        "torch_dtype": "float32"
    }
}
```

Then we can apply ONNXRuntime transformers optimizations to this converted ONNX model:
```json
"optimize": {
    "type": "OrtTransformersOptimization",
    "config": {
        "save_as_external_data": true,
        "all_tensors_to_one_file": true,
        "model_type": "gpt2",
        "opt_level": 0,
        "only_onnxruntime": false,
        "keep_io_types": false,
        "float16": true,
        "use_gpu": true,
        "optimization_options": {
            "use_multi_head_attention": false
        }
    }
}
```

#### Engine
Add the engine config as following:
```json
"engine": {
    "log_severity_level": 0,
    "evaluator": "common_evaluator",
    "target": "azure_arc",
    "host": "aml",
    "execution_providers": ["CUDAExecutionProvider"],
    "cache_dir": "cache",
    "output_dir" : "models/llama2"
}
```

The configuration file can be found in [config.json](./config.json).

### Run Olive Workflow
Now you can run Olive with command:

In [None]:
!olive run --config config.json

### Output models and metrics

Once the workflow run finishes, you will get a graph with Pass history and metrics results:


| model_id | parent_model_id | from_pass | duration_sec | metrics |
| ---- | ---- | ---- | ---- | ---- |
| input_model | | | | {"latency-avg": 53.96777} |
| 1_QLoRA | input_model | QLoRA | 4212.34 | |
| 2_OnnxConversion | 1_QLoRA | OnnxConversion | 221.756 | |
| 3_OrtTransformersOptimization | 2_OnnxConversion | OrtTransformersOptimization | 1231.44 | {"latency-avg": 27.68013} |

The output model, which has an average latency of **27.68013**, shows a **48.7%** improvement in performance on your target machine in comparison to the original Llama2 model's average latency of **53.96777**. 

You can find the output ONNX model in the output folder `models/llama2`.