# Fine-tune Llama2 using QLoRA and Deploy Model with Multiple Adapters

In this tutorial, we will fine-tune a llama2 model using QLoRA, optimize it using ONNX Runtime tools, and extract the fine-tuned adapters from the model. 
The resulting model can be deployed with multiple adapters for different tasks.

## Prerequisites

Before running this tutorial, please ensure you already installed olive-ai. Please refer to the [installation guide](https://github.com/microsoft/Olive?tab=readme-ov-file#installation) for more information.

### Install Dependencies
We will optimize for `CUDAExecutionProvider` so the corresponding `onnxruntime` should also be installed allong with the other dependencies:

In [None]:
!pip install -r requirements-qlora.txt
!pip install ipywidgets tabulate
!pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
# !pip install --pre onnxruntime-genai

### Get access to model and fine-tuning dataset

Get access to the following resources on Hugging Face Hub:
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [nampdn-ai/tiny-codes](https://huggingface.co/nampdn-ai/tiny-codes)

Login to your Hugging Face account:

In [None]:
from huggingface_hub import login

login()

## Workflow

The olive workflow is defined in the [llama2_qlora.json](../../llama2_qlora.json) file. 

It fine-tunes [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) model using [QLoRA](https://arxiv.org/abs/2305.14314) on a subsection of [nampdn-ai/tiny-codes](https://huggingface.co/nampdn-ai/tiny-codes) to generate python code given a prompt. The fine-tuned model is then optimized using ONNX Runtime Tools.

Performs optimization pipeline:
- GPU, FP16: *Pytorch Model -> Fine-tuned Pytorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp16 -> Extract Adapters*

**Note:**
- The code language is set to `Python` but can be changed to other languages in the config file.
Supported languages are Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. Refer to the [dataset card](https://huggingface.co/datasets/nampdn-ai/tiny-codes) for more details on the dataset.
- The `ExtractAdapters` pass in this workflow extracts the lora adapters from the model and converts them into model inputs. An alternate option is to extract them as external intializers. This can be done by setting `"make_inputs": false` in the `ExtractAdapters` pass configuration.

Run the worklow:

In [None]:
!olive run --config llama2_qlora.json --setup

In [None]:
!CUDA_VISIBLE_DEVICES=0 olive run --config llama2_qlora.json

The output model files are can be found at:
- Model: `models/tiny-codes-qlora/qlora-conversion-transformers_optimization-extract/gpu-cuda_model/model.onnx`
- Adapter weights: `models/tiny-codes-qlora/qlora-conversion-transformers_optimization-extract/gpu-cuda_model/adapter_weights.npz`

## Export Pre-existing Adapters

Olive provides a standalone script to export the fine-tuned adapters from a pre-existing repository on huggingface hub or your local directory. The adapters must be fine-tuned on the same base model with the same configuration as the model obtained from the previous step. 

Lets export the adapters from [Mikael110/llama-2-7b-guanaco-qlora](https://huggingface.co/Mikael110/llama-2-7b-guanaco-qlora):

In [None]:
# run this cell to see the available options to export-adapters command
!olive export-adapters --help

In [None]:
!olive export-adapters --adapter_path Mikael110/llama-2-7b-guanaco-qlora --output_path models/exported/guanaco_qlora.npz --pack_weights --dtype float16

## Deploy Model with Multiple Adapters

We can now deploy the same model with multiple adapters for different tasks by loading the adapter weights independently of the model and providing the relevant weights as input at inference time.

In [2]:
base_model_name = "meta-llama/llama-2-7b-hf"
model_path = "models/tiny-codes-qlora/qlora-conversion-transformers_optimization-extract-metadata/gpu-cuda_model/model.onnx"
adapters = {
    "guanaco": {
        "weights": "models/exported/guanaco_qlora.npz",
        "template": "### Human: {prompt} ### Assistant:"
    },
    "tiny-codes": {
        "weights": "models/tiny-codes-qlora/qlora-conversion-transformers_optimization-extract-metadata/gpu-cuda_model/adapter_weights.npz",
        "template": "### Question: {prompt} \n### Answer:"
    }
}

### Custom Generate Loop


We implemented an example class `ORTGenerator` in [generator.py](../utils/generator.py) that loads the model and adapters, and generates code given a prompt. If your execution provider supports IO Binding, it is recommended to use it for better performance since the adapter weights will be preallocated in the device memory.

In [None]:
import sys
from pathlib import Path

# add the utils directory to the path
sys.path.append(str(Path().resolve().parent / "utils"))

from generator import ORTGenerator
from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# load the generator
generator = ORTGenerator(model_path, tokenizer, execution_provider="CUDAExecutionProvider", device_id=0, adapters=adapters, adapter_mode="inputs")

#### Generate using Guanaco Adapters

In [7]:
prompt = "What time is it?"

response = generator.generate(prompt, adapter="guanaco", max_gen_len=100, use_io_binding=True)

print(response)

### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.

However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.

In general, the current time can vary depending on your location and the time zone you are in.

If you would like to know the current time for a specific location, you can try searching for the time zone for that


#### Generate with Tiny Codes Adapters

In [8]:
prompt = "Calculate the sum of all even numbers in a list."

response = generator.generate(prompt, adapter="tiny-codes", max_gen_len=200, use_io_binding=True)

print(response)

### Question: Calculate the sum of all even numbers in a list. 
### Answer: Here's some python code which implements this functionality:

```python 
def sum_even_numbers(lst):
    """
    Returns the sum of all even numbers in a list.

    Args:
        lst (list): List of numbers to sum.

    Returns:
        int: The sum of all even numbers in the input list.
    """
    even_nums = []
    for num in lst:
        if num % 2 == 0:
            even_nums.append(num)

    return sum(even_nums)
``` 


### ONNX Runtime generate() API

The [ONNX Runtime generate() API](https://github.com/microsoft/onnxruntime-genai) also supports loading multiple adapters for inference. During generation, the adapter weights can be provided as inputs to the model using `GeneratorParam`'s `set_model_input` method.

In [4]:
from pathlib import Path

# import torch
import onnxruntime_genai as og
import numpy as np


def generate(model, tokenizer, adapter_weights, prompt, template, max_gen_len=100):
    params = og.GeneratorParams(model)
    # model doesn't have GQA nodes so we can't use the share buffer option
    params.set_search_options(max_length=max_gen_len, past_present_share_buffer=False)
    params.input_ids = tokenizer.encode(template.format(prompt=prompt))

    for k, v in adapter_weights.items():
        params.set_model_input(k, v)

    # generate the response
    output_tokens = model.generate(params)
    return tokenizer.decode(output_tokens)

model = og.Model(str(Path(model_path).parent))
tokenizer = og.Tokenizer(model)

# load the adapter weights
adapters_weights = {
    key: dict(np.load(value["weights"])) for key, value in adapters.items()
}

#### Generate with Guanaco Adapters

In [5]:
prompt = "What time is it?"

response = generate(model, tokenizer, adapters_weights["guanaco"], prompt, adapters["guanaco"]["template"], max_gen_len=100)

print(response)

### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.

However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.

In general, the current time can vary depending on your location and the time zone you are in.

If you would like to know the current time for a


#### Generate with Tiny Codes Adapters

In [6]:
prompt = "Calculate the sum of all even numbers in a list."

response = generate(model, tokenizer, adapters_weights["tiny-codes"], prompt, adapters["tiny-codes"]["template"], max_gen_len=200)

print(response)

### Question: Calculate the sum of all even numbers in a list. 
### Answer: Here's some python code which implements this functionality:

```python 
def sum_even_numbers(lst):
    """
    Returns the sum of all even numbers in a list.

    Args:
        lst (list): List of numbers to sum.

    Returns:
        int: The sum of all even numbers in the input list.
    """
    even_nums = []
    for num in lst:
        if num % 2 == 0:
            even_nums.append(num)

    return sum(even_nums)
``` 
