# Fine-tune Llama2 using QLoRA and Deploy Model with Multiple Adapters

In this tutorial, we will fine-tune a llama2 model using QLoRA, convert it to ONNX, and extract the fine-tuned adapters from the model. 
The resulting model can be deployed with multiple adapters for different tasks.

## Prerequisites

Before running this tutorial, please ensure you already installed olive-ai. Please refer to the [installation guide](https://github.com/microsoft/Olive?tab=readme-ov-file#installation) for more information.

### Install Dependencies
We will optimize for `CUDAExecutionProvider` so `onnxruntime-gpu>=1.20` should also be installed allong with the other dependencies:

In [None]:
# install required packages
!pip install -r requirements-qlora.txt
!pip install ipywidgets tabulate

# install onnxruntime-genai-cuda
!pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda
!pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/

# install onnxruntime-gpu >1.20, if not available install onnxruntime-gpu nightly
# ort-nightly had been renamed to onnxruntime: https://github.com/microsoft/onnxruntime/issues/22541
!pip uninstall -y onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu
!pip install "onnxruntime-gpu>=1.20" || pip install --pre onnxruntime-gpu --extra-index-url=https://pkgs.dev.azure.com/aiinfra/PublicPackages/_packaging/ORT-Nightly/pypi/simple/

### Get access to model and fine-tuning dataset

Get access to the following resources on Hugging Face Hub:
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [nampdn-ai/tiny-codes](https://huggingface.co/nampdn-ai/tiny-codes)

Login to your Hugging Face account:

In [None]:
from huggingface_hub import login

login()

## Workflow

Olive provides a command line tools to run a lora/qlora fine-tuning workflow. This workflow includes the following steps:
- `finetune`: Fine-tune a model using LoRA or QLoRA.
- `capture-onnx-graph`: Convert the fine-tuned model to ONNX
- `generate-adapter`: Extract the adapters from the ONNX model as model inputs.

In [None]:
# run this cell to see the available options to finetune, capture-onnx-graph and generate-adapter commands
!olive finetune --help
!olive capture-onnx-graph --help
!olive generate-adapter --help

First, fine tune the llama2 model using QLoRA on [nampdn-ai/tiny-codes](https://huggingface.co/datasets/nampdn-ai/tiny-codes) to generate python code given a language and prompt.

In [None]:
!CUDA_VISIBLE_DEVICES=0 olive finetune --method qlora \
    -m meta-llama/Llama-2-7b-hf -d nampdn-ai/tiny-codes \
    --train_split "train[:4096]" --eval_split "train[4096:4224]" \
    --text_template "### Language: {programming_language} \n### Question: {prompt} \n### Answer: {response}" \
    --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_steps 150 --logging_steps 50 \
    -o models/tiny-codes/fine-tune

Export the model to onnx. We can use the output of the previous step as input to this step.

In [None]:
!olive capture-onnx-graph -m models/tiny-codes/fine-tune --torch_dtype float16 --use_ort_genai -o models/tiny-codes/onnx

Finally, extract the adapters from the ONNX model.

In [None]:
!olive generate-adapter -m models/tiny-codes/onnx -o models/tiny-codes/extracted

The output model files are can be found at:
- Model: `models/tiny-codes/extracted/model/model.onnx`
- Adapter weights: `models/tiny-codes/extracted/model/adapter_weights.onnx_adapter`

## Export Pre-existing Adapters

Olive provides a standalone script to export the fine-tuned adapters from a pre-existing repository on huggingface hub or your local directory. The adapters must be fine-tuned on the same base model with the same configuration as the model obtained from the previous step. 

Lets export the adapters from [Mikael110/llama-2-7b-guanaco-qlora](https://huggingface.co/Mikael110/llama-2-7b-guanaco-qlora):

In [None]:
# run this cell to see the available options to convert-adapters command
!olive convert-adapters --help

In [None]:
!olive convert-adapters --adapter_path Mikael110/llama-2-7b-guanaco-qlora --output_path models/exported/guanaco_qlora --dtype float16

## Deploy Model with Multiple Adapters

We can now deploy the same model with multiple adapters for different tasks by loading the adapter weights independently of the model and providing the relevant weights as input at inference time.

In [None]:
base_model_name = "meta-llama/llama-2-7b-hf"
model_path = "models/tiny-codes/extracted/model/model.onnx"
adapters = {
    "guanaco": {
        "weights": "models/exported/guanaco_qlora.onnx_adapter",
        "template": "### Human: {prompt} ### Assistant:"
    },
    "tiny-codes": {
        "weights": "models/tiny-codes/extracted/model/adapter_weights.onnx_adapter",
        "template": "### Language: {prompt_0} \n### Question: {prompt_1} \n### Answer: "
    }
}

### Custom Generate Loop


We implemented an example class `ORTGenerator` in [generator.py](../utils/generator.py) that loads the model and adapters, and generates code given a prompt. If your execution provider supports IO Binding, it is recommended to use it for better performance since the adapter weights will be preallocated in the device memory.

In [None]:
import sys
from pathlib import Path

# add the utils directory to the path
sys.path.append(str(Path().resolve().parent / "utils"))

from generator import ORTGenerator
from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# load the generator
generator = ORTGenerator(model_path, tokenizer, execution_provider="CUDAExecutionProvider", device_id=0, adapters=adapters)

#### Generate using Guanaco Adapters

In [None]:
prompt = "What time is it?"
response = generator.generate(prompt, adapter="guanaco", max_gen_len=200, use_io_binding=True)
print(response)

### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.

However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.

In general, the current time can vary depending on your location and the time zone you are in.

If you would like to know the current time for a specific location, you can try searching for the time zone for that


#### Generate with Tiny Codes Adapters

In [None]:
for language in ["python", "javascript"]:
    prompt = (language, "Calculate the sum of all even numbers in a list.")
    response = generator.generate(prompt, adapter="tiny-codes", max_gen_len=150, use_io_binding=True)
    print(response, end="\n\n")

### Language: python 
### Question: Calculate the sum of all even numbers in a list. 
### Answer: 
```python 
def sum_even(lst):
    """
    Calculates the sum of all even numbers in a list
    
    Args:
        lst (list): A list containing numbers
        
    Returns:
        float: The sum of all even numbers in the list
    """ 
    total = 0
    for num in lst:
        if num % 2 == 0:
            total += num
    
    return total
``` 

### Language: javascript 
### Question: Calculate the sum of all even numbers in a list. 
### Answer: 
```javascript 
function calculateSumOfEvenNumbers(list) {
  let sum = 0;

  for (let i = 0; i < list.length; i++) {
    if (list[i] % 2 === 0) {
      sum += list[i];
    }
  }

  return sum;
}
``` 



### ONNX Runtime generate() API

The [ONNX Runtime generate() API](https://github.com/microsoft/onnxruntime-genai) also supports loading multiple adapters for inference. During generation, the adapter weights can be provided as inputs to the model using `GeneratorParam`'s `set_model_input` method.

In [None]:
from pathlib import Path

from generator import apply_template
import onnxruntime_genai as og

def generate(model, tokenizer, og_adapters, adapter_name, prompt, template, max_gen_len=100):
    params = og.GeneratorParams(model)
    # model doesn't have GQA nodes so we can't use the share buffer option
    params.set_search_options(max_length=max_gen_len, past_present_share_buffer=False)

    # create the generator
    og_generator = og.Generator(model, params)
    og_generator.set_active_adapter(og_adapters, adapter_name)
    og_generator.append_tokens(tokenizer.encode(apply_template(template, prompt)))

    # generate response
    while not og_generator.is_done():
        og_generator.generate_next_token()
    output_tokens = og_generator.get_sequence(0)
    return tokenizer.decode(output_tokens)

model_dir = str(Path(model_path).parent)
model = og.Model(model_dir)
og_adapters = og.Adapters(model)
for key, value in adapters.items():
    og_adapters.load(value["weights"], key)
tokenizer = og.Tokenizer(model)

#### Generate with Guanaco Adapters

In [None]:
prompt = "What time is it?"
response = generate(model, tokenizer, og_adapters, "guanaco", prompt, adapters["guanaco"]["template"], max_gen_len=200)
print(response)

### Human: What time is it? ### Assistant: I'm sorry, but as an AI language model, I do not have access to real-time information.

However, I can try to estimate the current time based on the context of your question and my knowledge of the current time zone.

In general, the current time can vary depending on your location and the time zone you are in.

If you would like to know the current time for a


#### Generate with Tiny Codes Adapters

In [None]:
for language in ["python", "javascript"]:
    prompt = (language, "Calculate the sum of all even numbers in a list.")
    response = generate(model, tokenizer, og_adapters, "tiny-codes", prompt, adapters["tiny-codes"]["template"], max_gen_len=150)
    print(response, end="\n\n")

### Language: python 
### Question: Calculate the sum of all even numbers in a list. 
### Answer: 
```python 
def sum_even(lst):
    """
    Calculates the sum of all even numbers in a list
    
    Args:
        lst (list): A list containing numbers
        
    Returns:
        float: The sum of all even numbers in the list
    """ 
    total = 0
    for num in lst:
        if num % 2 == 0:
            total += num
    
    return total
``` 

### Language: javascript 
### Question: Calculate the sum of all even numbers in a list. 
### Answer: 
```javascript 
function calculateSumOfEvenNumbers(list) {
  let sum = 0;

  for (let i = 0; i < list.length; i++) {
    if (list[i] % 2 === 0) {
      sum += list[i];
    }
  }

  return sum;
}
``` 

