Open LLaMa Optimization

Overview

This folder contains examples of Open LLaMA workflows. There are several variants available in Open LLaMA that can be used for optimization. The following table shows a few models' configurations:

Model	Num Hidden Layers	Num Attention Heads	Hidden Size
openlm-research/open_llama_3b	26	32	3200
openlm-research/open_llama_7b	32	32	4096
openlm-research/open_llama_13b	40	40	5120

Inference optimization workflows

GPU: With Optimum conversion and merging and ORT optimizations for optimized ONNX model
GPU: With SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
AzureML compute: With Optimum conversion and merging and ORT optimizations in AzureML
CPU: With Optimum conversion and merging and ORT optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model

Fine-tune optimization workflows

Using LoRA in PyTorch for model fine tune on a chatbot dataset
Using LoRA/QLoRA/LoftQ in PyTorch for model fine tune on a code generation dataset
Using QLoRA in ONNX Runtime Training for model fine tune

Go to How to run

Inference Optimization Workflows

Note that these examples config uses openlm-research/open_llama_3b for demonstration purpose.

Convert, Optimize and Merge Open LLaMA Model for GPUs

This workflow also demonstrates how to use:

Huggingface transformers to load model from model hub.
Huggingface optimum to convert and merge generative models optimum.

This example config file open_llama_config.json is meant to be a starting point for optimizing Open LLaMA for target hardware. One can add additional passes as well as set different options for Transformer Optimization pass as per need. See Olive documentation for more information on optimizations passes available.

Requirements file: requirements.txt

When you run the example config for other larger models, you may need

change the model_path to the one you use in open_llama_config.json and user_script.py.

"input_model":{
    "type": "OptimumModel",
    "config": {
        "model_path": "openlm-research/open_llama_3b", // to change based on the model you use
        "model_components": ["decoder_model.onnx", "decoder_with_past_model.onnx"],
        "hf_config": {
            "model_class": "LlamaForCausalLM"
        }
    }
}

import torch
from transformers import AutoConfig

from olive.constants import Framework

model_id = "openlm-research/open_llama_3b" # to change based on the model you use
config = AutoConfig.from_pretrained(model_id)

change the transformer optimization pass options in open_llama_config.json based on the above table:

"optimize": {
    "type": "OrtTransformersOptimization",
    "config": {
        "model_type": "gpt2",
        "float16": true,
        "use_gpu": false,
        "keep_io_types": true,
        "num_heads": 32, // to change based on the model you use
        "hidden_size": 4096, // to change based on the model you use
        "optimization_options": {
            "use_multi_head_attention": false
        }
    }
}

Sparsify Open LLaMA Model using SparseGPT for GPUs

This workflow sparsifies Open LLaMA model using SparseGPT. The output model is still a transformers pytorch model but with the layer weights sparsified. The given config has sparsity set to [2,4] for structured 2:4 sparsity pattern but can be changed to other sparsity pattern such as 0.5 for 50% unstructured sparsity or [4,8] for 4:8 structured sparsity pattern.

To take advantage of the sparsity using TensorRT, the sparse torch.nn.Linear modules in the transformer layers are then converted to TRTModule from torch-tensorrt with fp16 precision and sparsity enabled. This is done using the TorchTRTConversion pass in Olive which saves the entire model. This saved model can then be loaded using torch.load but requires Olive to be installed. Inference is done like a normal pytorch model.

The relevant config file is open_llama_sparsegpt_gpu.json

Requirements file: requirements-sparsegpt.txt

Optimizing Open Llama Model with Azure Arc

This workflow optimizes Open Llama model on Azure ML compute, and evaluate output models on your device. Please connect your device to Azure Arc by following instruction: Self-hosted Kubernetes cluster

This example config file is open_llama_arc.json.

Requirements file: requirements-arc.txt

Compress Open Llama Model with Intel® Neural Compressor 4-bits Weight-only Quantization

This workflow compresses Open Llama model with 4-bits weight-only quantization (WOQ) using Intel® Neural Compressor, and evaluate accuracy and perplexity on lambada_openai datasets.

This example config file is open_llama_inc_woq.json.

Requirements file: requirements-woq.txt. Skip installing LLM runtime of intel-extension-for-transformers with this command SKIP_RUNTIME=True pip install -r requirements-woq.txt

Prerequisites

To use Intel® Neural Compressor 4-bits weight-only quantization, please install neural-compressor>=2.3. Weight-only quantization in Intel® Neural Compressor is still under development. We encourage you to use the master branch to access the latest features. Please check the link of installing neural-compressor from source

Run 4-bits weight-only quantization

4-bits weight-only quantization supports two algorithms:

Round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps.
GPTQ algorithm provides more accurate quantization but requires more computational resources.

To compress model with 4-bits weight-only quantization, you may need

set approach to weight_only, and set algorithm to GPTQ or RTN in weight_only_config.

"quantization": {
    "type": "IncStaticQuantization",
    "config": {
        "user_script": "user_script.py",
        "approach": "weight_only",
        "weight_only_config":{
            "algorithm": "RTN"
        }
    }
}

if GPTQ algorithm is used, you need to provide a calibration dataloader, which outputs input data and label.

"quantization": {
    "type": "IncStaticQuantization",
    "config": {
        "user_script": "user_script.py",
        "approach": "weight_only",
        "weight_only_config":{
            "algorithm": "GPTQ"
        }
    },
    "dataloader_func": "calib_dataloader",
}

class CalibDataloader:
    def __init__(self, batch_size, **kwargs):
        self.batch_size = batch_size
        self.dataset = []
        # operations to add (input_data, label) pairs into self.dataset

    def __iter__(self):
        for input_data, label in self.dataset:
            yield input_data, label

Validated results

The following table shows the accuracy and perplexity results of Open Llama models evaluated on lambada_openai task. GPTQ W4G32Asym in the configuration column means GPTQ algorithm is used for 4-bits weight only quantization, setting group_size=32 and scheme=asym.

Model name	Configuration	Lambada_openai		Accuracy Ratio [WOQ/FP32]
Model name	Configuration	Accuracy	Perplexity	Accuracy Ratio [WOQ/FP32]
openlm-research/open_llama_3b	FP32	0.6647	4.8445	/
openlm-research/open_llama_3b	GPTQ W4G32Asym	0.6569	4.9937	98.82%
openlm-research/open_llama_7b	FP32	0.7041	3.9686	/
openlm-research/open_llama_7b	RTN W4G32Asym	0.6887	4.1749	97.81%
openlm-research/open_llama_13b	FP32	0.7213	3.5750	/
openlm-research/open_llama_13b	RTN W4G32Sym	0.7169	3.7339	99.39%

Note: The above results are obtained using onnxruntime==1.16.3, which supports the MatMulNBits op. Tested by Intel(R) Xeon(R) Platinum 8375c CPU @2.9GHz

Fine-tune Optimization Workflows

Fine-tune Llama Model on a chatbot dataset using QLoRA

This workflow fine-tunes LLaMA model using QLoRA. The output model is still the input transformers model along with a quantization config and LoRA adapters that were fine-tuned on the training dataset.

The relevant config file is llama_qlora.json. It corresponds to the guanaco 7b example in the original qlora implementation.

Requirements file: requirements-lora.txt

Fine-tune Open Llama Model on a code generation dataset

With LoRA. This workflow fine-tunes Open LLaMA model using LoRA to generate code given a prompt. The relevant config file is open_llama_lora_tinycodes.json.
With QLoRA. This workflow fine-tunes Open LLaMA model using QLoRA to generate code given a prompt. The relevant config file is open_llama_qlora_tinycodes.json.
With LoftQ. This workflow fine-tunes Open LLaMA model using LoftQ to generate code given a prompt. The relevant config file is open_llama_loftq_tinycodes.json.

Requirements file: requirements-lora.txt

The code language is set to Python but can be changed to other languages by changing the language field in the config file. Supported languages are Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. Refer to the dataset card for more details on the dataset.

Note: You must first request access to the nampdn-ai/tiny-codes datatset. Then login in to HuggingFace on your machine using huggingface-cli login or update token field in the config file with your HuggingFace token.

Fine-tune OpenLlama model using QLoRA with ONNX Runtime Training

You can also train the model using ONNX Runtime Training. The relevant config file is open_llama_qlora_ort_tinycodes.json. Requirements file: requirements-qlora-ort.txt

It also requires the latest version of onnxruntime-training:

python -m pip uninstall -y onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu
python -m pip install onnxruntime-training --pre --upgrade --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/

Configure torch-ort:

python -m torch_ort.configure

How to run

Pip requirements

Install the necessary python packages using the corresponding requirements file.

python -m pip install -r <requirements_file>.txt

Run sample using config

The optimization techniques to run are specified in the relevant config json file.

First, install required packages according to passes.

olive run --config <config_file>.json --setup

Then, optimize the model

olive run --config <config_file>.json

or run simply with python code:

from olive.workflows import run as olive_run
olive_run("<config_file>.json")

After running the above command, the model candidates and corresponding config will be saved in the output directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open LLaMa Optimization

Overview

Inference Optimization Workflows

Convert, Optimize and Merge Open LLaMA Model for GPUs

Sparsify Open LLaMA Model using SparseGPT for GPUs

Optimizing Open Llama Model with Azure Arc

Compress Open Llama Model with Intel® Neural Compressor 4-bits Weight-only Quantization

Prerequisites

Run 4-bits weight-only quantization

Validated results

Fine-tune Optimization Workflows

Fine-tune Llama Model on a chatbot dataset using QLoRA

Fine-tune Open Llama Model on a code generation dataset

Fine-tune OpenLlama model using QLoRA with ONNX Runtime Training

How to run

Pip requirements

Run sample using config

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open LLaMa Optimization

Overview

Inference Optimization Workflows

Convert, Optimize and Merge Open LLaMA Model for GPUs

Sparsify Open LLaMA Model using SparseGPT for GPUs

Optimizing Open Llama Model with Azure Arc

Compress Open Llama Model with Intel® Neural Compressor 4-bits Weight-only Quantization

Prerequisites

Run 4-bits weight-only quantization

Validated results

Fine-tune Optimization Workflows

Fine-tune Llama Model on a chatbot dataset using QLoRA

Fine-tune Open Llama Model on a code generation dataset

Fine-tune OpenLlama model using QLoRA with ONNX Runtime Training

How to run

Pip requirements

Run sample using config