This folder contains examples of Open LLaMA workflows. There are several variants available in Open LLaMA that can be used for optimization. The following table shows a few models' configurations:
Model | Num Hidden Layers | Num Attention Heads | Hidden Size |
---|---|---|---|
openlm-research/open_llama_3b | 26 | 32 | 3200 |
openlm-research/open_llama_7b | 32 | 32 | 4096 |
openlm-research/open_llama_13b | 40 | 40 | 5120 |
Inference optimization workflows
- GPU: With Optimum conversion and merging and ORT optimizations for optimized ONNX model
- GPU: With SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
- AzureML compute: With Optimum conversion and merging and ORT optimizations in AzureML
- CPU: With Optimum conversion and merging and ORT optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model
Fine-tune optimization workflows
- Using LoRA in PyTorch for model fine tune on a chatbot dataset
- Using LoRA/QLoRA/LoftQ in PyTorch for model fine tune on a code generation dataset
- Using QLoRA in ONNX Runtime Training for model fine tune
Go to How to run
Note that these examples config uses openlm-research/open_llama_3b for demonstration purpose.
This workflow also demonstrates how to use:
- Huggingface
transformers
to load model from model hub. - Huggingface
optimum
to convert and merge generative models optimum.
This example config file open_llama_config.json is meant to be a starting point for optimizing Open LLaMA for target hardware. One can add additional passes as well as set different options for Transformer Optimization pass as per need. See Olive documentation for more information on optimizations passes available.
Requirements file: requirements.txt
When you run the example config for other larger models, you may need
- change the
model_path
to the one you use inopen_llama_config.json
anduser_script.py
."input_model":{ "type": "OptimumModel", "config": { "model_path": "openlm-research/open_llama_3b", // to change based on the model you use "model_components": ["decoder_model.onnx", "decoder_with_past_model.onnx"], "hf_config": { "model_class": "LlamaForCausalLM" } } }
import torch from transformers import AutoConfig from olive.constants import Framework model_id = "openlm-research/open_llama_3b" # to change based on the model you use config = AutoConfig.from_pretrained(model_id)
- change the transformer optimization pass options in
open_llama_config.json
based on the above table:"optimize": { "type": "OrtTransformersOptimization", "config": { "model_type": "gpt2", "float16": true, "use_gpu": false, "keep_io_types": true, "num_heads": 32, // to change based on the model you use "hidden_size": 4096, // to change based on the model you use "optimization_options": { "use_multi_head_attention": false } } }
This workflow sparsifies Open LLaMA model using SparseGPT. The output model is still a transformers pytorch model but with the layer weights
sparsified. The given config has sparsity set to [2,4]
for structured 2:4 sparsity pattern but
can be changed to other sparsity pattern such as 0.5
for 50% unstructured sparsity or [4,8]
for 4:8 structured sparsity pattern.
To take advantage of the sparsity using TensorRT, the sparse torch.nn.Linear
modules in the transformer layers are then converted to TRTModule
from torch-tensorrt
with fp16 precision and sparsity enabled.
This is done using the TorchTRTConversion
pass in Olive which saves the entire model. This saved model can then be loaded using torch.load
but requires Olive to be installed.
Inference is done like a normal pytorch model.
The relevant config file is open_llama_sparsegpt_gpu.json
Requirements file: requirements-sparsegpt.txt
This workflow optimizes Open Llama model on Azure ML compute, and evaluate output models on your device. Please connect your device to Azure Arc by following instruction: Self-hosted Kubernetes cluster
This example config file is open_llama_arc.json.
Requirements file: requirements-arc.txt
This workflow compresses Open Llama model with 4-bits weight-only quantization (WOQ) using Intel® Neural Compressor, and evaluate accuracy and perplexity on lambada_openai datasets.
This example config file is open_llama_inc_woq.json.
Requirements file: requirements-woq.txt. Skip installing LLM runtime of intel-extension-for-transformers
with this command SKIP_RUNTIME=True pip install -r requirements-woq.txt
To use Intel® Neural Compressor 4-bits weight-only quantization, please install neural-compressor>=2.3
. Weight-only quantization in Intel® Neural Compressor is still under development. We encourage you to use the master branch to access the latest features. Please check the link of installing neural-compressor from source
4-bits weight-only quantization supports two algorithms:
- Round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps.
- GPTQ algorithm provides more accurate quantization but requires more computational resources.
To compress model with 4-bits weight-only quantization, you may need
- set
approach
toweight_only
, and setalgorithm
toGPTQ
orRTN
inweight_only_config
.
"quantization": {
"type": "IncStaticQuantization",
"config": {
"user_script": "user_script.py",
"approach": "weight_only",
"weight_only_config":{
"algorithm": "RTN"
}
}
}
- if
GPTQ
algorithm is used, you need to provide a calibration dataloader, which outputs input data and label.
"quantization": {
"type": "IncStaticQuantization",
"config": {
"user_script": "user_script.py",
"approach": "weight_only",
"weight_only_config":{
"algorithm": "GPTQ"
}
},
"dataloader_func": "calib_dataloader",
}
class CalibDataloader:
def __init__(self, batch_size, **kwargs):
self.batch_size = batch_size
self.dataset = []
# operations to add (input_data, label) pairs into self.dataset
def __iter__(self):
for input_data, label in self.dataset:
yield input_data, label
The following table shows the accuracy and perplexity results of Open Llama models evaluated on lambada_openai task. GPTQ W4G32Asym
in the configuration column means GPTQ algorithm is used for 4-bits weight only quantization, setting group_size=32 and scheme=asym.
Model name | Configuration | Lambada_openai | Accuracy Ratio [WOQ/FP32] |
|
---|---|---|---|---|
Accuracy | Perplexity | |||
openlm-research/open_llama_3b | FP32 | 0.6647 | 4.8445 | / |
GPTQ W4G32Asym |
0.6569 | 4.9937 | 98.82% | |
openlm-research/open_llama_7b | FP32 | 0.7041 | 3.9686 | / |
RTN W4G32Asym |
0.6887 | 4.1749 | 97.81% | |
openlm-research/open_llama_13b | FP32 | 0.7213 | 3.5750 | / |
RTN W4G32Sym |
0.7169 | 3.7339 | 99.39% |
Note: The above results are obtained using
onnxruntime==1.16.3
, which supports theMatMulNBits
op. Tested by Intel(R) Xeon(R) Platinum 8375c CPU @2.9GHz
This workflow fine-tunes LLaMA model using QLoRA. The output model is still the input transformers model along with a quantization config and LoRA adapters that were fine-tuned on the training dataset.
The relevant config file is llama_qlora.json. It corresponds to the guanaco 7b example in the original qlora implementation.
Requirements file: requirements-lora.txt
- With LoRA. This workflow fine-tunes Open LLaMA model using LoRA to generate code given a prompt. The relevant config file is open_llama_lora_tinycodes.json.
- With QLoRA. This workflow fine-tunes Open LLaMA model using QLoRA to generate code given a prompt. The relevant config file is open_llama_qlora_tinycodes.json.
- With LoftQ. This workflow fine-tunes Open LLaMA model using LoftQ to generate code given a prompt. The relevant config file is open_llama_loftq_tinycodes.json.
Requirements file: requirements-lora.txt
The code language is set to Python
but can be changed to other languages by changing the language
field in the config file.
Supported languages are Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. Refer to the dataset card for more details on the dataset.
Note: You must first request access to the nampdn-ai/tiny-codes datatset. Then login in to HuggingFace on your machine using huggingface-cli login
or update token
field in the config file with your HuggingFace token.
You can also train the model using ONNX Runtime Training. The relevant config file is open_llama_qlora_ort_tinycodes.json. Requirements file: requirements-qlora-ort.txt
It also requires the latest version of onnxruntime-training:
python -m pip uninstall -y onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu
python -m pip install onnxruntime-training --pre --upgrade --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/
Configure torch-ort
:
python -m torch_ort.configure
Install the necessary python packages using the corresponding requirements file.
python -m pip install -r <requirements_file>.txt
The optimization techniques to run are specified in the relevant config json file.
First, install required packages according to passes.
olive run --config <config_file>.json --setup
Then, optimize the model
olive run --config <config_file>.json
or run simply with python code:
from olive.workflows import run as olive_run
olive_run("<config_file>.json")
After running the above command, the model candidates and corresponding config will be saved in the output directory.