Skip to content

Latest commit

 

History

History
280 lines (232 loc) · 13.5 KB

README.md

File metadata and controls

280 lines (232 loc) · 13.5 KB

Open LLaMa Optimization

Overview

This folder contains examples of Open LLaMA workflows. There are several variants available in Open LLaMA that can be used for optimization. The following table shows a few models' configurations:

Model Num Hidden Layers Num Attention Heads Hidden Size
openlm-research/open_llama_3b 26 32 3200
openlm-research/open_llama_7b 32 32 4096
openlm-research/open_llama_13b 40 40 5120

Inference optimization workflows

Fine-tune optimization workflows

Go to How to run

Inference Optimization Workflows

Note that these examples config uses openlm-research/open_llama_3b for demonstration purpose.

Convert, Optimize and Merge Open LLaMA Model for GPUs

This workflow also demonstrates how to use:

  • Huggingface transformers to load model from model hub.
  • Huggingface optimum to convert and merge generative models optimum.

This example config file open_llama_config.json is meant to be a starting point for optimizing Open LLaMA for target hardware. One can add additional passes as well as set different options for Transformer Optimization pass as per need. See Olive documentation for more information on optimizations passes available.

Requirements file: requirements.txt

When you run the example config for other larger models, you may need

  1. change the model_path to the one you use in open_llama_config.json and user_script.py.
    "input_model":{
        "type": "OptimumModel",
        "config": {
            "model_path": "openlm-research/open_llama_3b", // to change based on the model you use
            "model_components": ["decoder_model.onnx", "decoder_with_past_model.onnx"],
            "hf_config": {
                "model_class": "LlamaForCausalLM"
            }
        }
    }
    import torch
    from transformers import AutoConfig
    
    from olive.constants import Framework
    
    model_id = "openlm-research/open_llama_3b" # to change based on the model you use
    config = AutoConfig.from_pretrained(model_id)
  2. change the transformer optimization pass options in open_llama_config.json based on the above table:
    "optimize": {
        "type": "OrtTransformersOptimization",
        "config": {
            "model_type": "gpt2",
            "float16": true,
            "use_gpu": false,
            "keep_io_types": true,
            "num_heads": 32, // to change based on the model you use
            "hidden_size": 4096, // to change based on the model you use
            "optimization_options": {
                "use_multi_head_attention": false
            }
        }
    }

Sparsify Open LLaMA Model using SparseGPT for GPUs

This workflow sparsifies Open LLaMA model using SparseGPT. The output model is still a transformers pytorch model but with the layer weights sparsified. The given config has sparsity set to [2,4] for structured 2:4 sparsity pattern but can be changed to other sparsity pattern such as 0.5 for 50% unstructured sparsity or [4,8] for 4:8 structured sparsity pattern.

To take advantage of the sparsity using TensorRT, the sparse torch.nn.Linear modules in the transformer layers are then converted to TRTModule from torch-tensorrt with fp16 precision and sparsity enabled. This is done using the TorchTRTConversion pass in Olive which saves the entire model. This saved model can then be loaded using torch.load but requires Olive to be installed. Inference is done like a normal pytorch model.

The relevant config file is open_llama_sparsegpt_gpu.json

Requirements file: requirements-sparsegpt.txt

Optimizing Open Llama Model with Azure Arc

This workflow optimizes Open Llama model on Azure ML compute, and evaluate output models on your device. Please connect your device to Azure Arc by following instruction: Self-hosted Kubernetes cluster

This example config file is open_llama_arc.json.

Requirements file: requirements-arc.txt

Compress Open Llama Model with Intel® Neural Compressor 4-bits Weight-only Quantization

This workflow compresses Open Llama model with 4-bits weight-only quantization (WOQ) using Intel® Neural Compressor, and evaluate accuracy and perplexity on lambada_openai datasets.

This example config file is open_llama_inc_woq.json.

Requirements file: requirements-woq.txt. Skip installing LLM runtime of intel-extension-for-transformers with this command SKIP_RUNTIME=True pip install -r requirements-woq.txt

Prerequisites

To use Intel® Neural Compressor 4-bits weight-only quantization, please install neural-compressor>=2.3. Weight-only quantization in Intel® Neural Compressor is still under development. We encourage you to use the master branch to access the latest features. Please check the link of installing neural-compressor from source

Run 4-bits weight-only quantization

4-bits weight-only quantization supports two algorithms:

  • Round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps.
  • GPTQ algorithm provides more accurate quantization but requires more computational resources.

To compress model with 4-bits weight-only quantization, you may need

  1. set approach to weight_only, and set algorithm to GPTQ or RTN in weight_only_config.
"quantization": {
    "type": "IncStaticQuantization",
    "config": {
        "user_script": "user_script.py",
        "approach": "weight_only",
        "weight_only_config":{
            "algorithm": "RTN"
        }
    }
}
  1. if GPTQ algorithm is used, you need to provide a calibration dataloader, which outputs input data and label.
"quantization": {
    "type": "IncStaticQuantization",
    "config": {
        "user_script": "user_script.py",
        "approach": "weight_only",
        "weight_only_config":{
            "algorithm": "GPTQ"
        }
    },
    "dataloader_func": "calib_dataloader",
}
class CalibDataloader:
    def __init__(self, batch_size, **kwargs):
        self.batch_size = batch_size
        self.dataset = []
        # operations to add (input_data, label) pairs into self.dataset

    def __iter__(self):
        for input_data, label in self.dataset:
            yield input_data, label

Validated results

The following table shows the accuracy and perplexity results of Open Llama models evaluated on lambada_openai task. GPTQ W4G32Asym in the configuration column means GPTQ algorithm is used for 4-bits weight only quantization, setting group_size=32 and scheme=asym.

Model name Configuration Lambada_openai Accuracy Ratio
[WOQ/FP32]
Accuracy Perplexity
openlm-research/open_llama_3b FP32 0.6647 4.8445 /
GPTQ
W4G32Asym
0.6569 4.9937 98.82%
openlm-research/open_llama_7b FP32 0.7041 3.9686 /
RTN
W4G32Asym
0.6887 4.1749 97.81%
openlm-research/open_llama_13b FP32 0.7213 3.5750 /
RTN
W4G32Sym
0.7169 3.7339 99.39%

Note: The above results are obtained using onnxruntime==1.16.3, which supports the MatMulNBits op. Tested by Intel(R) Xeon(R) Platinum 8375c CPU @2.9GHz

Fine-tune Optimization Workflows

Fine-tune Llama Model on a chatbot dataset using QLoRA

This workflow fine-tunes LLaMA model using QLoRA. The output model is still the input transformers model along with a quantization config and LoRA adapters that were fine-tuned on the training dataset.

The relevant config file is llama_qlora.json. It corresponds to the guanaco 7b example in the original qlora implementation.

Requirements file: requirements-lora.txt

Fine-tune Open Llama Model on a code generation dataset

Requirements file: requirements-lora.txt

The code language is set to Python but can be changed to other languages by changing the language field in the config file. Supported languages are Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. Refer to the dataset card for more details on the dataset.

Note: You must first request access to the nampdn-ai/tiny-codes datatset. Then login in to HuggingFace on your machine using huggingface-cli login or update token field in the config file with your HuggingFace token.

Fine-tune OpenLlama model using QLoRA with ONNX Runtime Training

You can also train the model using ONNX Runtime Training. The relevant config file is open_llama_qlora_ort_tinycodes.json. Requirements file: requirements-qlora-ort.txt

It also requires the latest version of onnxruntime-training:

python -m pip uninstall -y onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu
python -m pip install onnxruntime-training --pre --upgrade --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/

Configure torch-ort:

python -m torch_ort.configure

How to run

Pip requirements

Install the necessary python packages using the corresponding requirements file.

python -m pip install -r <requirements_file>.txt

Run sample using config

The optimization techniques to run are specified in the relevant config json file.

First, install required packages according to passes.

olive run --config <config_file>.json --setup

Then, optimize the model

olive run --config <config_file>.json

or run simply with python code:

from olive.workflows import run as olive_run
olive_run("<config_file>.json")

After running the above command, the model candidates and corresponding config will be saved in the output directory.