# Fine-tune Open LLMs with HuggingFace

LLMs can handle many tasks out-of-the-box through prompting, including chatbots, question answering, and summarization.


For specialized applications requiring high accuracy or domain expertise, fine-tuning remains a powerful approach to achieve higher quality results than prompting alone, reduce costs by training smaller, more efficient models, and ensure reliability and consistency for specific use cases.

**Q-LoRA (Quantized Low-Rank Adaptation)** enables efficient fine-tuning of LLMs using 4-bit quantization and minimal parameter updates, reducing resource needs but potentially impacting performance due to quantization trade-offs.

**Spectrum** is a fine-tuning method that identifies the most informative layers of a LLM using Signal-to-Noise Ratio (SNR) analysis and selectively fine-tunes them, offering performance comparable to full fine-tuning with reduced resource usage, especially in distributed training setups.


## When to fine-tune?

Fine-tuning is particulary valuable and useful when we need to
* consistently improve performance on a specific set of tasks
* control the style and format of model outputs (e.g., enforcing a compnay's tone of voice)
* teach the model domain-specific knowledge or terminology
* reduce hallucinations for critical applications
* optimize for latency by creating smaller, specialized models
* ensure consistent adherence to specific guidelines or constraints

## Setups

In [None]:
# Install Pytorch & other libraries
%pip install "torch==2.4.1" tensorboard flash-attn "liger-kernel==0.4.2" "setuptools<71.0.0" "deepspeed==0.15.4" openai "lm-eval[api]==0.4.5"

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.46.3" \
  "datasets==3.1.0" \
  "accelerate==1.1.1" \
  "bitsandbytes==0.44.1" \
  "trl==0.12.1" \
  "peft==0.13.2" \
  "lighteval==0.6.2" \
  "hf-transfer==0.1.8"

In [None]:
from huggingface_hub import notebook_login
notebook_login

## Create and prepare the dataset

Most datasets are created using automated synthetic workflows with LLMs, though several approaches exist:
* **Synthetic generation with LLMs**: Most common approach using frameworks like Distilabel to generate high-quality synthetic data at scale
* **Existing datasets**: using public datasets from HuggingFace Hub
* **Human annotation**: For highest quality but most expensive option


In this example, we will use [`orca-math-word-problems-200k`](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) dataset including 200,000 math world problems.

Modern fine-tuning frameworks like `trl` support standard formats:
```yaml
// Conversation format
{
    'messages': [
        {'role': 'system', 'content': 'You are ...'},
        {'role': 'user', 'content': '...'},
        {'role': 'assistant', 'content': '...'}
    ]
}

// Instruction format
{'prompt': '<prompt text>', 'completion': '<ideal generated text>'}
```


To prepare our datasets we will use the `datasets` library and convert it into the conversation fomrat, where we include the schema definition in the system message for our assistant. Then we save the dataset as `jsonl` file for fine-tuning.

In [None]:
from datasets import load_dataset

system_message = """Solve the given high school math problem by providing a clear explanation of each step leading to the final solution.

Provide a detailed breakdown of your calculations, beginning with an explanation of the problem and describing how you derive each formula, value, or conclusion. Use logical steps that build upon one another, to arrive at the final answer in a systematic manner.

# Steps

1. **Understand the Problem**: Restate the given math problem and clearly identify the main question and any important given values.
2. **Set Up**: Identify the key formulas or concepts that could help solve the problem (e.g., algebraic manipulation, geometry formulas, trigonometric identities).
3. **Solve Step-by-Step**: Iteratively progress through each step of the math problem, justifying why each consecutive operation brings you closer to the solution.
4. **Double Check**: If applicable, double check the work for accuracy and sense, and mention potential alternative approaches if any.
5. **Final Answer**: Provide the numerical or algebraic solution clearly, accompanied by appropriate units if relevant.

# Notes

- Always clearly define any variable or term used.
- Wherever applicable, include unit conversions or context to explain why each formula or step has been chosen.
- Assume the level of mathematics is suitable for high school, and avoid overly advanced math techniques unless they are common at that level.
"""

def create_conversation(sample):
    return {
        'messages': [
            {'role': 'system', 'content': system_message},
            {'role': 'user', 'content': sample['question']},
            {'role': 'assistant', 'content': sample['answer']}
        ]
    }


# load dataset
dataset = load_dataset(
    'microsoft/orca-math-word-problems-200k',
    split='train'
)

In [None]:
dataset[0]

In [None]:
# convert dataset to the OpenAI message template
dataset = dataset.map(
    create_conversation,
    remove_columns=dataset.features,
    batched=False
)

In [None]:
dataset[0]['messages']

In [None]:
# save dataset to disk
dataset.to_json('train_dataset.json', orient='records')

## Fine-tune the model using `trl` and the `SFTTrainer` with QLoRA

We will use the `SFTTrainer` from `trl` to fine-tune our model. The `SFTTrainer` makes it straightforward to supervise fine-tune open LLMs.

The `SFTTrainer` is a subclass of the `Trainer` from the `transformers` library and supports all the same features, including logging, evaluation, and checkpointing, and adds additional features:
* dataset formatting, including conversational and instructional format
* training on completions only, ignoring prompts
* packing datasets for more efficient training
* PEFT support including Q-LoRA, or Spectrum
* distributed trianing with `accelerate` and FSDP/DeepSpeed

The following `run_sft.py` script is used to run the fine-tuning with a yaml configuration. It uses the `TrlParser` to parses the yaml file and converts it into the `TrainingArguments` arguments.

In [None]:
# -------- `run_sft.py` ---------
from dataclasses import dataclass
from datetime import datetime
from distutils.util import strtobool
import logging
import re
import os
from typing import Optional
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, BitsAndBytesConfig
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import is_liger_kernel_available
from trl import SFTTrainer, TrlParser, ModelConfig, SFTConfig, get_peft_config
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM

if is_liger_kernel_available():
    from liger_kernel.transformers import AutoLigerKernelForCausalLM



#######################
# Custom dataclasses
#######################
@dataclass
class ScriptArguments:
    dataset_id_or_path: str
    dataset_splits: str = 'train'
    tokenizer_name_or_path: str = None
    spectrum_config_path: Optional[str] = None


#######################
# Setup logging
#######################
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(handler)

#######################
# Helper functions
#######################

def get_checkpoint(training_args: SFTConfig):
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
    return last_checkpoint

def setup_model_for_spectrum(model, spectrum_config_path):
    unfrozen_parameters = []
    with open(spectrum_config_path, 'r') as fin:
        yaml_parameters = fin.read()

    # get the unfrozen parameters from the yaml file
    for line in yaml_parameters.splitlines():
        if line.startswith('- '):
            unfrozen_parameters.append(line.split('- ')[1])

    # freeze all parameters
    for param in model.parameters():
        param.requires_grad = False
    # unfreeze Spectrum parameters
    for name, param in model.named_parameters():
        if any(re.match(unfrozen_param, name) for unfrozen_param in unfrozen_parameters):
            param.requires_grad = True

    # Sanity check and print the trainable parameters
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"Trainable parameter: {name}")

    return model


###############################################################################################

def train_function(model_args: ModelConfig, script_args: ScriptArguments, training_args: SFTConfig):
    """Main training function"""
    ###################
    # log parameters
    ###################
    logger.info(f'Model parameters {model_args}')
    logger.info(f'Script parameters {script_args}')
    logger.info(f'Training/evaluation parameters {training_args}')

    ###################
    # load datasets
    ###################
    if script_args.dataset_id_or_path.endswith('.json'):
        train_dataset = load_dataset('json', data_files=script_args.dataset_id_or_path, split='train')
    else:
        train_dataset = load_dataset(script_args.dataset_id_or_path, split=script_args.dataset_splits)

    train_dataset = train_dataset.select(range(10000))

    logger.info(f'Loaded dataset with {len(train_dataset)} samples and the following features: {train_dataset.features}')

    ##################
    # load tokenizer
    ##################
    tokenizer = AutoTokenizer.from_pretrained(
        script_args.tokenizer_name_or_path if script_args.tokenizer_name_or_path else model_args.model_name_or_path,
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # If we use peft we need to make sure we use a chat template that is not using special tokens
    # because by default embedding layers will not be trainable

    ##########################
    # load pretrained model
    ##########################

    # define model kwargs
    model_kwargs = dict(
        revision=model_args.model_revision, # what revision from huggingface to use, defaults to main
        trust_remote_code=model_args.trust_remote_code, # whether to trust the remote code, this allows us to finetune custom architectures
        attn_implementation=model_args.attn_implementation, # what attention implementation to use, defaults to flash_attention_2
        torch_dtype=model_args.torch_dtype if model_args.torch_dtype in ['auto', None] else getattr(torch, model_args.torch_dtype), # what torch dtype to use, defaults to auto
        use_cache=False if training_args.gradient_checkpointing else True, # whether to use cache or not, defaults to False if gradient checkpointing is enabled
        low_cpu_mem_usage=True if not strtobool(os.environ.get('ACCELERATE_USE_DEEPSPEED', 'false')) else None, # reduce memory usage on CPU for loading the model
    )

    # check which training method to use and if 4-bit quantization is needed
    if model_args.load_in_4bit:
        model_kwargs['quantization_config'] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=model_kwargs['torch_dtype'],
            bnb_4bit_quant_storage=model_kwargs['torch_dtype']
        )
    if model_args.use_peft:
        peft_config = get_peft_config(model_args)
    else:
        peft_config = None

    # load the model with our kwargs
    if training_args.use_liger:
        model = AutoLigerKernelForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            **model_kwargs
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            **model_kwargs
        )

    training_args.distributed_state.wait_for_everyone() # wait for all processes to load

    if script_args.spectrum_config_path:
        model = setup_model_for_spectrum(model, script_args.spectrum_config_path)


    ##########################
    # initialize the Trainer
    ##########################
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        peft_config=peft_config
    )
    if trainer.accelerator.is_main_process and peft_config:
        trainer.model.print_trainble_parameters()


    ######################
    # training loop
    ######################
    last_checkpoint = get_checkpoint(training_args)
    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
        logger.info(f'Checkpoint detected, resuming training at {last_checkpoint}')

    logger.info(f'*** Starting training {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} for {training_args.num_train_epochs} epochs ***')
    train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
    # log metrics
    metrics = train_result.metrics
    metrics['train_samples'] = len(train_dataset)
    trainer.log_metrics('train', metrics)
    trainer.save_metrics('train', metrics)
    trainer.save_state()

    ############################################
    # save model and create model card
    ############################################
    logger.info('*** Save model ***')
    if trainer.is_fsdp_enabled and peft_config:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type('FULL_STATE_DICT')

    # restore k,v cache for fast inference
    trainer.model.config.use_cache = True
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")
    training_args.distributed_state.wait_for_everyone() # wait for all processes to load

    tokenizer.save_pretrained(training_args.output_dir)
    logger.info(f"Tokenizer saved to {training_args.output_dir}")

    # Save everything else on main process
    if trainer.accelerator.is_main_process:
        trainer.create_model_card(
            {'tags': ['sft', 'tutorial']}
        )

    # push to hub if necessary
    if training_args.push_to_hub:
        logger.info('Pushing to hub...')
        trainer.push_to_hub()

    logger.info('*** Training complete ***')


def main():
    parser = TrlParser((ModelConfig, ScriptArguments, SFTConfig))
    model_args, script_args, training_args = parser.parse_args_and_config()

    # set seed for reproducibility
    set_seed(training_args.seed)

    # run the training loop
    train_function(model_args, script_args, training_args)


if __name__ == '__main__':
    main()

We use `dataclasses` definition for our arguments so that every argument can be provided either via the command line or a yaml configuration file:
```python
@dataclass
class ScriptArguments:
    dataset_id_or_path: str
    ...

```

Next, we can customize behavior for different training methods and use them in our script with `scirpt_args`. The training script is separated by `#######` blocks for the different parts of the script.

The main training function:
1. logs all hyperparameters
2. loads the dataset from HuggingFace Hub or local disk
3. loads the tokenizer and model with our training strategy (e.g., Q-LoRA, Spectrum)
4. initializes the `SFTTrainer`
5. starts the training loop (optionally continue training from a checkpoint)
6. saves the model and optionally pushes it to the HuggingFace Hub

The following `llama-3-1-8b-qlora.yaml` is an example recipe to fine-tune a `Llama-3.1-8B` model with Q-LoRA:
```yaml
# Model arguments
model_name_or_path: Meta-Llama/Meta-Llama-3.1-8B
tokenizer_name_or_path: Meta-Llama/Meta-Llama-3.1-8B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
use_liger: true
bf16: true
tf32: true
output_dir: runs/llama-3-1-8b-math-orca-qlora-10k-ep1

# Dataset arguments
dataset_id_or_path: train_dataset.json
max_seq_length: 1024
packing: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_target_modules: 'all-linear'
# important as we need to train the special tokens for the caht template of llama
lora_modules_to_save: ['lm_head', 'embed_tokens'] # we may need to change this for qwen or other models
lora_r: 16
lora_alpha: 16

# Training arguments
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
    use_reentrant: false
learning_rate: 2.0e-4
lr_scheduler_type: constant
warmup_ratio: 0.1

# Logging arguments
logging_strategy: steps
logging_steps: 5
report_to:
    - tensorboard
save_strategy: epoch
seed: 42

# HuggingFace Hub
push_to_hub: false
hub_strategy: every_save
hub_model_id: llama-3-1-8b-math-orca-qlora-10k-ep1 # if not defined the same as output_dir
```

This config works for single-GPU training and for multi-GPU training with DeepSpeed.

What we need to do is to run the following line in the terminal:
```bash
python run_sft.py --config llama-3-1-8b-qlora.yaml
```

Notes:
* Q-LoRA includes trianing the embedding layer and the `lm_head`, as we use the Llama 3.1 chat template and in the base model the special tokens are not trained.
* For distributed training DeepSpeed with ZeRO3 and HuggingFace Accelerate was used.
* Spectrum with 30% SNR layers took slightly longer than Q-LoRA, but achieves higher accuracy on GSM8K dataset.

Using Q-LoRA only saves the trained adapter weights. If we want to use the model as standalone model, e.g., for inference we may want to merge the adapter and base model.

## Test model and run inference

As we trained our model on solving math problems, we will evaluate the model on `GSM8K` (Grade School Math 8K) dataset, which is a dataset of 8.5K high quality linguistically diverse grade school math word problems. This dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

In this example, we will use [`llm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness), an open-source framework to evaluate language models on a wide range of tasks and benchmarks.

We will use [`text-generation-inference` (TGI)](https://github.com/huggingface/text-generation-inference) for testing and deploying our model. TGI is a purpose-built solution for deploying and serving LLMs. TGI enables high-performance text generation using Tensor parallelism and continous batching.

We will start with 1 GPU, but free to increase the `num_gpus` if available.

In [None]:
%%bash

num_gpus=1
model_id=philschmid/llama-3-1-8b-math-orca-spectrum-10k-ep1 # replace with your model id

docker run --name tgi --gpus ${num_gpus} -d -ti -p 8080:80 --shm-size=2GB \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  ghcr.io/huggingface/text-generation-inference:3.0.1 \
  --model-id ${model_id} \
  --num-shard ${num_gpus}


Our container will start in the background and download the model from HuggingFace Hub. We can check the logs to see the progress with
```bash
docker logs -f tgi
```

Once our container is running we can send requests using the `openai` or `huggingface_hub` SDK. In this example, we will use the `openai` SDK to send a request to our inference server.

In [None]:
from openai import OpenAI

# create client
client = OpenAI(base_url='http://localhost:8080/v1', api_key='dummy')

system_message = """Solve the given high school math problem by providing a clear explanation of each step leading to the final solution.

Provide a detailed breakdown of your calculations, beginning with an explanation of the problem and describing how you derive each formula, value, or conclusion. Use logical steps that build upon one another, to arrive at the final answer in a systematic manner.

# Steps

1. **Understand the Problem**: Restate the given math problem and clearly identify the main question and any important given values.
2. **Set Up**: Identify the key formulas or concepts that could help solve the problem (e.g., algebraic manipulation, geometry formulas, trigonometric identities).
3. **Solve Step-by-Step**: Iteratively progress through each step of the math problem, justifying why each consecutive operation brings you closer to the solution.
4. **Double Check**: If applicable, double check the work for accuracy and sense, and mention potential alternative approaches if any.
5. **Final Answer**: Provide the numerical or algebraic solution clearly, accompanied by appropriate units if relevant.

# Notes

- Always clearly define any variable or term used.
- Wherever applicable, include unit conversions or context to explain why each formula or step has been chosen.
- Assume the level of mathematics is suitable for high school, and avoid overly advanced math techniques unless they are common at that level.
"""

messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"}
]
expected_answer = '72'

# Test an example question
response = client.chat.completions.create(
    model='orca',
    messages=messages,
    stream=False, # no streaming
    max_tokens=256
)
response = response.choices[0].message.content

print(f"Query:\n{messages[1]['content']}")
print(f"Original Answer:\n{expected_answer}")
print(f"Generated Answer:\n{response}")

Next we will evaluate our model with the `llm-evaluation-harness`:

In [None]:
!lm_eval --model local-chat-completions \
  --tasks gsm8k_cot \
  --model_args model=philschmid/llama-3-1-8b-math-orca-spectrum-10k-ep1,base_url=http://localhost:8080/v1/chat/completions,num_concurrent=8,max_retries=3,tokenized_requests=False \
  --apply_chat_template \
  --fewshot_as_multiturn

Stop our container before disconnection:

In [None]:
!docker stop tgi
!docker rm tgi

## Distributed training: DeepSpeed + Q-LoRA

We can combine DeepSpeed and Q-LoRA. Don't forget to change the `num_processes` to the number of GPUs we want to use:
```bash
accelerate launch run_sft.py
    --config_file deepspeed_zero3.yaml
    --num_processes 8
    --config llama-3-1-8b-qlora.yaml
```

## Inference: vllm

For faster inference,
```bash
docker run --runtime nvidia --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai ----model philschmid/llama-3-1-8b-math-orca-qlora-10k-ep1-merged
```

## Spectrum

Spectrum uses Signal-to-Noise Ratio (SNR) analysis to select the most useful layers for fine-tuning. It provides scripts and pre-run scanned for different models. If our model is not scanned, it will prompt us for the batch size for scanning. Batch size of 4 for 70B models requires 8xH100. Popular models like Llama 3.1 8B are already scanned.

The script will generate a yaml configuration file in the `model_snr_results` with the name of the model and the top-precent, e.g., for `meta-llama/Llama-3.1-8B` and `30` , it will generate it at `snr_results_meta-llama-Meta-Llama-3.1-8B_unfrozenparameters_30percent.yaml.
* `--model_name`: specifies the local model path or the HuggingFace repository
* `--top-percent`: specifies the top percentage of SNR layers we want to retrieve

For example,
```bash
git clone https://github.com/cognitivecomputations/spectrum.git
cd spectrum

python3 spectrum.py
    --model-name meta-llama/Meta-Llama-3.1-8B
    --top-percent 30
cd ..
```
Then, the top 30% SNR layers are saved to `snr_results_meta-llama-Meta-Llama-3.1-8B_unfrozenparameters_30percent.yaml`

Afte the yaml configuration is generated we can use it to fine-tune our model. We need to define the yaml configuration file in our train config yaml file and provide the path to the yaml file as `spectrum_config_path`.