# Fine-Tuning Open-Source LLM using LoRA with MLflow and PEFT

## Overview

Many powerful open-source LLMs have emerged and are easily accessible. However, they are not designed to be deployed to your production environment out-of-the-box; instead, you have to **fine-tune** them for your specific tasks, such as a chatbot, content generation, etc. One challenge, though, is that training LLMs is usually very expensive. Even if your dataset for fine-tuning is small, the backpropagation step needs to compute gradients for billions of parameters. For example, fully fine-tuning a 7B model requires 112GB of VRAM, i.e. at least two 80GB H100 GPUs. Fortunately, there are many research efforts on how to reduce the cost of LLM fine-tuning.

In this tutorial, we will demonstrate how to build a Python coding aassitant by fine-tuning the Qwen2.5 7B model.

### What You Will Learn
1. Hands-on learning of the typical LLM fine-tuning process.
2. Understand how to use **LoRA** and **PEFT** to overcome the GPU memory limitation for fine-tuning.
3. Manage the model training cycle using **MLflow** to log the model artifacts, hyperparameters, metrics, and prompts.
4. How to save prompt template and inference parameters (e.g. max_token_length) in MLflow to simplify prediction interface.

### Key Actors
In this tutorial, you will learn about the techniques and methods behind efficient LLM fine-tuning by actually running the code. There are more detailed explanations for each cell below, but let's start with a brief preview of a few main important libraries/methods used in this tutorial.

* [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) model is a pretrained text-generation model with 7 billion parameters, developed by [Qwen](https://github.com/QwenLM). Qwen2.5 is the latest series of Qwen large language models.
* [LoRA](https://huggingface.co/docs/diffusers/en/training/lora) is a novel method that allows us to fine-tune large foundational models with limited GPU resources. It reduces the number of trainable parameters by learning pairs of rank-decomposition matrices.
* [PEFT](https://huggingface.co/docs/peft/en/index) is a library developed by HuggingFace🤗, that enables developers to easily integrate various optimization methods with pretrained models available on the HuggingFace Hub. With PEFT, you can apply LoRA to the pretrained model with a few lines of configurations and run fine-tuning just like the normal Transformers model training.
* [MLflow](https://mlflow.org/) manages an exploding number of configurations, assets, and metrics during the LLM training on your behalf. MLflow is natively integrated with Transformers and PEFT, and plays a crucial role in organizing the fine-tuning cycle.

## 1. Environment Set up

### Hardware Requirement
This notebook has been tested on a single NVIDIA H100 GPU with 80GB of VRAM.

### Install Python Libraries

This tutorial utilizes the following Python libraries:

* [mlflow](https://pypi.org/project/mlflow/) - for tracking parameters, metrics, and saving trained models.
* [transformers](https://pypi.org/project/transformers/) - for defining the model, tokenizer, and trainer.
* [peft](https://pypi.org/project/peft/) - for creating a LoRA adapter on top of the Transformer model.
* [trl](https://pypi.org/project/trl/) -  for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO).
* [accelerate](https://pypi.org/project/accelerate/) - a dependency required by bitsandbytes.
* [datasets](https://pypi.org/project/datasets/) - for loading the training dataset from the HuggingFace hub.

**Note**: Restarting the Python kernel may be necessary after installing these dependencies.

The notebook has been tested with `mlflow==2.15.1`, `transformers==4.47.0`, `peft==0.13.2`, `accelerate==1.2.0`, `trl==0.12.1`  and `datasets==3.2.0`.

In [None]:
%pip install mlflow==2.15.1
%pip install transformers peft accelerate trl datasets -q -U

We need to provide the necessary environment variables to use managed MLFlow from `Nebius`:
``` bash
# the following vars can be accessed from your managed MLFlow deployement
MLFLOW_TRACKING_SERVER_CERT_PATH=path/to/tracking/server/certificate
MLFLOW_TRACKING_URI=tracking/server/uri
MLFLOW_TRACKING_USERNAME=username
MLFLOW_TRACKING_PASSWORD=password
```

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

## 2. Dataset Preparation

### Load Dataset from HuggingFace Hub

We will use the `iamtarun/python_code_instructions_18k_alpaca` dataset from the [Hugging Face Hub](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca) for this tutorial. This dataset comprises 18.6k pairs of natural language instructions, inputs and their corresponding outputs (Python code), making it usefult to teaching model coding in Python. The dataset includes 4 columns:

* `instruction`: A natural language instruction (problem definition) which is supposed to be solved via Python code.
* `input`: An optional input to the previous instruction which has to be accepted by the resulting Python program.
* `output`: The solution to the initial problem (working Python program).
* `prompt`: Prepared prompt for training in `alpaca` format.

In [2]:
import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display

dataset_name = "iamtarun/python_code_instructions_18k_alpaca"
dataset = load_dataset(dataset_name, split="train")


def display_table(dataset_or_sample):
    # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
    pd.set_option("display.max_colwidth", None)
    pd.set_option("display.width", None)
    pd.set_option("display.max_rows", None)

    if isinstance(dataset_or_sample, dict):
        df = pd.DataFrame(dataset_or_sample, index=[0])
    else:
        df = pd.DataFrame(dataset_or_sample)

    html = df.to_html().replace("\\n", "<br>")
    styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
    display(HTML(styled_html))


display_table(dataset.select(range(3)))

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,instruction,input,output,prompt
0,Create a function to calculate the sum of a sequence of integers.,"[1, 2, 3, 4, 5]",# Python code def sum_sequence(sequence):  sum = 0  for num in sequence:  sum += num  return sum,"Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Create a function to calculate the sum of a sequence of integers. ### Input: [1, 2, 3, 4, 5] ### Output: # Python code def sum_sequence(sequence):  sum = 0  for num in sequence:  sum += num  return sum"
1,Generate a Python code for crawling a website for a specific type of data.,website: www.example.com data to crawl: phone numbers,"import requests import re def crawl_website_for_phone_numbers(website):  response = requests.get(website)  phone_numbers = re.findall('\d{3}-\d{3}-\d{4}', response.text)  return phone_numbers  if __name__ == '__main__':  print(crawl_website_for_phone_numbers('www.example.com'))","Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Generate a Python code for crawling a website for a specific type of data. ### Input: website: www.example.com data to crawl: phone numbers ### Output: import requests import re def crawl_website_for_phone_numbers(website):  response = requests.get(website)  phone_numbers = re.findall('\d{3}-\d{3}-\d{4}', response.text)  return phone_numbers  if __name__ == '__main__':  print(crawl_website_for_phone_numbers('www.example.com'))"
2,"Create a Python list comprehension to get the squared values of a list [1, 2, 3, 5, 8, 13].",,"[x*x for x in [1, 2, 3, 5, 8, 13]]","Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Create a Python list comprehension to get the squared values of a list [1, 2, 3, 5, 8, 13]. ### Input: ### Output: [x*x for x in [1, 2, 3, 5, 8, 13]]"


### Split Train and Test Dataset
The `iamtarun/python_code_instructions_18k_alpaca` dataset consists of a single split, "train". We will separate 20% of this as test samples.

In [3]:
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} code generation prompts")
print(f"Test dataset contains {len(test_dataset)} code generation prompts")

Training dataset contains 14889 code generation prompts
Test dataset contains 3723 code generation prompts


## 3. Load the Base Model

Next, we'll load the `Qwen2.5-7B` model, which will serve as our base model for fine-tuning. This model can be loaded from the HuggingFace Hub repository [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) using the Transformers' `from_pretrained()` API. 


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_id = "Qwen/Qwen2.5-7B"

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Needed for training
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    cache_dir='',
    use_cache = False,
    attn_implementation="flash_attention_2",
    torch_dtype=getattr(torch, "bfloat16"),
    device_map="auto",
)


model.device

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.50it/s]


device(type='cuda', index=0)

### How Does the Base Model Perform?
First, let's assess the performance of the vanilla `Qwen2.5-7B` model on the Python code generation task before any fine-tuning. The model produces some python code, however it may not be always correct and is also accompanied by natural language explatnations which will make using this model as a coding assistant difficult. This outcome indicates the necessity of fine-tuning the model for our specific task.


In [5]:
PROMPT_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Output:
{output}
"""

import transformers

pipeline = transformers.pipeline(model=model, tokenizer=tokenizer, task="text-generation")

sample = test_dataset[42]
prompt = PROMPT_TEMPLATE.format(
    instruction=sample["instruction"], input=sample["input"], output=""
)  # Leave the answer part blank

with torch.no_grad():
    response = pipeline(prompt, max_new_tokens=2048, repetition_penalty=1.15, return_full_text=False)

display_table({"prompt": prompt, "generated_output": response[0]["generated_text"], "ground_truth": sample["output"]})

Unnamed: 0,prompt,generated_output,ground_truth
0,"Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Construct a Python script to perform a linear regression model for the given data-points. ### Input: X = [1, 2, 3, 4, 5] Y = [1, 4, 9, 16, 25] ### Output:","To construct a Python script for performing a linear regression model on the provided dataset (X and Y), you can use libraries such as `numpy` for numerical operations and `scipy.stats` or other machine learning libraries like `sklearn`. Here's how you could do it using both approaches: #### Using NumPy & SciPy Stats ```python import numpy as np from scipy import stats # Given datasets X = np.array([1, 2, 3, 4, 5]) Y = np.array([1, 4, 9, 16, 25]) # Perform linear regression slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y) print(f""Slope: {slope}"") print(f""Intercept: {intercept}"") # Predicting values based on the fitted line equation y = mx + b def predict(x):  return slope * x + intercept # Example prediction at point X=6 predicted_y_at_6 = predict(6) print(f""Predicted value of Y when X=6: {predicted_y_at_6}"") ``` #### Using Scikit-Learn Library If you prefer using scikit-learn which provides more flexibility with different types of models including Linear Regression, here’s how you would implement this: ```python import numpy as np from sklearn.linear_model import LinearRegression # Convert lists into arrays if they aren't already if not isinstance(X, np.ndarray):  X = np.array(X).reshape(-1, 1) # Reshape because SkLearn expects input in columns format even though we have only one feature. if not isinstance(Y, np.ndarray):  Y = np.array(Y) # Initialize and fit the model model = LinearRegression() model.fit(X, Y) # Get coefficients coefficients = model.coef_ intercept = model.intercept_ print(""Coefficients:"", coefficients[0]) # Slope print(""Intercept:"", intercept) # Making predictions new_X = [[7]] # New sample points where you want to make predictions predictions = model.predict(new_X) print(""Prediction for new X=[7]:"", predictions[0][0]) ``` Both scripts will output similar results showing the slope and intercept from your linear regression analysis along with predicted values outside the original range specified by your training set.","import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) y = np.array([1, 4, 9, 16, 25]) # Build the model model = LinearRegression() model.fit(X, y) # Plot the results plt.scatter(X, y) plt.plot(X, model.predict(X)) plt.title(""Linear Regression Model"") plt.xlabel(""x"") plt.ylabel(""y"") plt.show()"


## 4. Define a PEFT Model

[LoRA (Low Rank Adaptation)](https://github.com/microsoft/LoRA) is a preceding method for resource-efficient fine-tuning, by reducing the number of trainable parameters through matrix decomposition. Let `W'` represent the final weight matrix from fine-tuning. In LoRA, `W'` is approximated by the sum of the original weight and its update, i.e., `W + ΔW`, then decomposing the delta part into two low-dimensional matrices, i.e., `ΔW ≈ AB`. Suppose `W` is `m`x`m`, and we select a smaller `r` for the rank of `A` and `B`, where `A` is `m`x`r` and `B` is `r`x`m`. Now, the original trainable parameters, which are quadratic in size of `W` (i.e., `m^2`), after decomposition, become `2mr`. Empirically, we can choose a much smaller number for `r`, e.g., 32, 64, compared to the full weight matrix size, therefore this significantly reduces the number of parameters to train.

Although the mathematics behind LoRA is intricate, [PEFT](https://huggingface.co/docs/peft/en/index) helps us by simplifying the process of adapting LoRA to the pretrained Transformer model.

In the next cell, we create a [LoraConfig](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/config.py) with various settings for LoRA. These hyperparameters might need optimization to achieve the best model performance for your specific task. **MLflow** facilitates this process by tracking these hyperparameters, the associated model, and its outcomes.

In [6]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
   
# This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
lora_r = 64
# This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
lora_alpha = 32
# Drop out ratio for the layers in LoRA adaptors A and B.
lora_dropout = 0.1
# Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
bias="none"

# We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
target_modules=[
    "k_proj", 
    "q_proj", 
    "v_proj", 
    "up_proj", 
    "down_proj", 
    "gate_proj"
]
# These modules will be not affected by an adapter
modules_to_save=[
    "embed_tokens", 
    "input_layernorm", 
    "post_attention_layernorm", 
    "norm"
]

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=target_modules,
    modules_to_save=modules_to_save,
    bias=bias,
)

**That's it!!!** PEFT has made the LoRA setup super easy.

An additional bonus is that the PEFT model exposes the same interfaces as a Transformers model. This means that everything from here on is quite similar to the standard model training process using Transformers.

## 5. Kick-off a Training Job

Similar to conventional Transformers training, we'll first set up a Trainer object to organize the training iterations. There are numerous hyperparameters to configure, but MLflow will manage them on your behalf.

To enable MLflow logging, you can specify `report_to="mlflow"` and name your training trial with the `run_name` parameter. This action initiates an [MLflow run](https://mlflow.org/docs/latest/tracking.html#runs) that automatically logs training metrics, hyperparameters, configurations, and the trained model. 

In [7]:
from datetime import datetime
from transformers import TrainingArguments
from trl import SFTTrainer

import mlflow
mlflow.enable_system_metrics_logging()
mlflow.set_experiment("Finetuning LLMs with MLFlow")

try:
    mlflow.start_run(run_name=f"{base_model_id}-demo-LoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}")
    run = mlflow.active_run()

    max_seq_length = 2048
    output_dir = "./demo-results"
    max_grad_norm = 0.3
    warmup_ratio = 0.1
    lr_scheduler_type = "cosine"

    training_arguments = TrainingArguments(
        # Set this to mlflow for logging your training
        report_to="mlflow",
        # Name the MLflow run
        run_name=run.info.run_name,
        # Replace with your output destination
        output_dir=output_dir,
        # For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs = {"use_reentrant": False},
        optim="paged_adamw_32bit",
        bf16=True,
        learning_rate=2e-5,
        lr_scheduler_type=lr_scheduler_type,
        max_grad_norm=max_grad_norm,
        num_train_epochs=3,
        logging_steps=10,
        warmup_ratio=warmup_ratio,
        # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
        ddp_find_unused_parameters=False,
        group_by_length=True,
        eval_strategy="steps",
        eval_steps=20,
    )

    # log datasets to the same run
    mlflow.log_input(
        mlflow.data.huggingface_dataset.from_huggingface(
            dataset,
            path=dataset_name
        ), 
        context="train"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset.take(1000),
        eval_dataset=test_dataset.take(1000),
        peft_config=peft_config,
        dataset_text_field="prompt",
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_arguments,
    )
except Exception as e:
    print(e)
    mlflow.end_run()


2024/12/18 09:34:03 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


The training duration may span several hours, contingent upon your hardware specifications. Nonetheless, the primary objective of this tutorial is to acquaint you with the process of fine-tuning using PEFT and MLflow, rather than to cultivate a highly code assistant. If you don't care much about the model performance, you may specify a smaller number of steps or interrupt the following cell to proceed with the rest of the notebook.

In [8]:
try:
    trainer.train()
finally:
    mlflow.end_run()


Step,Training Loss,Validation Loss
20,0.7848,0.68001
40,0.5987,0.568557
60,0.5516,0.55223
80,0.5367,0.549905


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
2024/12/18 09:37:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run Qwen/Qwen2.5-7B-demo-LoRA-2024-12-18-09-34-1734514443 at: https://tracking.mlflow-e00rhqs1bwevnqy5wj.backbone-e00ffdgj3ybad7mxrx.msp.eu-north1.nebius.cloud/#/experiments/25/runs/415a78c599c54e218f4212bc426028b1.
2024/12/18 09:37:54 IN

Retrieve the information on training dataset

In [9]:
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")


Dataset name: dataset
Dataset digest: 2bfa6908
Dataset profile: {"num_rows": 18612, "dataset_size": 25180782, "size_in_bytes": 36537858}
Dataset schema: {"mlflow_colspec": [{"type": "string", "name": "instruction", "required": true}, {"type": "string", "name": "input", "required": true}, {"type": "string", "name": "output", "required": true}, {"type": "string", "name": "prompt", "required": true}]}


In [10]:
dataset_source = mlflow.data.get_source(dataset_info)
dataset_source.to_dict()

{'path': 'iamtarun/python_code_instructions_18k_alpaca',
 'config_name': 'default',
 'data_dir': None,
 'data_files': None,
 'split': 'train',
 'revision': None}

## 6. Save the PEFT Model to MLflow

Hooray! We have successfully fine-tuned the `Qwen2.5-7B` model into a Python coding assistant. Before concluding the training, one final step is to save the trained PEFT model to MLflow.

### Set Prompt Template and Default Inference Parameters (optional)

LLMs prediction behavior is not only defined by the model weights, but also largely controlled by the prompt and inference paramters such as `max_token_length`, `repetition_penalty`. Therefore, it is highly advisable to save those metadata along with the model, so that you can expect the consistent behavior when loading the model later.

#### Prompt Template
The user prompt itself is free text, but you can harness the input by applying a 'template'. MLflow Transformer flavor supports saving a prompt template with the model, and apply it automatically before the prediction. This also allows you to hide the system prompt from model clients. To save the prompt template, we have to define a single string that contains `{prompt}` variable, and pass it to the `prompt_template` argument of [mlflow.transformers.log_model](https://mlflow.org/docs/latest/python_api/mlflow.transformers.html#mlflow.transformers.log_model) API. Refer to [Saving Prompt Templates with Transformer Pipelines](https://mlflow.org/docs/latest/llms/transformers/guide/index.html#saving-prompt-templates-with-transformer-pipelines) for more detailed usage of this feature.

In [11]:
# Basically the same format as we applied to the dataset. However, the template only accepts {prompt} variable so both instruction and input need to be fed in there.
prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

{prompt}

### Output:
"""

#### Inference Parameters

Inference parameters can be saved with MLflow model as a part of [Model Signature](https://mlflow.org/docs/latest/model/signatures.html). The signature defines model input and output format with additional parameters passed to the model prediction, and you can let MLflow to infer it from some sample input using [mlflow.models.infer_signature](https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_signature) API. If you pass the concrete value for parameters, MLflow treats them as default values and apply them at the inference if they are not provided by users. For more details about the Model Signature, please refer to the [MLflow documentation](https://mlflow.org/docs/latest/model/signatures.html).

In [12]:
from mlflow.models import infer_signature

sample = train_dataset[42]
_prompt_prompt = """### Instruction:
{instruction}

### Input:
{input}"""


# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
    model_input=_prompt_prompt.format(
        instruction=sample["instruction"],
        input=sample["input"]
    ),
    model_output=sample["output"],
    # Parameters are saved with default values if specified
    params={"max_new_tokens": 2048, "repetition_penalty": 1.15, "return_full_text": False},
)
signature

inputs: 
  [string (required)]
outputs: 
  [string (required)]
params: 
  ['max_new_tokens': long (default: 2048), 'repetition_penalty': double (default: 1.15), 'return_full_text': boolean (default: False)]

### Save the PEFT Model to MLflow
Finally, we will call [mlflow.transformers.log_model](https://mlflow.org/docs/latest/python_api/mlflow.transformers.html#mlflow.transformers.log_model) API to log the model to MLflow. A few critical points to remember when logging a PEFT model to MLflow are:

1. **MLflow logs the Transformer model as a [Pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines).** A pipeline bundles a model with its tokenizer (or other components, depending on the task type) and simplifies the prediction steps into an easy-to-use interface, making it an excellent tool for ensuring reproducibility. In the code below, we pass the model and tokenizer as a dictionary, then MLflow automatically deduces the correct pipeline type and saves it.
2. **MLflow does not save the base model weight for the PEFT model**. When executing `mlflow.transformers.log_model`, MLflow only saves the small number of trained parameters, i.e., the PEFT adapter. For the base model, MLflow instead records a reference to the HuggingFace hub (repository name and commit hash), and downloads the base model weights on the fly when loading the PEFT model. This approach significantly reduces storage usage and logging latency; for instance, the logged artifacts size in this tutorial is about 1GB, while the full `Qwen2.5-7B` model is about 20GB.
3. **Save a tokenizer without padding**. During fine-tuning, we applied padding to the dataset to standardize the sequence length in a batch. However, padding is no longer necessary at inference, so we save a different tokenizer without padding. This ensures the loaded model can be used for inference immediately.

**Note**: Currently, manual logging is required for the PEFT adapter and config, while other information, such as dataset, metrics, Trainer parameters, etc., are automatically logged. However, this process may be automated in future versions of MLflow and Transformers.

In [13]:
import mlflow

# Get the ID of the MLflow Run that was automatically created above
last_run_id = mlflow.last_active_run().info.run_id

# Save a tokenizer without padding because it is only needed for training
tokenizer_no_pad = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True)

# If you interrupt the training, uncomment the following line to stop the MLflow run
mlflow.end_run()

with mlflow.start_run(run_id=last_run_id):
    mlflow.log_params(peft_config.to_dict())
    mlflow.transformers.log_model(
        transformers_model={"model": trainer.model, "tokenizer": tokenizer_no_pad},
        prompt_template=prompt_template,
        signature=signature,
        artifact_path="model",  # This is a relative path to save model files within MLflow run
    )

2024/12/18 09:47:46 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
2024/12/18 09:47:47 INFO mlflow.transformers: Overriding save_pretrained to False for PEFT models, following the Transformers behavior. The PEFT adaptor and config will be saved, but the base model weights will not and reference to the HuggingFace Hub repository will be logged instead.
2024/12/18 09:47:50 INFO mlflow.transformers: Skipping saving pretrained model weights to disk as the save_pretrained is set to False. The reference to HuggingFace Hub repository Qwen/Qwen2.5-7B will be logged instead.
2024/12/18 09:47:50 INFO mlflow.transformers: text-generation pipelines saved with prompt templates have the `return_full_text` pipeline kwarg set to False by default. To override this behavior, provide a `model_config` dict with `return_full_text` set to `True` when saving the model.
2024/12/18 09:48:48 INFO mlflow.tracking._tracking_service.client: 🏃 View run Qwen/Qwen2.5-7B-demo-LoR

### What's Logged to MLflow?

Let's briefly review what is logged/saved to MLflow as a result of your training. Select the experiment "Finetuning LLMs with MLFlow" in the MLFlow UI on the left side. Then click on the latest MLflow Run named `Qwen/Qwen2.5-7B-demo-LoRA-...` to view the Run details.

#### Parameters

The `Parameters` section displays hundreds of parameters specified for the Trainer and LoraConfig such as `learning_rate`, `r`, `lora_alpha`. It also includes default parameters that were not explicitly specified, which is crucial for ensuring reproducibility, especially if the library's default values change.

#### Metrics
The `Metrics` section presents the model metrics collected during the run, such as `train_loss`. You can visualize these metrics with various types of graphs in the "Chart" tab. 

#### Artifacts
The `Artifacts` section displays the files/directories saved in MLflow as a result of training. For Transformers PEFT training, you should see the following files/directories:

```

    model/
      ├─ peft/
      │  ├─ adapter_config.json       # JSON file of the LoraConfig
      │  ├─ adapter_module.safetensor # The weight file of the LoRA adapter
      │  └─ README.md                 # Empty README file generated by Transformers
      │
      ├─ LICENSE.txt                  # License information about the base model (Qwen2.5-7B)
      ├─ MLModel                      # Contains various metadata about your model
      ├─ conda.yaml                   # Dependencies to create conda environment
      ├─ model_card.md                # Model card text for the base model
      ├─ model_card_data.yaml         # Model card data for the base model
      ├─ python_env.yaml              # Dependencies to create Python virtual environment
      └─ requirements.txt             # Pip requirements for model inference

```

#### Model Metadata

In the MLModel file, you can see the many detailed metadata are saved about the PEFT and base model.
Here is an excerpt of the MLModel file (some fields are omitted for simplicity)

```
flavors:
  transformers:
    peft_adaptor: peft                                 # Points the location of the saved PEFT model
    pipeline_model_type: Qwen2ForCausalLM              # The base model implementation
    source_model_name: Qwen/Qwen2.5-7B.                # Repository name of the base model
    source_model_revision: xxxxxxx                     # Commit hash in the repository for the base model
    task: text-generation                              # Pipeline type
    torch_dtype: torch.bfloat16                        # Dtype for loading the model
    tokenizer_type: Qwen2TokenizerFast                 # Tokenizer implementation

# Prompt template saved with the model above
metadata:
  prompt_template: 'Below is an instruction that describes a task. Write a response
    that appropriately completes the request.


    {prompt}


    ### Output:

    '
# Defines the input and output format of the model, with additional inference parameters with default values
signature:
  inputs: '[{"type": "string", "required": true}]'
  outputs: '[{"type": "string", "required": true}]'
  params: '[{"name": "max_new_tokens", "type": "long", "default": 2048, "shape": null},
    {"name": "repetition_penalty", "type": "double", "default": 1.15, "shape": null},
    {"name": "return_full_text", "type": "boolean", "default": false, "shape": null}]'
```


## 7. Load the Saved PEFT Model from MLflow

Finally, let's load the model logged in MLflow and evaluate its performance as a Python code assistant. There are two ways to load a Transformer model in MLflow:

1. Use [mlflow.transformers.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.transformers.html#mlflow.transformers.load_model). This method returns a native Transformers pipeline instance.
2. Use [mlflow.pyfunc.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.load_model). This method returns an MLflow's PythonModel instance that wraps the Transformers pipeline, offering additional features over the native pipeline, such as (1) a unified `predict()` API for inference, (2) model signature enforcement, and (3) automatically applying a prompt template and default parameters if saved. Please note that not all the Transformer pipelines are supported for pyfunc loading, refer to the [MLflow documentation](https://mlflow.org/docs/latest/llms/transformers/guide/index.html#supported-transformers-pipeline-types-for-pyfunc) for the full list of supported pipeline types.

The first option is preferable if you wish to use the model via the native Transformers interface. The second option offers a simplified and unified interface across different model types and is particularly useful for model testing before production deployment. In the following code, we will use the [mlflow.pyfunc.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.load_model) to show how it applies the prompt template and the default inference parameters defined above.


**NOTE**: Invoking `load_model()` loads a new model instance onto your GPU, which may exceed GPU memory limits and trigger an Out Of Memory (OOM) error, or cause the Transformers library to attempt to offload parts of the model to other devices or disk. This offloading can lead to issues, such as a "ValueError: We need an `offload_dir` to dispatch this model according to this `decide_map`." If you encounter this error, consider restarting the Python Kernel and loading the model again.

**CAUTION**: Restarting the Python Kernel will erase all intermediate states and variables from the above cells. Ensure that the trained PEFT model is properly logged in MLflow before restarting.


In [14]:
# You can find the ID of run in the Run detail page on MLflow UI
mlflow_model = mlflow.pyfunc.load_model("runs:/415a78c599c54e218f4212bc426028b1/model")

Downloading artifacts: 100%|██████████| 10/10 [00:56<00:00,  5.60s/it]  
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.28it/s]
Some parameters are on the meta device because they were offloaded to the cpu.


In [17]:
# # We only input table and question, since system prompt is adeed in the prompt template.
test_prompt = """
### Instruction:
Develop a Python program to convert the following Fahrenheit value to Celsius.

### Input:
212
"""

# Inference parameters like max_tokens_length are set to default values specified in the Model Signature
generated_query = mlflow_model.predict(test_prompt)[0]
display_table({"prompt": test_prompt, "generated_output": generated_query})



Unnamed: 0,prompt,generated_output
0,### Instruction: Develop a Python program to convert the following Fahrenheit value to Celsius. ### Input: 212,"def fahrenheit_to_celsius(f):  c = (f - 32) * 5/9  return round(c, 2) print(fahrenheit_to_celsius(212))"


In [18]:
def fahrenheit_to_celsius(f):
    c = (f - 32) * 5/9
    return round(c, 2)

print(fahrenheit_to_celsius(212))

100.0


Perfect!! The fine-tuned model now generates Python code correctly. As you can see in the code and result above, the system prompt and default inference parameters are applied automatically, so we don't have to pass it to the loaded model. This is super powerful when you want to deploy multiple models (or update an existing model) with different the system prompt or parameters, because you don't have to edit client's implementation as they are abstracted behind the MLflow model :)

## Conclusion

In this tutorial, you learned how to fine-tune a large language model with LoRA for Python coding assistant task using PEFT. You also learned the key role of MLflow in the LLM fine-tuning process, which tracks parameters and metrics during the fine-tuning, and manage models and other assets.