# Fine-tuning and Optimizing a Student Model

This notebook demonstrates how to fine-tune a small "student" model using the training data we generated from our "teacher" model in the previous notebook. We'll also optimize the model for efficient deployment.

![](../../lab_manual/images/step-2.png)

## What You'll Learn

- How to use Microsoft Olive to fine-tune the Phi-4-mini model
- How to apply Low-Rank Adaptation (LoRA) for efficient fine-tuning
- How to convert a model to ONNX format for optimization
- How to apply quantization to reduce model size
- How to prepare the model for deployment on resource-constrained environments

## Prerequisites

- Completed the previous notebook (`01.AzureML_Distillation.ipynb`)
- Generated training data in `data/train_data.jsonl`
- Access to Azure ML with the Phi-4-mini model in the registry
- Python environment with necessary libraries (which we'll install)

## Setup Instructions

1. **Azure Authentication**: Ensure you're logged in to Azure using `az login --use-device-code` in a terminal
2. **Kernel Selection**: Change the Jupyter kernel to **"Python 3.10 PyTorch and Tensorflow"** using the selector in the top right
3. **Environment File**: Ensure your `local.env` file exists with proper credentials


## Initial Setup

Before we begin, make sure you've completed these steps:

1. **Azure Login**: Run `az login --use-device-code` in a terminal to authenticate with Azure

2. **Kernel Selection**: Select the "Python 3.10 PyTorch and Tensorflow" kernel from the dropdown in the top-right corner. This kernel has most of the dependencies we need pre-installed.

3. **Check Environment**: Ensure your `local.env` file is in the same directory as this notebook

## 1. Install Authentication Packages

Here we install the packages needed to authenticate with Azure services:

- **azure-ai-ml**: The Azure ML SDK for working with Azure Machine Learning

The `-U` flag ensures we get the latest versions of these packages.

In [24]:
# Install required packages for authentication
! pip install azure-ai-ml -U

## 2. Install PyTorch

Here we install PyTorch, which is the deep learning framework we'll use for fine-tuning. This command installs
 
- **torch**: The core PyTorch library for neural networks and tensor operations
- **torchvision**: For computer vision tasks (included as a dependency)
- **torchaudio**: For audio processing tasks (included as a dependency)
    
We're installing from a specific URL (`download.pytorch.org/whl/cu124`) to get a version compatible with CUDA 12.4, which is optimized for modern NVIDIA GPUs. The `-U` flag ensures we get the latest version."

In [3]:
! pip  install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 -U

Looking in indexes: https://download.pytorch.org/whl/cu124
Collecting torch
  Obtaining dependency information for torch from https://download.pytorch.org/whl/cu124/torch-2.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata
  Downloading https://download.pytorch.org/whl/cu124/torch-2.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Obtaining dependency information for torchvision from https://download.pytorch.org/whl/cu124/torchvision-0.21.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata
  Downloading https://download.pytorch.org/whl/cu124/torchvision-0.21.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Obtaining dependency information for torchaudio from https://download.pytorch.org/whl/cu124/torchaudio-2.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata
  Downloading https://download.pytorch.org/whl/cu124/torchaudio-2.6.0%2Bcu124-cp310-cp310-linux_x86_64.whl.metadata (6.6 kB)
Downloading https://download.pytorch.org/whl

## 3. Install Microsoft Olive

Now we install Microsoft Olive, an open-source model optimization toolkit that will be the main tool for our fine-tuning and optimization process. The `[auto-opt]` option includes additional dependencies for automatic optimization.

Olive provides:
- Model fine-tuning capabilities
- ONNX conversion tools
- Quantization for model compression
- Performance optimization for various hardware targets

This powerful tool will help us efficiently fine-tune our model and prepare it for deployment on resource-constrained devices.

In [4]:
! pip install olive-ai[auto-opt] -U

Collecting olive-ai[auto-opt]
  Downloading olive_ai-0.9.1-py3-none-any.whl (688 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m688.8/688.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting onnxscript>=0.2.5
  Downloading onnxscript-0.2.5-py3-none-any.whl (689 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m689.4/689.4 kB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
Collecting optuna
  Downloading optuna-4.3.0-py3-none-any.whl (386 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics>=1.0.0
  Downloading torchmetrics-1.7.1-py3-none-any.whl (961 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m961.5/961.5 kB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
Collecting optimum
  Downloading optimum-1.25.3-py3-none-any.whl (429 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m429.3/429.3 kB[0m [31m43.7 M

## 4. Verify Olive Installation

We'll now check the installed version of Olive to ensure it installed correctly. This command shows:
- The package name
- The installed version number
- Where the package is installed
- The package's dependencies

Confirming the version is important as different versions of Olive may have different features or requirements.

In [5]:
! pip show olive-ai

Name: olive-ai
Version: 0.9.1
Summary: Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
Home-page: https://microsoft.github.io/Olive/
Author: Microsoft Corporation
Author-email: olivedevteam@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.10/site-packages
Requires: numpy, onnx, onnxscript, optuna, pandas, pydantic, pyyaml, torch, torchmetrics, transformers
Required-by: 


## 5. Install ONNX Runtime GenAI

Next, we install ONNX Runtime GenAI, a specialized version of ONNX Runtime designed specifically for generative AI models. This package will allow us to:

- Run our optimized model efficiently
- Leverage specialized optimizations for transformer models
- Access adapter-based fine-tuning capabilities

We're installing version 0.7.1 with the `--pre` flag because it's a pre-release version with features we need for our work. Later notebooks will use this to run inference with our optimized model.

In [6]:
! pip install onnxruntime-genai==0.7.1 --pre

Collecting onnxruntime-genai==0.7.1
  Downloading onnxruntime_genai-0.7.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting onnxruntime>=1.21.0
  Downloading onnxruntime-1.22.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: onnxruntime, onnxruntime-genai
  Attempting uninstall: onnxruntime
    Found existing installation: onnxruntime 1.17.3
    Uninstalling onnxruntime-1.17.3:
      Successfully uninstalled onnxruntime-1.17.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
azureml-training-tabular 1.60.

In [7]:
! pip install onnxruntime==1.21.1 -U

Collecting onnxruntime==1.21.1
  Downloading onnxruntime-1.21.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.0/16.0 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: onnxruntime
  Attempting uninstall: onnxruntime
    Found existing installation: onnxruntime 1.22.0
    Uninstalling onnxruntime-1.22.0:
      Successfully uninstalled onnxruntime-1.22.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
azureml-training-tabular 1.60.0 requires onnxruntime~=1.17.3, but you have onnxruntime 1.21.1 which is incompatible.
azureml-training-tabular 1.60.0 requires psutil<5.9.4,>=5.2.2, but you have psutil 6.1.1 which is incompatible.
azureml-training-tabular 1.60.0 requires scipy<1.11.0,>=1.0.0, but you have scipy 1.11.0 which is in

## 6. Package Management

The next few cells handle package management to avoid conflicts. We're:

1. **Uninstalling onnxruntime-gpu** to avoid conflicts with the regular onnxruntime package
2. **Installing regular onnxruntime** for CPU-based inference
3. **Installing additional dependencies** including:
   - bitsandbytes: For efficient quantization
   - transformers: For working with transformer models
   - peft: For parameter-efficient fine-tuning (LoRA)
   - accelerate: For optimized training

These packages will ensure our environment is properly set up for fine-tuning and optimization.

In [8]:
pip uninstall onnxruntime-gpu --yes

[0mNote: you may need to restart the kernel to use updated packages.


In [9]:
! pip install onnxruntime



In [10]:
! pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.5


In [11]:
! pip install transformers==4.49.0 -U

Collecting transformers==4.49.0
  Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting tokenizers<0.22,>=0.21
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m121.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: transformers
    Found existing installation: transformers 4.30.1
    Uninstalling transformers-4.30.1:
      Successfully uninstalled transformers-4.30.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are inst

In [12]:
! pip install azure-ai-ml -U  



In [13]:
! pip install marshmallow==3.23.2 -U   

Collecting marshmallow==3.23.2
  Downloading marshmallow-3.23.2-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: marshmallow
  Attempting uninstall: marshmallow
    Found existing installation: marshmallow 3.26.1
    Uninstalling marshmallow-3.26.1:
      Successfully uninstalled marshmallow-3.26.1
Successfully installed marshmallow-3.23.2


In [14]:
! pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting numpy<2.2.0,>=1.26.0
  Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting MarkupSafe>=2.1.1
  Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Installing collected packages: numpy, MarkupSafe, tf-keras
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
  Attempting uninstall: MarkupSafe
    Found existing installation: MarkupSafe 2.0.1
    Uninstalling MarkupSafe-2.0.1:
      Successfully uninstalled MarkupSafe-2.0.1
[31mERROR: pip's

In [15]:
! pip install numpy==1.23.5 -U

Collecting numpy==1.23.5
  Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.3
    Uninstalling numpy-2.1.3:
      Successfully uninstalled numpy-2.1.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-ml 0.6.1 requires enum34, which is not installed.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 1.23.5 which is incompatible.
scikit-image 0.25.0 requires numpy>=1.24, but you have numpy 1.23.5 which is incompatible.
scikit-image 0.25.0 requires pillow>=10.1, but you have pillow 9.2.0 which is incompatible.
scikit-image 0.25.0 requires sc

In [16]:
! pip install peft

Collecting peft
  Downloading peft-0.15.2-py3-none-any.whl (411 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.1/411.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: peft
Successfully installed peft-0.15.2


In [17]:
! pip list

Package                                    Version
------------------------------------------ --------------
absl-py                                    2.2.2
accelerate                                 0.34.2
adal                                       1.2.7
adlfs                                      2024.12.0
aiofiles                                   22.1.0
aiohappyeyeballs                           2.6.1
aiohttp                                    3.11.16
aiohttp-cors                               0.8.1
aiosignal                                  1.3.2
aiosqlite                                  0.21.0
alembic                                    1.15.2
annotated-types                            0.7.0
ansicolors                                 1.1.8
antlr4-python3-runtime                     4.13.2
anyio                                      4.9.0
applicationinsights                        0.11.10
arch                                       5.6.0
argcomplete               

In [18]:
! pip install peft



## 7. Fine-tune with Low-Rank Adaptation (LoRA)

This is the core command that fine-tunes our student model. We're using Microsoft Olive's fine-tuning capabilities with LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning method. Here's what each parameter does:

- **`--method lora`**: Use Low-Rank Adaptation, which adds small trainable matrices to key layers instead of updating all weights

- **`--model_name_or_path`**: The base model to fine-tune (Phi-4-mini-instruct from Azure ML registry)

- **`--trust_remote_code`**: Allow execution of code from the remote model repository

- **`--data_name json`**: The format of our training data (JSON)

- **`--data_files`**: Path to our training data generated from the teacher model

- **`--text_template`**: Template for formatting inputs and outputs during training

- **`--max_steps 100`**: Only train for 100 steps (for speed, in production you'd use more)

- **`--output_path`**: Where to save the fine-tuned model and adapter

- **`--target_modules`**: Which layers to apply LoRA to (attention and feed-forward layers)

- **`--log_level 1`**: Set verbosity of logging

This process will take several minutes to complete. It creates a LoRA adapter that captures the knowledge our model learned from the teacher without modifying the base model weights.


In [19]:
! olive finetune \
    --method lora \
    --model_name_or_path  azureml://registries/azureml/models/Phi-4-mini-instruct/versions/1 \
    --trust_remote_code \
    --data_name json \
    --data_files ./data/train_data.jsonl \
    --text_template "<|user|>{Question}<|end|><|assistant|>{Answer}<|end|>" \
    --max_steps 100 \
    --output_path models/phi-4-mini/ft \
    --target_modules "q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj" \
    --log_level 1

Loading HuggingFace model from {'type': 'azureml_registry_model', 'registry_name': 'azureml', 'name': 'Phi-4-mini-instruct', 'version': '1'}
[2025-05-20 20:36:36,950] [INFO] [run.py:142:run_engine] Running workflow default_workflow
[2025-05-20 20:36:36,956] [INFO] [cache.py:138:__init__] Using cache directory: /afh/projects/cvi-lab329-h-3-d0ca370a-6510-40b5-b1dc-98d97b208684/shared/Users/cedricvidal/.olive-cache/default_workflow
[2025-05-20 20:36:37,059] [INFO] [accelerator_creator.py:217:create_accelerators] Running workflow on accelerator specs: gpu-cuda
Subtype value SAS has no mapping, use base class DataReferenceCredentialDto.
Downloading the model mlflow_model_folder at /afh/projects/cvi-lab329-h-3-d0ca370a-6510-40b5-b1dc-98d97b208684/shared/Users/cedricvidal/.olive-cache/default_workflow/resources/ee7f3578/olive_tmpb_5bcs0a/Phi-4-mini-instruct/mlflow_model_folder

Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCop

## 8. Reinstall ONNX Runtime GenAI

Here we reinstall ONNX Runtime GenAI to ensure we have the correct version after all our package management. This is a precautionary step to make sure we have the version (0.7.1) needed for our model optimization in the next step.

In [20]:
! pip install onnxruntime-genai==0.7.1 --pre



In [21]:
! pip install onnxruntime==1.21.1 -U



In [22]:
 ! pip install protobuf==3.20.3 -U 

Collecting protobuf==3.20.3
  Downloading protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.6
    Uninstalling protobuf-4.25.6:
      Successfully uninstalled protobuf-4.25.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 1.23.5 which is incompatible.
mlflow-skinny 2.21.3 requires packaging<25, but you have packaging 25.0 which is incompatible.
azureml-training-tabular 1.60.0 requires onnxruntime~=1.17.3, but you have onnxruntime 1.21.1 which is incompatible.
azureml-training-tabular 1.60.0 require

## 9. Optimize and Quantize the Model

This command uses Microsoft Olive's auto-optimization capabilities to convert our fine-tuned model to ONNX format and apply int4 quantization. Here's what each parameter does:

- **`--model_name_or_path`**: The base model from Azure ML registry

- **`--adapter_path`**: Path to our LoRA adapter created in the previous step

- **`--device cpu`**: Target CPU for optimization (you could also use cuda for GPU)

- **`--provider CPUExecutionProvider`**: Use the CPU execution provider for ONNX Runtime

- **`--use_model_builder`**: Use Olive's model builder for optimized conversion

- **`--precision int4`**: Apply int4 quantization, which reduces model size by up to 75% compared to FP16

- **`--output_path`**: Where to save the optimized model

- **`--log_level 1`**: Set verbosity of logging

The optimization process:
1. Merges the base model with our LoRA adapter
2. Converts to ONNX format, which is more efficient for inference
3. Applies int4 quantization to dramatically reduce model size
4. Optimizes the model for CPU inference

This process will take several minutes to complete. The result will be a much smaller, more efficient model that can run on devices with limited resources while maintaining most of the accuracy.


In [23]:
! olive auto-opt \
    --model_name_or_path  azureml://registries/azureml/models/Phi-4-mini-instruct/versions/1 \
    --adapter_path models/phi-4-mini/ft/adapter \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_model_builder \
    --precision int4 \
    --output_path models/phi-4-mini/onnx \
    --log_level 1

Loading HuggingFace model from {'type': 'azureml_registry_model', 'registry_name': 'azureml', 'name': 'Phi-4-mini-instruct', 'version': '1'}
[2025-05-20 20:44:12,127] [INFO] [run.py:142:run_engine] Running workflow default_workflow
[2025-05-20 20:44:12,134] [INFO] [cache.py:138:__init__] Using cache directory: /afh/projects/cvi-lab329-h-3-d0ca370a-6510-40b5-b1dc-98d97b208684/shared/Users/cedricvidal/.olive-cache/default_workflow
[2025-05-20 20:44:12,170] [INFO] [accelerator_creator.py:217:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-05-20 20:44:12,173] [INFO] [engine.py:223:run] Running Olive on accelerator: cpu-cpu
[2025-05-20 20:44:12,173] [INFO] [engine.py:864:_create_system] Creating target system ...
[2025-05-20 20:44:12,173] [INFO] [engine.py:867:_create_system] Target system created in 0.000080 seconds
[2025-05-20 20:44:12,173] [INFO] [engine.py:879:_create_system] Creating host system ...
[2025-05-20 20:44:12,173] [INFO] [engine.py:882:_create_syste