# Fine-tuning and Optimizing a Student Model

This notebook demonstrates how to fine-tune a small "student" model using the training data we generated from our "teacher" model in the previous notebook. We'll also optimize the model for efficient deployment.

![](../../lab_manual/images/step-2.png)

## What You'll Learn

- How to use Microsoft Olive to fine-tune the Phi-4-mini model
- How to apply Low-Rank Adaptation (LoRA) for efficient fine-tuning
- How to convert a model to ONNX format for optimization
- How to apply quantization to reduce model size
- How to prepare the model for deployment on resource-constrained environments

## Prerequisites

- Completed the previous notebook (`01.AzureML_Distillation.ipynb`)
- Generated training data in `data/train_data.jsonl`
- Access to Azure ML with the Phi-4-mini model in the registry
- Python environment with necessary libraries (which we'll install)

## Setup Instructions

1. **Azure Authentication**: Ensure you're logged in to Azure using `az login --use-device-code` in a terminal
2. **Kernel Selection**: Change the Jupyter kernel to **"Python 3.10 PyTorch and Tensorflow"** using the selector in the top right
3. **Environment File**: Ensure your `local.env` file exists with proper credentials

## Initial Setup

Before we begin, make sure you've completed these steps:

1. **Azure Login**: Run `az login --use-device-code` in a terminal to authenticate with Azure

2. **Kernel Selection**: Select the "Python 3.10 PyTorch and Tensorflow" kernel from the dropdown in the top-right corner. This kernel has most of the dependencies we need pre-installed.

3. **Check Environment**: Ensure your `local.env` file is in the same directory as this notebook

## 1. Install Authentication Packages

Here we install the packages needed to authenticate with Azure services:

- **azure-ai-ml**: The Azure ML SDK for working with Azure Machine Learning

The `-U` flag ensures we get the latest versions of these packages.

In [None]:
# Install required packages for authentication
! pip install azure-ai-ml -U

## 2. Install PyTorch

Here we install PyTorch, which is the deep learning framework we'll use for fine-tuning. This command installs
 
- **torch**: The core PyTorch library for neural networks and tensor operations
- **torchvision**: For computer vision tasks (included as a dependency)
- **torchaudio**: For audio processing tasks (included as a dependency)
    
We're installing from a specific URL (`download.pytorch.org/whl/cu124`) to get a version compatible with CUDA 12.4, which is optimized for modern NVIDIA GPUs. The `-U` flag ensures we get the latest version."

In [None]:
! pip  install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 -U

## 3. Install Microsoft Olive

Now we install Microsoft Olive, an open-source model optimization toolkit that will be the main tool for our fine-tuning and optimization process. The `[auto-opt]` option includes additional dependencies for automatic optimization.

Olive provides:
- Model fine-tuning capabilities
- ONNX conversion tools
- Quantization for model compression
- Performance optimization for various hardware targets

This powerful tool will help us efficiently fine-tune our model and prepare it for deployment on resource-constrained devices.

In [None]:
! pip install olive-ai[auto-opt] -U

## 4. Verify Olive Installation

We'll now check the installed version of Olive to ensure it installed correctly. This command shows:
- The package name
- The installed version number
- Where the package is installed
- The package's dependencies

Confirming the version is important as different versions of Olive may have different features or requirements.

In [None]:
! pip show olive-ai

## 5. Install ONNX Runtime GenAI

Next, we install ONNX Runtime GenAI, a specialized version of ONNX Runtime designed specifically for generative AI models. This package will allow us to:

- Run our optimized model efficiently
- Leverage specialized optimizations for transformer models
- Access adapter-based fine-tuning capabilities

We're installing version 0.7.1 with the `--pre` flag because it's a pre-release version with features we need for our work. Later notebooks will use this to run inference with our optimized model.

In [None]:
! pip install onnxruntime-genai==0.7.1 --pre

In [None]:
! pip install onnxruntime==1.21.1 -U

## 6. Package Management

The next few cells handle package management to avoid conflicts. We're:

1. **Uninstalling onnxruntime-gpu** to avoid conflicts with the regular onnxruntime package
2. **Installing regular onnxruntime** for CPU-based inference
3. **Installing additional dependencies** including:
   - bitsandbytes: For efficient quantization
   - transformers: For working with transformer models
   - peft: For parameter-efficient fine-tuning (LoRA)
   - accelerate: For optimized training

These packages will ensure our environment is properly set up for fine-tuning and optimization.

In [None]:
pip uninstall onnxruntime-gpu --yes

In [None]:
! pip install onnxruntime

In [None]:
! pip install bitsandbytes

In [None]:
! pip install transformers==4.49.0 -U

In [None]:
! pip install azure-ai-ml -U  

In [None]:
! pip install marshmallow==3.23.2 -U   

In [None]:
! pip install tf-keras

In [None]:
! pip install numpy==1.23.5 -U

In [None]:
! pip install peft

In [None]:
! pip list

In [None]:
! pip install peft

## 7. Fine-tune with Low-Rank Adaptation (LoRA)

This is the core command that fine-tunes our student model. We're using Microsoft Olive's fine-tuning capabilities with LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning method. Here's what each parameter does:

- **`--method lora`**: Use Low-Rank Adaptation, which adds small trainable matrices to key layers instead of updating all weights

- **`--model_name_or_path`**: The base model to fine-tune (Phi-4-mini-instruct from Azure ML registry)

- **`--trust_remote_code`**: Allow execution of code from the remote model repository

- **`--data_name json`**: The format of our training data (JSON)

- **`--data_files`**: Path to our training data generated from the teacher model

- **`--text_template`**: Template for formatting inputs and outputs during training

- **`--max_steps 100`**: Only train for 100 steps (for speed, in production you'd use more)

- **`--output_path`**: Where to save the fine-tuned model and adapter

- **`--target_modules`**: Which layers to apply LoRA to (attention and feed-forward layers)

- **`--log_level 1`**: Set verbosity of logging

This process will take several minutes to complete. It creates a LoRA adapter that captures the knowledge our model learned from the teacher without modifying the base model weights.

In [None]:
! olive finetune \
    --method lora \
    --model_name_or_path  azureml://registries/azureml/models/Phi-4-mini-instruct/versions/1 \
    --trust_remote_code \
    --data_name json \
    --data_files ./data/train_data.jsonl \
    --text_template "<|user|>{Question}<|end|><|assistant|>{Answer}<|end|>" \
    --max_steps 100 \
    --output_path models/phi-4-mini/ft \
    --target_modules "q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj" \
    --log_level 1

## 8. Reinstall ONNX Runtime GenAI

Here we reinstall ONNX Runtime GenAI to ensure we have the correct version after all our package management. This is a precautionary step to make sure we have the version (0.7.1) needed for our model optimization in the next step.

In [None]:
! pip install onnxruntime-genai==0.7.1 --pre

In [None]:
! pip install onnxruntime==1.21.1 -U

In [None]:
 ! pip install protobuf==3.20.3 -U 

## 9. Optimize and Quantize the Model

This command uses Microsoft Olive's auto-optimization capabilities to convert our fine-tuned model to ONNX format and apply int4 quantization. Here's what each parameter does:

- **`--model_name_or_path`**: The base model from Azure ML registry

- **`--adapter_path`**: Path to our LoRA adapter created in the previous step

- **`--device cpu`**: Target CPU for optimization (you could also use cuda for GPU)

- **`--provider CPUExecutionProvider`**: Use the CPU execution provider for ONNX Runtime

- **`--use_model_builder`**: Use Olive's model builder for optimized conversion

- **`--precision int4`**: Apply int4 quantization, which reduces model size by up to 75% compared to FP16

- **`--output_path`**: Where to save the optimized model

- **`--log_level 1`**: Set verbosity of logging

The optimization process:
1. Merges the base model with our LoRA adapter
2. Converts to ONNX format, which is more efficient for inference
3. Applies int4 quantization to dramatically reduce model size
4. Optimizes the model for CPU inference

This process will take several minutes to complete. The result will be a much smaller, more efficient model that can run on devices with limited resources while maintaining most of the accuracy.

In [None]:
! olive auto-opt \
    --model_name_or_path  azureml://registries/azureml/models/Phi-4-mini-instruct/versions/1 \
    --adapter_path models/phi-4-mini/ft/adapter \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_model_builder \
    --precision int4 \
    --output_path models/phi-4-mini/onnx \
    --log_level 1