# Introduction to Model Distillation: Efficient Knowledge Transfer for AI Applications - Part 2

Our dataset for fine-tuning is created! We can now procede to fine-tuning - the most exciting part for most of AI developers :-).

## Pre requisites

This notebook picks off where [1_generate_training_data.ipynb](1_generate_training_data.ipynb) notebook finished.

Be sure to run that notebook before starting on this one.

## 1 - Load dependencies

In [1]:
import os
from dotenv import load_dotenv

from typing import Sequence
from openai import Client
from datasets import load_dataset, Dataset, concatenate_datasets
from tqdm import tqdm
import pandas as pd
import json
import numpy as np
import time
import requests
import re

## 2 - Load Configuration

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

## 3  - Initialize Client

The cell below creates the OpenAI-like Client to work with Nebius AI Studio and defines necessary variables.

In [3]:
DATASETS_CACHE_DIR = 'cache'
BASE_URL = "https://api.studio.nebius.ai"

client = Client(
    base_url=f'{BASE_URL}/v1',
    api_key=os.getenv('NEBIUS_API_KEY')
)

## 4 - Prepare fine tuning dataset

This dataset was created by previous setp [1_generate_training_data.ipynb](1_generate_training_data.ipynb)

In [4]:
## load finetuning dataset from jsonl files

from datasets import Dataset

ft_dataset = Dataset.from_json('data/ft_dataset.jsonl')
ft_dataset

Dataset({
    features: ['input', 'output'],
    num_rows: 22092
})

First, let's split our dataset into training and validation parts. As mentioned above, we'll leave 21k obervations for training, allocating ~ 5% of observations to validate the model performance.

In [5]:
validation_size = 1097
seed = 42

ft_dataset_split = ft_dataset.train_test_split(test_size=validation_size, seed=seed, shuffle=True)
ft_dataset_split

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 20995
    })
    test: Dataset({
        features: ['input', 'output'],
        num_rows: 1097
    })
})

## 5 - Create training and validation dataset 

We now need to format our subsets and store them into separate files.

While fine-tuning allows to avoid the usage of long detailed prompt, which, in turn, reduces the token consumption and accelerates the model inference, providing general instructions is still highly recommended. This will prevent large gradient updates in the first steps of the fine-tuning because the output will be somewhat "expected" by the model. Large gradient updates generally lead model weights away from local minima and result in lower quality of the model after fine-tuning.

Furthermore, after pre-training, Qwen3 models underwent a large-scale 4-phased training, and each of the phases ensured the assistant's message starts with thinking part (for non-thinking generations, this thinking part is empty). Consequently, similar to the system prompt, let's include the empty thinking part into assistant's messages to avoid large gradients.

In [6]:
system_prompt_fine_tuning = "Please correct the grammar in the user's text if necessary."
empty_reasoning_prefix = """
<think>

</think>

""".lstrip()

ft_train_save_path = 'data/fine_tuning_train.jsonl'
ft_validation_save_path = 'data/fine_tuning_validation.jsonl'

with open(ft_train_save_path, 'w') as f:
    for inst in ft_dataset_split['train']:
        dict_to_write = {
            "messages": [
                {
                    "role": "system",
                    "content": system_prompt_fine_tuning,
                },
                {
                    "role": "user",
                    "content": inst["input"],
                },
                {
                    "role": "assistant",
                    "content": empty_reasoning_prefix + inst["output"],
                }
            ]
        }
        json.dump(dict_to_write, f, ensure_ascii=False)
        f.write('\n')

In [7]:
with open(ft_validation_save_path, 'w') as f:
    for inst in ft_dataset_split['test']:
        dict_to_write = {
            "messages": [
                {
                    "role": "system",
                    "content": system_prompt_fine_tuning,
                },
                {
                    "role": "user",
                    "content": inst["input"],
                },
                {
                    "role": "assistant",
                    "content": empty_reasoning_prefix + inst["output"],
                }
            ]
        }
        json.dump(dict_to_write, f, ensure_ascii=False)
        f.write('\n')

After both files are created, upload them to the service.

In [8]:
fine_tuning_train_file = client.files.create(
    file=open(ft_train_save_path, "rb"),
    purpose="fine-tune"
)
fine_tuning_train_file

FileObject(id='file-76807105-7fed-4350-940c-a1dd1b5b6f49', bytes=8802754, created_at=1753341501, filename='fine_tuning_train.jsonl', object='file', purpose='fine-tune', status=None, expires_at=None, status_details=None)

In [9]:
fine_tuning_validation_file = client.files.create(
    file=open(ft_validation_save_path, "rb"),
    purpose="fine-tune"
)
fine_tuning_validation_file

FileObject(id='file-d1c98e51-a4ab-4d9d-9c2d-5e0a1b800f20', bytes=454502, created_at=1753341502, filename='fine_tuning_validation.jsonl', object='file', purpose='fine-tune', status=None, expires_at=None, status_details=None)

## 6 - Fine tuning


_Heads up: Running this part will cost ~$7.1_

### 6.1 - Launch fine tuning job!

We are ready to launch the fine-tuning. 

- We'll train LoRA adapters to reduce the usage price of the model. 
- We slightly increase the rank of the LoRA (`lora_r` parameter) to reduce the quality gap between the full fine-tuning and fine-tuning of LoRA adapters. 
- We also increase the LoRA alpha value correspondingly to keep the ratio of `lora_r` to `lora_alpha` equal to 1, as suggested in the original LoRA paper [3].
- Since our inputs and outputs are pretty short, we can use the maximum available batch size (32). 
- We will train the model for 10 epochs.

In [10]:
ft_job = client.fine_tuning.jobs.create(
    training_file=fine_tuning_train_file.id,
    validation_file=fine_tuning_validation_file.id,
    model="Qwen/Qwen3-4B",
    hyperparameters={
        "n_epochs": 10,
        "batch_size": 32,
        "lora": True,
        "lora_r": 16,
        "lora_alpha": 16,
        "packing": True
    },
    seed=42
)
ft_job

FineTuningJob(id='ftjob-c7cc1f8df48144aeb180a5bf5818dc81', created_at=1753341502, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(batch_size=32, learning_rate_multiplier=None, n_epochs=10, learning_rate=1e-05, warmup_ratio=0.0, weight_decay=0.0, lora=True, lora_r=16, lora_alpha=16, lora_dropout=0.0, packing=True, max_grad_norm=1.0, context_length=8192), model='Qwen/Qwen3-4B', object='fine_tuning.job', organization_id='', result_files=[], seed=42, status='running', trained_tokens=0, training_file='file-76807105-7fed-4350-940c-a1dd1b5b6f49', validation_file='file-d1c98e51-a4ab-4d9d-9c2d-5e0a1b800f20', estimated_finish=None, integrations=None, metadata=None, method=None, suffix='', trained_steps=0, total_steps=None)

## 6.2 - Monitor job progress

Same as with batched generation, this process may take some time. The loop below will update the state of the fine-tuning job every minute and stop once it is finished.

In [11]:
%%time 
import time

start_time = time.time()
update_num_seconds = 60
active_statuses = ["validating_files", "queued", "running"]
print (f"Fine tuning job {ft_job.id} created, waiting for completion...")

while ft_job.status in active_statuses:
    time.sleep(update_num_seconds)
    ft_job = client.fine_tuning.jobs.retrieve(ft_job.id)
    elapsed = time.time() - start_time
    print(f"Elapsed: {int(elapsed)}s ({elapsed/60:.1f} min) : current status: {ft_job.status}")

Fine tuning job ftjob-c7cc1f8df48144aeb180a5bf5818dc81 created, waiting for completion...
Elapsed: 60s (1.0 min) : current status: running
Elapsed: 121s (2.0 min) : current status: running
Elapsed: 182s (3.0 min) : current status: running
Elapsed: 242s (4.0 min) : current status: running
Elapsed: 303s (5.1 min) : current status: running
Elapsed: 364s (6.1 min) : current status: running
Elapsed: 425s (7.1 min) : current status: running
Elapsed: 485s (8.1 min) : current status: running
Elapsed: 546s (9.1 min) : current status: running
Elapsed: 607s (10.1 min) : current status: running
Elapsed: 668s (11.1 min) : current status: running
Elapsed: 728s (12.1 min) : current status: running
Elapsed: 789s (13.2 min) : current status: running
Elapsed: 850s (14.2 min) : current status: succeeded
CPU times: user 199 ms, sys: 23.4 ms, total: 222 ms
Wall time: 14min 10s


## 7 - Inspect training quality

Let's examine the loss on the validation set on each epoch to download the checkpoint yielding the highest quality.

### 7.1 - Visualize loss over epochs

In [12]:
ft_checkpoints = client.fine_tuning.jobs.checkpoints.list(ft_job.id).data
metrics = []
for epoch_data in ft_checkpoints:
    epoch_metrics = {}
    epoch_metrics["train_loss"] = epoch_data.metrics.train_loss
    epoch_metrics["valid_loss"] = epoch_data.metrics.valid_loss
    metrics.append(epoch_metrics)

df_metrics = pd.DataFrame(metrics)
df_metrics.style.background_gradient(cmap='Reds')

Unnamed: 0,train_loss,valid_loss
0,2.114447,0.41674
1,1.149326,0.241002
2,0.662357,0.230296
3,0.423568,0.229275
4,0.305986,0.226082
5,0.24793,0.210415
6,0.218345,0.233326
7,0.203119,0.199091
8,0.195031,0.214885
9,0.190383,0.228618


### 7.2 - save the last checkpoint

We can see our loss on the validation set has been gradually decreasing - meaning that, most likely, we could train our adapters even further to squeeze out the best quality. Therefore, let's save the last trained checkpoint.

In [13]:
import shutil

save_dir = "models/qwen3-4b-grammar-checker"
shutil.rmtree(save_dir, ignore_errors=True)
os.makedirs(save_dir, exist_ok=True)

n_selected_epoch = -1
best_checkpoint = ft_checkpoints[n_selected_epoch]

for model_file_id in best_checkpoint.result_files:
    # Get the name of the file
    file_name = client.files.retrieve(model_file_id).filename.split('/')[1]
    # Retrieve the contents of the file
    file_content = client.files.content(model_file_id)
    # Save the file
    file_content.write_to_file(os.path.join(save_dir, file_name))

## 8 - How much did our fine tuning cost?

The price for fine-tuning a model under 20B parameters is \\$0.4/1M tokens. Let's calculate the total fine-tuning price.

[pricing guide](https://nebius.com/prices-ai-studio)

In [14]:
price = ft_job.trained_tokens * 0.4 / 1_000_000
print(f'Fine-tuning price: ${price:.1f}')

Fine-tuning price: $7.3


## 9 -  Deploy the model in AI Studio

Nebius AI Studio provides a "Zero-click deployment" feature, which enables automatic deployment of trained LoRA adapters to AI Studio inference platform, empowering seamless use of your trained model for inference.

The list of models supported for integration of fine-tuning and inference is provided here: https://docs.nebius.com/studio/fine-tuning/models#deploy

In [15]:
lora_creation_request = {
    "name": "grammar-checker",  # You can set whatever name you like
    "base_model": "Qwen/Qwen3-4B-fast",  # Base model. Qwen3-4B is only available with the `fast` mode
    "source": f"{ft_job.id}:{best_checkpoint.id}",
    "description": "Qwen-3-4B model fine-tuned on the grammatic error correction task."
}
url = f"{BASE_URL}/v0/models"

response = requests.post(
    url,
    json=lora_creation_request,
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.getenv('NEBIUS_API_KEY')}"
    }
)    
response_json = response.json()
model_name = response_json["name"]

# save the model name to a file so next notebook can use it
with open('model_name.out', 'w') as f:
    f.write(model_name)
    
response_json

{'name': 'Qwen/Qwen3-4B-fast-LoRa:grammar-checker-WpCf',
 'base_model': 'Qwen/Qwen3-4B-fast',
 'source': 'ftjob-c7cc1f8df48144aeb180a5bf5818dc81:ftckpt_9d1ea66e-4e86-4785-9c88-372be3ea1b73',
 'description': 'Qwen-3-4B model fine-tuned on the grammatic error correction task.',
 'created_at': 1753342377,
 'status': 'validating'}

We need to wait for a few seconds or minutes until it's deployed.

In [16]:
url = f"{BASE_URL}/v0/models/{model_name}"
active_statuses = ["validating"]
update_num_seconds = 15

while response_json['status'] in active_statuses:
    time.sleep(update_num_seconds)
    response = requests.get(
        url, 
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {os.getenv('NEBIUS_API_KEY')}"
        }
    )
    response_json = response.json()
    print("Current status:", response_json['status'])

Current status: active


## 10 - See our model on AI Studio

You can see the custom model under **models --> custom** section

![](distilled-model-1.png)

## Done 👏