# Fine-tuning Phi-3-mini for Coding Tasks

This notebook demonstrates fine-tuning Microsoft's Phi-3-mini model for coding tasks using Unsloth. This smaller model requires less GPU resources while still providing good performance.

In [1]:
# Install required packages
!pip install -q unsloth
!pip install -q datasets
!pip install -q accelerate>=0.24.1
!pip install -q bitsandbytes>=0.41.1
!pip install -q peft>=0.6.0
!pip install -q trl>=0.7.6

# Verify GPU availability
import torch
print("CUDA available:", torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("GPU Memory:", torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.7/192.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m115.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

## Loading the Model and Setting Up Unsloth

We'll use Phi-3-mini which is around 3.8B parameters but has shown remarkable performance for its size. Unsloth optimizations will help us fine-tune it efficiently.

In [3]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from tqdm import tqdm

# Choose a lightweight precision format suitable for smaller GPUs
# Instead of using getattr(torch, precision), we'll use torch.float16 directly
max_seq_length = 1024

# Load the Phi-3-mini model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="microsoft/Phi-3-mini-4k-instruct",
    max_seq_length=max_seq_length,
    dtype=torch.float16,  # Directly use torch.float16 instead of getattr
    load_in_4bit=True,    # 4-bit quantization drastically reduces memory usage
)

print(f"Model loaded with fp16 precision and max sequence length of {max_seq_length}")

==((====))==  Unsloth 2025.3.19: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Model loaded with fp16 precision and max sequence length of 1024


## Loading a Small Coding Dataset

For fine-tuning a coding assistant, we'll use a small subset of code examples from the HuggingFace datasets library. This keeps our training time short while still being effective.

In [4]:
# Load a small subset of the code alpaca dataset
dataset = load_dataset("sahil2801/CodeAlpaca-20k", split="train")
print(f"Dataset loaded with {len(dataset)} examples")

# Preview a sample
print("Sample data point:")
print(dataset[0])

# Let's use only 1000 examples to keep training fast
small_dataset = dataset.select(range(1000))
print(f"Using {len(small_dataset)} examples for fine-tuning")

README.md:   0%|          | 0.00/147 [00:00<?, ?B/s]

code_alpaca_20k.json:   0%|          | 0.00/8.06M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20022 [00:00<?, ? examples/s]

Dataset loaded with 20022 examples
Sample data point:
{'output': 'arr = [2, 4, 6, 8, 10]', 'instruction': 'Create an array of length 5 which contains all even numbers between 1 and 10.', 'input': ''}
Using 1000 examples for fine-tuning


## Preparing the Dataset with Phi-3's Chat Template

We need to format our data according to Phi-3's chat template for optimal performance. This includes properly formatting instructions, input code, and expected outputs.

In [5]:
# Function to format examples according to Phi-3's chat template
def format_phi3_prompt(example):
    instruction = example["instruction"]
    input_text = example["input"] if example["input"] else ""
    output = example["output"]

    if input_text:
        prompt = f"<|user|>\n{instruction}\n\n{input_text}<|end|>\n<|assistant|>\n"
    else:
        prompt = f"<|user|>\n{instruction}<|end|>\n<|assistant|>\n"

    return {
        "text": prompt + output + "<|end|>"
    }

# Apply formatting to our dataset
formatted_dataset = small_dataset.map(format_phi3_prompt)

# Show an example of formatted data
print("Formatted example:")
print(formatted_dataset[0]["text"])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Formatted example:
<|user|>
Create an array of length 5 which contains all even numbers between 1 and 10.<|end|>
<|assistant|>
arr = [2, 4, 6, 8, 10]<|end|>


## Setting Up Training Parameters for Fine-tuning

Now we'll configure the LoRA (Low-Rank Adaptation) parameters for efficient fine-tuning and prepare the training arguments. LoRA allows us to fine-tune large models with minimal GPU memory.

In [6]:
# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,             # Rank of the update matrices
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,    # Alpha parameter for LoRA scaling
    lora_dropout=0.1  # Dropout probability for LoRA layers
)

# Set up the training arguments (optimized for smaller GPUs)
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./phi3_mini_code_assistant",
    num_train_epochs=1,               # Just 1 epoch for quick training
    per_device_train_batch_size=4,    # Small batch size for lower memory usage
    gradient_accumulation_steps=4,    # Accumulate gradients for effective larger batch size
    learning_rate=2e-4,               # Learning rate
    weight_decay=0.01,                # Weight decay for regularization
    warmup_steps=10,                  # Warmup steps
    logging_steps=10,                 # How often to log during training
    save_steps=500,                   # Save checkpoint every 500 steps
    gradient_checkpointing=True,      # Enable gradient checkpointing to save memory
    fp16=True,                        # Use mixed precision for faster training
    max_grad_norm=0.3,                # Gradient clipping
    optim="adamw_torch"               # Optimizer
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.3.19 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


## Fine-tuning the Model

Now we'll fine-tune the model using the SFTTrainer from the TRL library. This is optimized for instruction fine-tuning tasks and works well with Unsloth.

In [7]:
from trl import SFTTrainer
from transformers import DataCollatorForLanguageModeling

# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=False  # Set to True if you want to pack multiple examples into one sequence
)

# Train the model
trainer.train()

# Save the trained model
output_dir = "./phi3_mini_code_assistant_final"
trainer.save_model(output_dir)
print(f"Model saved to {output_dir}")

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 62
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 29,884,416/4,000,000,000 (0.75% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpns00911[0m ([33mpns00911-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,0.9729
20,0.7507
30,0.6615
40,0.6323
50,0.6318
60,0.593


Model saved to ./phi3_mini_code_assistant_final


## Testing the Fine-tuned Model

Let's test our fine-tuned model with some coding problems to see how it performs.

In [9]:
# Load the fine-tuned model
fine_tuned_model, fine_tuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name="./phi3_mini_code_assistant_final",
    max_seq_length=max_seq_length,
    dtype=torch.float16,
    load_in_4bit=True
)

# Function to generate responses with improved error handling
def generate_response(instruction, input_text=""):
    if input_text:
        prompt = f"<|user|>\n{instruction}\n\n{input_text}<|end|>\n<|assistant|>\n"
    else:
        prompt = f"<|user|>\n{instruction}<|end|>\n<|assistant|>\n"

    inputs = fine_tuned_tokenizer(prompt, return_tensors="pt").to(fine_tuned_model.device)

    outputs = fine_tuned_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
        do_sample=True
    )

    response = fine_tuned_tokenizer.decode(outputs[0], skip_special_tokens=False)

    # More robust parsing of response
    try:
        if "<|assistant|>" in response:
            response = response.split("<|assistant|>")[1]
            if "<|end|>" in response:
                response = response.split("<|end|>")[0]
        else:
            # Fallback if the expected tokens aren't found
            assistant_start = response.find(prompt) + len(prompt)
            response = response[assistant_start:]
    except Exception as e:
        print(f"Error parsing response: {e}")
        print(f"Raw response: {response}")
        response = "Error generating response. Please try again."

    return response.strip()

# Test with a coding problem
test_instruction = "Write a Python function to find the factorial of a number using recursion."
response = generate_response(test_instruction)
print("Model response:")
print(response)

==((====))==  Unsloth 2025.3.19: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model response:
def rec_factor(n):
    if n == 0: return  1 # base case, when it reaches zero then we stop and start multiplying values from right (2 * fact) until leftmost value is reached i.e., at one or less than that will be returned as result/output by recursive calls which ultimately gives us desired output for given input.
           elif len((os).argv)>=4:"print('usage')":"to check whether used correctly ..or incorrectly..."                except ValueError:{raise SystemExit()}else{System Exit}## END OF CODE ## ################