To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth now supports [gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which automatically creates kernels!

[Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) is now supported! Train Qwen2.5-VL, Gemma 3 etc. with GSPO or GRPO.

Introducing Unsloth [Standby for RL](https://docs.unsloth.ai/basics/memory-efficient-rl): GRPO is now faster, uses 30% less memory with 2x longer context.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [None]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Successfully patched SmolVLMForConditionalGeneration for better torch.compile compatibility.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.2.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.755 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.language_model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
from datasets import load_dataset
dataset = load_dataset("Nishan726/sri-lankan-legal-conversations", split = "train")

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Let's see how row 100 looks like!

In [None]:
dataset[100]

{'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?',
   'role': 'user'},
  {'content': 'In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Theref

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [None]:
dataset[100]["text"]

'<start_of_turn>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<end_of_turn>\n<start_of_turn>model\nIn programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the ou

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=255):   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<end_of_turn>\n<start_of_turn>model\nIn programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, t

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                               In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<end_of_t

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
4.283 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.2377
2,1.6364
3,1.7663
4,1.4207
5,1.2357
6,1.8066
7,1.0101
8,1.8966
9,1.4647
10,1.3097


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1068.4322 seconds used for training.
17.81 minutes used for training.
Peak reserved memory = 13.561 GB.
Peak reserved memory for training = 9.278 GB.
Peak reserved memory % of max memory = 91.995 %.
Peak reserved memory for training % of max memory = 62.94 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<end_of_turn>\n<start_of_turn>model\n13, 21, 34, 55, 89...\n\nThis is the Fibonacci sequence, where each number is the sum of the two preceding ones.\n<end_of_turn>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Okay, let's break down why the sky is blue! It's a fascinating phenomenon that boils down to a combination of physics and light. Here's the explanation:

**1. Sunlight and its Colors:**

* Sunlight, which appears white to us, is actually made up of *all* the


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

['gemma-3/processor_config.json']

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Gemma-3?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Okay, let's break down what Gemma-3 is. It's a fascinating development in the world of AI, and here's a comprehensive overview:

**1. What it is:**

* **A Family of Open-Weight Language Models:** Gemma-3 isn't just *one* model


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3-finetune", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-finetune", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-finetune",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
        token = "hf_...",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>


### Model Evaluation and Performance Analysis

Now let's evaluate the fine-tuned model's performance on legal questions and generate comprehensive results for analysis.

In [None]:
# Simple Model Evaluation
import json
import os
from datetime import datetime
from datasets import load_dataset
import random

# Install evaluation packages if needed
try:
    import nltk
    from rouge_score import rouge_scorer
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
except ImportError:
    print("Installing evaluation packages...")
    !pip install rouge-score nltk -q
    import nltk
    from rouge_score import rouge_scorer
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Download NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

print("✅ Simple evaluation setup complete")

## Comprehensive Model Evaluation

Let's implement a thorough evaluation of our fine-tuned model to assess its performance on Sri Lankan legal questions. We'll use the validation dataset and compute various metrics to understand how well our model performs.

In [None]:
# Install required packages for evaluation
!pip install scikit-learn rouge-score bert-score seaborn matplotlib pandas numpy

In [None]:
# Import required libraries for evaluation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from rouge_score import rouge_scorer
import json
import time
from tqdm import tqdm
import torch
from transformers import TextStreamer
from datasets import load_dataset

# Set up visualization parameters
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

In [None]:
# Load the evaluation dataset
print("Loading evaluation dataset...")
eval_dataset = load_dataset("Nishan726/sri-lankan-legal-conversations", split="validation")
print(f"Loaded {len(eval_dataset)} examples for evaluation")

# Standardize the evaluation dataset format
from unsloth.chat_templates import standardize_data_formats
eval_dataset = standardize_data_formats(eval_dataset)

# Apply the same formatting function as used for training
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for convo in convos]
    return {"text": texts}

eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

print("Sample evaluation example:")
print(eval_dataset[0]["text"][:500] + "..." if len(eval_dataset[0]["text"]) > 500 else eval_dataset[0]["text"])

In [None]:
# Evaluation Functions
def extract_question_answer_pairs(eval_dataset, max_samples=50):
    """Extract question-answer pairs from the evaluation dataset."""
    qa_pairs = []
    
    for i, example in enumerate(eval_dataset):
        if i >= max_samples:
            break
            
        conversations = example["conversations"]
        if len(conversations) >= 2:
            # Get the user question and assistant answer
            user_msg = conversations[0]["content"] if conversations[0]["role"] == "user" else None
            assistant_msg = conversations[1]["content"] if conversations[1]["role"] == "assistant" else None
            
            if user_msg and assistant_msg:
                qa_pairs.append({
                    "question": user_msg,
                    "reference_answer": assistant_msg,
                    "example_id": i
                })
    
    return qa_pairs

def generate_model_response(question, model, tokenizer, max_new_tokens=256):
    """Generate a response from the fine-tuned model."""
    messages = [{
        "role": "user",
        "content": [{"type": "text", "text": question}]
    }]
    
    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True
    )
    
    with torch.no_grad():
        outputs = model.generate(
            **tokenizer([text], return_tensors="pt").to("cuda"),
            max_new_tokens=max_new_tokens,
            temperature=1.0,
            top_p=0.95,
            top_k=64,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Extract only the generated response (remove the input prompt)
    input_length = len(tokenizer([text], return_tensors="pt")["input_ids"][0])
    generated_tokens = outputs[0][input_length:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    
    return response.strip()

# Extract question-answer pairs for evaluation
print("Extracting Q&A pairs for evaluation...")
qa_pairs = extract_question_answer_pairs(eval_dataset, max_samples=30)  # Reduced for faster evaluation
print(f"Extracted {len(qa_pairs)} Q&A pairs for evaluation")

In [None]:
# Run Model Evaluation
print("Starting model evaluation...")
evaluation_results = []

# Track timing and performance
start_time = time.time()
total_input_tokens = 0
total_output_tokens = 0

for i, qa_pair in enumerate(tqdm(qa_pairs, desc="Evaluating model")):
    try:
        # Generate response
        generated_response = generate_model_response(
            qa_pair["question"], 
            model, 
            tokenizer, 
            max_new_tokens=256
        )
        
        # Count tokens for analysis
        input_tokens = len(tokenizer.encode(qa_pair["question"]))
        output_tokens = len(tokenizer.encode(generated_response))
        total_input_tokens += input_tokens
        total_output_tokens += output_tokens
        
        evaluation_results.append({
            "example_id": qa_pair["example_id"],
            "question": qa_pair["question"],
            "reference_answer": qa_pair["reference_answer"],
            "generated_answer": generated_response,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
        })
        
        # Print a few examples
        if i < 3:
            print(f"\n--- Example {i+1} ---")
            print(f"Question: {qa_pair['question'][:200]}...")
            print(f"Reference: {qa_pair['reference_answer'][:200]}...")
            print(f"Generated: {generated_response[:200]}...")
            print("-" * 50)
            
    except Exception as e:
        print(f"Error processing example {i}: {e}")
        continue

evaluation_time = time.time() - start_time
print(f"\nEvaluation completed in {evaluation_time:.2f} seconds")
print(f"Average time per example: {evaluation_time/len(evaluation_results):.2f} seconds")
print(f"Total input tokens: {total_input_tokens}")
print(f"Total output tokens: {total_output_tokens}")
print(f"Successfully evaluated {len(evaluation_results)} examples")

In [None]:
# Calculate Evaluation Metrics
print("Calculating evaluation metrics...")

# Initialize ROUGE scorer
rouge_scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate metrics for each example
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []
response_lengths = []
question_lengths = []

for result in evaluation_results:
    # Calculate ROUGE scores
    rouge_scores = rouge_scorer_obj.score(result["reference_answer"], result["generated_answer"])
    
    rouge1_scores.append(rouge_scores['rouge1'].fmeasure)
    rouge2_scores.append(rouge_scores['rouge2'].fmeasure)
    rougeL_scores.append(rouge_scores['rougeL'].fmeasure)
    
    # Calculate lengths
    response_lengths.append(len(result["generated_answer"].split()))
    question_lengths.append(len(result["question"].split()))

# Calculate average metrics
avg_rouge1 = np.mean(rouge1_scores)
avg_rouge2 = np.mean(rouge2_scores)
avg_rougeL = np.mean(rougeL_scores)
avg_response_length = np.mean(response_lengths)
avg_question_length = np.mean(question_lengths)

# Print results
print("\n" + "="*60)
print("EVALUATION RESULTS SUMMARY")
print("="*60)
print(f"Number of examples evaluated: {len(evaluation_results)}")
print(f"Average ROUGE-1 Score: {avg_rouge1:.4f}")
print(f"Average ROUGE-2 Score: {avg_rouge2:.4f}")
print(f"Average ROUGE-L Score: {avg_rougeL:.4f}")
print(f"Average Response Length: {avg_response_length:.1f} words")
print(f"Average Question Length: {avg_question_length:.1f} words")
print(f"Total Evaluation Time: {evaluation_time:.2f} seconds")
print("="*60)

# Store metrics for visualization
metrics_summary = {
    'rouge1': avg_rouge1,
    'rouge2': avg_rouge2,
    'rougeL': avg_rougeL,
    'avg_response_length': avg_response_length,
    'avg_question_length': avg_question_length,
    'evaluation_time': evaluation_time,
    'num_examples': len(evaluation_results),
    'total_input_tokens': total_input_tokens,
    'total_output_tokens': total_output_tokens
}

In [None]:
# Create Comprehensive Visualizations
print("Creating visualizations...")

# Create a comprehensive evaluation report with multiple plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Fine-tuned Gemma-3 Model Evaluation Report', fontsize=16, fontweight='bold')

# 1. ROUGE Scores Comparison
rouge_metrics = ['ROUGE-1', 'ROUGE-2', 'ROUGE-L']
rouge_values = [avg_rouge1, avg_rouge2, avg_rougeL]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

bars = axes[0, 0].bar(rouge_metrics, rouge_values, color=colors, alpha=0.8)
axes[0, 0].set_title('ROUGE Scores', fontweight='bold')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_ylim(0, 1)
# Add value labels on bars
for bar, value in zip(bars, rouge_values):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                    f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

# 2. ROUGE Score Distribution
rouge_data = pd.DataFrame({
    'ROUGE-1': rouge1_scores,
    'ROUGE-2': rouge2_scores,
    'ROUGE-L': rougeL_scores
})

axes[0, 1].boxplot([rouge1_scores, rouge2_scores, rougeL_scores], 
                   labels=['ROUGE-1', 'ROUGE-2', 'ROUGE-L'])
axes[0, 1].set_title('ROUGE Score Distributions', fontweight='bold')
axes[0, 1].set_ylabel('Score')
axes[0, 1].grid(True, alpha=0.3)

# 3. Response Length Distribution
axes[0, 2].hist(response_lengths, bins=15, color='#96CEB4', alpha=0.7, edgecolor='black')
axes[0, 2].axvline(avg_response_length, color='red', linestyle='--', linewidth=2, 
                   label=f'Mean: {avg_response_length:.1f}')
axes[0, 2].set_title('Generated Response Length Distribution', fontweight='bold')
axes[0, 2].set_xlabel('Response Length (words)')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Performance Metrics Bar Chart
performance_metrics = ['Avg Response\nLength (words)', 'Avg Question\nLength (words)', 
                      'Total Input\nTokens', 'Total Output\nTokens']
performance_values = [avg_response_length, avg_question_length, 
                     total_input_tokens/100, total_output_tokens/100]  # Scale tokens for visibility
performance_labels = [f'{avg_response_length:.1f}', f'{avg_question_length:.1f}', 
                     f'{total_input_tokens}', f'{total_output_tokens}']

bars = axes[1, 0].bar(performance_metrics, performance_values, 
                      color=['#FECA57', '#48CAE4', '#FF6B6B', '#95E1D3'], alpha=0.8)
axes[1, 0].set_title('Performance Metrics', fontweight='bold')
axes[1, 0].set_ylabel('Value')
# Add labels
for bar, label in zip(bars, performance_labels):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(performance_values)*0.01, 
                    label, ha='center', va='bottom', fontweight='bold')

# 5. Correlation: Question Length vs Response Length
scatter = axes[1, 1].scatter(question_lengths, response_lengths, alpha=0.6, color='#7209B7')
axes[1, 1].set_xlabel('Question Length (words)')
axes[1, 1].set_ylabel('Response Length (words)')
axes[1, 1].set_title('Question vs Response Length', fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

# Add correlation coefficient
correlation = np.corrcoef(question_lengths, response_lengths)[0, 1]
axes[1, 1].text(0.05, 0.95, f'Correlation: {correlation:.3f}', 
                transform=axes[1, 1].transAxes, fontweight='bold',
                bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))

# 6. Model Performance Summary
axes[1, 2].axis('off')
summary_text = f"""
Model Evaluation Summary

📊 Dataset: Sri Lankan Legal Conversations
🔢 Examples Evaluated: {len(evaluation_results)}
⏱️ Total Time: {evaluation_time:.1f}s
⚡ Avg Time/Example: {evaluation_time/len(evaluation_results):.2f}s

📈 ROUGE Scores:
  • ROUGE-1: {avg_rouge1:.3f}
  • ROUGE-2: {avg_rouge2:.3f}
  • ROUGE-L: {avg_rougeL:.3f}

📝 Response Analysis:
  • Avg Length: {avg_response_length:.1f} words
  • Token Usage: {total_output_tokens:,} tokens
"""

axes[1, 2].text(0.1, 0.9, summary_text, transform=axes[1, 2].transAxes, 
                fontsize=11, verticalalignment='top', fontfamily='monospace',
                bbox=dict(boxstyle="round,pad=0.5", facecolor="#F8F9FA", alpha=0.8))

plt.tight_layout()
plt.show()

# Save the evaluation results
results_df = pd.DataFrame(evaluation_results)
results_df.to_csv('model_evaluation_results.csv', index=False)
print(f"\n✅ Evaluation results saved to 'model_evaluation_results.csv'")

# Save metrics summary
with open('evaluation_metrics_summary.json', 'w') as f:
    json.dump(metrics_summary, f, indent=2)
print(f"✅ Metrics summary saved to 'evaluation_metrics_summary.json'")

In [None]:
# Detailed Qualitative Analysis
print("="*80)
print("DETAILED QUALITATIVE ANALYSIS")
print("="*80)

# Show best and worst performing examples based on ROUGE-L scores
results_with_scores = []
for i, result in enumerate(evaluation_results):
    rouge_scores = rouge_scorer_obj.score(result["reference_answer"], result["generated_answer"])
    results_with_scores.append({
        **result,
        'rouge_l_score': rouge_scores['rougeL'].fmeasure
    })

# Sort by ROUGE-L score
results_with_scores.sort(key=lambda x: x['rouge_l_score'], reverse=True)

print("\n🏆 TOP 3 BEST PERFORMING EXAMPLES:")
print("-" * 60)
for i, result in enumerate(results_with_scores[:3]):
    print(f"\nExample {i+1} (ROUGE-L: {result['rouge_l_score']:.3f})")
    print(f"Question: {result['question'][:150]}...")
    print(f"Reference: {result['reference_answer'][:150]}...")
    print(f"Generated: {result['generated_answer'][:150]}...")
    print("-" * 60)

print("\n⚠️ BOTTOM 3 EXAMPLES (NEED IMPROVEMENT):")
print("-" * 60)
for i, result in enumerate(results_with_scores[-3:]):
    print(f"\nExample {i+1} (ROUGE-L: {result['rouge_l_score']:.3f})")
    print(f"Question: {result['question'][:150]}...")
    print(f"Reference: {result['reference_answer'][:150]}...")
    print(f"Generated: {result['generated_answer'][:150]}...")
    print("-" * 60)

In [None]:
# Generate Comprehensive Evaluation Report
report = f"""
{'='*80}
GEMMA-3 4B FINE-TUNED MODEL EVALUATION REPORT
{'='*80}

📋 EXECUTIVE SUMMARY
{'='*50}
This report presents the evaluation results of the fine-tuned Gemma-3 4B model 
on Sri Lankan legal conversations. The model was evaluated on {len(evaluation_results)} 
examples from the validation dataset.

📊 KEY METRICS
{'='*50}
• ROUGE-1 Score: {avg_rouge1:.4f} (measures unigram overlap)
• ROUGE-2 Score: {avg_rouge2:.4f} (measures bigram overlap)  
• ROUGE-L Score: {avg_rougeL:.4f} (measures longest common subsequence)
• Average Response Length: {avg_response_length:.1f} words
• Evaluation Time: {evaluation_time:.2f} seconds ({evaluation_time/len(evaluation_results):.2f}s per example)

🎯 PERFORMANCE ANALYSIS
{'='*50}
"""

# Add performance interpretation
if avg_rouge1 > 0.3:
    report += "✅ ROUGE-1 Score: GOOD - Model shows strong unigram overlap with references\n"
elif avg_rouge1 > 0.2:
    report += "⚠️ ROUGE-1 Score: MODERATE - Decent unigram overlap, room for improvement\n"
else:
    report += "❌ ROUGE-1 Score: LOW - Limited unigram overlap with references\n"

if avg_rouge2 > 0.2:
    report += "✅ ROUGE-2 Score: GOOD - Strong bigram overlap indicates good phrase matching\n"
elif avg_rouge2 > 0.1:
    report += "⚠️ ROUGE-2 Score: MODERATE - Some bigram overlap present\n"
else:
    report += "❌ ROUGE-2 Score: LOW - Limited bigram overlap\n"

if avg_rougeL > 0.25:
    report += "✅ ROUGE-L Score: GOOD - Good structural similarity with references\n"
elif avg_rougeL > 0.15:
    report += "⚠️ ROUGE-L Score: MODERATE - Some structural similarity\n"
else:
    report += "❌ ROUGE-L Score: LOW - Limited structural similarity\n"

report += f"""
📈 RESPONSE CHARACTERISTICS
{'='*50}
• Response Length Statistics:
  - Average: {avg_response_length:.1f} words
  - Min: {min(response_lengths)} words
  - Max: {max(response_lengths)} words
  - Std Dev: {np.std(response_lengths):.1f} words

• Token Usage:
  - Total Input Tokens: {total_input_tokens:,}
  - Total Output Tokens: {total_output_tokens:,}
  - Average Output Tokens per Response: {total_output_tokens/len(evaluation_results):.1f}

🔍 SCORE DISTRIBUTION ANALYSIS
{'='*50}
• ROUGE-1 Distribution: μ={np.mean(rouge1_scores):.3f}, σ={np.std(rouge1_scores):.3f}
• ROUGE-2 Distribution: μ={np.mean(rouge2_scores):.3f}, σ={np.std(rouge2_scores):.3f}
• ROUGE-L Distribution: μ={np.mean(rougeL_scores):.3f}, σ={np.std(rougeL_scores):.3f}

📋 RECOMMENDATIONS
{'='*50}
"""

# Add recommendations based on performance
recommendations = []

if avg_rouge1 < 0.3:
    recommendations.append("• Consider increasing training epochs or adjusting learning rate for better content overlap")
    
if avg_rouge2 < 0.15:
    recommendations.append("• Focus on improving phrase-level understanding through more diverse training examples")
    
if avg_rougeL < 0.2:
    recommendations.append("• Work on improving response structure and coherence")

if np.std(response_lengths) > 50:
    recommendations.append("• Consider response length regularization to ensure consistent output length")

if correlation < 0.3:
    recommendations.append("• Improve model's ability to scale response length based on question complexity")

if len(recommendations) == 0:
    recommendations.append("• Model performance is good! Consider testing on additional diverse legal scenarios")
    recommendations.append("• Explore advanced evaluation metrics like BERTScore for semantic similarity")

for rec in recommendations:
    report += rec + "\n"

report += f"""
🎯 NEXT STEPS
{'='*50}
1. Conduct human evaluation for response quality assessment
2. Test on out-of-domain legal questions for generalization
3. Compare with baseline models (e.g., base Gemma-3 without fine-tuning)
4. Implement feedback collection mechanism for continuous improvement
5. Consider domain-specific evaluation metrics for legal accuracy

📊 DATA EXPORT
{'='*50}
• Detailed results: model_evaluation_results.csv
• Metrics summary: evaluation_metrics_summary.json
• Evaluation plots: Generated and displayed above

{'='*80}
Report Generated: {time.strftime('%Y-%m-%d %H:%M:%S')}
{'='*80}
"""

print(report)

# Save the report
with open('model_evaluation_report.txt', 'w', encoding='utf-8') as f:
    f.write(report)
    
print(f"\n✅ Complete evaluation report saved to 'model_evaluation_report.txt'")

In [None]:
# Update Trainer with Evaluation Dataset (For Future Training)
print("Setting up trainer with evaluation dataset for future training runs...")

# Create a new trainer instance with evaluation dataset
trainer_with_eval = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=eval_dataset,  # Now we have evaluation!
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,  # Added eval batch size
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,
        learning_rate=2e-4,
        logging_steps=1,
        eval_steps=10,  # Evaluate every 10 steps
        evaluation_strategy="steps",  # Enable evaluation during training
        save_steps=10,
        save_total_limit=3,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",
        load_best_model_at_end=True,  # Load best model based on eval
        metric_for_best_model="eval_loss",  # Use eval loss to determine best model
    ),
)

# Apply the same training optimization
trainer_with_eval = train_on_responses_only(
    trainer_with_eval,
    instruction_part="<start_of_turn>user\n",
    response_part="<start_of_turn>model\n",
)

print("✅ Trainer with evaluation dataset is ready!")
print("💡 You can now run trainer_with_eval.train() to train with evaluation monitoring")
print("📊 This will provide evaluation metrics during training for better monitoring")

In [None]:
# Optional: Setup Gemini API for enhanced analysis
try:
    import google.generativeai as genai
    from dotenv import load_dotenv
    load_dotenv()
    
    gemini_api_key = os.getenv('GEMINI_API_KEY')
    if gemini_api_key:
        genai.configure(api_key=gemini_api_key)
        gemini_model = genai.GenerativeModel('gemini-1.5-flash')  # Use gemini_model variable
        print("✅ Gemini API configured successfully")
        gemini_available = True
    else:
        print("⚠️ GEMINI_API_KEY not found. Using basic evaluation only.")
        gemini_available = False
except ImportError:
    print("⚠️ Google GenerativeAI not installed. Using basic evaluation only.")
    gemini_available = False
except Exception as e:
    print(f"⚠️ Gemini setup failed: {e}. Using basic evaluation only.")
    gemini_available = False

In [None]:
# Load simple test cases
print("Loading test cases...")

# Try to load from dataset
try:
    dataset = load_dataset("Nishan726/sri-lankan-legal-conversations", split="train")
    # Use last 20 examples as test cases
    test_indices = list(range(len(dataset)-20, len(dataset)))
    test_data = dataset.select(test_indices)
    
    test_cases = []
    for i, example in enumerate(test_data):
        conversations = example['conversations']
        
        # Find user question and assistant answer
        question = ""
        reference = ""
        
        for conv in conversations:
            if conv['role'] == 'user':
                question = conv['content']
            elif conv['role'] == 'assistant':
                reference = conv['content']
                break
        
        if question and reference:
            test_cases.append({
                'id': i + 1,
                'question': question,
                'reference': reference
            })
    
    print(f"✅ Loaded {len(test_cases)} test cases from dataset")
    
except Exception as e:
    print(f"⚠️ Dataset loading failed: {e}")
    print("Using sample test cases...")
    
    # Fallback sample cases
    test_cases = [
        {
            'id': 1,
            'question': "What are the fundamental rights in Sri Lankan constitution?",
            'reference': "The fundamental rights in Sri Lanka are enshrined in Chapter III of the Constitution. They include freedom of speech, expression, assembly, and religion, as well as the right to equality and due process."
        },
        {
            'id': 2,
            'question': "Define theft under Sri Lankan Penal Code",
            'reference': "Whoever, intending to take dishonestly any movable property out of the possession of any person without that person's consent, moves that property in order to such taking, is said to commit 'theft'."
        },
        {
            'id': 3,
            'question': "What is the role of the Supreme Court in Sri Lanka?",
            'reference': "The Supreme Court is the highest court in Sri Lanka with final appellate jurisdiction. It has the power to interpret the Constitution and protect fundamental rights."
        }
    ]
    print(f"✅ Using {len(test_cases)} sample test cases")

print(f"Ready to evaluate {len(test_cases)} test cases")

In [None]:
# Generate model responses
def generate_response(question):
    """Generate response from the fine-tuned model"""
    messages = [{
        "role": "user",
        "content": [{"type": "text", "text": question}]
    }]
    
    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True
    )
    
    outputs = model.generate(
        **tokenizer([text], return_tensors="pt").to("cuda"),
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    # Extract just the response part
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    prompt_text = tokenizer.decode(tokenizer(text)["input_ids"], skip_special_tokens=True)
    
    if full_response.startswith(prompt_text):
        response = full_response[len(prompt_text):].strip()
    else:
        response = full_response.strip()
    
    return response

# Generate responses for all test cases
print("Generating model responses...")
results = []

for test_case in test_cases:
    print(f"Processing question {test_case['id']}: {test_case['question'][:50]}...")
    
    model_response = generate_response(test_case['question'])
    
    results.append({
        'id': test_case['id'],
        'question': test_case['question'],
        'reference': test_case['reference'],
        'model_response': model_response
    })

print(f"✅ Generated {len(results)} responses")

In [None]:
# Simple evaluation metrics
def calculate_bleu(reference, hypothesis):
    """Calculate BLEU score"""
    ref_tokens = nltk.word_tokenize(reference.lower())
    hyp_tokens = nltk.word_tokenize(hypothesis.lower())
    smoothing = SmoothingFunction()
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smoothing.method1)

def calculate_rouge(reference, hypothesis):
    """Calculate ROUGE scores"""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Calculate scores for all results
print("Calculating evaluation metrics...")

bleu_scores = []
rouge1_scores = []
rougeL_scores = []

for result in results:
    # BLEU score
    bleu = calculate_bleu(result['reference'], result['model_response'])
    bleu_scores.append(bleu)
    
    # ROUGE scores
    rouge = calculate_rouge(result['reference'], result['model_response'])
    rouge1_scores.append(rouge['rouge1'])
    rougeL_scores.append(rouge['rougeL'])
    
    result['bleu'] = bleu
    result['rouge1'] = rouge['rouge1']
    result['rougeL'] = rouge['rougeL']

# Calculate averages
avg_bleu = sum(bleu_scores) / len(bleu_scores)
avg_rouge1 = sum(rouge1_scores) / len(rouge1_scores)
avg_rougeL = sum(rougeL_scores) / len(rougeL_scores)

print("✅ Evaluation complete!")
print(f"Average BLEU Score: {avg_bleu:.3f}")
print(f"Average ROUGE-1: {avg_rouge1:.3f}")
print(f"Average ROUGE-L: {avg_rougeL:.3f}")

In [None]:
# Optional: Enhanced analysis with Gemini API
if gemini_available:
    print("Running enhanced analysis with Gemini...")
    
    def analyze_with_gemini(question, model_response, reference):
        """Use Gemini to analyze the quality of the response"""
        prompt = f"""
        Analyze this legal AI response:
        
        Question: {question}
        Model Response: {model_response}
        Reference Answer: {reference}
        
        Rate the response on:
        1. Accuracy (1-5)
        2. Completeness (1-5) 
        3. Legal terminology usage (1-5)
        
        Provide a brief explanation and overall score (1-5).
        Format: Accuracy: X, Completeness: X, Terminology: X, Overall: X, Explanation: [brief explanation]
        """
        
        try:
            response = gemini_model.generate_content(prompt)  # Using gemini_model here
            return response.text
        except Exception as e:
            return f"Analysis failed: {e}"
    
    # Analyze a few sample responses
    sample_analyses = []
    for i, result in enumerate(results[:3]):  # Analyze first 3 to save API calls
        print(f"Analyzing response {i+1}...")
        analysis = analyze_with_gemini(
            result['question'], 
            result['model_response'], 
            result['reference']
        )
        sample_analyses.append({
            'id': result['id'],
            'question': result['question'][:50] + "...",
            'analysis': analysis
        })
    
    print("✅ Gemini analysis complete!")
    for analysis in sample_analyses:
        print(f"\nQuestion {analysis['id']}: {analysis['question']}")
        print(f"Analysis: {analysis['analysis']}")
else:
    print("⚠️ Gemini API not available. Skipping enhanced analysis.")

In [None]:
# Display results and save
print("\n" + "="*60)
print("FINAL EVALUATION RESULTS")
print("="*60)

print(f"Test Cases: {len(results)}")
print(f"Average BLEU Score: {avg_bleu:.3f}")
print(f"Average ROUGE-1: {avg_rouge1:.3f}")
print(f"Average ROUGE-L: {avg_rougeL:.3f}")

print("\n" + "="*60)
print("SAMPLE RESULTS")
print("="*60)

# Show best and worst cases
if results:
    best_result = max(results, key=lambda x: x['bleu'])
    worst_result = min(results, key=lambda x: x['bleu'])
    
    print(f"\n🏆 BEST RESULT (BLEU: {best_result['bleu']:.3f}):")
    print(f"Q: {best_result['question'][:80]}...")
    print(f"A: {best_result['model_response'][:100]}...")
    
    print(f"\n🔻 LOWEST RESULT (BLEU: {worst_result['bleu']:.3f}):")
    print(f"Q: {worst_result['question'][:80]}...")
    print(f"A: {worst_result['model_response'][:100]}...")

# Save results to JSON
evaluation_summary = {
    'timestamp': datetime.now().isoformat(),
    'model': 'Fine-tuned Gemma 3 (4B)',
    'test_cases': len(results),
    'average_scores': {
        'bleu': avg_bleu,
        'rouge1': avg_rouge1,
        'rougeL': avg_rougeL
    },
    'detailed_results': results
}

with open('simple_evaluation_results.json', 'w') as f:
    json.dump(evaluation_summary, f, indent=2)

print(f"\n✅ Results saved to 'simple_evaluation_results.json'")
print("✅ Simple evaluation complete!")

### 🎉 Simple Evaluation Complete!

Your fine-tuned Gemma 3 legal AI model has been evaluated with a clean, simple approach!

#### 📊 What was evaluated:
- **BLEU Score**: Measures text similarity to reference answers
- **ROUGE-1**: Measures word overlap with reference
- **ROUGE-L**: Measures longest common sequence similarity
- **Optional Gemini Analysis**: Enhanced quality assessment (if API key provided)

#### 📁 Generated Files:
- **`simple_evaluation_results.json`** - All evaluation results and scores

#### 🚀 Simple and Clean Code:
1. **Easy to understand** - No complex functions or confusing variables
2. **Uses `gemini_model`** - Proper variable naming for Gemini API
3. **Clear separation** - Each step is in its own cell
4. **Basic metrics** - Focus on essential evaluation metrics
5. **Optional enhancement** - Gemini analysis only if available

#### 💡 Key Features:
- ✅ Loads test cases from your dataset (or uses samples)
- ✅ Generates responses using your fine-tuned model
- ✅ Calculates standard evaluation metrics
- ✅ Shows best and worst performing cases
- ✅ Saves results to JSON file
- ✅ Optional Gemini analysis using `gemini_model` variable

**Your evaluation is now simple, clean, and easy to understand!** 🎯