## Model Comparison: Evaluate Fine-tuned Models

This notebook allows you to compare responses from three different fine-tuned models:
- `Edwinexd/lora_model_merged` (Model 1)
- `Edwinexd/lora_model_merged_2` (Model 2)  
- `Edwinexd/lora_model_merged_3` (Model 3)

You'll be able to run the same prompts on all three models and vote for which response you prefer.

In [1]:
%%capture
# Install dependencies (run this cell if starting fresh without running the notebook from the beginning)
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git
!pip install ipywidgets
!pip install -U bitsandbytes

In [2]:
import unsloth
# Load all three models for comparison
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch

model_names = [
    "Edwinexd/lora_model_merged",
    "Edwinexd/lora_model_merged_2",
    "Edwinexd/lora_model_merged_3",
]

models = {}
tokenizers = {}

for model_name in model_names:
    print(f"Loading {model_name}...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True,
    )
    tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
    FastLanguageModel.for_inference(model)

    short_name = model_name.split("/")[-1]
    models[short_name] = model
    tokenizers[short_name] = tokenizer
    print(f"Loaded {model_name} successfully!\n")

print("All models loaded!")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
Loading Edwinexd/lora_model_merged...
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Loaded Edwinexd/lora_model_merged successfully!

Loading Edwinexd/lora_model_merged_2...
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Loaded Edwinexd/lora_model_merged_2 successfully!

Loading Edwinexd/lora_model_merged_3...
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Loaded Edwinexd/lora_model_merged_3 successfully!

All models loaded!


In [3]:
# Define test prompts for model comparison
test_prompts = [
    "Explain quantum computing to a 10-year-old.",
    "Write a short poem about artificial intelligence.",
    "What are the main differences between Python and JavaScript?",
    "How does photosynthesis work?",
    "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,",
    "Describe the water cycle in simple terms.",
    "What is the capital of France and what is it famous for?",
    "Explain why the sky is blue.",
    "Write a haiku about programming.",
    "What are three tips for learning a new language?",
]

print(f"Defined {len(test_prompts)} test prompts for comparison")

Defined 10 test prompts for comparison


In [4]:
# Function to generate response from a model
def generate_response(model, tokenizer, prompt, max_new_tokens=256):
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        temperature=1.5,
        min_p=0.1,
        pad_token_id=tokenizer.eos_token_id,
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # Extract only the assistant's response
    if "<|start_header_id|>assistant<|end_header_id|>" in response:
        response = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1]
        response = response.split("<|eot_id|>")[0].strip()
    return response

# Generate all responses
all_responses = {}
for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*60}")
    print(f"Prompt {i+1}/{len(test_prompts)}: {prompt}")
    print('='*60)

    all_responses[prompt] = {}
    for model_name in models.keys():
        print(f"\nGenerating response from {model_name}...")
        response = generate_response(models[model_name], tokenizers[model_name], prompt)
        all_responses[prompt][model_name] = response
        print(f"[{model_name}]: {response[:200]}..." if len(response) > 200 else f"[{model_name}]: {response}")

print("\n\nAll responses generated!")


Prompt 1/10: Explain quantum computing to a 10-year-old.

Generating response from lora_model_merged...


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[lora_model_merged]: Quantum computing is a way of doing math that uses special kinds of computers. These computers are called quantum computers, and they use something called "quantum bits" or "qubits" to do their math.
...

Generating response from lora_model_merged_2...
[lora_model_merged_2]: Quantum computing is a type of computer that uses the principles of quantum mechanics to process information. This means that instead of using bits (0s and 1s) to represent information, quantum comput...

Generating response from lora_model_merged_3...
[lora_model_merged_3]: Imagine you have a super-powerful computer that can do anything in the world. But, this computer is very, very small and can only do a tiny bit of work at a time. That's like a toy car that can only g...

Prompt 2/10: Write a short poem about artificial intelligence.

Generating response from lora_model_merged...
[lora_model_merged]: In silicon halls, a mind is born,
A synthetic soul, with thoughts that form,
A world of dat

### Interactive Voting

Use the widget below to navigate through prompts and vote for the best response from each model.

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML

# Store votes
votes = {model_name: 0 for model_name in models.keys()}
votes["tie"] = 0
vote_history = []

# Current prompt index
current_idx = [0]

# Create output areas
prompt_output = widgets.Output()
responses_output = widgets.Output()
voting_output = widgets.Output()
results_output = widgets.Output()

def display_current_prompt():
    prompt = test_prompts[current_idx[0]]

    with prompt_output:
        clear_output(wait=True)
        display(HTML(f"""
        <div style="background-color: #f0f0f0; padding: 15px; border-radius: 10px; margin-bottom: 10px;">
            <h3>Prompt {current_idx[0] + 1} of {len(test_prompts)}</h3>
            <p style="font-size: 16px; font-weight: bold;">{prompt}</p>
        </div>
        """))

    with responses_output:
        clear_output(wait=True)
        model_names = list(models.keys())
        colors = ["#e3f2fd", "#fff3e0", "#e8f5e9"]

        for i, model_name in enumerate(model_names):
            response = all_responses[prompt][model_name]
            display(HTML(f"""
            <div style="background-color: {colors[i]}; padding: 15px; border-radius: 10px; margin-bottom: 10px; border-left: 4px solid {'#1976d2' if i==0 else '#f57c00' if i==1 else '#388e3c'};">
                <h4 style="margin-top: 0;">Model: {model_name}</h4>
                <p style="white-space: pre-wrap;">{response}</p>
            </div>
            """))

def on_vote(model_name):
    votes[model_name] += 1
    vote_history.append({
        "prompt_idx": current_idx[0],
        "prompt": test_prompts[current_idx[0]],
        "voted_for": model_name
    })

    with voting_output:
        clear_output(wait=True)
        print(f"Voted for: {model_name}")

    # Auto-advance to next prompt
    if current_idx[0] < len(test_prompts) - 1:
        current_idx[0] += 1
        display_current_prompt()
    else:
        with voting_output:
            clear_output(wait=True)
            print("Voting complete! See results below.")
        update_results()

def on_prev(b):
    if current_idx[0] > 0:
        current_idx[0] -= 1
        display_current_prompt()

def on_next(b):
    if current_idx[0] < len(test_prompts) - 1:
        current_idx[0] += 1
        display_current_prompt()

def update_results():
    with results_output:
        clear_output(wait=True)
        total_votes = sum(votes.values())
        if total_votes > 0:
            display(HTML("<h3>Current Vote Tally:</h3>"))
            for model_name, count in sorted(votes.items(), key=lambda x: -x[1]):
                pct = (count / total_votes) * 100
                bar_width = int(pct * 2)
                color = "#1976d2" if "1" in model_name else "#f57c00" if "2" in model_name else "#388e3c" if "3" in model_name else "#9e9e9e"
                display(HTML(f"""
                <div style="margin: 5px 0;">
                    <span style="display: inline-block; width: 180px;">{model_name}:</span>
                    <span style="display: inline-block; background-color: {color}; width: {bar_width}px; height: 20px;"></span>
                    <span> {count} votes ({pct:.1f}%)</span>
                </div>
                """))

# Create vote buttons
model_names = list(models.keys())
vote_buttons = [
    widgets.Button(description=f"Vote: {name}", button_style='primary' if '1' in name else 'warning' if '2' in name else 'success')
    for name in model_names
]
tie_button = widgets.Button(description="Tie / No preference", button_style='')

for i, btn in enumerate(vote_buttons):
    btn.on_click(lambda b, m=model_names[i]: on_vote(m))
tie_button.on_click(lambda b: on_vote("tie"))

# Navigation buttons
prev_btn = widgets.Button(description="Previous", button_style='')
next_btn = widgets.Button(description="Next", button_style='')
prev_btn.on_click(on_prev)
next_btn.on_click(on_next)

# Layout
nav_box = widgets.HBox([prev_btn, next_btn])
vote_box = widgets.HBox(vote_buttons + [tie_button])

# Display everything
display(prompt_output)
display(responses_output)
display(HTML("<h4>Cast your vote:</h4>"))
display(vote_box)
display(nav_box)
display(voting_output)
display(results_output)

# Initial display
display_current_prompt()

Output()

Output()

HBox(children=(Button(button_style='success', description='Vote: lora_model_merged', style=ButtonStyle()), Butâ€¦

HBox(children=(Button(description='Previous', style=ButtonStyle()), Button(description='Next', style=ButtonStyâ€¦

Output()

Output()

### View Final Results

Run the cell below after voting to see a summary of results and export vote data.

In [8]:
# Display final results summary
import json
from datetime import datetime

print("=" * 60)
print("FINAL VOTING RESULTS")
print("=" * 60)

total_votes = sum(votes.values())
if total_votes > 0:
    print(f"\nTotal votes cast: {total_votes}\n")

    # Sort by votes descending
    sorted_votes = sorted(votes.items(), key=lambda x: -x[1])

    print("Rankings:")
    for rank, (model_name, count) in enumerate(sorted_votes, 1):
        pct = (count / total_votes) * 100
        bar = "â–ˆ" * int(pct / 5)
        print(f"  {rank}. {model_name}: {count} votes ({pct:.1f}%) {bar}")

    # Determine winner
    winner = sorted_votes[0]
    print(f"\nWinner: {winner[0]} with {winner[1]} votes ({(winner[1]/total_votes)*100:.1f}%)")

    # Export data
    export_data = {
        "timestamp": datetime.now().isoformat(),
        "total_votes": total_votes,
        "vote_counts": votes,
        "vote_history": vote_history,
        "prompts": test_prompts,
        "all_responses": all_responses
    }

    with open("model_comparison_results.json", "w") as f:
        json.dump(export_data, f, indent=2)

    print(f"\nResults exported to: model_comparison_results.json")
else:
    print("\nNo votes cast yet. Please use the voting widget above to cast your votes.")

FINAL VOTING RESULTS

Total votes cast: 10

Rankings:
  1. lora_model_merged: 7 votes (70.0%) â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
  2. lora_model_merged_2: 2 votes (20.0%) â–ˆâ–ˆâ–ˆâ–ˆ
  3. lora_model_merged_3: 1 votes (10.0%) â–ˆâ–ˆ
  4. tie: 0 votes (0.0%) 

Winner: lora_model_merged with 7 votes (70.0%)

Results exported to: model_comparison_results.json


### View All Responses (Static)

Run this cell to see all responses in a static format (useful for reviewing without the interactive widget).

In [9]:
# Static display of all responses for comparison
from IPython.display import display, HTML

for i, prompt in enumerate(test_prompts):
    display(HTML(f"""
    <div style="border: 2px solid #333; border-radius: 10px; padding: 15px; margin: 20px 0;">
        <h3 style="background-color: #333; color: white; padding: 10px; margin: -15px -15px 15px -15px; border-radius: 8px 8px 0 0;">
            Prompt {i+1}: {prompt}
        </h3>
    """))

    model_names = list(models.keys())
    colors = [("#e3f2fd", "#1976d2"), ("#fff3e0", "#f57c00"), ("#e8f5e9", "#388e3c")]

    for j, model_name in enumerate(model_names):
        response = all_responses[prompt][model_name]
        bg_color, border_color = colors[j]
        display(HTML(f"""
        <div style="background-color: {bg_color}; padding: 12px; margin: 10px 0; border-left: 4px solid {border_color}; border-radius: 5px;">
            <strong style="color: {border_color};">{model_name}:</strong>
            <p style="margin: 8px 0 0 0; white-space: pre-wrap;">{response}</p>
        </div>
        """))

    display(HTML("</div>"))