# Model Evaluation: Base vs Fine-tuned Qwen2.5-3B
This notebook compares the performance of the base Qwen2.5-3B-Instruct model with the fine-tuned version on academic Q&A tasks.

## 1. Load Test Questions

In [1]:
import json
import gc
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
from peft import PeftModel

# Load test questions
with open('test_evaluation.json', 'r', encoding='utf-8') as f:
    test_data = json.load(f)

print(f"Loaded {len(test_data)} test questions")
for i, item in enumerate(test_data[:3], 1):
    print(f"Q{i}: {item['question'][:100]}...")

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken?
A possible explanation is you have a new CUDA version which isn't
yet compatible with FA2? Please file a ticket to Unsloth or FA2.
We shall now use Xformers instead, which does not have any performance hits!
We found this negligible impact by benchmarking on 1x A100.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loaded 10 test questions
Q1: What is HiFi, and what does it aim to achieve in complexresidual vector quantization?...
Q2: What is PARCO, and what does it aim to achieve in Chinese AISHELL-1 and WER of 11?...
Q3: What is CMRAG, and what does it aim to achieve in processing?...


## 2. Test Base Model (Qwen2.5-3B-Instruct)

In [2]:
# Load base model
base_model_name = "unsloth/Qwen2.5-3B-Instruct"
print(f"Loading base model: {base_model_name}")

base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    base_model_name,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)

# Enable inference mode
FastLanguageModel.for_inference(base_model)

print("Base model loaded successfully!")

Loading base model: unsloth/Qwen2.5-3B-Instruct
==((====))==  Unsloth 2025.9.4: Fast Qwen2 patching. Transformers: 4.56.1.
   \\   /|    inference-ai GPU cuda. Num GPUs = 1. Max memory: 15.836 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Base model loaded successfully!


In [3]:
# Test base model on all questions
system_prompt = "You are a helpful academic Q&A assistant specialized in scholarly content."
base_answers = []

print("Testing base model...")
for i, item in enumerate(test_data, 1):
    question = item['question']
    
    # Format prompt
    prompt = f"<|system|>{system_prompt}<|user|>{question}<|assistant|>"
    
    # Tokenize and generate
    inputs = base_tokenizer(prompt, return_tensors="pt").to(base_model.device)
    
    with torch.no_grad():
        outputs = base_model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=False,  # Deterministic for fair comparison
            temperature=0.1,
            pad_token_id=base_tokenizer.eos_token_id
        )
    
    # Decode answer
    full_response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = full_response.split('<|assistant|>')[-1].strip()
    
    base_answers.append(answer)
    print(f"Q{i} completed")

print(f"Base model testing completed. Generated {len(base_answers)} answers.")

Testing base model...
Q1 completed
Q2 completed
Q3 completed
Q4 completed
Q5 completed
Q6 completed
Q7 completed
Q8 completed
Q9 completed
Q10 completed
Base model testing completed. Generated 10 answers.


In [4]:
# Display base model answers
print("=== BASE MODEL ANSWERS ===")
for i, (item, answer) in enumerate(zip(test_data, base_answers), 1):
    print(f"\nQ{i}: {item['question']}")
    print(f"Base Answer: {answer}")
    print("-" * 80)

=== BASE MODEL ANSWERS ===

Q1: What is HiFi, and what does it aim to achieve in complexresidual vector quantization?
Base Answer: HiFi (High-Fidelity) is a method or framework used in the field of signal processing, particularly in audio compression and reconstruction. In the context of complex residual vector quantization (CRVQ), HiFi aims to improve the quality of reconstructed signals by addressing issues such as distortion and noise that can occur during the compression and decompression processes.

In CRVQ, the goal is to efficiently encode and decode audio data using a residual vector quantization approach. However, traditional CRVQ methods often struggle with high-frequency components and other fine details in the audio signal, leading to artifacts and reduced fidelity. HiFi techniques are designed to mitigate these problems by incorporating advanced signal processing methods that enhance the fidelity of the reconstructed signal.

Some key aspects of
---------------------------

In [5]:
# Clean up base model from memory
del base_model
del base_tokenizer
torch.cuda.empty_cache()
gc.collect()
print("Base model cleared from memory")

Base model cleared from memory


## 3. Test Fine-tuned Model (Qwen2.5-3B-qlora-finetuned)

In [6]:
# Load fine-tuned model with PEFT adapters
base_model_name = "unsloth/Qwen2.5-3B-Instruct"
ft_adapter_path = "Qwen2.5-3B-qlora-finetuned"
print(f"Loading base model: {base_model_name}")
print(f"Loading LoRA adapters from: {ft_adapter_path}")

try:
    # First load the base model
    ft_base_model, ft_tokenizer = FastLanguageModel.from_pretrained(
        base_model_name,
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True
    )
    
    # Then load and apply the LoRA adapters using PEFT
    try:
        ft_model = PeftModel.from_pretrained(ft_base_model, ft_adapter_path)
        print("✅ LoRA adapters loaded successfully!")
        
        # Enable inference mode
        FastLanguageModel.for_inference(ft_model)
        
        print("Fine-tuned model ready for inference!")
        
    except Exception as peft_error:
        print(f"❌ Error loading LoRA adapters: {peft_error}")
        print("Trying alternative loading method...")
        
        # Alternative: Try loading directly with Unsloth (if adapters were saved differently)
        try:
            ft_model, ft_tokenizer = FastLanguageModel.from_pretrained(
                ft_adapter_path,
                max_seq_length=2048,
                dtype=None,
                load_in_4bit=True
            )
            FastLanguageModel.for_inference(ft_model)
            print("✅ Model loaded with alternative method!")
        except Exception as alt_error:
            print(f"❌ Alternative loading failed: {alt_error}")
            ft_model = None
    
except Exception as e:
    print(f"❌ Error loading base model: {e}")
    print("Make sure the base model is available and the adapter directory exists.")
    ft_model = None
    ft_tokenizer = None

Loading base model: unsloth/Qwen2.5-3B-Instruct
Loading LoRA adapters from: Qwen2.5-3B-qlora-finetuned
==((====))==  Unsloth 2025.9.4: Fast Qwen2 patching. Transformers: 4.56.1.
   \\   /|    inference-ai GPU cuda. Num GPUs = 1. Max memory: 15.836 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
✅ LoRA adapters loaded successfully!
Fine-tuned model ready for inference!


In [7]:
# Test fine-tuned model on all questions
ft_answers = []

if ft_model is not None:
    print("Testing fine-tuned model...")
    for i, item in enumerate(test_data, 1):
        question = item['question']
        
        # Format prompt (same as base model)
        prompt = f"<|system|>{system_prompt}<|user|>{question}<|assistant|>"
        
        # Tokenize and generate
        inputs = ft_tokenizer(prompt, return_tensors="pt").to(ft_model.device)
        
        with torch.no_grad():
            outputs = ft_model.generate(
                **inputs,
                max_new_tokens=150,
                do_sample=False,  # Deterministic for fair comparison
                temperature=0.1,
                pad_token_id=ft_tokenizer.eos_token_id
            )
        
        # Decode answer
        full_response = ft_tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = full_response.split('<|assistant|>')[-1].strip()
        
        ft_answers.append(answer)
        print(f"Q{i} completed")
    
    print(f"Fine-tuned model testing completed. Generated {len(ft_answers)} answers.")
else:
    print("Skipping fine-tuned model testing due to loading error.")
    ft_answers = ["Model not available"] * len(test_data)

Testing fine-tuned model...
Q1 completed
Q2 completed
Q3 completed
Q4 completed
Q5 completed
Q6 completed
Q7 completed
Q8 completed
Q9 completed
Q10 completed
Fine-tuned model testing completed. Generated 10 answers.


In [8]:
# Display fine-tuned model answers
if ft_model is not None:
    print("=== FINE-TUNED MODEL ANSWERS ===")
    for i, (item, answer) in enumerate(zip(test_data, ft_answers), 1):
        print(f"\nQ{i}: {item['question']}")
        print(f"Fine-tuned Answer: {answer}")
        print("-" * 80)

=== FINE-TUNED MODEL ANSWERS ===

Q1: What is HiFi, and what does it aim to achieve in complexresidual vector quantization?
Fine-tuned Answer: In this text, we introduce a novel approach for complex residual vectorquantization (CRVQ) that leverages the power of the Transformer model. Ourmethod, called Transformer-based CRVQ (TCR), first applies a Transformerencoder to encode the input data into a high-dimensional feature space. Then, itapplies a quantization function to map these features into a lower-dimensionalvector space, effectively reducing the dimensionality of the representation. Theresulting vectors are then used as indices to retrieve the closest codebookentries, which are finally used to reconstruct the original data. We show thatour method outperforms existing methods in terms of both reconstruction qualityand computational efficiency. In particular, TCR achieves a PSNR value of 4
--------------------------------------------------------------------------------

Q2: What is 

In [9]:
# Clean up fine-tuned model from memory
if ft_model is not None:
    del ft_model
    if 'ft_base_model' in locals():
        del ft_base_model
    del ft_tokenizer
torch.cuda.empty_cache()
gc.collect()
print("Fine-tuned model cleared from memory")

Fine-tuned model cleared from memory


## 4. Comparison and Analysis

In [10]:
# Compare answers and analyze results
print("=== MODEL COMPARISON ANALYSIS ===")
print("\n" + "=" * 100)

comparison_results = []

for i, (item, base_ans, ft_ans) in enumerate(zip(test_data, base_answers, ft_answers), 1):
    question = item['question']
    expected = item['answer']
    
    print(f"\n🔍 QUESTION {i}:")
    print(f"Q: {question}")
    print(f"\n📚 EXPECTED ANSWER:")
    print(f"   {expected}")
    print(f"\n🤖 BASE MODEL:")
    print(f"   {base_ans}")
    print(f"\n🎯 FINE-TUNED MODEL:")
    print(f"   {ft_ans}")
    
    # Simple evaluation (you can make this more sophisticated)
    base_length = len(base_ans.split())
    ft_length = len(ft_ans.split())
    
    # Check if answers contain key terms from expected answer
    expected_words = set(expected.lower().split())
    base_words = set(base_ans.lower().split())
    ft_words = set(ft_ans.lower().split())
    
    base_overlap = len(expected_words.intersection(base_words)) / max(len(expected_words), 1)
    ft_overlap = len(expected_words.intersection(ft_words)) / max(len(expected_words), 1)
    
    # Determine which is better
    if ft_ans == "Model not available":
        better_model = "N/A - Fine-tuned model not loaded"
        improvement = "N/A"
    elif ft_overlap > base_overlap:
        better_model = "Fine-tuned"
        improvement = "✅ Better keyword overlap with expected answer"
    elif base_overlap > ft_overlap:
        better_model = "Base"
        improvement = "❌ Base model had better keyword overlap"
    else:
        better_model = "Similar"
        improvement = "➖ Similar performance"
    
    print(f"\n📊 ANALYSIS:")
    print(f"   Base overlap: {base_overlap:.2f}, Fine-tuned overlap: {ft_overlap:.2f}")
    print(f"   Better model: {better_model}")
    print(f"   Assessment: {improvement}")
    
    # Store results
    comparison_results.append({
        "question_id": i,
        "question": question,
        "expected_answer": expected,
        "base_answer": base_ans,
        "finetuned_answer": ft_ans,
        "base_word_overlap": round(base_overlap, 3),
        "finetuned_word_overlap": round(ft_overlap, 3),
        "better_model": better_model,
        "improvement_note": improvement
    })
    
    print("\n" + "-" * 100)

print(f"\n📈 SUMMARY:")
if ft_answers[0] != "Model not available":
    ft_better = sum(1 for r in comparison_results if r['better_model'] == 'Fine-tuned')
    base_better = sum(1 for r in comparison_results if r['better_model'] == 'Base')
    similar = sum(1 for r in comparison_results if r['better_model'] == 'Similar')
    
    print(f"Fine-tuned better: {ft_better}/{len(comparison_results)}")
    print(f"Base better: {base_better}/{len(comparison_results)}")
    print(f"Similar: {similar}/{len(comparison_results)}")
else:
    print("Fine-tuned model was not available for comparison.")

=== MODEL COMPARISON ANALYSIS ===


🔍 QUESTION 1:
Q: What is HiFi, and what does it aim to achieve in complexresidual vector quantization?

📚 EXPECTED ANSWER:
   We present a neural speech codec that challenges the need for complexresidual vector quantization (RVQ) stacks by introducing a simpler,single-stage quantization approach.

🤖 BASE MODEL:
   HiFi (High-Fidelity) is a method or framework used in the field of signal processing, particularly in audio compression and reconstruction. In the context of complex residual vector quantization (CRVQ), HiFi aims to improve the quality of reconstructed signals by addressing issues such as distortion and noise that can occur during the compression and decompression processes.

In CRVQ, the goal is to efficiently encode and decode audio data using a residual vector quantization approach. However, traditional CRVQ methods often struggle with high-frequency components and other fine details in the audio signal, leading to artifacts and reduced 

## 5. Save Results to JSON

In [11]:
# Save evaluation results to JSON file
eval_results = {
    "evaluation_metadata": {
        "base_model": "unsloth/Qwen2.5-3B-Instruct",
        "finetuned_model": "Qwen2.5-3B-qlora-finetuned",
        "num_questions": len(test_data),
        "evaluation_date": "2024"
    },
    "results": comparison_results
}

# Save to file
with open('eval_results.json', 'w', encoding='utf-8') as f:
    json.dump(eval_results, f, indent=2, ensure_ascii=False)

print("✅ Evaluation results saved to 'eval_results.json'")
print(f"📁 Results contain {len(comparison_results)} question comparisons")

✅ Evaluation results saved to 'eval_results.json'
📁 Results contain 10 question comparisons


## Conclusion

This notebook compared the base Qwen2.5-3B-Instruct model with the fine-tuned version on academic Q&A tasks. The evaluation focused on:

1. **Loading and testing both models** on 10 academic questions
2. **Memory management** by properly unloading models after testing
3. **Comparative analysis** using keyword overlap with expected answers
4. **Structured output** saved to JSON for further analysis

The fine-tuning process aimed to improve the model's ability to provide more accurate and relevant answers to academic questions by training on domain-specific Q&A pairs.