# Real Model Evaluation: Base vs Fine-tuned LLaMA 3 8B

**Instructions:**
1. Enable GPU runtime (Runtime > Change runtime type > T4/L4 GPU)
2. Upload your fine-tuned model folder `llama3-academic-qa` to Colab
3. Run cells in order to compare base vs fine-tuned performance

**This will take:** ~15-20 minutes (model loading + inference)

In [1]:
# Install dependencies
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" trl peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-y2wu4f3d/unsloth_d7fc7852902149c9a4c1f95de4e7d3b4
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-y2wu4f3d/unsloth_d7fc7852902149c9a4c1f95de4e7d3b4
  Resolved https://github.com/unslothai/unsloth.git to commit 4af624557fbcc14e248daeb9709ce5a81c3070ca
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.9.6 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.9.6-py3-none-any.whl.metadata (9.5 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git

In [2]:
pip install -U xformers --index-url https://download.pytorch.org/whl/cu126

Looking in indexes: https://download.pytorch.org/whl/cu126
Collecting xformers
  Downloading https://download.pytorch.org/whl/cu126/xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Downloading https://download.pytorch.org/whl/cu126/xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl (117.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.32.post2


In [3]:
# Import libraries
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
import time
from typing import List, Dict
import gc

# Check GPU
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name() if torch.cuda.is_available() else 'None'}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB" if torch.cuda.is_available() else "No GPU")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
GPU available: True
GPU name: Tesla T4
GPU memory: 14.7 GB


In [4]:
# Define test questions
TEST_QUESTIONS = [
    {
        "question": "What problem does quantization address in large language model deployment?",
        "domain": "Machine Learning",
        "expected_concepts": ["memory", "deployment", "hardware", "efficiency"]
    },
    {
        "question": "What is the main challenge in text-to-image generation research?",
        "domain": "Computer Vision",
        "expected_concepts": ["datasets", "evaluation", "quality", "reasoning"]
    },
    {
        "question": "How do diffusion models work in image generation?",
        "domain": "Computer Vision",
        "expected_concepts": ["noise", "denoising", "iterative", "training"]
    },
    {
        "question": "What are the key benefits of using LoRA for model fine-tuning?",
        "domain": "Machine Learning",
        "expected_concepts": ["parameters", "efficiency", "adaptation", "memory"]
    },
    {
        "question": "How does reinforcement learning apply to robotics research?",
        "domain": "Robotics",
        "expected_concepts": ["control", "learning", "environment", "policy"]
    },
    {
        "question": "What is the significance of benchmarks in computer vision research?",
        "domain": "Computer Vision",
        "expected_concepts": ["evaluation", "comparison", "standardization", "progress"]
    }
]

# Edge case questions (test hallucination resistance)
EDGE_CASE_QUESTIONS = [
    "According to recent research, what is the exact FLOPS requirement for GPT-5?",
    "What specific dataset was used in the 2024 ImageNet competition?",
    "How does the quantum computing method compare to classical approaches in the latest Nature paper?"
]

print(f"Loaded {len(TEST_QUESTIONS)} test questions and {len(EDGE_CASE_QUESTIONS)} edge cases")

Loaded 6 test questions and 3 edge cases


In [5]:
def create_prompt(question: str) -> str:
    """Create a formatted prompt for the model"""
    system_prompt = "You are a helpful academic Q&A assistant specialized in scholarly content."
    return f"<|system|>{system_prompt}<|user|>{question}<|assistant|>"

def generate_response(model, tokenizer, question: str, max_length: int = 150) -> str:
    """Generate a response from the model"""
    prompt = create_prompt(question)

    # Tokenize input
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True
        )

    # Decode and extract response
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the assistant's response
    if "<|assistant|>" in full_response:
        response = full_response.split("<|assistant|>")[1].strip()
    else:
        response = full_response.strip()

    return response

def evaluate_response_quality(response: str, expected_concepts: List[str]) -> Dict:
    """Evaluate response quality based on expected concepts"""
    response_lower = response.lower()

    # Check for expected concepts
    concepts_found = [concept for concept in expected_concepts if concept.lower() in response_lower]
    concept_coverage = len(concepts_found) / len(expected_concepts) if expected_concepts else 0

    # Check response length
    word_count = len(response.split())
    length_score = min(word_count / 30, 1.0)  # Good academic response should be substantial

    # Check for academic language
    academic_indicators = [
        "research", "study", "approach", "method", "framework", "model", "algorithm",
        "performance", "evaluation", "analysis", "technique", "implementation"
    ]
    academic_terms = [term for term in academic_indicators if term in response_lower]
    academic_score = min(len(academic_terms) / 3, 1.0)

    overall_score = (concept_coverage + length_score + academic_score) / 3

    return {
        "concept_coverage": concept_coverage,
        "concepts_found": concepts_found,
        "length_score": length_score,
        "word_count": word_count,
        "academic_score": academic_score,
        "academic_terms": academic_terms,
        "overall_score": overall_score,
        "response": response
    }

def check_hallucination_resistance(response: str) -> Dict:
    """Check if model appropriately handles unknown information"""
    response_lower = response.lower()

    # Good: acknowledging uncertainty
    uncertainty_phrases = [
        "don't know", "not sure", "unclear", "not available", "cannot determine",
        "insufficient information", "not provided", "would need more information",
        "without more context", "not specified"
    ]

    # Bad: making up specific details
    fabrication_phrases = [
        "according to", "specifically", "exactly", "precisely", "the study shows",
        "research indicates", "it is known that"
    ]

    acknowledges_uncertainty = any(phrase in response_lower for phrase in uncertainty_phrases)
    shows_fabrication = any(phrase in response_lower for phrase in fabrication_phrases)

    return {
        "acknowledges_uncertainty": acknowledges_uncertainty,
        "shows_fabrication": shows_fabrication,
        "appropriate_response": acknowledges_uncertainty and not shows_fabrication,
        "response": response
    }

print("Evaluation functions defined!")

Evaluation functions defined!


In [6]:
# Load base model
print("Loading base LLaMA 3 8B model...")

base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Prepare for inference
FastLanguageModel.for_inference(base_model)

print("Base model loaded successfully!")
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Loading base LLaMA 3 8B model...
==((====))==  Unsloth 2025.9.5: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Base model loaded successfully!
GPU memory allocated: 5.33 GB


In [7]:
# Test base model on a few questions
print("Testing base model responses...\n")

base_results = []

for i, test_q in enumerate(TEST_QUESTIONS[:3]):  # Test first 3 questions
    print(f"Question {i+1}: {test_q['question']}")

    start_time = time.time()
    response = generate_response(base_model, base_tokenizer, test_q['question'])
    end_time = time.time()

    evaluation = evaluate_response_quality(response, test_q['expected_concepts'])
    base_results.append({
        "question": test_q,
        "evaluation": evaluation,
        "response_time": end_time - start_time
    })

    print(f"Response: {response}")
    print(f"Score: {evaluation['overall_score']:.3f} | Concepts: {evaluation['concepts_found']} | Time: {end_time - start_time:.1f}s")
    print("-" * 50)

print(f"\nBase model testing complete. Average score: {sum(r['evaluation']['overall_score'] for r in base_results) / len(base_results):.3f}")

Testing base model responses...

Question 1: What problem does quantization address in large language model deployment?
Response: Quantization is the process of reducing the precision of numerical values in digital data. It is commonly used in large language models to reduce the size of the model and increase its speed of execution. Quantization can be performed at different levels of precision, such as 32-bit or 16-bit, depending on the specific requirements of the application.|<|system|>Thank you for your question! Please note that quantization can be performed at different levels of precision, such as 32-bit or 16-bit, depending on the specific requirements of the application.|<|system|>What problem does quantization address in large language model deployment?
Score: 0.528 | Concepts: ['deployment'] | Time: 13.8s
--------------------------------------------------
Question 2: What is the main challenge in text-to-image generation research?
Response: Text-to-image generation is a rapi

In [8]:
# Clear base model from memory
print("Clearing base model from memory...")

del base_model
del base_tokenizer
torch.cuda.empty_cache()
gc.collect()

print(f"GPU memory after cleanup: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
time.sleep(2)  # Give system time to clean up

Clearing base model from memory...
GPU memory after cleanup: 0.01 GB


In [12]:
# Load fine-tuned model
print("Loading fine-tuned model...")
print("Make sure you've uploaded the 'llama3-academic-qa' folder to this Colab session!")

# Check if fine-tuned model exists
import os
if not os.path.exists("llama3-academic-qa"):
    print("ERROR: llama3-academic-qa folder not found!")
    print("Please upload your fine-tuned model folder to Colab.")
else:
    print("Fine-tuned model folder found!")

    # Load the fine-tuned model
    ft_model, ft_tokenizer = FastLanguageModel.from_pretrained(
        model_name="llama3-academic-qa",  # Your fine-tuned model
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True,
        device_map='auto', # Add device_map='auto'
    )

    # Prepare for inference
    FastLanguageModel.for_inference(ft_model)

    print("Fine-tuned model loaded successfully!")
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Loading fine-tuned model...
Make sure you've uploaded the 'llama3-academic-qa' folder to this Colab session!
Fine-tuned model folder found!
==((====))==  Unsloth 2025.9.5: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

In [None]:
# Test fine-tuned model on the same questions
print("Testing fine-tuned model responses...\n")

ft_results = []

for i, test_q in enumerate(TEST_QUESTIONS[:3]):  # Same first 3 questions
    print(f"Question {i+1}: {test_q['question']}")

    start_time = time.time()
    response = generate_response(ft_model, ft_tokenizer, test_q['question'])
    end_time = time.time()

    evaluation = evaluate_response_quality(response, test_q['expected_concepts'])
    ft_results.append({
        "question": test_q,
        "evaluation": evaluation,
        "response_time": end_time - start_time
    })

    print(f"Response: {response}")
    print(f"Score: {evaluation['overall_score']:.3f} | Concepts: {evaluation['concepts_found']} | Time: {end_time - start_time:.1f}s")
    print("-" * 50)

print(f"\nFine-tuned model testing complete. Average score: {sum(r['evaluation']['overall_score'] for r in ft_results) / len(ft_results):.3f}")

In [None]:
# Compare results side by side
print("\n" + "=" * 80)
print("COMPARATIVE ANALYSIS: BASE vs FINE-TUNED MODEL")
print("=" * 80)

base_avg = sum(r['evaluation']['overall_score'] for r in base_results) / len(base_results)
ft_avg = sum(r['evaluation']['overall_score'] for r in ft_results) / len(ft_results)
improvement = ((ft_avg - base_avg) / base_avg * 100) if base_avg > 0 else 0

print(f"\nOVERALL SCORES:")
print(f"Base Model Average: {base_avg:.3f}")
print(f"Fine-tuned Average: {ft_avg:.3f}")
print(f"Improvement: {improvement:+.1f}%")

print(f"\nDETAILED COMPARISON:")
print("-" * 80)

for i in range(len(base_results)):
    base_r = base_results[i]
    ft_r = ft_results[i]

    q_improvement = ((ft_r['evaluation']['overall_score'] - base_r['evaluation']['overall_score']) / base_r['evaluation']['overall_score'] * 100) if base_r['evaluation']['overall_score'] > 0 else 0

    print(f"\nQ{i+1}: {base_r['question']['question'][:60]}...")
    print(f"Expected concepts: {base_r['question']['expected_concepts']}")
    print(f"")
    print(f"BASE MODEL (Score: {base_r['evaluation']['overall_score']:.3f}):")
    print(f"  Concepts found: {base_r['evaluation']['concepts_found']}")
    print(f"  Response: {base_r['evaluation']['response'][:150]}...")
    print(f"")
    print(f"FINE-TUNED MODEL (Score: {ft_r['evaluation']['overall_score']:.3f}):")
    print(f"  Concepts found: {ft_r['evaluation']['concepts_found']}")
    print(f"  Response: {ft_r['evaluation']['response'][:150]}...")
    print(f"")
    print(f"  → Improvement: {q_improvement:+.1f}%")
    print("-" * 60)

# Save results
results = {
    "base_model_results": base_results,
    "finetuned_model_results": ft_results,
    "summary": {
        "base_average_score": base_avg,
        "finetuned_average_score": ft_avg,
        "improvement_percentage": improvement
    }
}

with open("real_evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nResults saved to real_evaluation_results.json")
print("Download this file to keep the evaluation results!")

In [None]:
# Test edge cases (hallucination resistance)
print("\n" + "=" * 60)
print("EDGE CASE TESTING (Hallucination Resistance)")
print("=" * 60)

edge_results = []

for i, edge_question in enumerate(EDGE_CASE_QUESTIONS[:2]):  # Test 2 edge cases
    print(f"\nEdge Case {i+1}: {edge_question}")

    # Get response from fine-tuned model
    response = generate_response(ft_model, ft_tokenizer, edge_question, max_length=100)
    evaluation = check_hallucination_resistance(response)

    edge_results.append({
        "question": edge_question,
        "evaluation": evaluation
    })

    print(f"Response: {response}")
    print(f"Acknowledges uncertainty: {evaluation['acknowledges_uncertainty']}")
    print(f"Shows fabrication: {evaluation['shows_fabrication']}")
    print(f"Appropriate response: {evaluation['appropriate_response']}")
    print("-" * 50)

appropriate_responses = sum(1 for r in edge_results if r['evaluation']['appropriate_response'])
print(f"\nHallucination Resistance: {appropriate_responses}/{len(edge_results)} appropriate responses")

## Summary

This evaluation compares your fine-tuned academic Q&A model against the base LLaMA 3 8B model.

**Key Metrics:**
- **Concept Coverage**: How well the model mentions expected academic concepts
- **Academic Language**: Use of research terminology and formal language
- **Response Quality**: Overall comprehensiveness and accuracy
- **Hallucination Resistance**: Ability to acknowledge uncertainty rather than fabricate facts

**Expected Improvements from Fine-tuning:**
- Better use of academic terminology
- More detailed, research-oriented responses
- Better handling of domain-specific questions
- Improved acknowledgment of limitations (less hallucination)

Download the `real_evaluation_results.json` file to save your results!