## Introduction

In this example, you'l explore how to evaluate LLM's

### Note Huggingface

For some models (if not most) you need to install the huggingface cli tool (terminal) and authenticate it with your huggingface account. Please do not sign up with privacy (student) sensitive information.

https://huggingface.co/docs/huggingface_hub/main/en/guides/cli

## Setup <a id="setup"></a>

Start by importing all the necessary packages. Fix the errors by using pip install!

In [12]:
import os
import re
import torch
import gc
import pandas as pd
from tqdm import tqdm
import sys
sys.path.append('..')
from datasets import load_dataset
from typing import List, Optional, Dict, Any
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import HfApi

# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')
print("All warnings suppressed.")



The following cell defines functions to make everything easier. Just run it for now and don't worry about understanding it all.

In [13]:
# =============================================================================
# HUGGING FACE UTILITIES
# =============================================================================

def validate_token():
    """
    Validate the current Hugging Face token and display user information.
    
    This function should be called after setting up your token to ensure
    it was configured correctly.
    """
    try:
        # Try to get token from environment variable first
        token = None
        for env_var in ['HUGGINGFACE_HUB_TOKEN', 'HF_TOKEN']:
            token = os.environ.get(env_var)
            if token:
                break
        
        if token:
            # Set the token for the API
            api = HfApi(token=token)
        else:
            # Try without explicit token (maybe already logged in)
            api = HfApi()
            
        user_info = api.whoami()
        print(f"‚úÖ Token validated successfully!")
        print(f"   Logged in as: {user_info['name']}")
        print(f"   Token type: {user_info.get('type', 'unknown')}")
        return True
    except Exception as e:
        print(f"‚ùå Token validation failed.")
        print(f"   Error: {e}")
        print("\nüí° Please check that:")
        print("   1. Your token is correctly set")
        print("   2. Your token has the necessary permissions") 
        print("   3. You have access to the Deepseek models")
        print("   4. Try running: huggingface-cli login")
        return False

# =============================================================================
# MODEL SERVICE CLASS
# =============================================================================

class ServeLLM:
    """
    A service class for loading and running language models with proper memory management.
    """
    
    def __init__(self, model_name: str, device: str = "auto"):
        """
        Initialize the ServeLLM instance.
        
        Args:
            model_name (str): Name/path of the model to load
            device (str): Device to load model on ('auto', 'cuda', 'cpu')
        """
        self.model_name = model_name
        self.device = self._determine_device(device)
        self.tokenizer = None
        self.model = None
        self._load_model()
    
    def _determine_device(self, device: str) -> str:
        """Determine the best device to use."""
        if device == "auto":
            # Check for CUDA (NVIDIA) or ROCm (AMD) availability
            if torch.cuda.is_available():
                return "cuda"
            # Check for Apple metal
            elif torch.backends.mps.is_available():
                return "mps"
            # ROCm also uses torch.cuda API
            return "cpu"
        return device
    
    def _load_model(self):
        """Load the tokenizer and model."""
        try:
            print(f"Loading {self.model_name}...")
            
            # Load tokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True,
              #  local_files_only=True
            )
            
            # Add padding token if not present
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
            
            # Load model with appropriate settings
            model_kwargs = {
                "trust_remote_code": True,
                "dtype": torch.float16 if self.device == "cuda" else torch.float32,
               #"local_files_only": True
            }
            if self.device == "mps":
                model_kwargs["low_cpu_mem_usage"] = True
            if self.device == "cuda":
                model_kwargs["device_map"] = "auto"
            
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                **model_kwargs
            )
            
            if ["cpu", "mps"]:
                self.model = self.model.to(self.device)
                
            print(f"‚úÖ Model loaded successfully on {self.device}")
            
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            raise
    
    def generate_response(
        self, 
        prompt: str, 
        max_tokens: int = 512,
        temperature: float = 0.7,
        top_p: float = 0.9,
        do_sample: bool = True, 
    ) -> str:
        """
        Generate a response to the given prompt.
        
        Args:
            prompt (str): Input prompt
            max_tokens (int): Maximum tokens to generate
            temperature (float): Sampling temperature
            top_p (float): Top-p sampling parameter
            do_sample (bool): Whether to use sampling
            
        Returns:
            str: Generated response
        """
        if self.model is None or self.tokenizer is None:
            raise RuntimeError("Model not loaded. Call _load_model() first.")
        
        try:
            # Tokenize input
            inputs = self.tokenizer(
                prompt, 
                return_tensors="pt",
                truncation=True,
                max_length=2048
            ).to(self.device)
            
            # Generate
            with torch.no_grad():
                outputs = self.model.generate(
                    inputs.input_ids,
                    attention_mask=inputs.attention_mask,
                    max_new_tokens=max_tokens,
                    temperature=temperature,
                    top_p=top_p,
                    do_sample=do_sample,
                    pad_token_id=self.tokenizer.eos_token_id,
                    eos_token_id=self.tokenizer.eos_token_id,
                )
            
            # Decode response (exclude input tokens)
            response_tokens = outputs[0][inputs.input_ids.shape[1]:]
            response = self.tokenizer.decode(response_tokens, skip_special_tokens=True)
            
            return response.strip()
            
        except Exception as e:
            print(f"‚ùå Error generating response: {e}")
            return f"Error: {str(e)}"
    
    def cleanup(self):
        """Clean up model and free GPU memory."""
        if self.model is not None:
            del self.model
            self.model = None
        
        if self.tokenizer is not None:
            del self.tokenizer
            self.tokenizer = None
        
        # Clear GPU cache
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        if torch.mps.is_available():
            torch.mps.empty_cache() 
        # Force garbage collection
        gc.collect()
        
        print("üßπ Model cleaned up and memory freed")
    
    def __enter__(self):
        """Context manager entry."""
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit with automatic cleanup."""
        self.cleanup()
    
    @staticmethod
    def cleanup_all():
        """Static method to clean up GPU memory."""
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        if torch.mps.is_available():
            torch.mps.empty_cache()
        gc.collect()
        print("üßπ All GPU memory cleaned up")

# =============================================================================
# DISPLAY UTILITIES
# =============================================================================

def display_section_header(title: str, level: int = 1):
    """
    Display a section header with appropriate formatting.
    
    Args:
        title (str): Section title
        level (int): Header level (1, 2, or 3)
    """
    if level == 1:
        print(f"\n{'='*60}")
        print(f"{title.upper()}")
        print('='*60)
    elif level == 2:
        print(f"\n{'-'*40}")
        print(f"{title}")
        print('-'*40)
    else:
        print(f"\n{title}")
        print('¬∑'*len(title))

def display_warning(message: str):
    """Display a warning message in a prominent way."""
    print("‚ö†Ô∏è  WARNING:", message)

def display_success(message: str):
    """Display a success message."""
    print("‚úÖ", message)

def display_info(message: str):
    """Display an info message."""
    print("‚ÑπÔ∏è", message)

Qwen3 technical report 
https://arxiv.org/pdf/2505.09388

If you can run 7b models try the new Olmo models! They claim to be very open and ethical.
https://huggingface.co/allenai/collections

In [14]:
# Model definitions
# Feel free to add more models or test your own.
small_models = ["qwen/qwen3-0.6b-base",
                "qwen/qwen3-0.6b", 
                #"qwen/qwen3-4b"
                ]
#https://huggingface.co/meta-llama/Llama-Guard-3-8B
llama_guard = "meta-llama/Llama-Guard-3-8B"

## Example Prompts <a id="exampleprompts"></a>

Here are a few selected problems of varying complexity to test different aspects of mathematical reasoning.

In [15]:
# Test prompts for model comparison
TEST_PROMPTS = [
    "What is the area of a rectangle with a length of 8 units and a width of 5 units?",
    "Solve: 2x + 3 = 7",
    "What is the derivative of sin(x)?"
]

# Expected key information in correct answers
EXPECTED_KEYWORDS = [
    "40",      # 8 * 5 = 40
    "x = 2",   # 2x + 3 = 7 ‚Üí 2x = 4 ‚Üí x = 2
    "cos(x)"   # derivative of sin(x) is cos(x)
]

print("Test prompts defined:")
for i, prompt in enumerate(TEST_PROMPTS):
    print(f"{i+1}. {prompt}")

Test prompts defined:
1. What is the area of a rectangle with a length of 8 units and a width of 5 units?
2. Solve: 2x + 3 = 7
3. What is the derivative of sin(x)?


## Processing Function <a id="processing"></a>

### Exercise 1: Implement a function to get responses from the model

The `ServeLLM` class is a wrapper we've created for you in `utils.py` to simplify model loading and inference. It handles GPU memory management, model initialization, and provides clean methods like `generate_response()`. 


In [None]:
def process_prompts(model_name, prompts):
    """
    Process a list of prompts with a given model and return responses.
    """
    
    results = []
    # You can change device="auto" to cuda, cpu or mps to manually select where to do the compute
    with ServeLLM(model_name, device="auto") as llm:
        for i, prompt in enumerate(prompts):

            response = llm.generate_response(prompt)
            print(f"Processed prompt {i} of {len(prompts)}")
            results.append(response)
    
    return results

Now you can evaluate all models.


In [17]:
eval_results = []
for i, model in enumerate(small_models):
    # Clean up memory before loading the next model
    ServeLLM.cleanup_all()
    
    print(f"PROCESSING {model}")
    print("="*50)
    
    eval_results.append(process_prompts(small_models[i], TEST_PROMPTS))

# Display results
for i, (prompt, response) in enumerate(zip(TEST_PROMPTS, eval_results[i])):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("---")

PROCESSING qwen/qwen3-0.6b-base
Loading qwen/qwen3-0.6b-base...


Traceback (most recent call last):
  File "C:\Users\jackv\AppData\Local\pypoetry\Cache\virtualenvs\rottermaatje-UFzBPA2d-py3.12\Lib\site-packages\transformers\modeling_utils.py", line 4805, in from_pretrained
    is_accelerate_available()
  File "C:\Users\jackv\AppData\Local\pypoetry\Cache\virtualenvs\rottermaatje-UFzBPA2d-py3.12\Lib\site-packages\transformers\utils\import_utils.py", line 108, in _is_package_available
    return importlib.util.find_spec(pkg_name) is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jackv\AppData\Local\Programs\Python\Python312\Lib\importlib\util.py", line 108, in find_spec
    parent = __import__(parent_name, fromlist=['__path__'])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'accelerate'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "<stdin>", line 8, in process_

### Exercise 2: Implement a function to evaluate the responses

Now that you have the responses, you need to evaluate them. For this simple case, you'll check if the expected keywords are present in the model's output. This is a basic form of evaluation, but it's a good starting point.

In [None]:
def evaluate_responses(responses, expected_keywords):
    """
    Evaluate model responses based on expected keywords.
    """
    
    scores = []
    for i, response in enumerate(responses):
        # Simple check: does the response contain the expected keyword?
        if expected_keywords[i] in response:
            scores.append(1) # Correct
        else:
            scores.append(0) # Incorrect
            
    return scores

Now, let's use this function to evaluate the results from each model.

In [None]:
evaluation_scores = []
for i, model_responses in enumerate(eval_results):
    scores = evaluate_responses(model_responses, EXPECTED_KEYWORDS)
    evaluation_scores.append(scores)
    
    print(f"Scores for {small_models[i]}: {scores}")

### Exercise 3: Create a summary DataFrame

To make the results easier to read, let's put them into a pandas DataFrame. This is a common practice in data science for organizing and analyzing results.

In [None]:
def create_summary_df(models, prompts, results, scores):
    """
    Create a pandas DataFrame to summarize the evaluation results.
    """
    
    summary_data = []
    for i, model_name in enumerate(models):
        for j, prompt in enumerate(prompts):
            summary_data.append({
                "Model": model_name,
                "Prompt": prompt,
                "Response": results[i][j],
                "Score": scores[i][j]
            })
            
    return pd.DataFrame(summary_data)

Now, create and display the summary.

In [None]:
summary_df = create_summary_df(small_models, TEST_PROMPTS, eval_results, evaluation_scores)
display(summary_df)

## Conclusion

In this notebook, you've gone through a simple but complete LLM evaluation workflow:
1.  **Processed Prompts**: You sent a series of questions to different language models.
2.  **Evaluated Responses**: You checked the models' answers for correctness based on keywords.
3.  **Summarized Results**: You organized everything into a clean table for easy comparison.

This is a foundational approach to model evaluation. In more advanced scenarios, you might use more sophisticated metrics, larger datasets, or even other LLMs to help with the evaluation.