<a href="https://colab.research.google.com/github/lmassaron/fine-tuning-workshop/blob/main/01_knowledge_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the output of the **NVIDIA System Management Interface (nvidia-smi)**, a command-line utility used for monitoring and managing NVIDIA GPU devices. It provides a real-time snapshot of the GPU's status and the processes utilizing it.

Here's a breakdown of what each section of the output means:

### Header Information

The top section provides details about the installed NVIDIA driver and the CUDA version it supports.

| Header | Value | Description |
|---|---|---|
| `NVIDIA-SMI` | `550.163.01` | The version of the nvidia-smi utility itself. |
| `Driver Version` | `550.163.01` | The version of the installed NVIDIA display driver. |
| `CUDA Version` | `12.4` | The highest version of CUDA that is supported by the installed driver. |

### GPU Details Table

This table provides detailed information about the GPU installed in the system.

| Metric | Value | Description |
|---|---|---|
| `GPU` | `0` | The index of the GPU in the system. Since it's 0, this is the first and only GPU. |
| `Name` | `NVIDIA GeForce RTX 3090` | The model of the graphics card. |
| `Persistence-M` | `Off` | Persistence mode. When "On", the NVIDIA driver remains loaded even when no applications are using the GPU. |
| `Bus-Id` | `00000000:01:00.0` | The PCI bus address of the GPU, which helps in identifying the physical slot it's in. |
| `Disp.A` | `On` | Display Active. This indicates whether a display is connected to and active on this GPU. |
| `Volatile Uncorr. ECC` | `N/A` | Information about volatile uncorrectable ECC (Error Correction Code) memory errors. "N/A" means it's not applicable to this GPU. |
| `Fan` | `30%` | The current speed of the GPU's cooling fan as a percentage of its maximum speed. |
| `Temp` | `36C` | The current temperature of the GPU in degrees Celsius. |
| `Perf` | `P5` | The current performance state of the GPU. This ranges from P0 (maximum performance) to P12 (minimum performance). |
| `Pwr:Usage/Cap` | `33W / 350W` | The current power consumption of the GPU in watts compared to its maximum power capacity. |
| `Memory-Usage` | `382MiB / 24576MiB` | The amount of dedicated GPU memory currently in use out of the total available memory. |
| `GPU-Util` | `36%` | The percentage of time the GPU's processing cores were active over a specific period. |
| `Compute M.` | `Default` | The compute mode of the GPU. "Default" allows multiple processes to use the GPU for compute tasks simultaneously. |

### Processes Table

This section lists the processes that are currently using the GPU's resources.

| Column | Description |
|---|---|
| `GPU` | The index of the GPU being used by the process. |
| `GI` & `CI` | Graphics and Compute Instance IDs. These are used for Multi-Instance GPU (MIG) functionality, which is not active here ("N/A"). |
| `PID` | The Process ID of the application using the GPU. |
| `Type` | The type of context the process is using: "G" for Graphics, "C" for Compute, and "C+G" for both. In this case, all listed processes are using the GPU for graphics. |
| `Process name` | The name of the process. Here we see the X.Org display server, the GNOME desktop environment's shell, and the Visual Studio Code editor. |
| `GPU Memory Usage` | The amount of the GPU's memory being used by that specific process. |

In [1]:
# Check the GPU information
!nvidia-smi

Thu Sep 25 08:13:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

`%%capture` is a "magic command" in IPython, which is the interactive shell that powers Jupyter notebooks. Magic commands are special commands that are not part of the Python language itself but provide extra functionality within the IPython/Jupyter environment.

Specifically, `%%capture` is a **cell magic**, which means it starts with `%%` and applies to the entire code cell in which it is placed. Its primary purpose is to **prevent the output of a code cell from being displayed and to capture that output into a variable**.

### Basic Usage

The most basic way to use `%%capture` is to simply put it at the top of a cell. This will run the code in the cell but will discard all the output.

```python
%%capture
print("This output will be hidden.")
```

### Storing Captured Output

To make `%%capture` truly useful, you can provide a variable name after it. The captured output will be stored in an object in that variable.

```python
%%capture my_output
import sys

print("This is standard output.")
print("This is an error message.", file=sys.stderr)
```

After running this cell, the variable `my_output` will contain a special `CapturedIO` object. You can then access the captured output through its attributes:

*   `my_output.stdout`: A string containing everything that was sent to standard output.
*   `my_output.stderr`: A string containing everything that was sent to standard error.

You can then print these variables to see the captured content:

```python
print("--- Standard Output ---")
print(my_output.stdout)
print("--- Standard Error ---")
print(my_output.stderr)
```

Here is a brief description of each package:

*   **`transformers`**: Developed by Hugging Face, this is a foundational library providing a vast collection of pre-trained models (like BERT, GPT, and Llama) for a wide range of tasks in natural language processing, computer vision, and audio. It simplifies downloading and using these state-of-the-art models with a consistent API.

*   **`trl`**: Standing for Transformer Reinforcement Learning, `trl` is a library designed to fine-tune models from the `transformers` library using reinforcement learning techniques. It simplifies advanced training methods like Supervised Fine-Tuning (SFT) and Proximal Policy Optimization (PPO).

*   **`accelerate`**: Also from Hugging Face, `accelerate` is a library that simplifies running PyTorch training scripts across different types of hardware (like multiple GPUs or TPUs) with minimal code changes. It handles the boilerplate code for distributed training and mixed-precision, making it easier to scale up your model training.

*   **`bitsandbytes`**: This is a lightweight library that provides powerful quantization methods, allowing you to run and train large language models with significantly less memory. Its key features are 8-bit and 4-bit quantization, which dramatically reduce the GPU memory footprint of a model, making it possible to work with very large models on consumer-grade hardware.

In [2]:
# Install necessary libraries for model training and evaluation
%%capture
!pip install -U transformers trl accelerate bitsandbytes

In [3]:
# Import and print the versions of the installed libraries
import torch
import trl
import bitsandbytes

print(f"Using PyTorch version: {torch.__version__}")
print(f"Using TRL version: {trl.__version__}")
print(f"Using bitsandbytes version: {bitsandbytes.__version__}")

Using PyTorch version: 2.8.0+cu126
Using TRL version: 0.23.0
Using bitsandbytes version: 0.47.0


In [4]:
# Import various libraries needed for data handling, model loading, and training
import os
import gc
import warnings
import torch
import numpy as np
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import login
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

This class acts as a **centralized place to store and manage important settings** for a machine learning script, specifically for running Google's Gemma model.

The reasons for using it are:

*   **Easy to Change:** You can quickly change the model size (e.g., from `"3-1b"` to `"7b"`) or token lengths in one spot without hunting through the code.
*   **Keeps Code Clean:** It groups all the key parameters together, making the main script more readable and organized.
*   **Avoids "Magic Numbers":** It gives descriptive names (`max_prompt_length`) to numbers that would otherwise be scattered throughout the code, making their purpose clear.

In [5]:
# Define configuration parameters for the model and data
class Config:
    """Configuration parameters"""

    SIZE = "3-1b"
    MODEL_NAME = f"google/gemma-{SIZE}-it"

    max_prompt_length = 352
    max_completion_length = 100

"Gemma 3 Instruct" model family.

### Gemma 3 4B-Instruct: The Efficient Workhorse

This model would be the balanced, general-purpose choice, offering a strong blend of performance and efficiency. It's designed to be a capable and reliable assistant for a wide range of tasks without requiring high-end, specialized hardware.

*   **Reasoning:** Good to Strong
*   **Speed:** Fast
*   **Hardware:** Consumer GPUs, Cloud CPUs
*   **Ideal Use Cases:**
    *   General-purpose chatbots
    *   RAG & summarization
    *   Developer co-pilots
    *   Content creation

**Key Strengths:**
*   **Excellent Performance-to-Size Ratio:** Its primary strength is delivering high-quality, nuanced responses that are competitive with larger models, while being significantly faster and less resource-intensive.
*   **Strong General Reasoning:** It would be capable of handling moderately complex tasks like multi-turn conversations, detailed content summarization, and Retrieval-Augmented Generation (RAG).
*   **High Accessibility:** Small enough to run efficiently on consumer-grade GPUs, making it a go-to choice for developers, researchers, and small businesses.
*   **Reliable Instruction Following:** As an instruct-tuned model, it would excel at following complex prompts and adhering to specific formats, making it ideal for creating dependable chatbots and assistants.

---

### Gemma 3 1B-Instruct: The Ultra-Lightweight Specialist

This model prioritizes speed and extreme efficiency above all else. It's designed for applications where low latency and a minimal resource footprint are critical, such as on-device or real-time tasks.

*   **Reasoning:** Basic to Moderate
*   **Speed:** Blazing Fast
*   **Hardware:** Standard CPUs, Mobile Devices
*   **Ideal Use Cases:**
    *   Real-time text classification
    *   Simple, high-volume tasks
    *   On-device voice commands
    *   Basic customer service bots

**Key Strengths:**
*   **Blazing Speed and Low Latency:** It would provide near-instantaneous responses, which is crucial for interactive applications, simple chatbots, and processing high volumes of text quickly.
*   **On-Device Deployment:** Its small size makes it perfect for running directly on mobile phones, laptops, and edge devices, enabling AI applications that work offline and with enhanced privacy.
*   **Cost-Effectiveness at Scale:** Extremely cheap to run, making it the best option for high-volume, simple tasks like text classification, sentiment analysis, or command parsing.
*   **Excellent for Fine-Tuning:** Small models are fast and inexpensive to fine-tune, allowing developers to easily create a highly specialized expert for a single, narrow task.

---

### Gemma 3 270M-Instruct: The Specialized Micro-Model

This is an extremely compact model. It is not designed for general conversation or complex reasoning. Instead, its strength lies in being a highly efficient foundation for a single, well-defined task, often after being fine-tuned.

*   **Reasoning:** Very Limited (before fine-tuning)
*   **Speed:** Near-Instantaneous
*   **Hardware:** Microcontrollers, In-Browser
*   **Ideal Use Cases:**
    *   Fine-tuned classifiers (e.g., toxicity)
    *   Smart text filters (e.g., PII detection)
    *   Format checking (e.g., JSON)
    *   Low-power/edge devices
    
**Key Strengths:**
*   **Minimal Resource Footprint:** The smallest and most efficient option, capable of running on virtually any hardware, including microcontrollers and in-browser environments (WebAssembly).
*   **Task-Specific Excellence:** While poor at general tasks, it can become exceptionally good at one specific thing after fine-tuning. For example, it could be trained to be a world-class toxicity detector, a PII (Personally Identifiable Information) scrubber, or a format checker.
*   **Energy Efficiency:** Its low computational requirements make it ideal for battery-powered and low-power devices where energy consumption is a major concern.
*   **Simple Integration:** Can be easily integrated into larger systems as a fast, intelligent filter or classifier.

`init()`

This function **sets up the entire environment** for the script to run properly. It does three main things:
1.  **Configures Settings:** It sets several environment variables to ensure stability, control which GPU is used (`"0"`), and manage memory allocation.
2.  **Handles Login:** It automatically logs into Hugging Face by looking for your access token, first in the standard environment variables and then specifically in Google Colab's secure "secrets" manager.
3.  **Cleans Up:** It frees up GPU memory, runs Python's garbage collector to free up system RAM, and silences warning messages for a cleaner output.

`is_bfloat16_supported()`

This function is a **hardware check**. It checks if the available NVIDIA GPU has the necessary architecture (Compute Capability 8.0 or higher, like an A100 or RTX 30/40 series) to support the `bfloat16` data type, which is an efficient format for training modern deep learning models. It returns `True` if it's supported and `False` if it's not.

`info_device()`

This function **determines the best hardware to use** for computations. It checks if a CUDA-enabled GPU is available and selects it. If not, it falls back to using the CPU. It then prints a message to inform you which device is being used (`"cuda"` or `"cpu"`) and returns this device object so the rest of the program knows where to send the model and data.

In [6]:
# Initialization script to set up the environment and Hugging Face login
def init():
    """Initialization script"""
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    # It is recommended to set the HF_TOKEN as an environment variable
    token = os.environ.get("HF_TOKEN")
    if token:
        login(token=token)
    else:
      try:
        from google.colab import userdata
        # Retrieve your Hugging Face token from Colab's secrets manager
        # The name 'HF_TOKEN' should match the name you used in the secrets tab
        hf_token = userdata.get('HF_TOKEN')

        # Check if the token was successfully retrieved
        if hf_token:
            # Log in to Hugging Face using the retrieved token
            # The `add_to_git_credential=True` argument is optional and useful if you plan to push models to the Hub
            login(token=hf_token, add_to_git_credential=True)
            print("Hugging Face login successful using Google Colab secrets!")
        else:
            print("Error: HF_TOKEN not found in Google Colab secrets or is empty.")
            print("Please ensure you have created a secret named 'HF_TOKEN' in the 'Secrets' tab (🔑) on the left sidebar.")
      except:
        print("HF_TOKEN not set. You might need to log in manually.")

    torch.cuda.empty_cache()
    gc.collect()
    warnings.filterwarnings("ignore")

def is_bfloat16_supported():
    """Checks if the current device supports bfloat16."""
    return torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8


def info_device():
    """Get device for PyTorch"""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    return device

In [7]:
# Initialize the environment, get parameters, device, and data type
init()
params = Config()
device = info_device()
dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16
print(f"Using dtype: {dtype}")

Hugging Face login successful using Google Colab secrets!
Using device: cuda
Using dtype: torch.float16


In [8]:
# Function to load dataset from Hugging Face Hub
def get_data(repo_id, mapping_func=None, split="train"):
    """Upload HF dataset"""
    data = load_dataset(repo_id, cache_dir="/tmp")[split]
    if mapping_func:
      data = data.map(mapping_func)
    return data

The line `get_data(repo_id="lmassaron/Sherlock_QA_test")` does the following:

1.  **Downloads a Dataset:** It connects to the Hugging Face Hub and downloads the dataset named `Sherlock_QA_test`. This is an example of expert-based QA for testing purposes.
2.  **Selects the Training Data:** By default, it selects the `"train"` split from that dataset.
3.  **Returns the Data:** It returns this training data and stores it in the `data` variable without applying any transformations (since no `mapping_func` was provided).

In [None]:
# Load the Sherlock QA dataset
data = get_data(repo_id="lmassaron/Sherlock_QA_test")

The Hugging Face `Dataset` object structures data like a **highly efficient, memory-mapped table** (similar to a spreadsheet or a database table).

In regard of our dataset:

*   `features: ['Question', 'Answer', 'Difficulty']`: These are the **columns** of the table. Every single entry in the dataset will have these three fields.

*   `num_rows: 25`: This is the total number of **rows** or records in the table.

So, you can think of this dataset as a table with **25 rows** and **3 columns**: `Question`, `Answer`, and `Difficulty`.

The key difference from a simple list of dictionaries is that the `datasets` library uses **Apache Arrow** on the backend. This allows it to handle massive datasets that don't fit in RAM by only loading the data from your disk as you need it, making it extremely fast and memory-efficient.

In [10]:
# Display the loaded dataset information
data

Dataset({
    features: ['Question', 'Answer', 'Difficulty'],
    num_rows: 25
})

In [11]:
for i, question in enumerate(data["Question"]):
  print(f"{i:2}. {question}")

 0. Who created the character of Sherlock Holmes?
 1. What is the name of Sherlock Holmes's enemy?
 2. Where does Sherlock Holmes live?
 3. Who is Sherlock Holmes's best friend?
 4. What is the name of Sherlock's older brother?
 5. Who is the landlady of 221b Baker Street?
 6. What musical instrument does Sherlock Holmes like to play?
 7. In which Sherlock Holmes short story do we meet Irene Adler?
 8. Which actor plays Sherlock Holmes in the TV series Sherlock?
 9. Who did Dr. Watson marry?
10. What are the street boys called who run errands for Sherlock Holmes?
11. Who stars as Sherlock Holmes in the 2009 film Sherlock Holmes?
12. Who stars as Watson in the 2009 film Sherlock Holmes?
13. What was the first Sherlock Holmes story titled?
14. Which 2020 film features the teenage sister of Sherlock Holmes?
15. Where did Sherlock and Watson first meet?
16. When Sherlock Holmes retired, what hobby did he take up?
17. Where does Sherlock Holmes keep his tobacco?
18. What is the client's nam

`AutoTokenizer` is a smart class from the Hugging Face `transformers` library. Its job is to **download the correct tokenizer for any given pre-trained model.**

Think of a tokenizer as a model-specific dictionary. It converts your text into a sequence of numbers (tokens) that the model can understand and converts the model's numerical output back into human-readable text. The "Auto" part is the key: you just give it a model name (like `"google/gemma-3-1b-it"`), and it automatically figures out the right tokenizer to use, saving you from having to find the specific class yourself.

`AutoModelForCausalLM` is another smart class from Hugging Face that **downloads the pre-trained weights and architecture for a model designed for causal language modeling.**

"Causal Language Modeling" is the task of predicting the next token in a sequence of text. This is the fundamental task for generative models like GPT and Gemma. Just like `AutoTokenizer`, the "Auto" part automatically selects the correct model architecture based on the name you provide.

#### Parameters:

*   `params.MODEL_NAME`: The identifier of the model on the Hugging Face Hub (e.g., `"google/gemma-3-1b-it"`). This tells the function *what* model to load.
*   `torch_dtype=dtype`: This sets the numerical precision (e.g., `float16` or `bfloat16`) for the model's weights. Using a lower precision than the default (`float32`) significantly **reduces the model's memory footprint and speeds up computation**, which is crucial for running large models.
*   `device_map=device`: This tells the library **where to place the model's layers** (e.g., on the GPU, specified by `device="cuda"`). For very large models, this can even be used to automatically distribute layers across multiple GPUs.
*   `use_cache=True`: This enables a key-value cache during text generation. This is a significant optimization that **dramatically speeds up the process of generating long sequences of text** by reusing previous calculations instead of recomputing them for every new word.

You set the `tokenizer.pad_token` to handle **batch processing**.

When you send a batch of multiple sentences to the model, they all must have the same length. To achieve this, shorter sentences are "padded" by adding a special token to them until they match the length of the longest sentence in the batch.

However, some models, especially those designed purely for text generation (like Gemma), are not trained with a specific padding token. In this case, `tokenizer.pad_token` is `None`. By setting `tokenizer.pad_token = tokenizer.eos_token` (the "end-of-sequence" token), we are telling the tokenizer to **use the end-of-sequence token for padding**. This is a common and safe practice because the model is already trained to understand that the `eos_token` signifies the end of meaningful content and will effectively ignore it during processing.

In [None]:
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(params.MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
    params.MODEL_NAME,
    torch_dtype=dtype,
    device_map=device,
    use_cache=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

###Two Different Processes

1.  **Text Generation (`model.generate`)**: This is an **autoregressive process**. The model predicts one token at a time, appends it to the input, and then predicts the next token in a loop. It's a creative, step-by-step procedure designed to *produce new text*. It often involves sampling strategies (like temperature) and uses an internal cache to be efficient. You are asking the model: **"What comes next?"** over and over again.

2.  **Perplexity Calculation (`model(...)`)**: This is a **single forward pass** on a *complete, existing sequence of text*. You provide the entire sequence to the model at once and ask it to calculate the probability of that sequence. This is done by measuring the cross-entropy loss between the model's predictions and the actual tokens in the sequence. You are asking the model: **"How probable or 'fluent' was this entire sequence?"**

### What Does This Perplexity Score Actually Measure?

We are calculating the perplexity of the **model's own generated answer**.

*   **This is a measure of the model's internal confidence or fluency.**
    *   A **low perplexity** score means the model generated a sequence of tokens that it found highly probable and predictable. The answer is "fluent" according to its own internal patterns (e.g., "The sky is blue.").
    *   A **high perplexity** score means the model generated a sequence that it found surprising or unlikely. This could indicate a less confident answer, a more complex or unusual phrasing, or potential disfluency.

**This is different from another common use of perplexity**, where you would measure the perplexity of the `expected_answer` (the ground truth). That would tell you how well the model might have predicted the "correct" answer, whereas your current code tells you how confident the model was in its *own* answer. Both are valid and useful metrics, but they measure different things.


`inputs = tokenizer.apply_chat_template(...)`

This section **formats the user's question into the specific conversational structure that the chat model was trained on.** It takes a simple question and wraps it with special tokens and role identifiers (like `user` and `model`) to create a formal prompt. It then tokenizes this formatted prompt, converts it into a PyTorch tensor, and moves it to the correct device (e.g., the GPU) to be ready for the model.

`outputs = model.generate(...)`

This is the **core text generation step.** It feeds the prepared `inputs` into the model and instructs it to predict the next sequence of tokens. The `attention_mask` is crucial here, as it tells the model which tokens are real and which are just padding, ensuring it doesn't get confused by the padding added for batching. The `**generation_kwargs` would contain other settings that control the generation process, like the maximum length or the sampling strategy.

`generated_text = tokenizer.decode(...)`

This final section **translates the model's numerical output back into human-readable text.** It first slices the output tensor to isolate only the newly generated tokens (stripping away the original input prompt). It then uses the tokenizer's `decode` method to convert these token IDs back into words, while also removing any special tokens (like the end-of-sequence token) for a clean, final answer.

`generated_token_ids = outputs[0, inputs.input_ids.shape[-1] :]` is for **separating the model's newly generated answer from the original prompt you gave it.**

Here's the breakdown:

1.  **`outputs`**: The `model.generate` function returns the **entire sequence of tokens**, which includes your original input prompt *plus* the model's generated response appended at the end.

2.  **`inputs.input_ids.shape[-1]`**: This gets the **length of your original input prompt** in tokens.

3.  **`[0, inputs.input_ids.shape[-1] :]`**: This is a tensor slicing operation.
    *   `0`: Selects the first (and likely only) sequence in the batch.
    *   `inputs.input_ids.shape[-1] :`: This tells Python to "start slicing from the index where the input prompt ends, and go all the way to the end of the sequence."

Of course. Let's break down this command with a simple analogy.

Think of this command as giving the model a **pop quiz** to see how well it knows the text you just gave it.

### High-Level Explanation

In one sentence, this command feeds the entire generated sentence back into the model at once and asks it: **"How well could you have predicted this exact sentence?"** Because you provide the `labels` (the correct answers), the model automatically calculates a `loss` score that measures how "surprised" or wrong its predictions were.

---

`outputs_perplexity = model( ... )`

This is a **forward pass**, the most fundamental operation of a neural network. It's different from `model.generate()`, which is a step-by-step loop.

1.  **`generated_inputs.input_ids` (The Quiz Questions)**
    *   **What it is:** The sequence of numerical tokens representing the text you want to evaluate (e.g., `[101, 2009, 3209, 2332]`).
    *   **Its Role:** This is the input sequence. At each position, the model will try to predict the *next* token.

2.  **`generated_inputs.attention_mask` (The "Focus" Instructions)**
    *   **What it is:** A tensor of 1s and 0s that tells the model which tokens are real and which are just padding.
    *   **Its Role:** It ensures the model only pays attention to the actual content and isn't confused by any padding that might be present.

3.  **`labels=labels` (The Answer Key)**
    *   **What it is:** This is the most important part. It's the sequence of "correct answers" that you prepared by shifting the input tokens one step to the left.
    *   **Its Role:** By providing the answer key directly to the model, you are activating a special, built-in feature of Hugging Face models.


In [13]:
# Evaluate the model on the dataset and store results
temperature = 0
results_list = []
instructions = "\nBriefly, just give the straight answer to the question."

# It's good practice to set the pad_token if it's not already set.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Set the model to evaluation mode
model.eval()

for row in tqdm(data, desc="Evaluating Samples"):
  question = row['Question']
  answer = row['Answer']
  difficulty = row['Difficulty']

  # Tokenize the input and get both input_ids and attention_mask
  inputs = tokenizer.apply_chat_template(
            [{"role": "user", "content": question + instructions}],
            tokenize=True,
            add_generation_prompt=True,  # Crucial for telling the model it's its turn to speak
            return_tensors="pt",
            return_dict=True  # Ensure the output is a dictionary
        ).to(device)

  # Prepare arguments for the generate function
  generation_kwargs = {
      "pad_token_id": tokenizer.eos_token_id,
      "max_new_tokens": params.max_completion_length,
      "do_sample": temperature > 0
  }

  # Only add temperature to kwargs if sampling is enabled
  if generation_kwargs["do_sample"]:
      generation_kwargs["temperature"] = temperature

  # Generate a completion from the model, passing the attention_mask
  outputs = model.generate(
      inputs.input_ids, # Pass input_ids explicitly
      attention_mask=inputs.attention_mask, # Pass the attention mask
      **generation_kwargs
      )

  generated_token_ids = outputs[0, inputs.input_ids.shape[-1] :]
  generated_text = tokenizer.decode(
      generated_token_ids,
      skip_special_tokens=True,
  ).strip()

  # Calculate perplexity for the generated answer
  with torch.no_grad():
        # 1. Concatenate the Prompt (input_ids) and the Answer (generated_token_ids)
        # Ensure generated_token_ids is 2D [batch, seq] to match input_ids
        answer_ids = generated_token_ids.unsqueeze(0)
        full_input_ids = torch.cat([inputs.input_ids, answer_ids], dim=1)

        # 2. Create the labels. Start by cloning the full input.
        labels = full_input_ids.clone()

        # 3. Mask the Prompt.
        # Get the length of the prompt
        prompt_length = inputs.input_ids.shape[1]
        # Set the labels corresponding to the prompt to -100 (IGNORED by the loss function)
        labels[:, :prompt_length] = -100

        # 4. Calculate the loss
        # Note: We do NOT need to manually shift the labels (like your original code did).
        # Hugging Face CausalLM models handle the required label shifting internally.
        outputs_perplexity = model(
            full_input_ids,
            labels=labels
        )

        loss = outputs_perplexity.loss
        perplexity = torch.exp(loss).item()

  results_list.append({
      'question': question,
      'expected_answer': answer,
      'generated_answer': generated_text,
      'difficulty': difficulty,
      'perplexity': perplexity # Add perplexity to the results
  })

results_df = pd.DataFrame(results_list)

Evaluating Samples:   0%|          | 0/25 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating Samples: 100%|██████████| 25/25 [00:18<00:00,  1.37it/s]


In [14]:
# Delete the model and tokenizer to free up GPU memory
del [model, tokenizer]

In [15]:
# Evaluate correctness based on keyword matching
def evaluate_keyword(row):
    return row['expected_answer'].lower() in row['generated_answer'].lower()

results_df['is_correct_keyword'] = results_df.apply(evaluate_keyword, axis=1)


Here we ** evaluate if the model's answers are correct by comparing their *semantic meaning*, not just their exact words.**

Here's the simple breakdown of the steps:

1.  **Convert to Numbers:** It uses a `SentenceTransformer` model to convert both the expected "correct" answers and the model's generated answers into numerical vectors (called embeddings). In these vectors, sentences with similar meanings are mathematically close to each other.

2.  **Calculate Similarity:** It then calculates the "cosine similarity" between the vector for the expected answer and the vector for the generated answer. This results in a score from -1 to 1, where 1 means the meanings are identical.

3.  **Make a Decision:** Finally, it checks if this similarity score is above a certain threshold (in this case, `0.5`). If it is, the answer is marked as `True` (correct), even if the wording isn't exactly the same.

Cosine similarity is a metric used to measure how similar two things are, not by their size or magnitude, but by their **orientation** or **direction**.

Imagine two arrows starting from the same point.

*   If the arrows point in the **exact same direction**, their cosine similarity is **1** (maximum similarity).
*   If the arrows are **perpendicular** (pointing at a 90-degree angle to each other), they are considered unrelated, and their similarity is **0**.
*   If they point in **opposite directions**, their similarity is **-1** (maximum dissimilarity).

In text analysis, sentences are converted into these "arrows" (vectors) in a high-dimensional space. Cosine similarity then tells us if two sentences "point" in the same semantic direction, meaning they have a similar topic or meaning, regardless of the exact words used.

At its core, the calculation is based on the **dot product** of two vectors divided by the product of their **magnitudes**.

Here is the formula:

![Cosine Similarity Formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/15d11df2d48da4787ee86a4b8c14551fbf0bc96a)

Where:
*   **A ⋅ B** is the **dot product** of vectors A and B.
*   **||A||** and **||B||** are the **magnitudes** (or lengths) of vectors A and B.

---

### Step-by-Step Calculation with a Simple Example

Let's say we want to measure the similarity between two short sentences:
*   **Sentence A:** "the cat sat"
*   **Sentence B:** "the dog sat"

**Step 1: Convert Sentences to Vectors**

First, we create a vocabulary of all unique words: `{"the", "cat", "sat", "dog"}`. Then, we count the occurrences of each word in each sentence to create our vectors.

*   **Vector A** (for "the cat sat"): `[1, 1, 1, 0]`  (1 "the", 1 "cat", 1 "sat", 0 "dog")
*   **Vector B** (for "the dog sat"): `[1, 0, 1, 1]`  (1 "the", 0 "cat", 1 "sat", 1 "dog")

Now we have our two vectors, `A = [1, 1, 1, 0]` and `B = [1, 0, 1, 1]`.

**Step 2: Calculate the Dot Product (A ⋅ B)**

The dot product is the sum of the products of the corresponding elements in the vectors.

`A ⋅ B = (1 * 1) + (1 * 0) + (1 * 1) + (0 * 1)`
`A ⋅ B = 1 + 0 + 1 + 0`
`A ⋅ B = 2`

**Step 3: Calculate the Magnitude of Each Vector (||A|| and ||B||)**

The magnitude is the square root of the sum of the squares of all the elements in the vector.

*   **Magnitude of A (||A||):**
    `||A|| = √(1² + 1² + 1² + 0²)`
    `||A|| = √(1 + 1 + 1 + 0)`
    `||A|| = √3 ≈ 1.732`

*   **Magnitude of B (||B||):**
    `||B|| = √(1² + 0² + 1² + 1²)`
    `||B|| = √(1 + 0 + 1 + 1)`
    `||B|| = √3 ≈ 1.732`

**Step 4: Divide the Dot Product by the Product of the Magnitudes**

Now we just plug the results from the previous steps into the main formula.

`Cosine Similarity = (A ⋅ B) / (||A|| * ||B||)`
`Cosine Similarity = 2 / (√3 * √3)`
`Cosine Similarity = 2 / 3 ≈ 0.667`

The resulting similarity score of **0.667** indicates a strong positive similarity between the two sentences, which makes sense as they share two out of three words.

In [None]:
# Evaluate correctness based on semantic similarity using Sentence-BERT
from sentence_transformers import SentenceTransformer, util

# Load the Sentence-BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the expected and generated answers into embeddings
expected_embeddings = model.encode(results_df['expected_answer'].tolist(), convert_to_tensor=True)
generated_embeddings = model.encode(results_df['generated_answer'].tolist(), convert_to_tensor=True)

# Calculate cosine similarity between embeddings
cosine_scores = util.cos_sim(expected_embeddings, generated_embeddings)
cosine_scores = np.array(cosine_scores.cpu())

# Store the semantic similarity scores
results_df['semantic_similarity'] = [cosine_scores[i][i] for i in range(len(cosine_scores))]

# Determine correctness based on a similarity threshold
similarity_threshold = 0.5
results_df['is_correct_semantic'] = results_df['semantic_similarity'] >= similarity_threshold

In [17]:
# Delete the Sentence-BERT model to free up memory
del [model]

**`meta-llama/Llama-3.2-3B-Instruct`** is a small yet powerful language model from Meta AI, designed to be highly efficient and accessible. The "3B" refers to its 3.21 billion parameters, making it lightweight enough to run on consumer hardware and at the edge. The "-Instruct" suffix means it has been specifically fine-tuned to follow instructions and engage in dialogue, making it ideal for creating chatbots and assistants.

### Key Strengths:

*   **Strong Multilingual Capabilities:** The model is explicitly optimized for multilingual use cases and officially supports eight languages: English, German, French, Spanish, Italian, Portuguese, Hindi, and Thai. This is a significant advantage for building global applications.
*   **Large Context Window:** It supports a 128,000-token context window, allowing it to understand and process very long documents or conversations without losing track of the details.
* **Strong Instruction Following:** The `-Instruct` fine-tuning means it's specifically designed to follow detailed instructions. You can provide it with a complex rubric or set of evaluation criteria (e.g., "Score the response based on helpfulness, clarity, and factual accuracy, then provide a short rationale"), and it will do a good job of adhering to that format.
*   **Commercial Use:** The model is available for commercial use under the Llama 3.2 Community License, making it an attractive option for businesses and developers looking to build AI-powered products.

### Weaknesses and Considerations (Where you need to be cautious)

1.  **Limited Nuance and Depth:** This is the primary trade-off. For highly complex, subtle, or creative tasks (e.g., judging the quality of a poem, evaluating a complex legal argument, or assessing a sophisticated piece of code), a 3B model lacks the deep world knowledge and nuanced understanding of a much larger model (like a 70B or a GPT-4 class model). Its judgments on such topics will be more superficial.

2.  **Susceptibility to Bias:** Like all LLMs, it can be prone to common evaluation biases, such as:
    *   **Positional Bias:** Tending to prefer the first or second answer it's shown, regardless of quality.
    *   **Verbosity Bias:** Preferring longer, more detailed answers even if they aren't more correct.
    *   **Affirmation Bias:** Agreeing with the user's framing or the premise of the question.
    These biases might be more pronounced in a smaller model.

3.  **Fact-Checking Limitations:** A 3B model has a smaller knowledge base and is more likely to hallucinate or fail to identify factual errors in the texts it's judging. You cannot rely on it as a sole arbiter of factual accuracy.





In [None]:
# Load the evaluation model and tokenizer (AI Judge)
evaluation_model = "meta-llama/Llama-3.2-3B-Instruct" # "alpindale/Llama-3.2-3B-Instruct"
eval_tokenizer = AutoTokenizer.from_pretrained(evaluation_model)
eval_model = AutoModelForCausalLM.from_pretrained(
    evaluation_model,
    torch_dtype=dtype,
    device_map=device,
    use_cache=True
)

In [19]:
# Function to generate the prompt for the AI judge
def evaluation_prompt(question, expected_answer, generated_answer):
  prompt = f"""You are an impartial evaluator.
Your task is to determine if the "Generated Answer", even if too verbose, correctly answers the "Question".
The "Expected Answer" is provided as a reference for the correct information.

Question:
{question}

Expected Answer:
{expected_answer}

Generated Answer:
{generated_answer}

Is the "Generated Answer" correct? Please answer with "Yes" or "No".
"""
  return prompt

# Evaluate generated answers using the AI judge
ai_judge = []

for i in tqdm(range(len(results_df))):
  question = results_df.iloc[i]['question']
  expected_answer = results_df.iloc[i]['expected_answer']
  generated_answer = results_df.iloc[i]['generated_answer']
  prompt = evaluation_prompt(question, expected_answer, generated_answer)

  inputs = eval_tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        ).to(device)

  # Generate a response from the AI judge
  outputs = eval_model.generate(
      inputs,
      pad_token_id=eval_tokenizer.eos_token_id,
      max_new_tokens=100,
      temperature=0.1,
      do_sample=True,
  )

  generated_token_ids = outputs[0, inputs.shape[-1] :]
  generated_text = eval_tokenizer.decode(
      generated_token_ids,
      skip_special_tokens=True,
  ).strip()

  # Determine correctness based on the AI judge's response
  if "yes" in generated_text.lower():
    ai_judge.append(True)
  else:
    ai_judge.append(False)

results_df["is_correct_ai_eval"] = ai_judge

  0%|          | 0/25 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 25/25 [00:03<00:00,  6.31it/s]


In [20]:
# Calculate overall correctness metrics for each evaluation method
overall_keyword_accuracy = results_df['is_correct_keyword'].mean()
overall_semantic_accuracy = results_df['is_correct_semantic'].mean()
overall_ai_judge_accuracy = results_df['is_correct_ai_eval'].mean()

print(f"Overall Keyword Matching Accuracy: {overall_keyword_accuracy:.2f}")
print(f"Overall Semantic Similarity Accuracy (threshold=0.5): {overall_semantic_accuracy:.2f}")
print(f"Overall AI Judge Accuracy: {overall_ai_judge_accuracy:.2f}")

Overall Keyword Matching Accuracy: 0.24
Overall Semantic Similarity Accuracy (threshold=0.5): 0.52
Overall AI Judge Accuracy: 0.32


In [21]:
# Analyze correctness by difficulty for each evaluation method
difficulty_analysis_keyword = results_df.groupby('difficulty')['is_correct_keyword'].mean().reset_index()
difficulty_analysis_semantic = results_df.groupby('difficulty')['is_correct_semantic'].mean().reset_index()
difficulty_analysis_ai_judge = results_df.groupby('difficulty')['is_correct_ai_eval'].mean().reset_index()

print("\nKeyword Matching Accuracy by Difficulty:")
display(difficulty_analysis_keyword)

print("\nSemantic Similarity Accuracy by Difficulty (threshold=0.5):")
display(difficulty_analysis_semantic)

print("\nAI Judge Accuracy by Difficulty:")
display(difficulty_analysis_ai_judge)


Keyword Matching Accuracy by Difficulty:


Unnamed: 0,difficulty,is_correct_keyword
0,Easy,0.416667
1,Hard,0.0
2,Medium,0.111111



Semantic Similarity Accuracy by Difficulty (threshold=0.5):


Unnamed: 0,difficulty,is_correct_semantic
0,Easy,0.833333
1,Hard,0.25
2,Medium,0.222222



AI Judge Accuracy by Difficulty:


Unnamed: 0,difficulty,is_correct_ai_eval
0,Easy,0.583333
1,Hard,0.0
2,Medium,0.111111


In [24]:
# Calculate average perplexity by difficulty
average_perplexity_by_difficulty = results_df.groupby('difficulty')['perplexity'].mean().reset_index()

print("\nAverage Perplexity by Difficulty:")
display(average_perplexity_by_difficulty)


Average Perplexity by Difficulty:


Unnamed: 0,difficulty,perplexity
0,Easy,1.199247
1,Hard,1.329931
2,Medium,1.240532


In [23]:
# Display the detailed results DataFrame
display(results_df)

Unnamed: 0,question,expected_answer,generated_answer,difficulty,perplexity,is_correct_keyword,semantic_similarity,is_correct_semantic,is_correct_ai_eval
0,Who created the character of Sherlock Holmes?,Sir Arthur Conan Doyle,Arthur Conan Doyle,Easy,1.167677,False,0.962578,True,True
1,What is the name of Sherlock Holmes's enemy?,Professor Moriarty,Professor Moriarty,Easy,1.310314,True,1.0,True,True
2,Where does Sherlock Holmes live?,221b Baker Street in London,221B Baker Street.,Easy,1.000381,False,0.901767,True,False
3,Who is Sherlock Holmes's best friend?,Dr. John Watson,Dr. John Watson,Easy,1.122368,True,1.0,True,True
4,What is the name of Sherlock's older brother?,Mycroft Holmes,William.,Easy,1.244261,False,0.264208,False,False
5,Who is the landlady of 221b Baker Street?,Mrs. Hudson,Mrs. Hudson,Easy,1.138389,True,1.0,True,True
6,What musical instrument does Sherlock Holmes l...,The violin,The piano.,Easy,1.136622,False,0.587473,True,False
7,In which Sherlock Holmes short story do we mee...,A Scandal In Bohemia,The Adventure of the Dancing Men.,Medium,1.395083,False,0.20019,False,False
8,Which actor plays Sherlock Holmes in the TV se...,Benedict Cumberbatch,Benedict Cumberbatch,Easy,1.229793,True,1.0,True,True
9,Who did Dr. Watson marry?,Mary Morstan,Dr. John Watson.,Medium,1.5342,False,0.315029,False,False
