# CS 4650/7650 - Natural Language Processing - HW - 0 
Georgia Tech, Summer 2025 (Instructor: Kartik Goyal)

In this assignment, you will be implementing different evaluation methodologies for Large Language Models (LLMs) across three key tasks: Multiple Choice Questions (MCQ), Machine Translation (MT), and Short Story Generation (SSG). 

These tasks are intended to be ascending order of "open-endedness". 
In MCQ, there is a clear notion of a correct answer, and the answer space is constrained to finitely many options. 
In MT, there are infinitely many correct translations, but the final answer must have the same semantic meaning as the reference sentence. 
In SSG, the final answer can have different semantic meanings, but some collection of generations may be more desirable (e.g. coherent, diverse) than others. 

This assignment will cover fundamental evaluation metrics, prompt engineering techniques, and analysis of model performance across each of these different tasks. 





DO NOT CHANGE the names of any of the files and contents outside the cells where you have to write code.

NOTE: DO NOT USE EXTERNAL LIBRARIES FOR THIS ASSIGNMENT OTHER THAN THE ONES ALREADY IMPORTED

<!-- TODO: add deadlines -->

The assignment is broken down into 5 Sections. The sections are as follows:

| Section | Part                                      | Points |
|---------|-------------------------------------------|--------|
| 0       | Setup                                     | 0      |
| 1       | Utility Classes                           | 6      |
| 2       | Multiple Choice Questions (MCQ) Evaluation| 15     |
| 3       | Machine Translation Evaluation            | 14     |
| 4       | Short Story Generation Evaluation         | 17     |
| 5       | Sampling Hyperparameters and Prompt Optimization (BONUS) | 4 |
| Total        |                                        | 50 |
| Bonus        |                                        | 8 |

<!-- TODO: assign points appropriately. -->



## 0. Setup [0 points]

In [4]:
# DO NOT CHANGE THIS CELL

%load_ext autoreload
%autoreload 2

# Check what version of Python is running
import os
import sys 
print(sys.version)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]


In [2]:
# DO NOT CHANGE THIS CELL
# RUN THIS CELL ONLY IF RUNNING ON GOOGLE COLAB

from google.colab import drive
drive.mount('/content/drive')

%pip install -U bitsandbytes
%pip install -U datasets pyarrow fsspec

# REPLACE the path to folder where the notebook is located
%cd /content/drive/My\ Drive/path/to/your/notebook/folder


ModuleNotFoundError: No module named 'google'

In [5]:
# DO NOT CHANGE THIS CELL
# RUN THIS CELL ONLY IF RUNNING ON PACE-ICE


# override the huggingface cache path and nltk cache path
dirs = {
    "HF_HOME":"~/scratch/hf_cache",
    "TRITON_CACHE_DIR":"~/scratch/triton_cache",
    "TORCHINDUCTOR_CACHE_DIR":"~/scratch/inductor_cache",
    'NLTK_DATA':"~/scratch/nltk_data"
}

for name in dirs:
    d = dirs[name]
    path = os.path.expanduser(d)
    print(name)
    print(path)
    os.makedirs(path, exist_ok=True)
    # making sure the cache dirs are rwx for owner
    os.chmod(path, 0o700)
    os.environ[name] = path


HF_HOME
/home/jblevins32/scratch/hf_cache
TRITON_CACHE_DIR
/home/jblevins32/scratch/triton_cache
TORCHINDUCTOR_CACHE_DIR
/home/jblevins32/scratch/inductor_cache
NLTK_DATA
/home/jblevins32/scratch/nltk_data


In [6]:
# export 

# DO NOT CHANGE THIS CELL
# Importing required libraries - DO NOT CHANGE THIS CELL

import os 
import sys 
import json
import pandas as pd
import datetime
from collections import Counter, defaultdict
import re
from dataclasses import dataclass
from typing import Callable, Dict, List, Sequence, Tuple, Optional
from tqdm import tqdm
import random
import numpy as np

import torch
import torch.nn.functional as F

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# supply your huggingface token to access llama and qwen
os.environ['HF_TOKEN'] = ''

In [9]:
# DO NOT CHANGE THIS CELL
# Login to Huggingface Hub to access gated models

import os
from huggingface_hub import login

# Get the Huggingface token from environment variable
hf_token = os.environ.get("HF_TOKEN")

if hf_token:
    # Login to Huggingface Hub with the token
    login(token=hf_token)
    print("Successfully logged in to Huggingface Hub")
else:
    print("Warning: HF_TOKEN environment variable not found. You may not be able to access gated models.")


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Successfully logged in to Huggingface Hub


In [10]:
# export 

# DO NOT CHANGE THIS CELL

# Import tokenizer for n-gram matching 

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/jblevins32/scratch/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/jblevins32/scratch/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [11]:
# export 

# DO NOT CHANGE THIS CELL
# Defining global constants - DO NOT CHANGE THESE VALUES

RANDOM_SEED = 42
PADDING_VALUE = 0
UNK_VALUE     = 1
BATCH_SIZE = 128

torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

working_dir = os.getcwd()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


  return torch._C._cuda_getDeviceCount() > 0


In [12]:
# This is how we select a GPU if it's available on your computer or in the Colab environment.
print('Device of execution - ', device)

Device of execution -  cpu


## 1. Utility Classes [6 points]

In this section, you will implement wrappers around the Huggingface transformers and SentenceTransformers APIs so that they are easier to use later in the notebook. 

### 1.1. LLM Wrapper Class [5 points]


In this section, we'll implement a wrapper around the Huggingface transformers API. 

The following are provided for you: 

- Methods to load the tokenizer and model from Huggingface 
- The `generate()` method to produce text completions from the LLM
- The `perplexity()` method for computing "how confused a model is" for a given piece of text. (See section 4.2 for details.)

You will implement the `logits()` method. We will need this for the MCQ task.

Logits are the raw, unnormalized scores output by the last layer of the language model.
These scores represent the model's prediction for which token in the vocabulary is most likely to follow after the input text.

The `logits()` method will extract these raw scores from the model. For multiple-choice questions (MCQ), we select the answer by calculating the logit (either for all the vocabulary items or for the given answer choices) and choosing the most likely one out of all of them. 


In [13]:
# export 

# DO NOT CHANGE THIS CELL

@dataclass
class LLMGenerationConfig:
    """
    Configuration class for LLM generation parameters.
    This is for convenience for keeping track of the generation parameters.
    http://brainiac:9657/tree?token=9d0d4afb1bc218c5d85f6418e7d1e8007ef0dc1b2a44bd28
    Args:
        temperature (float): Controls randomness in sampling. Higher values make output more random.
        max_new_tokens (int): Maximum number of new tokens to generate.
    """
    temperature: float = 0.7
    max_new_tokens: int = 100


In [14]:
# export 

class LLM:
    """
    A wrapper class for Hugging Face language models that provides a unified interface
    for text generation, logit computation, and perplexity calculation.
    
    If transformers library is not available, falls back to deterministic stubs.
    """

    def __init__(self, hf_id: str = "gpt2", device: str = None, quantize: bool = True):
        """
        Initialize the LLM wrapper.
        
        Args:
            hf_id (str): Hugging Face model identifier
            device (str): Device to load model on ('cuda', 'cpu', 'mps')
        """      
        self.hf_id = hf_id
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        
        # Use auto-loading with device_map="auto" for faster loading and automatic memory management
        self.tokenizer = AutoTokenizer.from_pretrained(hf_id)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            
        if device == 'cpu':
            self.model = AutoModelForCausalLM.from_pretrained(
                hf_id,
                torch_dtype=torch.float16,  # Use half precision for faster loading and less memory
                load_in_8bit=False
            ).to(self.device).eval()
        else:
            self.model = AutoModelForCausalLM.from_pretrained(
                hf_id,
                device_map="auto",  # Automatically determine optimal device mapping
                torch_dtype=torch.float16,  # Use half precision for faster loading and less memory
                load_in_8bit=quantize,  # Enable 4-bit quantization for even more memory efficiency
            ).eval()
                

    @torch.inference_mode()
    def generate(self, prompt: str, task_type: str = "mcq", config: LLMGenerationConfig = None) -> str:
        """
        Generate text continuation for the given prompt using the underlying language model.
        
        This method takes a text prompt and generates additional text that continues from
        the prompt in a coherent manner. The generation process can be controlled through
        various parameters specified in the config object.
        
        The method tokenizes the input prompt, passes it through the model, and then
        decodes the generated token IDs back to text, excluding the original prompt tokens.
        
        Args:
            prompt (str): Input text prompt that the model will continue from
            config (LLMGenerationConfig): Configuration object containing generation parameters
                such as temperature, top_p, top_k, and max_new_tokens. If None, default
                parameters will be used.
            
        Returns:
            str: Generated text continuation without the original prompt. The text is
                stripped of leading/trailing whitespace and special tokens are removed
                during decoding.
        """

        # set generation config 
        assert task_type in ["mcq", "mt", "ssg"], "Invalid task_type. Must be one of: mcq, mt, ssg"
        
        if config is not None: 
            # Use the provided config
            pass
        elif task_type == "mcq":
            config = LLMGenerationConfig(
                temperature=0.0,
                max_new_tokens=4,
            )
        elif task_type == "mt":
            config = LLMGenerationConfig(
                temperature=0.0,
                max_new_tokens=200, # max llama=189, qwen=188 on test set; mt contains longer sentences
            )
        elif task_type == "ssg":
            config = LLMGenerationConfig(
                temperature=0.7,
                max_new_tokens=100, # max llama=76, qwen=75 on all five sentences
                )   
            
        

        # tokenize the prompt   
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.device)
        
        # generate the output
        if config.temperature > 0:
            output = self.model.generate(
                input_ids,
                do_sample=True, 
                temperature=config.temperature,
                max_length=input_ids.shape[1] + config.max_new_tokens,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        else:
            output = self.model.generate(
                input_ids,
                do_sample=False, 
                max_length=input_ids.shape[1] + config.max_new_tokens,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        generated = self.tokenizer.decode(
            output[0, input_ids.shape[1]:], 
            skip_special_tokens=True
        )
        return generated.strip()

    @torch.inference_mode()
    def logits(self, prompt: str) -> torch.Tensor:
        """
        Get next-token logits for the given prompt.
        
        This method computes and returns the logits (raw, unnormalized prediction scores) 
        for the next token that would follow the given prompt. These logits represent the model's
        prediction distribution over the entire vocabulary for the next token position.
        
        Args:
            prompt (str): Input text prompt for which to compute next-token predictions
            
        Returns:
            torch.Tensor: A tensor of shape (vocab_size,) containing the logits for each
                possible next token in the vocabulary. Higher values indicate tokens the
                model considers more likely to follow the prompt.
        """
        tokens = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        outputs = self.model(**tokens)
        # Return logits for the last token position


        #### YOUR CODE HERE #### 
        return outputs.logits[:,-1,:]
        
        #### END YOUR CODE #### 



    ### DO NOT CHANGE THIS FUNCTION ###
    @torch.inference_mode()
    def perplexity(self, text: str) -> float:
        """
        Calculate perplexity of the given text.
        
        Perplexity is a measurement of how well a probability model predicts a sample.
        Lower perplexity indicates the model is better at predicting the text.
        It is calculated as the exponentiated average negative log-likelihood of a sequence.
        
        Args:
            text (str): Input text for which to calculate perplexity
        Returns:
            float: Perplexity value
        """
        if self.model is None:
            return 100.0  # Fixed stub value

        encodings = self.tokenizer(text, return_tensors="pt").to(self.device)
        
        with torch.no_grad():
            outputs = self.model(**encodings)
            logits = outputs.logits
        
        # Shift logits and labels for next-token prediction
        shift_logits = logits[:, :-1].contiguous()
        shift_labels = encodings.input_ids[:, 1:].contiguous()
        
        # Calculate cross-entropy loss
        loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.size(-1)), 
            shift_labels.view(-1)
        )
        
        return np.exp(loss.item())

In the following cells, let's load two different LLMs and test each of the methods we implemented in the previous cell. 

In [17]:
# DO NOT CHANGE THIS CELL

llama = LLM(hf_id="meta-llama/Llama-3.1-8B-Instruct", device=device)

ValueError: Using a `device_map`, `tp_plan`, `torch.device` context manager or setting `torch.set_default_device(device)` requires `accelerate`. You can install it with `pip install accelerate`

In [18]:
# DO NOT CHANGE THIS CELL
prompt = "Hello, how are you?"

# Let's test the generate method using greedy decoding 
assert llama.device.type == "cuda", "Device is not loaded to cuda"
assert llama.generate(prompt, task_type="mcq") == "I am doing well", "Greedy decoding is incorrect"

assert llama.logits(prompt).shape[0] == 128256, "Logit shape is incorrect"
assert torch.argmax(llama.logits(prompt)) == 358, "Logit is incorrect"
assert np.isclose(llama.perplexity(prompt), 17.969428099556087, atol=1e-1), "Perplexity is incorrect"
print("All tests passed!")

NameError: name 'llama' is not defined

In [None]:
# DO NOT CHANGE THIS CELL

qwen = LLM(hf_id="Qwen/Qwen2.5-7B-Instruct", device=device)

In [None]:
# DO NOT CHANGE THIS CELL

prompt = "Hello, how are you?"

# Let's test the generate method using greedy decoding 
assert qwen.device.type == "cuda", "Device is not loaded to cuda"
assert qwen.generate(prompt, task_type="mcq") == "I'm sorry,", "Greedy decoding is incorrect"

assert qwen.logits(prompt).shape[0] == 152064, "Logit shape is incorrect"
assert torch.argmax(qwen.logits(prompt)) == 358, "Logit is incorrect"
assert np.isclose(qwen.perplexity(prompt), 7.353064664668671, atol=1e-1), "Perplexity is incorrect"
print("All tests passed!")

### 1.2. Embedding Model [1 point]

Next, we will implement a wrapper around the Huggingface SentenceTransformer API for generating embeddings. 
Embedding models convert text into dense vector representations (embeddings) that capture semantic meaning.
These vectors allow us to measure similarity between texts in a high-dimensional space.
#
Key points about embedding models:
1. They transform variable-length text inputs into fixed-dimension vectors
2. They have a maximum context length, so longer inputs will be truncated
3. Similar texts will have embeddings that are close to each other in the vector space
#
We'll use these embeddings when evaluating LLM outputs based on semantic similarity rather than
exact string matching, which is particularly useful for tasks like machine translation and
short story generation where multiple valid outputs are possible.


In [None]:
# export 

# DO NOT CHANGE THIS CELL

class EmbeddingModel:
    """
    A wrapper around the Huggingface SentenceTransformer API for generating embeddings.
    This model creates semantic embeddings that can be used for measuring similarity
    between texts.
    """

    def __init__(self, hf_id: str = "sentence-transformers/all-MiniLM-L6-v2", dim: int = None):
        """
        Initialize the embedding model with the specified model ID.
        
        Args:
            hf_id (str): Hugging Face model ID for the SentenceTransformer model.
                         Default is "sentence-transformers/all-MiniLM-L6-v2".
            dim (int): Not used for SentenceTransformer models as the dimension is
                       determined by the model itself, but kept for API compatibility.
        """
        self.model = SentenceTransformer(hf_id,  trust_remote_code=True)
        self.dim = self.model.get_sentence_embedding_dimension()

    def embed(self, text: str) -> torch.Tensor:
        """
        Create an embedding for the given text using the SentenceTransformer model.
        
        Args:
            text (str): Input text to embed. Can be of any length.
            
        Returns:
            torch.Tensor: Embedding vector representing the semantic content of the input text.
        """
        # SentenceTransformer returns numpy array, convert to torch tensor
        embedding = self.model.encode(text, convert_to_tensor=True)
        return embedding

LaBSE (Language-agnostic BERT Sentence Embedding) is a multilingual embedding model specifically trained for machine translation tasks. It can encode sentences from 109 different languages into a shared embedding space, allowing for effective cross-lingual similarity comparison. This makes it particularly useful for evaluating machine translation outputs by measuring semantic similarity between translations and references, rather than relying on exact string matching.


In the following cells, generate embeddings for the given sentences. 

In [None]:
# DO NOT CHANGE THIS CELL 

labse = EmbeddingModel(hf_id="sentence-transformers/LaBSE")

In [None]:
sentence1 = "Hello, how are you?"
sentence2 = "Goodbye, see you later!"
sentence3 = "What is the weather like today?"


### YOUR CODE HERE ### 
pass

### END YOUR CODE ### 


assert embedding1.shape == embedding2.shape
assert embedding1.shape == (768,)
assert embedding2.shape == (768,)
assert embedding3.shape == (768,)

assert torch.isclose(torch.cosine_similarity(embedding1, embedding2, dim=0), torch.tensor(0.4207), atol=1e-1)
assert torch.isclose(torch.cosine_similarity(embedding1, embedding1, dim=0), torch.tensor(1.0), atol=1e-3)
assert torch.isclose(torch.cosine_similarity(embedding2, embedding2, dim=0), torch.tensor(1.0), atol=1e-3)

print("All tests passed!")

The following is an embedding model from Alibaba's GTE (General Text Embedding) family. 
It is a multilingual embedding model supporting 70 languages.

In the following cells, generate sentences for the given sentences. 

In [None]:
# DO NOT CHANGE THIS CELL 

gte = EmbeddingModel(hf_id="Alibaba-NLP/gte-multilingual-base")

In [None]:
### YOUR CODE HERE ### 
pass

### END YOUR CODE ### 


assert embedding1.shape == embedding2.shape
assert embedding1.shape == (768,)
assert embedding2.shape == (768,)
assert embedding3.shape == (768,)


assert torch.isclose(torch.cosine_similarity(embedding1, embedding2, dim=0), torch.tensor(0.5831), atol=1e-1)
assert torch.isclose(torch.cosine_similarity(embedding1, embedding1, dim=0), torch.tensor(1.0), atol=1e-3)
assert torch.isclose(torch.cosine_similarity(embedding2, embedding2, dim=0), torch.tensor(1.0), atol=1e-3)

print("All tests passed!")

## 2. Multiple Choice Questions (MCQ) Evaluation [9 points Programming + 4 points Non-Programming + 2 BONUS points]

In this secton, we'll be evaluating LLMs on the [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300), a benchmark dataset consisting of multiple choice questions. 
It covers a wide range of subjects including mathematics, history, computer science, law, and more, making it a comprehensive test of an LLM's knowledge and reasoning abilities.
MCQ benchmarks are usually reported using accuracies which simply mean how many questions the model answered correctly. 

We'll implement two different evaluation approaches:

1. **Regex-based Accuracy**: A simple approach that extracts answers from model outputs using regular expressions, then compares them to ground truth answers.

2. **Logit-based Accuracy**: A more sophisticated approach that directly accesses the model's internal probability distributions (logits) to determine which answer choice the model predicts as most likely.

Each approach has its advantages and limitations, which we'll explore throughout this section.





### 2.1. Regex-based Accuracy [2 points]

In the following cell, implement a regular expression (regex) which extracts the first letter generated by the LLM and compares it with the reference answer. 

In [None]:
# export 

### YOUR CODE HERE ### 

_mcq_regex = None

### END YOUR CODE ### 

def compute_regex_accuracy(generation: str, reference: str) -> tuple[float, str]:
    """
    Extract the first letter A-D from the generation and compare with reference.
    
    Args:
        generation (str): Generated text from the model
        reference (str): Ground truth answer (A, B, C, or D)
        
    Returns:
        float: 1.0 if correct, 0.0 if incorrect
        str: The predicted letter (A, B, C, or D)
    """
    match = _mcq_regex.search(generation)
    regex_pred = match.group(1).upper() if match else None
    return float(regex_pred == reference.strip().upper()), regex_pred

In [None]:
generation = "The answer is B"
reference = "B"
accuracy, pred = compute_regex_accuracy(generation, reference)
assert accuracy == 1.0
assert pred == "B"
print(f"Test passed - Basic correct")

generation = "the answer is clearly d"
reference = "D"
accuracy, pred = compute_regex_accuracy(generation, reference)
assert accuracy == 1.0
assert pred == "D"
print(f"Test passed - Lowercase")

generation = "The correct option is (B)"
reference = "B"
accuracy, pred = compute_regex_accuracy(generation, reference)
assert accuracy == 1.0
assert pred == "B"
print(f"Test passed - Answer in parentheses")

print("All tests passed!")

### 2.2. Logit-based Accuracy. [5 points]

In the following cell, implement a function which calculates the logit-based accuracy for a single question using the logits returned by the model. 
That is, given the logits of the model, find the 

The function should support an optional `valid_letters` argument such that, 
when a list is passed to it, the maximum is computed over these valid letters (as opposed to, over the entire vocabulary).
For example, if the list `[A,B,C,D]` is passed to the function, then we will get the accuracy based on which letters the model thinks is most likely *out of the four letters contained in `valid_letters`*. 


In [None]:
# export 
LETTERS = ["A", "B", "C", "D"]

def compute_logit_accuracy(logits: torch.Tensor, reference: str, model: LLM, valid_letters: List[str] = None) -> Tuple[float, str]:
    """
    Compute accuracy based on logits for the first token generated by the LLM.
    
    Args:
        logits (torch.Tensor): Logits from the model for the first token
        reference (str): Ground truth answer (A, B, C, or D)
        model (LLM): The language model used for generation, needed for tokenizer access
        valid_letters (List[str]): List of valid letters to restrict argmax to (e.g., ['A', 'B', 'C', 'D'])
                                If None, uses the entire vocabulary
        
    Returns:
        float: 1.0 if the token with highest logit matches reference, 0.0 otherwise
        str: The predicted letter (If `valid_letters` is None, then the predicted letter is not constrained to be one of the valid letters, but can be any letter in the vocabulary)
    """

    ### YOUR CODE HERE ### 
    
    if valid_letters is not None:
        # limit vocabulary to only the valid letters
        pass
    else:
        # use entire vocabulary
        pass
    
    # compute accuracy
    pass

    ### END YOUR CODE ### 

In [None]:
# DO NOT CHANGE THIS CELL

llama_token_ids = {
    letter: llama.tokenizer.encode(letter, add_special_tokens=False)[0] 
    for letter in LETTERS
}

qwen_token_ids = {
    letter: qwen.tokenizer.encode(letter, add_special_tokens=False)[0] 
    for letter in LETTERS
}

vocab_size = len(llama.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[llama_token_ids["B"]] = 15.0  
reference = "B"
accuracy, pred = compute_logit_accuracy(logits, reference, llama)
assert accuracy == 1.0
assert pred == "B"
print("Test passed - llama, correct answer")

vocab_size = len(llama.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[2196] = 20.0 
reference = "B"
accuracy, pred = compute_logit_accuracy(logits, reference, llama)
assert accuracy == 0.0
assert pred == "context"
print("Test passed - llama, wrong answer")


vocab_size = len(llama.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[2196] = 20.0 
reference = "B"
accuracy, pred = compute_logit_accuracy(logits, reference, llama)
assert accuracy == 0.0
assert pred == "context"
print("Test passed - llama, wrong answer")


vocab_size = len(llama.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[llama_token_ids["C"]] = 15.0  
reference = "C"
accuracy, pred = compute_logit_accuracy(logits, reference, llama, LETTERS)
assert accuracy == 1.0
assert pred == "C"
print("Test passed - llama, valid letter C")


vocab_size = len(llama.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[llama_token_ids["A"]] = 15.0  
reference = "A"
accuracy, pred = compute_logit_accuracy(logits, reference, llama, LETTERS)
assert accuracy == 1.0
assert pred == "A"
print("Test passed - llama, valid letter A")



vocab_size = len(qwen.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[qwen_token_ids["D"]] = 15
reference = "D"
accuracy, pred = compute_logit_accuracy(logits, reference, qwen)
assert accuracy == 1.0
assert pred == "D"
print("Test passed - qwen, correct answer")

vocab_size = len(qwen.tokenizer)
logits = torch.randn(vocab_size) * 2.0
logits[670] = 20.0  
reference = "A"
accuracy, pred = compute_logit_accuracy(logits, reference, llama)
assert accuracy == 0.0
assert pred == "uct"
print("Test passed - qwen, wrong answer")


vocab_size = len(qwen.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[qwen_token_ids["A"]] = 15.0  
reference = "A"
accuracy, pred = compute_logit_accuracy(logits, reference, qwen, LETTERS)
assert accuracy == 1.0
assert pred == "A"
print("Test passed - qwen, valid letter A")

vocab_size = len(qwen.tokenizer)
logits = torch.randn(vocab_size) * 2.0  
logits[qwen_token_ids["B"]] = 15.0  
reference = "B"
accuracy, pred = compute_logit_accuracy(logits, reference, qwen, LETTERS)
assert accuracy == 1.0
assert pred == "B"
print("Test passed - qwen, valid letter B")

del logits
print("All tests passed!")

### 2.3. Run MCQ Evaluation [2 points - Non Programming]

In the following cells run the evaluation on the LLMs with the MMLU benchmark.
In the final cell, comment on what you observe from the evaluation results.

In [None]:
# DO NOT CHANGE THIS CELL

# Load MMLU dataset from Huggingface
mmlu_test_raw = load_dataset("lighteval/mmlu", "high_school_geography", split="test")

# preprocess
mmlu_test_data = []

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mmlu_test_raw:
    eval_item = {
        "question": item["question"], 
        "option_A": item["choices"][0],
        "option_B": item["choices"][1],
        "option_C": item["choices"][2],
        "option_D": item["choices"][3],
        "reference": LETTERS[item["answer"]],
        "reference_idx": item["answer"]
    }
    mmlu_test_data.append(eval_item)


In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

llama_mmlu_scores = defaultdict(list)
for item in tqdm(mmlu_test_data, desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = llama.generate(prompt, config=config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    llama_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama)
    llama_mmlu_scores["logit_accuracy"].append(logit_accuracy)

    # compute logit accuracy when restricted to valid letters
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama, LETTERS)
    llama_mmlu_scores["logit_accuracy_restricted"].append(logit_accuracy)


for metric_name, metric_scores in llama_mmlu_scores.items():
    numeric_scores = [score for score, _ in metric_scores]
    print(f"{metric_name}: {np.mean(numeric_scores):.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

qwen_mmlu_scores = defaultdict(list)
for item in tqdm(mmlu_test_data, desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = qwen.generate(prompt, config=config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    qwen_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen)
    qwen_mmlu_scores["logit_accuracy"].append(logit_accuracy)

    # compute logit accuracy when restricted to valid letters
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen, LETTERS)
    qwen_mmlu_scores["logit_accuracy_restricted"].append(logit_accuracy)


for metric_name, metric_scores in qwen_mmlu_scores.items():
    numeric_scores = [score for score, _ in metric_scores]
    print(f"{metric_name}: {np.mean(numeric_scores):.3f}")

YOUR RESPONSE HERE.

### 2.5  "None of the Above" [2 points Programming + 2 point Non Programming]

In this evaluation section, we assess the model's ability to detect when the correct answer is missing from a set of provided options—a key aspect of robust question-answering systems.

Your tasks are as follows:

1. Create a modified version of the MMLU dataset by randomly replacing one answer choice with 'None of the above'.
2. Record which questions have had their true correct answer replaced, as these will serve as the ground truth for when 'None of the above' is correct.
3. Evaluate the model's performance using precision, recall, and F1-score:

    **Precision** measures the accuracy when the model selects 'None of the above':
    $$\mathrm{Precision} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives} + \mathrm{False\ Positives}}$$

    **Recall** quantifies how often the model correctly identifies cases where the correct answer is missing:
    $$\mathrm{Recall} = \frac{\mathrm{True\ Positives}}{\mathrm{True\ Positives} + \mathrm{False\ Negatives}}$$

    **F1-score**, the harmonic mean of precision and recall, offers a balanced performance measure:
    $$\mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$$

For clarity, the evaluation categories are defined as:

- **True Positive (TP)**: The model selects 'None of the above' when the correct answer was replaced.
- **False Positive (FP)**: The model selects 'None of the above' even though the correct answer remains among the options.
- **True Negative (TN)**: The model selects an option other than 'None of the above' when the original correct answer is still available.
- **False Negative (FN)**: The model selects an option other than 'None of the above' even though the correct answer was replaced.

These metrics will help determine whether the model can correctly identify scenarios where no provided answer is correct. When preparing the `modified_mmlu_test_data`, feel free to include extra keys to capture which questions were modified and how the correct answers were affected.

In [None]:
# Load MMLU dataset from Huggingface
mmlu_test_raw = load_dataset("lighteval/mmlu", "high_school_geography", split="test")

LETTERS = ["A", "B", "C", "D"]

# preprocess
modified_mmlu_test_data = []


### YOUR CODE HERE ### 

# See: `prompts/mcq/default.txt` for the prompt template format 
# for each test case, randomly select one of the four options to replace with "None of the above"
pass

### END YOUR CODE ### 


Performance using LLAMA 3.1-8B

In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

### YOUR CODE HERE ### 
# USING LLAMA, for each test case in your modified mmlu test data
# calculate the following and store the result for metric calculation later
    # 1. compute regex accuracy
    # 2. compute regex stats
    # 3. compute logit accuracy
    # 4. compute logit stats
    # 5. compute logit accuracy when restricted to valid letters
    # 6. compute logit stats


# compute regex precision 


# compute regex recall 


# compute regex f1 score


# compute logit precision 


# compute logit recall 


# compute logit f1 score


# compute logit precision when restricted to valid letters


# compute logit recall when restricted to valid letters 


# compute logit f1 score when restricted to valid letters  


### END YOUR CODE ### 


Performance using QWEN 2.5-7B

In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

### YOUR CODE HERE ### 
# USING QWEN, for each test case in your modified mmlu test data
# calculate the following and store the result for metric calculation later
    # 1. compute regex accuracy
    # 2. compute regex stats
    # 3. compute logit accuracy
    # 4. compute logit stats
    # 5. compute logit accuracy when restricted to valid letters
    # 6. compute logit stats


# compute regex precision 


# compute regex recall 


# compute regex f1 score


# compute logit precision 


# compute logit recall 


# compute logit f1 score


# compute logit precision when restricted to valid letters


# compute logit recall when restricted to valid letters 


# compute logit f1 score when restricted to valid letters  


### END YOUR CODE ### 

In the cell below, explain what you observe for each of the evaluation results.

YOUR RESPONSE HERE

### 2.6 Shuffling Choices. [BONUS - 2 points Non-Programming]

When evaluating LLMs on multiple choice questions, we need to be careful about memorization effects.
Memorization occurs when an LLM has seen very similar questions during training,
and can simply recall the correct answer rather than reasoning about the question.
#
For example, if an LLM was trained on practice tests that contained the same multiple choice
question with answers in the same order (A, B, C, D), it might memorize that "A" was correct without
actually understanding the question.
#
To help distinguish between true reasoning ability and memorization, we can randomly shuffle the order
of answer choices while preserving which answer is correct. This way, even if the LLM has seen the
question before, it needs to identify the correct answer based on content rather than position.
# 
Re-evaluate the two models in the shuffled case and report the resulting accuracies. Comment on whether the results differ from the accuracies you got for sections 2.4 and 2.5. 

In [None]:
# DO NOT CHANGE THIS CELL

# preprocess
shuffled_mmlu_test_data = []

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mmlu_test_raw:
    shuffled_indices = list(range(len(item["choices"])))
    eval_item = {
        "question": item["question"], 
        "option_A": item["choices"][shuffled_indices[0]],
        "option_B": item["choices"][shuffled_indices[1]],
        "option_C": item["choices"][shuffled_indices[2]],
        "option_D": item["choices"][shuffled_indices[3]],
        "reference": LETTERS[item["answer"]],
        "reference_idx": item["answer"]
    }
    shuffled_mmlu_test_data.append(eval_item)


In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

shuffled_llama_mmlu_scores = defaultdict(list)
for item in tqdm(shuffled_mmlu_test_data, desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = llama.generate(prompt, config=config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    shuffled_llama_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama)
    shuffled_llama_mmlu_scores["logit_accuracy"].append(logit_accuracy)

    # compute logit accuracy when restricted to valid letters
    logits = llama.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, llama, LETTERS)
    shuffled_llama_mmlu_scores["logit_accuracy_restricted"].append(logit_accuracy)


for metric_name, metric_scores in shuffled_llama_mmlu_scores.items():
    numeric_scores = [score for score, _ in metric_scores]
    print(f"{metric_name}: {np.mean(numeric_scores):.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

config = LLMGenerationConfig(
    temperature=0,
    max_new_tokens=5
)
prompt_template = open("prompts/mcq/default.txt").read()

qwen_mmlu_scores = defaultdict(list)
for item in tqdm(shuffled_mmlu_test_data, desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["question"]
    reference, reference_idx = item["reference"], item["reference_idx"]
    
    # compute regex accuracy
    hypothesis = qwen.generate(prompt, config=config)
    regex_accuracy = compute_regex_accuracy(hypothesis, reference)
    qwen_mmlu_scores["regex_accuracy"].append(regex_accuracy)

    # compute logit accuracy
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen)
    qwen_mmlu_scores["logit_accuracy"].append(logit_accuracy)

    # compute logit accuracy when restricted to valid letters
    logits = qwen.logits(prompt)
    logit_accuracy = compute_logit_accuracy(logits, reference, qwen, LETTERS)
    qwen_mmlu_scores["logit_accuracy_restricted"].append(logit_accuracy)


for metric_name, metric_scores in qwen_mmlu_scores.items():
    numeric_scores = [score for score, _ in metric_scores]
    print(f"{metric_name}: {np.mean(numeric_scores):.3f}")

YOUR RESPONSE HERE.

## 3. Machine Translation Evaluation [11 points Programming + 3 points Non-Programming]

In this section, we'll be evaluating the Llama model on German to English machine translation tasks. In machine translation, models automatically converting text from one language to another while preserving meaning and fluency.

We'll implement three different evaluation metrics:

1. **N-gram Overlap**: A simple metric that counts the number of matching n-grams (sequences of n consecutive words) between the machine translation and reference translation. It provides a basic measure of lexical similarity.
#
2. **BLEU Score**: A precision-based metric that measures the overlap of n-grams between the machine translation and reference translations. It's one of the most widely used metrics in machine translation evaluation.
#
3. **Embedding-based Similarity**: Instead of direct word matching, we use an embedding model to convert sentences into vector representations that capture semantic meaning. 

Each metric has its strengths and limitations, which we'll explore throughout this section. We'll use these metrics to evaluate the translation capabilities of our model.


### 3.1. Helper Functions [2 points Programming]


In this section, we define some useful helper functions for evaluating the LLMs. 


N-grams are contiguous sequences of n items from a given sample of text. 
For example, in the sentence "I love natural language processing":
- 1-grams (unigrams): ["I", "love", "natural", "language", "processing"]
- 2-grams (bigrams): [("I", "love"), ("love", "natural"), ("natural", "language"), ("language", "processing")]
- 3-grams (trigrams): [("I", "love", "natural"), ("love", "natural", "language"), ("natural", "language", "processing")]
#
N-grams are useful for various NLP tasks including:
- Language modeling: Predicting the next word based on previous words
- Text similarity: Comparing documents based on shared n-grams
- Machine translation evaluation: Metrics like BLEU use n-gram overlap
#
Implement the `_ngrams` function below which extracts all possible n-grams from a sequence of tokens. 


In [None]:
# export 

# DO NOT CHANGE THIS CELL

def preprocess_text(text: str) -> List[str]:
    """
    Tokenize a text string into a list of tokens. This is only for n-gram matching. 
    
    Args:
        text (str): The input text string to tokenize
        
    Returns:
        List[str]: A list of tokens extracted from the input text
    """
    return nltk.word_tokenize(text)

In [None]:
# export 

def _ngrams(tokens: Sequence[str], n: int) -> List[Tuple[str, ...]]:
    """
    Extract n-grams from a sequence of tokens.
    
    Args:
        tokens (Sequence[str]): A sequence of tokens (words, characters, etc.)
        n (int): The size of each n-gram
        
    Returns:
        List[Tuple[str, ...]]: A list of tuples, where each tuple contains n consecutive tokens
                              from the input sequence
    """
    ### YOUR CODE HERE ### 
    pass
    ### END YOUR CODE ### 

### 3.2. N-gram Overlap [5 points]

N-gram overlap is a simple metric that measures how many n-grams are shared between the generated text and the reference text. n-gram overlap focuses on a single n-gram size and calculates the recall - what fraction of reference n-grams appear in the candidate text.

This metric is useful for evaluating:
- Content coverage: How much of the reference content is captured in the generation
- Lexical similarity: The degree to which the same word sequences are used

N-gram overlap is particularly helpful for tasks where we want to ensure specific information from a reference is included in the generated text, such as summarization or story generation.

Implement the `compute_ngram_overlap()` function below.

In [None]:
# export 

def compute_corpus_ngram_overlap(candidates: List[str], references: List[str], n: int = 2) -> float:
    """
    Compute corpus-level n-gram overlap by aggregating across all sentences.
    Compute the n-gram overlap, accounting for the number of times each n-gram appears in the candidate and reference.
    
    Args:
        candidates (List[str]): List of candidate translations
        references (List[str]): List of reference translations  
        n (int): N-gram order
        
    Returns:
        float: Corpus-level n-gram overlap score
    """
    ### YOUR CODE HERE ### 
    # Initialize counters for n-grams
    pass
    
    # Calculate overlap considering frequencies
    pass

    ### END YOUR CODE ###  

In [None]:
# DO NOT CHANGE THIS CELL

test_text = "the cat sat on the mat"
tokens = preprocess_text(test_text)
print(f"preprocess_text result: {tokens}")

test_tokens = ["the", "cat", "sat", "on", "the", "mat"]
bigrams = _ngrams(test_tokens, 2)
print(f"ngram result, n=2: {bigrams}")

test_tokens = ["the", "cat", "sat", "on", "the", "mat"]
trigrams = _ngrams(test_tokens, 3)
print(f"ngram result, n=3: {trigrams}")

# Test - perfect match
candidates = ["the cat sat on the mat"]
references = ["the cat sat on the mat"]
score = compute_corpus_ngram_overlap(candidates, references, n=2)
assert score == 1.0, f"Expected 1.0, got {score}"
print("Test passed - perfect match")

# Test - no match
candidates = ["the cat sat on the mat"]
references = ["dogs run fast"]
score = compute_corpus_ngram_overlap(candidates, references, n=2)
assert score == 0.0, f"Expected 0.0, got {score}"
print("Test passed - no match")

# Test - partial match
candidates = ["the cat sat on the mat"]
references = ["some dogs also sat on the mat"]
score = compute_corpus_ngram_overlap(candidates, references, n=3)
assert score == 0.4, f"Expected 0.4, got {score}"
print("Test passed - partial match")

# Test - multiple sentences
candidates = ["the cat sat", "the dog ran"]
references = ["the cat sat", "the dog walked"]
score = compute_corpus_ngram_overlap(candidates, references, n=2)
assert torch.isclose(torch.tensor(score), torch.tensor(0.75), atol=0.001), f"Expected 0.75, got {score}"
print("Test passed - multiple sentences")

# Test - repeated n-grams
candidates = ["the the the dog"]
references = ["the the cat cat"]
score = compute_corpus_ngram_overlap(candidates, references, n=2)
assert torch.isclose(torch.tensor(score), torch.tensor(1/3), atol=0.001), f"Expected 0.3333, got {score}"
print("Test passed - repeated n-grams")

print("All tests passed")

### 3.3. BLEU Score [2 points]


BLEU (Bilingual Evaluation Understudy) is a widely used metric for evaluating machine translation quality. 
This is done by: 

- Finding the n-grams for different values of n (typically n = 1,...,4). 
- Calculating the n-gram precisions between the candidate and reference texts for each n. This is the fraction of matching n-grams between the two texts, out of the n-grams from the candidate text. 

The final BLEU score is a geometric mean of these precision scores, multiplied by the brevity penalty.

The brevity penalty is crucial because without it, systems could artificially inflate their scores by 
producing very short translations that contain only high-confidence words. For example, a system could 
output just a few words that are guaranteed to be correct, achieving high precision while omitting 
much of the content. The brevity penalty addresses this by reducing the score when the candidate 
translation is shorter than the reference translation.


In the following cells, implement the precision calculation in the BLEU score function. 

In [None]:
# export 

def compute_corpus_bleu(candidates: List[str], references: List[str], max_n: int = 4) -> float:
    """
    Compute corpus-level BLEU score by aggregating n-gram statistics across all sentences.
    
    Args:
        candidates (List[str]): List of candidate translations
        references (List[str]): List of reference translations
        max_n (int): Maximum n-gram order to consider
        
    Returns:
        float: Corpus-level BLEU score
    """
    # Aggregate statistics across all sentences
    total_cand_len = 0
    total_ref_len = 0
    ngram_matches = [0] * max_n
    ngram_totals = [0] * max_n
    
    # iterate over each candidate and reference
    for cand, ref in zip(candidates, references):
        cand_tokens = preprocess_text(cand)
        ref_tokens = preprocess_text(ref)
        
        total_cand_len += len(cand_tokens)
        total_ref_len += len(ref_tokens)
        
        for n in range(1, max_n + 1):
            cand_ngrams = Counter(_ngrams(cand_tokens, n))
            ref_ngrams = Counter(_ngrams(ref_tokens, n))
            
            overlap = {ng: min(count, ref_ngrams[ng]) for ng, count in cand_ngrams.items()}
            ngram_matches[n-1] += sum(overlap.values())
            ngram_totals[n-1] += sum(cand_ngrams.values())

    precisions = []

    ### YOUR CODE HERE ### 

    # calculate precision for each n-gram order
    pass

    ### END YOUR CODE ### 
    
    # Geometric mean with smoothing
    eps = 1e-9
    geo_mean = np.exp(sum(np.log(p + eps) for p in precisions) / max_n)
    
    # Brevity penalty
    bp = 1.0 if total_cand_len > total_ref_len else np.exp(1 - total_ref_len / max(total_cand_len, 1))
    
    return bp * geo_mean

In [None]:
# Test - perfect match
candidates = ["the cat sat on the mat"]
references = ["the cat sat on the mat"]
score = compute_corpus_bleu(candidates, references, max_n=4)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(1.0), atol=0.001)
print("Test passed - perfect match")

# Test - multiple sentences
candidates = ["the cat sat on the mat", "the dog ran in the park"]
references = ["the cat sat on the mat", "the dog ran in the garden"]
score = compute_corpus_bleu(candidates, references, max_n=4)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(0.8806841685), atol=0.001)
print("Test passed - multiple sentences")

# Test - zero score
candidates = ["the cat sat on the mat"]
references = ["dogs run fast"]
score = compute_corpus_bleu(candidates, references, max_n=4)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(0.0), atol=0.001)
print("Test passed - zero score")

print("All tests passed")


### 3.4. Embedding-based Similarity [2 points]


Embedding-based similarity metrics go beyond surface-level text matching by capturing semantic relationships between words and phrases. Unlike n-gram overlap or BLEU, which rely on exact matches, embedding-based methods can recognize when different words express similar meanings.

This approach works by:
1. Converting texts into dense vector representations (embeddings) using pre-trained models
2. Measuring the similarity between these vectors using metrics like cosine similarity
3. Producing a score that reflects semantic similarity rather than lexical overlap

Unlike BLEU and n-gram overlap which are typically calculated at the corpus level, embedding-based similarity is a sentence-level metric that evaluates each translation independently. This makes it particularly useful for assessing individual translations without requiring a large corpus for reliable scores.

Implement the `compute_embedding_similarity()` function below to calculate the semantic similarity between generated and reference texts.


In [None]:
# export 

def compute_embedding_similarity(
    generation: str, 
    reference: str, 
    embedder: EmbeddingModel = None
) -> float:
    """
    This function computes the semantic similarity between a generated text and a reference text
    using embeddings and cosine similarity.
    
    Embedding-based similarity captures semantic meaning beyond exact word matches, allowing
    for evaluation of paraphrases and texts that convey similar meaning with different words.
    
    Args:
        generation (str): Generated text to be evaluated
        reference (str): Reference text to compare against
        embedder (EmbeddingModel): Model to create text embeddings. If None, a default model is used.
        
    Returns:
        float: Cosine similarity between the embeddings, ranging from -1.0 to 1.0,
              where higher values indicate greater semantic similarity
    """
    ### YOUR CODE HERE ### 
    pass

    ### END YOUR CODE ### 

In [None]:
# Test - perfect match
candidates = "the cat sat on the mat"
references = "the cat sat on the mat"
score = compute_embedding_similarity(candidates, references, labse)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(1.0), atol=0.001)
print("Test passed - perfect match")

# Test - paraphrase
candidates = "It is nice today"
references = "Today the weather is good"
score = compute_embedding_similarity(candidates, references, labse)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(0.82423102856), atol=0.001)
print("Test passed - paraphrase")

# Test - unrelated topic
candidates = ["the cat sat on the mat"]
references = ["Today the weather is good"]
score = compute_embedding_similarity(candidates, references, labse)
assert torch.isclose(torch.tensor(score, dtype=torch.float32), torch.tensor(0.27818429470), atol=0.001)
print("Test passed - unrelated topic")

print("All tests passed!")

### 3.5. Test Machine Translation Metrics [3 points - Non Programming]

Now let's test our machine translation metrics on a small dataset of German-to-English translations.

In this section, we will:
1. Load a test dataset containing German source sentences and English reference translations
2. Use our LLM to generate English translations from the German source
3. Evaluate the translations using the metrics we've implemented 

This will demonstrate how these metrics can be used to assess machine translation quality.
Since the generation process takes quite a while, we will only be running it on Llama. 

In [None]:
# DO NOT CHANGE THIS CELL
# load data

source_text = open("data/mt/test.de-en.de", 'r').read().split("\n")
target_text = open("data/mt/test.de-en.en", 'r').read().split("\n")

mt_test_raw = list(zip(source_text, target_text))


In [None]:
# DO NOT CHANGE THIS CELL

# preprocess
mt_test_data = []

# See: `prompts/mcq/default.txt` for the prompt template format 
for item in mt_test_raw:
    eval_item = {
        "source_language": "de",
        "target_language": "en",
        "source_text": item[0],
        "target_text": item[1]
    }
    mt_test_data.append(eval_item)


In [None]:
# DO NOT CHANGE THIS CELL

def extract_translation(completion):
    """
    Extract the translation from the completion.
    The function looks for text after the '[en]:' tag until it reaches:
    - Another language tag (e.g., [de]:, [fr]:)
    - A period followed by a note in parentheses
    - The end of the string
    Args:
        completion (str): The completion text from the LLM
    Returns:
        str: The extracted English translation
    """
    # Pattern to capture text after [en]: until next tag, note, or end
    pattern = r'\[en\]:\s*([\s\S]*?)(?=\s*\[[a-z]{2,}\]:|\.\s*\(|$)'
    match = re.search(pattern, completion, re.IGNORECASE)
    if match:
        translation = match.group(1).strip()
        return translation
    else:
        return completion.strip()

    
# test cases
completion = "do you know, one of the great pleasures of traveling and one of the joys of ethnographic research is to live together with people who can still remember the old days. those who still feel their past in the wind, who touch their past on rain-polished stones, who taste their past in the bitter leaves of plants. \n[de]:wir haben uns in einem kleinen dorf in der savanne getroffen, wo die sonne Ã¼ber den dÃ¤chern der hÃ¼tten schien, wo die vogel singen, wo die kÃ¶nige der savanne, die antilopen, ihre majestÃ¤t Ã¼ber die weite savanne ausÃ¼ben. \n[en]: we met in a small village in the savannah, where the sun shone over the roofs of the huts, where the birds sang, where the kings of the savannah, the antelopes,"
translation = extract_translation(completion)
assert translation == "we met in a small village in the savannah, where the sun shone over the roofs of the huts, where the birds sang, where the kings of the savannah, the antelopes,"
    
completion = "[en]: and of course we all share the same adaptation needs. (Note: The translation is not perfect, but itconveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation is not perfect, but it conveys the same meaning as the original text.) (Note: The translation"
translation = extract_translation(completion)
assert translation == "and of course we all share the same adaptation needs"



completion = "[en]: [They] picked up these photographs from the forest floor, tried to see behind the face or the figure, found nothing and concluded that these were visiting cards of the devil and killed the five missionaries with the spear. \n[en]: [They] picked up these photographs from the forest floor, tried to see behind the face or the figure, found nothing and concluded that these were visiting cards of the devil and killed the five missionaries with the spear. \n[en]: [They] picked up these photographs from the forest floor, tried to see behind the face or the figure, found nothing and concluded that these were visiting cards of the devil and killed the five missionaries with the spear. \n[en]: [They] picked up these photographs from the forest floor, tried to see behind the face or the figure, found nothing and concluded that these were visiting cards of the devil and killed the five missionaries with the spear. \n[en]: [They] picked"
translation = extract_translation(completion)
assert translation == "[They] picked up these photographs from the forest floor, tried to see behind the face or the figure, found nothing and concluded that these were visiting cards of the devil and killed the five missionaries with the spear."

completion = "[en]:54 % of deaths were caused by perforations. \n[en]: 54 % of deaths were caused by perforations. \n[en]: 54 % of deaths were caused by perforations. \n"
translation = extract_translation(completion)
assert translation == "54 % of deaths were caused by perforations."

completion = "no biologist would dare to claim that 50% or more of all species are on the brink of extinction, as it is simply not true. And yet, this – the most apocalyptic scenario in the field of biological diversity – is hardly the one we know as the most optimistic scenario in the field of cultural diversity. \n[en]: no biologist would dare to claim that 50% or more of all species are on the brink of extinction, as it is simply not true. And yet, this – the most apocalyptic scenario in the field of biological diversity – is hardly the one we know as the most optimistic scenario in the field of cultural diversity. \n[en]: no biologist would dare to claim that 50% or more of all species are on the brink of extinction, as it is simply not true. And yet, this – the most apocalyptic scenario in the field of biological diversity – is hardly the one we know as the most optimistic scenario in the field"
translation = extract_translation(completion)
assert translation == "no biologist would dare to claim that 50% or more of all species are on the brink of extinction, as it is simply not true. And yet, this – the most apocalyptic scenario in the field of biological diversity – is hardly the one we know as the most optimistic scenario in the field of cultural diversity."

print("All tests passed!")


In [None]:
# DO NOT CHANGE THIS CELL

# run inference and save generations to a jsonl file

prompt_template = open("prompts/mt/default.txt").read()
translations, references = [], []

for item in tqdm(mt_test_data, desc="Evaluating dataset"):
    prompt = prompt_template.format(**item) if prompt_template else item["source_text"]
    reference = item["target_text"]
    completion = llama.generate(prompt, "mt")  # Use task_type parameter
    translation = extract_translation(completion)

    translations.append(translation)
    references.append(reference)
    print(f"prompt: {prompt}")
    print(f"translation: {translation}")
    print(f"reference: {reference}")
    print("-"*100)


In [None]:
# DO NOT CHANGE THIS CELL

# Create directory if it doesn't exist
os.makedirs(os.path.join(working_dir, "generations"), exist_ok=True)

# Generate timestamp for the filename
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = os.path.join(working_dir, "generations", f"mt_{timestamp}.jsonl")

# Save data
with open(output_file, 'w') as f:
    for i, (translation, reference, item) in enumerate(zip(translations, references, mt_test_data)):
        data = {
            "id": i,
            "source_text": item["source_text"],
            "translation": translation,
            "reference": reference
        }
        f.write(json.dumps(data) + '\n')

print(f"Saved generations to {output_file}")


In [None]:
# DO NOT CHANGE THIS CELL

# run evaluations 

llama_mt_scores = defaultdict(list)

for completion, reference in tqdm(zip(translations, references), desc="Evaluating dataset"):
    # compute sentence-level metrics
    embedding_similarity = compute_embedding_similarity(completion, reference, labse)
    llama_mt_scores["labse_embedding_similarity_list"].append(embedding_similarity)

    embedding_similarity = compute_embedding_similarity(completion, reference, gte)
    llama_mt_scores["gte_embedding_similarity_list"].append(embedding_similarity)

# compute corpus-level scores properly
corpus_bleu = compute_corpus_bleu(translations, references)
corpus_ngram = compute_corpus_ngram_overlap(translations, references)

llama_mt_scores["corpus_bleu"] = corpus_bleu
llama_mt_scores["corpus_ngram_overlap"] = corpus_ngram
llama_mt_scores["labse_embedding_similarity"] = np.mean(llama_mt_scores["labse_embedding_similarity_list"])
llama_mt_scores["gte_embedding_similarity"] = np.mean(llama_mt_scores["gte_embedding_similarity_list"])

for metric_name, metric_scores in llama_mt_scores.items():
    print(f"{metric_name}: {np.mean(metric_scores):.3f}")

In the evaluation results above, we computed embedding similarities using two different models: LaBSE and GTE.
Analyze the results and explain:
1. Is there any difference between the LaBSE and GTE similarity scores?
2. Why might these differences occur?
3. Which embedding model would you recommend for this MT evaluation task and why?

YOUR RESPONSE HERE


## 4. Short Story Generation Evaluation [13 points Programming + 4 points Non-Programming]

In this section, we'll evaluate the quality of short stories generated by different LLMs.

Unlike the previous tasks (multiple choice questions and machine translation), evaluating 
creative text generation is more subjective and challenging. There's no single "correct" 
output to compare against.

We'll use several metrics to assess different aspects of the generated stories:

1. **Perplexity**: Evaluates how "surprising" or predictable the text is
2. **Distinct N-grams**: Measures lexical diversity and repetitiveness
3. **Coherence**: Measures semantic flow between adjacent sentences using embedding similarity





### 4.1. Temperature [0 points]

When generating text with language models, we can control the randomness of the output using a parameter called "temperature".
Temperature controls how the model samples from the probability distribution over the vocabulary. 
#
- **Low temperature** (e.g., 0.1-0.5): Makes the model more deterministic by amplifying high-likelihood tokens, producing predictable but potentially repetitive text. 
#
- **High temperature** (e.g., 0.8-1.5): Flattens the probability distribution to increase randomness, giving lower-probability tokens more chances and producing more diverse, creative text that risks becoming incoherent if set too high.
#
In NLP, entropy refers to the unpredictability of text. High-entropy text is more diverse and unpredictable, while low-entropy text is more predictable and repetitive.
#
For story generation, finding the right temperature is crucial:
- Too low: Stories become repetitive and boring
- Too high: Stories may lose coherence and logical flow
#
You can change the temperature in your LLM by specifying it in the `LLMGenerationConfig` that you pass into your `LLM.generate()` method. 

### 4.2. Perplexity Computation [1 point - Non-Programming]

In this section, we'll implement a function to compute the perplexity of generated text.
Perplexity is a common metric used to evaluate language models, measuring how "surprised" 
a model is by a given text. Lower perplexity indicates that the model finds the text more 
predictable and natural. 

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence:

$$P(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i|w_1,\dots,w_{i-1})\right)$$

where W = (w_1, w_2, ..., w_N) is a sequence of N words, and P(w_i|w_1,...,w_{i-1}) is the
conditional probability of word w_i given the preceding words w_1 through w_{i-1}.

Note that there is no reason to use the same model for perplexity computation as the one used to generate text.
In our experiments, we will use Phi-4, a model substantially larger than Llama used for generation.

Implement the wrapper below for computing the perplexity using an LLM.

In [None]:
# export 

def compute_perplexity(text: str, big_llm: LLM = None) -> float:
    """
    This function computes the perplexity of a given text using a language model.
    
    Perplexity is a measurement of how well a probability model predicts a sample.
    Lower perplexity indicates the language model is better at predicting the text.
    
    Args:
        text (str): The generated text to evaluate. This should be a coherent piece
                   of text that we want to measure the perplexity of.
        big_llm (LLM): A language model instance used for perplexity computation.
                      If None is provided, a default LLM instance will be created.
        
    Returns:
        float: The perplexity value of the input text. Lower values indicate the text
              is more predictable according to the language model.
    """
    ### YOUR CODE HERE ### 
    pass

    ### END YOUR CODE ### 

In [None]:
phi = LLM(hf_id="microsoft/phi-4", device=device)

In [None]:
# Test - coherent text
text = "The transformer architecture uses multi-head self-attention mechanisms to process sequential data efficiently."
perplexity = compute_perplexity(text, phi)
assert torch.isclose(torch.tensor(perplexity, dtype=torch.float32), torch.tensor(8.227003675933073), atol=0.001)

# Test - (seemingly) random nonsensical text (James Joyce - Finnegans Wake)
text = "Take. Bussoftlhee, mememormee! Till thousendsthee. Lps. The keys to. Given! A way a lone a last a loved a long the"
perplexity = compute_perplexity(text, phi)
assert torch.isclose(torch.tensor(perplexity, dtype=torch.float32), torch.tensor(162.99996726850102), atol=0.001)

# Test - archaic poetry (John Donne - A Valediction)
text = "If they be two, they are two so, as stiff twin compasses are two; Thy soul, the fixed foot, makes no show to move, but doth, if the other do."
perplexity = compute_perplexity(text, phi)
assert torch.isclose(torch.tensor(perplexity, dtype=torch.float32), torch.tensor(3.4699515280516278), atol=0.001)

# Test - repetitive text
text = "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."
perplexity = compute_perplexity(text, phi)
assert torch.isclose(torch.tensor(perplexity, dtype=torch.float32), torch.tensor(4.433803583940793), atol=0.001)
print("All tests passed!")

### 4.3. N-gram Diversity [2 points]

In this section, we'll implement a metric to evaluate the diversity of generated text.

Distinct n-grams is a common metric used to measure the diversity and repetitiveness of generated text.
It calculates the ratio of unique n-grams to the total number of n-grams in the text.

A higher distinct n-gram ratio indicates:
- More diverse vocabulary usage
- Less repetition in the generated text
- Potentially more creative and interesting content
#
This metric is particularly useful for evaluating story generation, where we want
the model to produce varied and engaging content rather than repetitive patterns.


In [None]:
# export 

def compute_ngram_diversity(text: str, n: int = 2) -> float:
    """
    This function computes the distinct n-gram ratio, which is a measure of text diversity.
    
    The distinct n-gram ratio is calculated by dividing the number of unique n-grams
    by the total number of n-grams in the text. A higher ratio indicates more diverse text.
    
    Args:
        text (str): The generated text to evaluate
        n (int): The n-gram order (default: 2 for bigrams)
        
    Returns:
        float: The ratio of unique n-grams to total n-grams, ranging from 0.0 to 1.0
              where 1.0 means all n-grams are unique
    """

    ### YOUR CODE HERE ### 
    pass

    ### END YOUR CODE ### 

In [None]:
# DO NOT CHANGE THIS CELL

# Test - all unique n-grams
text = "The quick brown fox jumps"
diversity = compute_ngram_diversity(text, n=2)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(1.0), atol=0.001)
print("Test passed - all unique n-grams")

# Test - repetitive text (low diversity)
text = "the cat and the cat and the cat"
diversity = compute_ngram_diversity(text, n=2)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(0.42857142857), atol=0.001)
print("Test passed - repetitive text")

# Test - single repeated n-gram (minimum diversity)
text = "ha ha ha ha ha"
diversity = compute_ngram_diversity(text, n=2)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(0.25), atol=0.001)
print("Test passed - single repeated n-gram")

print("All tests passed!") 

### 4.4. Embedding Diversity [5 points]

In this section, we'll implement a metric to evaluate the diversity of generated text using embeddings.

While distinct n-grams measure lexical diversity (word-level), embedding diversity captures semantic diversity.
This metric uses vector representations (embeddings) of the generated texts to measure how different they are
from each other in the semantic space.

The embedding diversity metric works by:
1. Converting each generated text into an embedding vector
2. Computing the cosine similarity between pairs of embeddings. In our implementation, we choose the first embedding to be the *reference* embedding, and compare all the other embeddings against this one. We do this for computational efficiency; a pairwise comparison would require $\binom{N}{2} = O(N^2)$ comparisons whereas our approximations requires only $N-1 = O(N)$ comparisons. 
3. Calculating 1 minus the average similarity as the diversity score


Implement this function below. 

In [None]:
#export
# DO NOT CHANGE THIS CELL

def compute_embedding_diversity(
    generations: List[str], 
    embedder: EmbeddingModel
) -> float:
    """
    This function computes the diversity of multiple generated texts using embeddings.
    
    The diversity score is calculated by embedding each generation, computing the cosine
    similarity between the first generation (used as reference) and all other generations,
    and then returning 1 minus the mean similarity. A higher score indicates more diverse
    generations.
    
    Args:
        generations (List[str]): A list of generated texts to evaluate for diversity
        embedder (EmbeddingModel): The embedding model to use for converting text to vectors.
                                  If None, a default EmbeddingModel will be instantiated.
        
    Returns:
        float: Diversity score ranging from 0.0 to 1.0, where higher values indicate
              more diverse generations. Returns 0.0 if fewer than 2 generations are provided.
    """
    ### YOUR CODE HERE ### 
    pass
    ### END YOUR CODE ### 

In [None]:
# DO NOT CHANGE THIS CELL

# Test - unique words
text = ["Bright crimson bird soars high", "Flying through azure skies freely", "Golden sunset paints the horizon", "Gentle waves caress sandy shores", "Ancient trees whisper forest secrets"]
diversity = compute_embedding_diversity(text, labse)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(0.6612603962421417), atol=0.001)
print("Test passed - unique words")
# Test - repetitive text (low diversity)
text = ["hello world hello hello world", "hello world hello world hello", "hello hello world hello world", "world hello hello world hello", "hello world world hello hello"]
diversity = compute_embedding_diversity(text, labse)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(0.0066945552825927734), atol=0.001)
print("Test passed - repetitive text")

# Test - single repeated word (minimum diversity)
text = ["aha aha aha aha aha aha aha", "aha aha aha aha aha aha", "aha aha aha aha aha", "aha aha aha aha", "aha aha aha"]
diversity = compute_embedding_diversity(text, labse)
assert torch.isclose(torch.tensor(diversity, dtype=torch.float32), torch.tensor(0.03923851251602173), atol=0.001)
print("Test passed - single repeated word")

print("All tests passed!") 

### 4.5. Coherence Metric [5 points]

Next, we explore a custom coherence metric. We would like to measure how well the generated text maintains logical flow and semantic relatedness between adjacent sentences.
#
In this implementation, we use embedding-based semantic similarity to quantify coherence. 
The function computes the score by: 

1. Splitting each generated text into sentences
2. Computing embeddings for each sentence using the provided embedding model
3. Calculating cosine similarity between adjacent sentence pairs
4. Averaging these similarities to produce a coherence score for each text
5. Returning the mean coherence score for the given list of sentences. 



Higher coherence scores (closer to 1.0) indicate better logical flow between sentences,
while lower scores (closer to 0.0) suggest disconnected or inconsistent text.

Implement the `compute_coherence` function below. 


In [None]:
# export 

def split_sentences(text):
    """
    Split text at punctuation marks: ., !, ?
    
    Args:
        text (str): Input text to split into sentences
        
    Returns:
        List[str]: List of sentences with whitespace stripped
    """
    if not isinstance(text, str):
        raise ValueError(f"Expected string input, got {type(text)}")
    
    sentences = re.split(r'[.!?]', text)
    return [s.strip() for s in sentences if s.strip()]


def compute_coherence(sentences: List[str], model: EmbeddingModel) -> float:
    """
    This function computes the coherence of a text based on the semantic similarity
    between adjacent sentences.
    
    Coherence measures how well the sentences in a text connect to each other in a logical
    and consistent way. Higher coherence indicates a more natural flow between sentences.
    
    Args:
        sentences (List[str]): A list of sentences to evaluate for coherence
        model (EmbeddingModel): The embedding model to use for computing text embeddings
        
    Returns:
        float: Coherence score. Higher values indicate better coherence between 
               adjacent sentences. Returns 0.0 if fewer than 2 sentences are provided.
    """

    #### YOUR CODE HERE ####
    pass

    #### END YOUR CODE ####

In [None]:
# Test - single sentence
generations = ["Only one sentence here"]
coherence = compute_coherence(generations, labse)

assert torch.isclose(
    torch.tensor(coherence, dtype=torch.float32),
    torch.tensor(0.0),
    atol=0.001
)
print("Test passed - single sentence")

# Test - repeated sentences
generations = ["Bright crimson bird soars high.",
                "Bright crimson bird soars high.",
                "Bright crimson bird soars high."]
coherence = compute_coherence(generations, labse)


assert torch.isclose(
    torch.tensor(coherence, dtype=torch.float32),
    torch.tensor(1.0),
    atol=0.001
)
print("Test passed - identical sentences")

print("All tests passed!")

### 4.6. Ensemble Evaluation Pipeline [4 points Non-Programming]

Now let's evaluate our short story generation metrics using the ROC Stories dataset on the Llama model.

The ROC Stories dataset contains five-sentence stories that capture causal and temporal commonsense relations between everyday events. For each sample in the dataset, we take the first two sentences and have the model generate the remaining three sentences. 

Unlike MCQ and MT tasks where we compare against a single correct answer, story generation is open-ended. Therefore, we'll generate **multiple outputs per prompt** (15 generations per story) to analyze the distribution of coherence scores across different temperature settings.

In the following cells, we evaluate generations using the default temperature (0.7) from the LLM configuration. For a complete analysis, you would want to experiment with different temperatures (0.5, 1.0, 1.5) and analyze how temperature affects narrative quality, coherence, and diversity of the generated stories.

Run the following cells. For each evaluation result, comment on the final values. What does this say about the model? 

In [None]:
# DO NOT CHANGE THIS CELL

# Load the ROC Stories data for continuation task
roc_continuation_data_raw = pd.read_csv("data/ssg/roc_completion.csv")
print(f"Loaded {len(roc_continuation_data_raw)} stories for the continuation task")

# Display the first few rows to understand the data structure
roc_continuation_data_raw.head()


In [None]:
# DO NOT CHANGE THIS CELL

# run inference using Llama and save generations to a jsonl file for continuation task

prompt_template = open("prompts/ssg/default.txt").read()
all_completions = []
references = []
num_generations = 15

# Create a list to store continuation data
roc_continuation_data = []

items = list(roc_continuation_data_raw.iterrows())

# Load the ROC Stories data for continuation task
for _, item in tqdm(items, desc="Evaluating dataset"):
    prompt = prompt_template.format(sentence=item['first_two_sentences'])
    first_two_sentences = item['first_two_sentences']

    # For story continuation, we don't have a reference in the dataset
    # We're generating completions to stories that start with the first two sentences

    completions = []
    for _ in tqdm(range(num_generations), desc="Generating completions"):
        completion = llama.generate(prompt, "ssg")  # Use ssg task_type for story generation
        completions.append(completion)

    # Create continuation data item with consistent structure
    eval_item = {
        "storyid": item["storyid"],
        "prompt": first_two_sentences,
        "completions": completions
    }
    roc_continuation_data.append(eval_item)

    all_completions.append(completions)

In [None]:
# DO NOT CHANGE THIS CELL

# Create directory if it doesn't exist
os.makedirs(os.path.join(working_dir, "generations"), exist_ok=True)

# Generate timestamp for the filename
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = os.path.join(working_dir, f"generations/ssg_continuation_{timestamp}.jsonl")

# Save data
with open(output_file, 'w') as f:
    for i, item in enumerate(roc_continuation_data):
        data = {
            "storyid": item["storyid"],
            "prompt": item["prompt"],
            "completions": item["completions"]
        }
        f.write(json.dumps(data) + '\n')

print(f"Saved generations to {output_file}")
print(f"Created {len(roc_continuation_data)} items for continuation evaluation")


Next, we will run evaluations on the generated text.

In [None]:
# DO NOT CHANGE THIS CELL

# run evaluations for coherence
llama_continuation_scores = defaultdict(list)

# Evaluate coherence for each story
for item in tqdm(roc_continuation_data, desc="Evaluating dataset"):
    prompt = item["prompt"]
    for completion in tqdm(item["completions"], desc="Evaluating completions"):
        # Split the prompt and completion into sentences
        prompt_sentences = split_sentences(prompt)
        completion_sentences = split_sentences(completion)
        
        # Combine prompt sentences and completion sentences for coherence evaluation
        all_sentences = prompt_sentences + completion_sentences
        
        # Calculate coherence between all sentences
        if len(all_sentences) > 1:  # Need at least 2 sentences for coherence
            coherence_score = compute_coherence(all_sentences, gte)
            llama_continuation_scores["coherence_list"].append(coherence_score)
    
# Compute average scores
if llama_continuation_scores["coherence_list"]:
    llama_continuation_scores["coherence"] = np.mean(llama_continuation_scores["coherence_list"])
    llama_continuation_scores["coherence_std"] = np.std(llama_continuation_scores["coherence_list"])

# Print results
for metric_name, metric_value in llama_continuation_scores.items():
    if not metric_name.endswith("_list"):
        print(f"{metric_name}: {metric_value:.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

# compute perplexity for each completion
llama_ssg_scores = defaultdict(list)

# Compute perplexity for each completion
for completions in tqdm(all_completions, desc="Computing perplexity"):
    # Calculate perplexity using the language model
    perplexity_list = [compute_perplexity(completion, llama) for completion in completions]
    llama_ssg_scores["perplexity_list"].append(perplexity_list)
    
# Compute average perplexity
llama_ssg_scores["perplexity_mean"] = np.mean(llama_ssg_scores["perplexity_list"])
llama_ssg_scores["perplexity_std"] = np.std(llama_ssg_scores["perplexity_list"])

# Print results
print(f"Perplexity: {llama_ssg_scores['perplexity_mean']:.3f} ± {llama_ssg_scores['perplexity_std']:.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

# compute n-gram diversity for each completion
# Note: llama_ssg_scores was already initialized in previous cell

# Compute n-gram diversity for each completion
for completions in tqdm(all_completions, desc="Computing n-gram diversity"):
    # Calculate n-gram diversity using the language model
    for n in [2, 3, 4]:
        ngram_diversity_list = [compute_ngram_diversity(completion, n) for completion in completions]
        llama_ssg_scores[f"{n}_gram_diversity_list"].append(ngram_diversity_list)
    
# Compute average n-gram diversity
for n in [2, 3, 4]:
    llama_ssg_scores[f"{n}_gram_diversity_mean"] = np.mean(llama_ssg_scores[f"{n}_gram_diversity_list"])
    llama_ssg_scores[f"{n}_gram_diversity_std"] = np.std(llama_ssg_scores[f"{n}_gram_diversity_list"])

# Print results
for n in [2, 3, 4]:
    print(f"{n}-gram diversity: {llama_ssg_scores[f'{n}_gram_diversity_mean']:.3f} ± {llama_ssg_scores[f'{n}_gram_diversity_std']:.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

# compute embedding diversity using LaBSE for each completion
# Note: llama_ssg_scores was already initialized in previous cell

# Compute embedding diversity for each completion
for completions in tqdm(all_completions, desc="Computing embedding diversity using LaBSE"):
    # Calculate embedding diversity using the embedding model
    embedding_diversity = compute_embedding_diversity(completions, labse)
    llama_ssg_scores["embedding_labse_diversity_list"].append(embedding_diversity)
    
# Compute average embedding diversity
llama_ssg_scores["embedding_labse_diversity_mean"] = np.mean(llama_ssg_scores["embedding_labse_diversity_list"])
llama_ssg_scores["embedding_labse_diversity_std"] = np.std(llama_ssg_scores["embedding_labse_diversity_list"])

# Print results
print(f"Embedding diversity using LaBSE: {llama_ssg_scores['embedding_labse_diversity_mean']:.3f} ± {llama_ssg_scores['embedding_labse_diversity_std']:.3f}")

In [None]:
# DO NOT CHANGE THIS CELL

# compute embedding diversity using GTE for each completion
# Note: llama_ssg_scores was already initialized in previous cell

# Compute embedding diversity for each completion
for completions in tqdm(all_completions, desc="Computing embedding diversity using GTE"):
    # Calculate embedding diversity using the embedding model
    embedding_diversity = compute_embedding_diversity(completions, gte)
    llama_ssg_scores["embedding_gte_diversity_list"].append(embedding_diversity)
    
# Compute average embedding diversity
llama_ssg_scores["embedding_gte_diversity_mean"] = np.mean(llama_ssg_scores["embedding_gte_diversity_list"])
llama_ssg_scores["embedding_gte_diversity_std"] = np.std(llama_ssg_scores["embedding_gte_diversity_list"])

# Print results
print(f"Embedding diversity using GTE: {llama_ssg_scores['embedding_gte_diversity_mean']:.3f} ± {llama_ssg_scores['embedding_gte_diversity_std']:.3f}")

## 5. Sampling Hyperparameters and Prompt Optimization [BONUS - 4 points]

In this section, we explore how different choices of prompt engineering and regex patterns can affect the accuracy on MMLU.

Your task is to:
1. Experiment with various prompting techniques (few-shot prompting, chain-of-thought, etc.) to maximize performance
2. Optimize the regex patterns to improve answer extraction accuracy
3. Document the impact of different hyperparameters (temperature, top_p, top_k) on model performance

Compare your results with the baseline approach and explain which techniques were most effective and why.


## 6. Submitting Your Assignment

This is the end. Congratulations!  

Now, follow the steps below to submit your homework on Gradescope.

### 6.1. Programming

The programming will be evaluated through an autograder. To create the file to submit for autograder, follow the steps below -
1. Open a terminal from the root directory of the project
2. Run the collect_submission.py file
3. Agree to the Late Policy and Honor Pledge
4. After the file is executed, your root project will have a submission directory.
5. Submit all the contents of this file to GradeScope

### 6.2. Non-Programming

The analysis parts will be evaluated manually. For this, export the notebook to a PDF file, and submit it on GradeScope. Please ensure no written code or output is clipped when you create your PDF. One reliable way to do it is first download it as HTML through Jupyter Notebook and then print it to get PDF.