In [None]:
%%capture
#### RUN THIS IN EVERY NEW COLAB SESSION
#### RUN IT if you change runtimes
#### shouldn't need to run after a kernel restart in the same session

from google.colab import drive
import sys
from pathlib import Path

drive.mount('/content/drive')
COLAB_NOTEBOOKS_DIR = Path("/content/drive/MyDrive/Colab Notebooks")

########## MODIFY THIS PATH TO AS NEEDED ##########
WORKING_DIR = COLAB_NOTEBOOKS_DIR / "Homework_11"
################################################### 
sys.path.append(str(WORKING_DIR))

# ✅ Now you can import from helpers.py in the your homework folder

# ✅ Install JupyterLab so the nbconvert lab template becomes available
%pip install -q jupyterlab jupyterlab_widgets
!jupyter nbconvert --to html --template lab --stdout --output dummy /dev/null || true

# ✅ Install the introdl course package
!wget -q https://github.com/DataScienceUWL/DS776/raw/main/Lessons/Course_Tools/introdl.zip
!unzip -q introdl.zip -d introdl_pkg
%pip install -q -e introdl_pkg --no-cache-dir

src_path = Path("introdl_pkg/src").resolve()
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Reload the introdl package (no kernel restart needed)
import importlib
try:
    import introdl
    importlib.reload(introdl)
except ImportError:
    import introdl

In [None]:
#### Run this cell later when you want to export your notebook to HTML
# see post @420 in Piazza for how to do this in CoCalc

from introdl.utils import convert_nb_to_html
my_html_file = (WORKING_DIR / "Homework_11_MY_NAME.html").resolve()  # change file name as needed
my_notebooks_dir = (WORKING_DIR / "Homework_11_Colab_Version.ipynb").resolve() # must include name of this notebook
convert_nb_to_html(output_filename = my_html_file, notebook_path = my_notebooks_dir)

'4.49.0'

In [None]:
# add other packages as needed here

from introdl.utils import config_paths_keys

paths = config_paths_keys()

# Homework 11 - Text Generation

Complete the tasks below.  Generate an HTML version of your final notebook and upload it to Canvas.

## Task 1 - Understanding Compute Costs. (10 points)

1. Search for and report the vocabulary size, embedding dimension, and number of decoder layer for Llama-3.3 (a 70B parameter model) and Llama-3.1-405B.  Present your results in a table.

2. Compare the FLOPs generating 300 tokens from a prompt of 500 tokens for Llama-3.3 and Llama-3.1-405B.

3. Now suppose the conversation has grown so that the prompt (with context) is now 50,000 tokens.  How many times more compute are required for each of the models to generate 300 tokens.


---

## Task 2 - Build an LLM Scorer for Determining Text Similarity (10 points)

Often we want to compare generated text to a reference text.  For example suppose we've prompted our text generation model and we expect it to generate the text `reference_text =  "The quick brown fox jumps over the lazy dog."` and it generates `generated_text = "The fast crimson fox bounded over the sluggish brown hound."`.  Then we'd want to produce a high similarity score because the two texts say essentially the same thing.  However if the generated text were "A slow green lizard leaps over a sleepy cat." we'd expect to get low similarity score.  For this exercise you're going to use a Gemini API-based model or a similar high-quality API-based model to build an `llm_score` function and then apply it to score text generated by a small local model.

#### Steps:
- Write a function named `llm_score` that:
   - Takes two arguments: `generated_text` (the text produced by the model) and `reference_text` (the target text for comparison).
   - Returns a **single numeric score** between **0 and 100**, where **0** indicates no similarity and **100** indicates perfect similarity.
   - The returned score must be a number, not a string.

-  Use the Open API client with the Gemini 2.0 Flash Lite model as in the lesson.  You can use a different API if you like, but the model should be at least as good as Gemini 2.0 Flash Lite, GPT 4o Mini, of a Llama 3 70B Instruct model.
-  Craft a prompt that clearly instructs the model to provide only a numeric score.
- Set `temperature=0.0` to ensure a deterministic response.
- Limit the response length with `max_tokens=10`.


#### Hint:
- Use the following format for the prompt (adjust as needed):
  ```python
  prompt = (
      f"Evaluate the similarity between the following generated text and reference text. "
      f"Score their similarity on a scale from 0 to 100, where 0 means no similarity and 100 means perfect similarity.\n\n"
      f"Reference Text: {reference_text}\n\n"
      f"Generated Text: {generated_text}\n\n"
      f"Respond with only the numeric score."
  )
  ```

- Example function signature:
  ```python
  def llm_score(generated_text, reference_text):
      # Your implementation here
  ```

Create a working `llm_score` function and show with at least four different examples that the scores produced are plausible.  Your solution should use and mimic code from the lesson and should not be AI-generated.  The point here is for you to understand the code in the leson well enough to do this.



---

## Task 3 - Evaluate Locally Generated Text with BERTScore (10 points)

### **What is BERTScore?**
**BERTScore** is a modern evaluation metric for comparing the similarity of two texts using contextual embeddings from a pretrained transformer model (e.g., BERT, RoBERTa, DeBERTa). 


### **How It Works**
1. **Embedding Extraction**:
   - Both the reference text and the generated text are **tokenized** and passed through a transformer model.
   - The model outputs contextualized embeddings for each token (e.g., vectors representing words considering their surrounding context).

2. **Matching Tokens**:
   - For each token in the generated text, the closest (most similar) token in the reference text is identified based on **cosine similarity**.
   - The similarity score between these tokens is recorded.

3. **Calculating Scores**:
   - **Precision (P):** Measures how much of the generated text matches the reference text.
   - **Recall (R):** Measures how much of the reference text is covered by the generated text.
   - **F1 Score:** The harmonic mean of Precision and Recall, providing a balanced measure.


### **How to Interpret the Scores**
- **Precision (P):** High precision means that the generated text is closely aligned with the reference text. A score of 1.0 would indicate that all tokens in the generated text are well-matched with relevant tokens in the reference text.
- **Recall (R):** High recall means that most of the important tokens from the reference text are present in the generated text. A score of 1.0 would indicate that the generated text covers all the important tokens from the reference.
- **F1 Score:** Combines Precision and Recall. A high F1 score indicates both good coverage (recall) and relevance (precision). It’s often the most important score to consider when evaluating generated text.

**Typical Range:** Scores are usually between **0 and 1**, where:
- **0:** No meaningful similarity.
- **1:** Perfect similarity.


### **Example Interpretation**
| Metric      | Score   | Interpretation                          |
|-------------|---------|-----------------------------------------|
| Precision   | 0.85    | The generated text matches most of the reference's key concepts. |
| Recall      | 0.78    | The generated text misses some important concepts present in the reference. |
| F1 Score    | 0.81    | A good overall similarity, with room for improvement in coverage. |

---



# BERTScore Explanation and Demo

### **What is BERTScore?**
**BERTScore** is a modern evaluation metric for comparing the similarity of two texts using contextual embeddings from a pretrained transformer model (e.g., BERT, RoBERTa, DeBERTa). Unlike traditional metrics like BLEU and ROUGE that rely on exact word overlap, BERTScore measures similarity in **semantic meaning** by comparing embeddings.

### **How It Works**
1. **Embedding Extraction**:
   - Both the reference text and the generated text are **tokenized** and passed through a transformer model.
   - The model outputs contextualized embeddings for each token (e.g., vectors representing words considering their surrounding context).

2. **Matching Tokens**:
   - For each token in the generated text, the closest (most similar) token in the reference text is identified based on **cosine similarity**.
   - The similarity score between these tokens is recorded.

3. **Calculating Scores**:
   - **Precision (P):** Measures how much of the generated text matches the reference text.
   - **Recall (R):** Measures how much of the reference text is covered by the generated text.
   - **F1 Score:** The harmonic mean of Precision and Recall, providing a balanced measure.

### **How to Interpret the Scores**
- **Precision (P):** High precision means that the generated text is closely aligned with the reference text. A score of 1.0 would indicate that all tokens in the generated text are well-matched with relevant tokens in the reference text.
- **Recall (R):** High recall means that most of the important tokens from the reference text are present in the generated text. A score of 1.0 would indicate that the generated text covers all the important tokens from the reference.
- **F1 Score:** Combines Precision and Recall. A high F1 score indicates both good coverage (recall) and relevance (precision). It’s often the most important score to consider when evaluating generated text.

**Typical Range:** Scores are usually between **0 and 1**, where:
- **0:** No meaningful similarity.
- **1:** Perfect similarity.


### **Example Interpretation**
| Metric      | Score   | Interpretation                          |
|-------------|---------|-----------------------------------------|
| Precision   | 0.85    | The generated text matches most of the reference's key concepts. |
| Recall      | 0.78    | The generated text misses some important concepts present in the reference. |
| F1 Score    | 0.81    | A good overall similarity, with room for improvement in coverage. |

### **Example Code:**
The code in the following cell demonstrates how to use generate text with a local model, clean the output of the model, and compute the BERTScores.  You'll modify the code to complete the exercises that follow for this task.



In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from bert_score import score
import pandas as pd
import re

# Load model and tokenizer
model_name = "unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

# Set model to evaluation mode
model.eval()

def clean_output(text, prompt=""):
    """
    Cleans the model-generated text by removing the prompt, formatting, and common prefixes/suffixes.
    
    Args:
        text (str): The generated text from the model.
        prompt (str): The prompt used for generating the text. If present at the beginning, it will be removed.
        
    Returns:
        str: Cleaned text ready for evaluation.
    """
    # Remove the prompt if it exists at the beginning of the text
    if text.startswith(prompt):
        text = text[len(prompt):].strip()
    
    # Remove markdown-like formatting (titles, headers, etc.)
    text = re.sub(r"\*\*.*?\*\*|=+", "", text)
    
    # Remove bullet points, numbers, dashes, and unwanted newlines
    text = re.sub(r"(\*|-|•|\d+\.)\s", "", text)
    text = re.sub(r"\n+", " ", text)

    # Remove anything that's not alphanumeric, standard punctuation, or whitespace
    text = re.sub(r"[^a-zA-Z0-9.,!?;:\-()\'\"\s]", "", text)

    # Remove common prefixes like "The answer is:", "Here is the explanation:", etc.
    unwanted_prefixes = [
        "The answer is:", "Here is the explanation:", 
        "In conclusion,", "To summarize,", "As follows:"
    ]
    for prefix in unwanted_prefixes:
        if text.startswith(prefix):
            text = text[len(prefix):].strip()
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Example prompt
prompt = "Explain the difference between supervised and unsupervised learning."

# Encode the prompt
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate text using Greedy Decoding and Top-K Sampling
max_length = 200  # Maximum length of the generated text
with torch.no_grad():
    greedy_output = model.generate(**inputs, max_length=max_length)
    top_k_output = model.generate(**inputs, max_length=max_length, top_k=50)

# Decode the outputs
generated_texts = {
    "Greedy": tokenizer.decode(greedy_output[0], skip_special_tokens=True),
    "Top-K": tokenizer.decode(top_k_output[0], skip_special_tokens=True)
}

# Clean all generated texts
cleaned_texts = {strategy: clean_output(text, prompt) for strategy, text in generated_texts.items()}

# Reference text (ground truth)
reference = [
    "Supervised learning uses labeled data to learn a mapping from inputs to outputs, while unsupervised learning tries to find patterns or groupings within unlabeled data."
]

results = {}
for strategy, text in cleaned_texts.items():
    P, R, F1 = score([text], reference, lang="en", model_type="microsoft/deberta-xlarge-mnli")
    llm_result = llm_score(text, reference[0])
    results[strategy] = {
        "Generated Text": text,
        "BERTScore P": P.mean().item(),
        "BERTScore R": R.mean().item(),
        "BERTScore F1": F1.mean().item(),
        "LLM Score": llm_result,
    }

# Display results
dataframe=pd.DataFrame.from_dict(results, orient='index')
from IPython.display import display, HTML
display(HTML(dataframe.to_html(float_format="%.4f", justify="center", index=True, border=0, classes='dataframe')))


Unnamed: 0,Generated Text,BERTScore P,BERTScore R,BERTScore F1,LLM Score
Greedy,"Supervised learning is used in image recognition, natural language processing, and other areas where a clear label is given to the model. Unsupervised learning is used in clustering, dimensionality reduction, and other areas where no clear label is given to the model. Step 1: Define Supervised Learning Supervised learning is a type of machine learning where the model is trained on labeled data. The model learns from examples that are already labeled, allowing it to learn the relationship between input and output. The goal of supervised learning is to make predictions on new, unseen data based on the patterns learned from the labeled data. Step 2: Define Unsupervised Learning Unsupervised learning, on the other hand, is a type of machine learning where the model is trained on unlabeled data. The model learns to identify patterns, relationships, and structure in the data without any prior knowledge of the expected output. The goal",0.5696,0.76,0.6512,65
Top-K,"In supervised learning, the model is trained on labeled data, where the target variable is already known. The model learns to predict the target variable based on the input features. In unsupervised learning, the model is trained on unlabeled data, where the target variable is not known. The model learns to identify patterns or relationships in the data. Supervised learning is often used for tasks such as classification, regression, and feature selection. Unsupervised learning is often used for tasks such as clustering, dimensionality reduction, and anomaly detection. Here is a simple example to illustrate the difference: Supervised Learning: You want to predict the price of a house based on its features (number of bedrooms, square footage, etc.). You have a dataset of labeled houses with their corresponding prices. You train a model on this data to learn the relationship between the features and the price. The model learns to predict the",0.5495,0.7514,0.6348,65


### **Exercises for Task 3**
For the exercises, use the following prompt and reference text:
Prompt: "Describe the process of photosynthesis."
Reference: "Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the aid of chlorophyll. The process generally involves the absorption of carbon dioxide and the release of oxygen as a by-product."

1. Add Top-P (temp = 0.7, p = 0.90)  and Beam Search (3 beams) decoding to the model evaluation and comment on the results.
2. Now, with a new code cell, evaluate four variations of top-p generation.  Compare the BERTScore and llm_score results for each temperature and explain how temperature impacts the quality of the generated text.
    - Low: 0.3 (High coherence, low diversity)
    - Medium: 0.7 (Balanced coherence and diversity)
    - High: 1.0 (More diverse, but potentially less coherent)
    - Very High: 1.5 (Significantly increased randomness, likely decreased coherence)
3. Notice that BERTScore matches similar tokens regardless of order.  Explain, with an example, how this could be a weakness of using BERTScore as an evaluation metric.




---

## Task 4 - Build a Local Chatbot Simulator (10 points)

This task requires you to pull together a few pieces from the lesson notebook.  We want you to mimic the Chatbot Simulation from the lesson, but using the local model "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" or similar.  You'll need to create a function similar to the `chat_with_gemini function`, call it `chat_with_local_model`.  

* Instead of passing the name of the model you'll want to pass in the model and tokenizer that you've already loaded.  
* Pass a `split_string` to `chat_with_local_model` and use it to split the output of the model so that the assistant response contains only the response and doesn't repeat the input prompt (we did this in the lesson too).  You can set the default `split_string` as needed.
* Also allow additional arguments to be passed to `chat_with_local_model` that you pass along to `model.generate()` so you can configure the decoding strategy.
* Use the chat_template to format the conversation in the tokenization step.
* You should use the helper function `visualize_conversation` to show the conversation after each new prompt.  

Your function will look something like this:

```python
# copy this to a code cell to start

def chat_with_local_model( user_input, model, tokenizer, split_string='assistant', **kwargs):

# fill in lots of stuff here

    outputs = model.generate(input_ids, **kwargs)

# more stuff here
```

Then you could call it like this (after initializing the conversation with the system prompt):
```python
chat_with_local_model( 'Tell me about overfitting in deep learning', model, tokenizer, do_sample=False, top_p = None, temperature=None)
```

Your solution should use and mimic code from the lesson and should not be AI-generated.  The point here is for you to understand the code in the leson well enough to do this.

Demonstrate that your function works by showing two conversations with at least 3 inputs each.  
1.  A creative conversation to generate a jingle, poem, limerick, or similar and refine it.  Use an appropriate decoding strategy.
2.  A coding conversation where you ask the chatbot to create some code and refine it.  Use an appropriate decoding strategy.