# Assignment 2 – Evaluating LLM Output Quality

# Introduction to Large Language Models and Prompt Engineering

**Course:** GenAI development and LLM applications


**Instructors:** Ori Shapira, Yuval Belfer

**Semester:** Summer
    
## Overview

This assignment provides a **hands‑on** experince with the world of LLM based systems evaluation: from understanding the business use-case and defining evaluation criterias in light of it. To performing human evaluation and dealing with the hardships of "non-objectivity", trough experimenting with **JLMs** (Judge Language Models). 

Along the way you will explore the differnces between the two evaluation methods, thier advanteges and dis-advanteges and try to figure out how and when to use each further down your GenAI road.

## Learning Objectives

- **Define evaluation criteria** understand the importance of defining how to measure your system performance in a non closely defined problem.
- **Compare** manual vs. automatic common methods.
- **Drive improvement** through proper evaluation, documentation and change cycles.
- **Design** usable automatic evaluation pipeline.

## Prerequisites
- Basic Python knowledge
- Familiarity with Jupyter notebooks
- Internet connection for API calls

# Part 1 - Human evaluation

In [51]:
!pip -q install python-dotenv

## 1  Setup

In [1]:
!pip -q install --upgrade "transformers[torch]" datasets accelerate bitsandbytes --progress-bar off

## 2  Business use case – Generate Product Descriptions
Many e‑commerce sites need engaging **product descriptions**. Given structured attributes (name, category, features, color, price), your model should craft a persuasive, 50‑90‑word description.

In [2]:
# Load the product dataset
import pandas as pd

dataset_path = "Assignment_02_product_dataset.csv"  # ensure the file is uploaded
df_products = pd.read_csv(dataset_path)
print(f"Loaded {len(df_products)} products")
df_products.head()

Loaded 50 products


Unnamed: 0,product_name,Product_attribute_list,material,warranty
0,Apple iPhone 15 Pro,"features: A17 Pro chip, 120 Hz ProMotion displ...","titanium frame, Ceramic Shield glass",1‑year limited warranty
1,Samsung Galaxy S24 Ultra,"features: 200 MP camera, S‑Pen support, 120 Hz...","Armor Aluminum frame, Gorilla Glass Victus",1‑year limited warranty
2,Google Pixel 8 Pro,"features: Tensor G3 chip, Magic Eraser, 50 MP ...","matte glass back, aluminum frame",1‑year limited warranty
3,Sony WH‑1000XM5 Headphones,"features: active noise cancelling, 30 hr batte...",synthetic leather earcups,1‑year limited warranty
4,Bose QuietComfort Ultra Earbuds,"features: CustomTune sound calibration, ANC, I...",silicone ear tips,1‑year limited warranty


## 3  Evaluation criteria
| Criterion | Description | Rating |
|-----------|-------------|--------|
| **Fluency** | Natural, easy‑to‑read sentences | good / ok / bad |
| **Grammar** | Correct spelling & punctuation | good / ok / bad |
| **Tone** | Matches friendly, credible sales voice | good / ok / bad |
| **Length** | 50‑90 words | good / ok / bad |
| **Grounding** | Sticks to provided attributes only | good / ok / bad |
| **Latency** | Time to first byte / full response | good / ok / bad  (based on avg. time per call)|
| **Cost** | Relative inference or API cost per 1K tokens | good / ok / bad (based on avg. price per cal)|

**Define your rubric:**
1. For each criterion, spell out what qualifies as **good**, **ok**, and **bad** to minimize subjectivity (e.g. for *Length*: good = 50‑90 words, ok = 40‑49 or 91‑110 words, bad = outside that range).
2. Decide the **cumulative pass bar**—for instance, at least three *good* ratings and no *bad* ratings overall.
3. Establish **go / no‑go rules**—e.g. if *Grounding* is *bad* the description is automatically rejected, regardless of other scores.

## 4  Prompt

💡 **Prompt‑engineering tip:**
Feel free to iterate on the prompt to maximize output quality. You can:
- Add a **system message** that defines writing style, brand voice, or formatting rules.
- Provide **one or two high‑quality examples** (few‑shot) of attribute→description pairs.
- Include explicit constraints (word count, tone adjectives, banned phrases).
- Experiment with phrases like *"Think step‑by‑step"* or *"First reason, then answer"*.

Document any changes you make and observe how they influence the evaluation scores.

In [16]:
prompt_tmpl = (
    "You are a copywriter for an online store. Using the product attributes, "
    "write an engaging product description (50–90 words).\n\n"
    "Product name: {product_name}\nFeatures: {Product_attribute_list}\Material: {material}\nWarranty: {warranty}\n\n"
    "Description:" 
)

  "Product name: {product_name}\nFeatures: {Product_attribute_list}\Material: {material}\nWarranty: {warranty}\n\n"


## 5  Run a medium‑size model (≤ 30 B parameters)

Choose **one or more** of the options below:

**A. Hugging Face checkpoint** (local inference) – already configured in the code cell that follows.

**B. OpenAI model** – call an OpenAI hosted model (e.g. `gpt‑4o`, `gpt‑4‑turbo`, `gpt‑3.5‑turbo`). Implement `call_openai(prompt: str) -> str` in a separate utility cell and then run the snippet.

**C. Google Gemini model** – call a Gemini endpoint (e.g. `gemini‑1.5‑pro`). Implement `call_gemini(prompt: str) -> str` similarly.

> ⚠️ Make sure you have your API keys set as environment variables or passed securely.


**Latency & cost tracking**
- Your `call_*` functions should return a **dict** with keys:
  - `text` – generated description (string)
  - `latency_ms` – end‑to‑end generation time in milliseconds
  - `input_tokens` – tokens sent to the model (**IF YOU ADDED A SYS PROMPT ADD IT TO THE CALCULATION**)
  - `output_tokens` – tokens received from the model
- Below, a helper `call_hf()` shows how to compute these metrics for a Hugging Face model. You must add equivalent tracking inside `call_openai()` and `call_gemini()`.


**FOR COLAB USERS**

You can set your HF_TOKEN secret in Colab, please follow these steps:

1. Click on the "🔑" icon in the left-hand sidebar in Colab. This opens the Secrets manager.
2. Click on "New secret".
3. In the "Name" field, type HF_TOKEN.
4. In the "Value" field, paste your Hugging Face access token (you can generate one from your Hugging Face account settings under "Access Tokens").
5. Make sure the "Notebook access" toggle is enabled for your notebook.
6. Close the Secrets manager.

In [5]:
# Set your Hugging Face access token
import os
os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN"
print(os.environ.get("HF_TOKEN"))

YOUR_HF_TOKEN


In [6]:
import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

assert torch.cuda.is_available(), "Switch Colab to GPU first"

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

bnb = BitsAndBytesConfig(load_in_4bit=True)

# this will work with L4 GPU, if you have a different GPU, you may need to modify the code
hf_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory={0: "24GiB", "cpu": "30GiB"},
    quantization_config=bnb,
    low_cpu_mem_usage=True
)

hf_tok = AutoTokenizer.from_pretrained(model_id)

hf_gen = pipeline("text-generation",
               model=hf_model,
               tokenizer=hf_tok,
               max_new_tokens=120,
               do_sample=False)

  from .autonotebook import tqdm as notebook_tqdm


AssertionError: Switch Colab to GPU first

In [None]:
sample = dict(
    product_name="EverCool Water Bottle",
    Product_attribute_list="double-wall vacuum insulation; keeps drinks cold 24 h; leak-proof lid",
    material="stainless steel",
    warranty="lifetime warranty",
)

print(hf_gen(prompt_tmpl.format(**sample),
          return_full_text=False)[0]["generated_text"])

In [None]:
# --- Option A: Hugging Face ---
# Ensure you implemented call_hf() in another cell (you can use the one from the previous assignment).

# Example with HF pipeline
def call_hf(prompt: str, model_id: str = "mistralai/Mistral-7B-Instruct-v0.3"):
    """Return dict with latency & token counts for HF pipeline."""
    # tokenizer = AutoTokenizer.from_pretrained(model_id)
    # model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
    # generator = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=120)

    import time
    start = time.time()
    res= hf_gen(prompt,
          return_full_text=False)[0]["generated_text"]

    latency = (time.time() - start) * 1000  # ms
    # token counts via tokenizer
    input_tokens = len(hf_tok.encode(prompt))
    output_tokens = len(hf_tok.encode(res))
    return {
        "text": res,
        "latency_ms": latency,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
    }

response_hf = call_hf(prompt_tmpl.format(**sample))
print(response_hf)

In [None]:
# --- Option B: OpenAI ---
import openai
import time

def call_openai(prompt: str, model_name: str = "gpt-4o", api_key: str = None) -> dict:
    """
    Call OpenAI API and return generated text and usage metrics.

    Args:
        prompt (str): The user prompt.
        model_name (str): OpenAI model to use (default: "gpt-4o").
        api_key (str): OpenAI API Key (optional if already set globally).

    Returns:
        dict: {
            "text": generated description (str),
            "latency_ms": latency in milliseconds (float),
            "input_tokens": tokens sent (int),
            "output_tokens": tokens received (int)
        }
    """
    if api_key:
        openai.api_key = api_key

    start_time = time.time()

    try:
        response = openai.ChatCompletion.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )

        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000

        output = response.choices[0].message.content.strip()
        usage = response.usage

        return {
            "text": output,
            "latency_ms": latency_ms,
            "input_tokens": usage.prompt_tokens if usage else None,
            "output_tokens": usage.completion_tokens if usage else None
        }

    except Exception as e:
        print(f"OpenAI API Error: {e}")
        return {
            "text": "",
            "latency_ms": 0,
            "input_tokens": 0,
            "output_tokens": 0
        }


In [None]:
import google.generativeai as genai
import time

In [53]:
from dotenv import load_dotenv
import os

# Load the .env file
load_dotenv()

# Access variables
gemini_api_key = os.getenv("gemini_api_key")

In [34]:
# --- Option C: Gemini ---
def call_gemini(prompt: str, model_name: str = "models/gemini-1.5-pro", api_key: str = gemini_api_key) -> dict:
    """
    Call Google Gemini API and return generated text and usage metrics.

    Args:
        prompt (str): The user prompt.
        model_name (str): Gemini model to use (default: "gemini-pro").
        api_key (str): Google API Key.

    Returns:
        dict: {
            "text": generated description (str),
            "latency_ms": latency in milliseconds (float),
            "input_tokens": tokens sent (int),
            "output_tokens": tokens received (int)
        }
    """

    if not api_key:
        raise ValueError("api_key must be provided for Google Gemini API")

    genai.configure(api_key=api_key)

    # Initialize model
    model = genai.GenerativeModel(model_name)

    # Measure start time
    start_time = time.time()

    try:
        response = model.generate_content(prompt)
        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000

        text = response.text.strip() if hasattr(response, "text") else ""

        # Token accounting (Gemini SDK does not expose usage directly yet)
        input_tokens = len(prompt.split())
        output_tokens = len(text.split())

        return {
            "text": text,
            "latency_ms": latency_ms,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens
        }

    except Exception as e:
        print(f"Gemini API Error: {e}")
        return {
            "text": "",
            "latency_ms": 0,
            "input_tokens": 0,
            "output_tokens": 0
        }

In [32]:
models = genai.list_models()
for m in models:
    print(m.name, m.supported_generation_methods)


models/embedding-gecko-001 ['embedText', 'countTextTokens']
models/gemini-1.0-pro-vision-latest ['generateContent', 'countTokens']
models/gemini-pro-vision ['generateContent', 'countTokens']
models/gemini-1.5-pro-latest ['generateContent', 'countTokens']
models/gemini-1.5-pro-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-pro ['generateContent', 'countTokens']
models/gemini-1.5-flash-latest ['generateContent', 'countTokens']
models/gemini-1.5-flash ['generateContent', 'countTokens']
models/gemini-1.5-flash-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-flash-8b ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-001 ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-latest ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-2.5-pro-preview-03-25 ['generateContent', 'countTokens', 'createCachedContent', 'batchGenerateContent']
models/ge

In [41]:
# --- Batch generation helper (type‑safe) ---
from typing import Callable
import pandas as pd

from typing import Callable, Dict
import pandas as pd

def batch_generate(
    sample_df: pd.DataFrame,
    call_model_fn: Callable[[str], Dict[str, object]],
    prompt_template: str = prompt_tmpl,
) -> pd.DataFrame:
    """Generate descriptions and metrics for each row in *sample_df*.

    The model-calling function *must* return a dict with keys:
    - ``text`` (str) – generated description
    - ``latency_ms`` (float | None)
    - ``input_tokens`` (int | None)
    - ``output_tokens`` (int | None)
    """
    if not isinstance(sample_df, pd.DataFrame):
        raise TypeError("sample_df must be a pandas DataFrame")
    if not callable(call_model_fn):
        raise TypeError("call_model_fn must be callable")

    outputs = []
    for _, row in sample_df.iterrows():
        prompt = prompt_template.format(**row.to_dict())
        out = call_model_fn(prompt)
        if not isinstance(out, dict) or 'text' not in out:
            raise ValueError("call_model_fn must return a dict with at least a 'text' field")
        outputs.append(out)

    result_df = sample_df.copy()
    result_df["generated_description"] = [o["text"] for o in outputs]
    result_df["latency_ms"] = [o.get("latency_ms") for o in outputs]
    result_df["input_tokens"] = [o.get("input_tokens") for o in outputs]
    result_df["output_tokens"] = [o.get("output_tokens") for o in outputs]
    return result_df


demo_df = batch_generate(df_products[:5], call_gemini)
demo_df.head()

Unnamed: 0,product_name,Product_attribute_list,material,warranty,generated_description,latency_ms,input_tokens,output_tokens
0,Apple iPhone 15 Pro,"features: A17 Pro chip, 120 Hz ProMotion displ...","titanium frame, Ceramic Shield glass",1‑year limited warranty,Experience the power of the Apple iPhone 15 Pr...,2331.658125,49,57
1,Samsung Galaxy S24 Ultra,"features: 200 MP camera, S‑Pen support, 120 Hz...","Armor Aluminum frame, Gorilla Glass Victus",1‑year limited warranty,Capture brilliance with the Samsung Galaxy S24...,2267.369032,48,62
2,Google Pixel 8 Pro,"features: Tensor G3 chip, Magic Eraser, 50 MP ...","matte glass back, aluminum frame",1‑year limited warranty,Experience the power of the Google Pixel 8 Pro...,2253.291845,47,58
3,Sony WH‑1000XM5 Headphones,"features: active noise cancelling, 30 hr batte...",synthetic leather earcups,1‑year limited warranty,Immerse yourself in pure sound with the Sony W...,1943.329096,44,52
4,Bose QuietComfort Ultra Earbuds,"features: CustomTune sound calibration, ANC, I...",silicone ear tips,1‑year limited warranty,Experience world-class sound and silence with ...,1944.385052,42,55


## 6  Manual evaluation
Use `batch_generate()` to create a DataFrame of model outputs, then add blank rating columns for each criterion plus a `final_score` column. An Excel file is saved so you can fill scores offline or share with peers.

Steps:
1. Run the code cell below (adjust which `call_*` function you pass in).
2. Open the generated `assignment_03_evaluation_sheet.xlsx` and rate each row with **good / ok / bad**.
3. Add a rule for `final_score` (e.g., majority = good, fails if grounding = bad).


**Cost calculator**
Use the helper below to compute cost in USD based on token usage:
```python
outputs_df = add_cost_columns(outputs_df, input_price_per_m=1.5, output_price_per_m=2.0)
```
Set prices to **0** if you ran everything locally on Hugging Face.

In [42]:
# --- Cost computation helper ---
def add_cost_columns(df, input_price_per_m: float, output_price_per_m: float):
    """Add cost columns based on token counts.
    Args:
        df: DataFrame with `input_tokens` and `output_tokens`.
        input_price_per_m: $ per 1M input tokens.
        output_price_per_m: $ per 1M output tokens.
    Returns: DataFrame with extra `cost_usd` column.
    """
    if 'input_tokens' not in df or 'output_tokens' not in df:
        raise ValueError('Token columns missing; run batch_generate first')
    cost_input = df['input_tokens'] * (input_price_per_m / 1000000)
    cost_output = df['output_tokens'] * (output_price_per_m / 1000000)
    df = df.copy()
    df['cost_usd'] = (cost_input + cost_output).round(4)
    return df

# Example usage (set prices to 0 for HF local models):
# outputs_df = add_cost_columns(outputs_df, 0, 0)
# outputs_df.head()

In [43]:
# --- Build evaluation sheet & export to Excel ---

#Update the prices according to the model you used, or leave them at 0 for HF local models
YOUR_MODEL_INPUT_PRICE_PER_M = 0
YOUR_MODEL_OUTPUT_PRICE_PER_M = 0
outputs_df = batch_generate(df_products, call_gemini)  # NOTE: change model function as needed

# Add rating columns (good/ok/bad)
rating_cols = ["fluency", "grammar", "tone", "length", "grounding", "latency", "cost", "final_score"]
for col in rating_cols:
    if col not in outputs_df:
        outputs_df[col] = ""

xlsx_path = "assignment_03_evaluation_sheet.xlsx"

# Add cost columns
outputs_df = add_cost_columns(outputs_df, YOUR_MODEL_INPUT_PRICE_PER_M, YOUR_MODEL_OUTPUT_PRICE_PER_M)

outputs_df.to_excel(xlsx_path, index=False)
print(f"Saved evaluation sheet → {xlsx_path} with {len(outputs_df)} rows")

Saved evaluation sheet → assignment_03_evaluation_sheet.xlsx with 50 rows


## 7  Improvement cycle

Now that you’ve established a baseline score in **Section 6**, iterate to achieve better results.

**Ideas to explore**
- **Prompt tuning** – rewrite the system/user prompts, add few‑shot examples, or enforce stricter constraints.
- **Model choice** – test a different checkpoint (larger ≠ always better), switch from HF to OpenAI or Gemini, or try a domain‑specific model.
- **Temperature / decoding params** – adjust `temperature`, `top_p`, `top_k`, or `max_new_tokens` to balance creativity vs. factuality.
- **Data preprocessing** – clean attribute text, expand abbreviations, or group similar products to feed additional context.
- **Post‑processing** – run grammar‑checking or length trimming after generation.
- **Ensembling / RAG** – combine outputs from two models or ground the prompt with retrieved copy from existing catalog listings.

Document each experiment in a brief bullet list:
1. **What you changed**
2. **Why you expected it to help**
3. **New evaluation scores**

💡 *Goal*: maximize the cumulative score according to your rubric while respecting the go/no‑go rules.

In [67]:
prompt_tmpl = (
    "You are a professional copywriter for a premium online tech store. "
    "Using the product details below, write a compelling and benefit-driven product description in **exactly 50 to 90 words**. "
    "Do not exceed this word range — keep it concise, natural, and persuasive. "
    "Highlight benefits over features, and integrate all product attributes seamlessly. "
    "Use a confident, modern tone that appeals to tech-savvy customers.\n\n"
    "Product name: {product_name}\n"
    "Key features: {Product_attribute_list}\n"
    "Material: {material}\n"
    "Warranty: {warranty}\n\n"
    "Only return the product description (no title, no bullet points, no word count). Begin directly with the description:\n"
)


In [46]:
# --- Option C: Gemini ---
def call_gemini(
    prompt: str,
    model_name: str = "models/gemini-1.5-pro",
    api_key: str = gemini_api_key,
    temperature: float = 0.85,
    top_p: float = 0.95,
    top_k: int = 50
) -> dict:
    """
    Call Google Gemini API and return generated text and usage metrics.

    Args:
        prompt (str): The user prompt.
        model_name (str): Gemini model to use.
        api_key (str): Google API Key.
        temperature (float): Controls randomness.
        top_p (float): Controls nucleus sampling.
        top_k (int): Limits candidates to top-k tokens by probability.

    Returns:
        dict: {
            "text": generated description (str),
            "latency_ms": latency in milliseconds (float),
            "input_tokens": tokens sent (int),
            "output_tokens": tokens received (int)
        }
    """

    if not api_key:
        raise ValueError("api_key must be provided for Google Gemini API")

    genai.configure(api_key=api_key)

    # Initialize model
    model = genai.GenerativeModel(model_name)

    # Measure start time
    start_time = time.time()

    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.types.GenerationConfig(
                temperature=temperature,
                top_p=top_p,
                top_k=top_k
            )
        )
        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000

        text = response.text.strip() if hasattr(response, "text") else ""

        # Token accounting (Gemini SDK does not expose usage directly yet)
        input_tokens = len(prompt.split())
        output_tokens = len(text.split())

        return {
            "text": text,
            "latency_ms": latency_ms,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens
        }

    except Exception as e:
        print(f"Gemini API Error: {e}")
        return {
            "text": "",
            "latency_ms": 0,
            "input_tokens": 0,
            "output_tokens": 0
        }


In [47]:
# --- Batch generation helper (type‑safe) ---
from typing import Callable
import pandas as pd

from typing import Callable, Dict
import pandas as pd

def batch_generate(
    sample_df: pd.DataFrame,
    call_model_fn: Callable[[str], Dict[str, object]],
    prompt_template: str = prompt_tmpl,
) -> pd.DataFrame:
    """Generate descriptions and metrics for each row in *sample_df*.

    The model-calling function *must* return a dict with keys:
    - ``text`` (str) – generated description
    - ``latency_ms`` (float | None)
    - ``input_tokens`` (int | None)
    - ``output_tokens`` (int | None)
    """
    if not isinstance(sample_df, pd.DataFrame):
        raise TypeError("sample_df must be a pandas DataFrame")
    if not callable(call_model_fn):
        raise TypeError("call_model_fn must be callable")

    outputs = []
    for _, row in sample_df.iterrows():
        prompt = prompt_template.format(**row.to_dict())
        out = call_model_fn(prompt)
        if not isinstance(out, dict) or 'text' not in out:
            raise ValueError("call_model_fn must return a dict with at least a 'text' field")
        outputs.append(out)

    result_df = sample_df.copy()
    result_df["generated_description"] = [o["text"] for o in outputs]
    result_df["latency_ms"] = [o.get("latency_ms") for o in outputs]
    result_df["input_tokens"] = [o.get("input_tokens") for o in outputs]
    result_df["output_tokens"] = [o.get("output_tokens") for o in outputs]
    return result_df


demo_df = batch_generate(df_products[:5], call_gemini)
demo_df.head()

Unnamed: 0,product_name,Product_attribute_list,material,warranty,generated_description,latency_ms,input_tokens,output_tokens
0,Apple iPhone 15 Pro,"features: A17 Pro chip, 120 Hz ProMotion displ...","titanium frame, Ceramic Shield glass",1‑year limited warranty,Experience blistering performance with the iPh...,2446.162939,111,62
1,Samsung Galaxy S24 Ultra,"features: 200 MP camera, S‑Pen support, 120 Hz...","Armor Aluminum frame, Gorilla Glass Victus",1‑year limited warranty,Capture brilliance with the Samsung Galaxy S24...,2200.408936,110,66
2,Google Pixel 8 Pro,"features: Tensor G3 chip, Magic Eraser, 50 MP ...","matte glass back, aluminum frame",1‑year limited warranty,Experience the next evolution of mobile with t...,2273.229837,109,65
3,Sony WH‑1000XM5 Headphones,"features: active noise cancelling, 30 hr batte...",synthetic leather earcups,1‑year limited warranty,Experience sound redefined with the Sony WH-10...,2089.838028,106,57
4,Bose QuietComfort Ultra Earbuds,"features: CustomTune sound calibration, ANC, I...",silicone ear tips,1‑year limited warranty,Experience world-class sound personalized for ...,2394.515991,104,51


In [57]:
# --- Build improvement sheet & export to Excel ---

outputs_df = batch_generate(df_products.head(), call_gemini)  # NOTE: change model function as needed


xlsx_path = "assignment_03_improvement_sheet.xlsx"

outputs_df.to_excel(xlsx_path, index=False)
print(f"Saved evaluation sheet → {xlsx_path} with {len(outputs_df)} rows")

Saved evaluation sheet → assignment_03_improvement_sheet.xlsx with 5 rows


# Part 2 – Judging Language Models (JLMs)

In [79]:
judge_prompt_tmpl = """
You are an expert evaluator of marketing copy. Based on the product information and the generated description below, evaluate the description using **only** these labels: "good", "ok", or "bad".

Rate each of the following categories:
- fluency
- grammar
- tone
- length (should be between 50–90 words)
- grounding (faithfulness to product attributes)
- latency (perceived delivery speed / verbosity)
- cost (efficiency of wording)
- final_score (overall quality)

**Product Details**
- Product name: {product_name}
- Key features: {Product_attribute_list}
- Material: {material}
- Warranty: {warranty}

**Generated Description**
{generated_description}

Return your evaluation in the following plain text format:

fluency: <good|ok|bad>  
grammar: <good|ok|bad>  
tone: <good|ok|bad>  
length: <good|ok|bad>  
grounding: <good|ok|bad>  
latency: <good|ok|bad>  
cost: <good|ok|bad>  
final_score: <good|ok|bad>
"""


In [80]:
# --- Batch generation helper (type‑safe) ---
from typing import Callable
import pandas as pd

from typing import Callable, Dict
import pandas as pd

def batch_generate(
    sample_df: pd.DataFrame,
    call_model_fn: Callable[[str], Dict[str, object]],
    prompt_template: str = prompt_tmpl,
) -> pd.DataFrame:
    """Generate descriptions and metrics for each row in *sample_df*.

    The model-calling function *must* return a dict with keys:
    - ``text`` (str) – generated description
    """
    if not isinstance(sample_df, pd.DataFrame):
        raise TypeError("sample_df must be a pandas DataFrame")
    if not callable(call_model_fn):
        raise TypeError("call_model_fn must be callable")

    outputs = []
    for _, row in sample_df.iterrows():
        prompt = prompt_template.format(**row.to_dict())
        out = call_model_fn(prompt)
        if not isinstance(out, dict) or 'text' not in out:
            raise ValueError("call_model_fn must return a dict with at least a 'text' field")
        outputs.append(out)

    result_df = sample_df.copy()
    result_df["generated_description"] = [o["text"] for o in outputs]
    return result_df



In [82]:
jlm_df = batch_generate(outputs_df.head(), call_gemini, prompt_template=judge_prompt_tmpl)  # NOTE: change model function as needed

xlsx_path = "assignment_03_jlm_sheet.xlsx"

jlm_df.to_excel(xlsx_path, index=False)
print(f"Saved evaluation sheet → {xlsx_path} with {len(outputs_df)} rows")

Saved evaluation sheet → assignment_03_jlm_sheet.xlsx with 5 rows
