# Theory: Importing Required Libraries

Before running prompt debugging and evaluation metrics, you need to install the necessary Python libraries:

- `sacrebleu`: For BLEU score calculation.
- `rouge_score`: For ROUGE metrics.
- `sentence_transformers`: For semantic similarity.

The following code cell installs these packages using pip.


In [None]:
!pip install sacrebleu rouge_score sentence_transformers

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting tabulate>=0.8.9 (from sacrebleu)
  Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting lxml (from sacrebleu)
  Downloading lxml-6.0.2-cp313-cp313-win_amd64.whl.metadata (3.7 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
Downloading lxml-6.0.2-cp313-cp313-win_amd64.whl (4.0 MB)
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   -- ------------------------------------- 0.3/4.0 MB ? eta -:--:--
   ----- ---------------------------------- 0.5/4.0 MB 1.5 MB/s eta 0:00:03
   ------------- -------------------------- 1.3/4.0 MB 2.4 MB/s eta 0:00:02
   -------------------- ------------------- 2.1/4.0 MB 3.1 MB/s eta 0:00:01
   --------------------------------- ------ 3.4/4.0 MB 3.5 MB/s eta

# Theory: Prompt Debugging

Prompt debugging involves generating variations of prompts and testing their outputs. This helps in identifying the most effective prompt for a given task.

- `generate_variations`: Creates all possible combinations of prompt templates.
- `mock_runner`: Simulates a model response (replace with actual model call for real use).
- `run_prompt_tests`: Runs all prompt variations and collects responses.

The following code cell demonstrates prompt debugging.


In [None]:
from typing import List, Dict
import itertools


# 1. Generate prompt variations
def generate_variations(
    base_prompt: str, variations: Dict[str, List[str]]
) -> List[str]:
    keys = list(variations.keys())
    combos = list(itertools.product(*(variations[k] for k in keys)))
    prompts = []
    for combo in combos:
        p = base_prompt
        for k, v in zip(keys, combo):
            p = p.replace("{" + k + "}", v)
        prompts.append(p)
    return prompts

Prompt: Translate into formal English: 'Bonjour'
Response: Response to: Translate into formal English: 'Bonjour'
---
Prompt: Translate into formal English: 'Comment ça va?'
Response: Response to: Translate into formal English: 'Comment ça va?'
---
Prompt: Translate into casual English: 'Bonjour'
Response: Response to: Translate into casual English: 'Bonjour'
---
Prompt: Translate into casual English: 'Comment ça va?'
Response: Response to: Translate into casual English: 'Comment ça va?'
---


In [None]:
# 2. Mock runner (replace with OpenAI/HF runner)
def mock_runner(prompt: str) -> str:
    return "Response to: " + prompt

In [None]:
# 3. Run tests
def run_prompt_tests(prompts: List[str]):
    results = []
    for p in prompts:
        resp = mock_runner(p)
        results.append({"prompt": p, "response": resp})
    return results

In [None]:
# Example Usage 1
base = "Translate into {tone} English: '{sentence}'"
vars = {"tone": ["formal", "casual"], "sentence": ["Bonjour", "Comment ça va?"]}

prompts = generate_variations(base, vars)
results = run_prompt_tests(prompts)

for r in results:
    print("Prompt:", r["prompt"])
    print("Response:", r["response"])
    print("---")

In [None]:
# Example Usage 2 (with different sentences)
base2 = "Summarize in {style} style: '{text}'"
vars2 = {
    "style": ["bullet", "paragraph"],
    "text": ["The sun rises in the east.", "Water is essential for life."],
}

prompts2 = generate_variations(base2, vars2)
results2 = run_prompt_tests(prompts2)

for r in results2:
    print("Prompt:", r["prompt"])
    print("Response:", r["response"])
    print("---")

# Theory: Performance Evaluation Metrics

Performance metrics help evaluate the quality of generated text compared to reference outputs.

- **Accuracy**: Measures exact match between prediction and reference.
- **BLEU**: Evaluates n-gram overlap (commonly used for translation).
- **ROUGE**: Measures overlap of sequences (used for summarization).
- **Semantic Similarity**: Measures meaning similarity using embeddings.

The following code cell defines functions for each metric and demonstrates their usage.


In [None]:
from typing import List
import sacrebleu
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util


# 1. Accuracy / Exact Match
def accuracy(preds: List[str], refs: List[str]) -> float:
    return sum([p.strip() == r.strip() for p, r in zip(preds, refs)]) / len(preds)


# 2. BLEU Score
def bleu(preds: List[str], refs: List[str]) -> float:
    return sacrebleu.corpus_bleu(preds, [refs]).score


# 3. ROUGE Scores
def rouge(preds: List[str], refs: List[str]):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
    scores = [scorer.score(r, p) for p, r in zip(preds, refs)]
    return scores


# 4. Semantic Similarity
def semantic_similarity(preds: List[str], refs: List[str]) -> List[float]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    pred_emb = model.encode(preds, convert_to_tensor=True)
    ref_emb = model.encode(refs, convert_to_tensor=True)
    sims = util.cos_sim(pred_emb, ref_emb)
    return [float(sims[i, i]) for i in range(len(preds))]


# Example Usage
preds = ["Hello", "How are you?"]
refs = ["Hello", "How are you?"]

print("Accuracy:", accuracy(preds, refs))
print("BLEU:", bleu(preds, refs))
print("ROUGE:", rouge(preds, refs))
print("Semantic similarity:", semantic_similarity(preds, refs))

Accuracy: 1.0
BLEU: 100.00000000000004
ROUGE: [{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}, {'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}]


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Semantic similarity: [1.0000001192092896, 1.000000238418579]


# Theory: Example Metric Calculation

This cell demonstrates how to use the installed libraries to calculate BLEU, ROUGE, and semantic similarity scores for sample outputs.

- BLEU compares n-gram overlap.
- ROUGE compares sequence overlap.
- Semantic similarity uses embeddings to compare meaning.

You can modify the reference and candidate outputs for your own evaluation.


In [None]:
# Reference and candidate outputs
reference = ["The cat is on the mat"]
candidate = "The cat is on mat"

# BLEU Score
bleu = sacrebleu.corpus_bleu([candidate], [reference])
print("BLEU:", bleu.score)

# ROUGE Score
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
scores = scorer.score(reference[0], candidate)
print("ROUGE-1:", scores["rouge1"].fmeasure)
print("ROUGE-L:", scores["rougeL"].fmeasure)

# Semantic Similarity
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")
ref_emb = model.encode(reference[0], convert_to_tensor=True)
cand_emb = model.encode(candidate, convert_to_tensor=True)
similarity = util.cos_sim(ref_emb, cand_emb)
print("Cosine Similarity:", float(similarity))

BLEU: 57.89300674674101
ROUGE-1: 0.9090909090909091
ROUGE-L: 0.9090909090909091
Cosine Similarity: 0.992526650428772


: 