Authored by: Aryan Mistry

# Evaluating and Safeguarding LLM Applications

Language models can produce fluent and impressive outputs, but it's crucial to evaluate how well they perform and to guard against undesirable behaviour. In this notebook we'll cover both **evaluation metrics** (how to measure quality) and **safety checks** (how to detect unfaithful or inappropriate content). We'll start with simple automatic metrics like BLEU, move on to heuristics for faithfulness, and end with structured output validation and basic guardrails. [12][13][14][15]

## 1. Evaluation Metrics

To quantify how good a model's output is we can measure it against a reference. Common metrics include:

- **Perplexity:** Roughly speaking, perplexity measures how *surprised* a model is by the correct answer. Lower perplexity means the model assigns higher probability to the target sequence. It's often used when you have a trained model and want to see how well it fits your data.
- **BLEU:** Stands for Bilingual Evaluation Understudy. BLEU counts how many n‑grams (contiguous word sequences) the candidate shares with the reference. It works well for translation and summarisation tasks.
- **ROUGE:** Recall-Oriented Understudy for Gisting Evaluation. ROUGE focuses on recall (how much of the reference text appears in the candidate). It's popular for summarisation.

No single metric is perfect, and automatic scores should be complemented with human judgments, especially for open‑ended outputs. [12][13]

### 2. Unigram BLEU

BLEU measures n-gram overlap between a candidate sentence (the model output)
and a reference sentence (ground truth). We'll start with the simplest case:
a **unigram BLEU** score that counts single-word overlaps. The precision of
matching unigrams is computed and normalised by the total number of words in
the candidate. In a full BLEU implementation you would include penalties for
short candidates and geometric means across multiple n-gram orders.

Run the cell below to implement and test unigram BLEU. [12]


In [None]:

from collections import Counter

def unigram_bleu(candidate: str, reference: str) -> float:
    """Compute unigram BLEU precision.

    Counts overlapping single words between the candidate and reference, ignoring case.
    """
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()
    ref_counts = Counter(ref_tokens)
    match_count = 0
    for token in cand_tokens:
        if ref_counts.get(token, 0) > 0:
            match_count += 1
            ref_counts[token] -= 1
    return match_count / max(len(cand_tokens), 1)

candidate = "humans landed on the moon with apollo 11"
reference = "A small step for man, a giant leap for mankind. The Apollo 11 mission landed humans on the moon."
print(f"Unigram BLEU: {unigram_bleu(candidate, reference):.2f}")


# Example usage:
if __name__ == '__main__':
    cand = 'the cat sat on the mat'
    ref = 'the cat is sitting on the mat'
    print('Unigram BLEU:', unigram_bleu(cand, ref))

Unigram BLEU: 0.75
Unigram BLEU: 0.8333333333333334


### 3. Bigram BLEU

To capture short phrases rather than individual words, we can extend BLEU to
**bigrams** (two-word sequences). The procedure is similar: count matching
bigrams between candidate and reference. We'll implement bigram BLEU below. [12]


In [None]:

from collections import Counter

def ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

def bigram_bleu(candidate: str, reference: str) -> float:
    """Compute bigram BLEU precision.

    This version counts overlapping two‑word sequences between the candidate and reference.
    """
    cand_tokens = candidate.lower().split()
    ref_tokens = reference.lower().split()
    cand_bigrams = ngrams(cand_tokens, 2)
    ref_bigrams = ngrams(ref_tokens, 2)
    ref_counts = Counter(ref_bigrams)
    match_count = 0
    for bg in cand_bigrams:
        if ref_counts.get(bg, 0) > 0:
            match_count += 1
            ref_counts[bg] -= 1
    return match_count / max(len(cand_bigrams), 1)

candidate_big = "Neil Armstrong walked on the moon with Buzz Aldrin"
reference_big = "Buzz Aldrin and Neil Armstrong explored the moon during the Apollo missions."
print(f"Bigram BLEU: {bigram_bleu(candidate_big, reference_big):.2f}")


# Example usage:
if __name__ == '__main__':
    cand = 'the cat sat on the mat'
    ref = 'the cat is sitting on the mat'
    print('Bigram BLEU:', bigram_bleu(cand, ref))

Bigram BLEU: 0.38
Bigram BLEU: 0.6


#### Exercises (Metrics)

1. **Implement trigram BLEU.** Write a function `trigram_bleu(candidate, reference)` that computes the precision over three-word sequences. Test it on your own examples.
2. **Multiple references.** Modify your BLEU functions to accept a list of reference sentences and compute the maximum match count across references. How does this change the scores?
3. **Comparing metrics.** For a set of candidate sentences and references, compute unigram and bigram BLEU scores. Identify cases where the two metrics disagree and think about why.


## 4. Faithfulness Heuristics

Automatic metrics often fail to detect hallucinations—statements that sound
plausible but aren't supported by the source. A simple way to check
faithfulness is to verify whether important words in the answer also appear
in the context. We'll implement a basic heuristic below.

First we define a list of common stop words (articles, conjunctions, etc.) and
then extract the remaining keywords from a text. Our `is_faithful` function
checks whether the answer's keywords are all contained in the context's
keywords. This is a very naive check; high-quality assessments often require
fact-checking pipelines or human evaluation. [14]


In [None]:

import string

STOP_WORDS = {
    'the','a','an','in','on','for','of','and','to','with','is','are','was','were',
    'this','that','these','those','as','by','at','from'
}

def extract_keywords(text: str) -> set:
    """Extract keywords from a text by removing stop words and punctuation.

    Useful for naive faithfulness checks.
    """
    translator = str.maketrans('', '', string.punctuation)
    words = text.translate(translator).lower().split()
    return {w for w in words if w not in STOP_WORDS}

def is_faithful(answer: str, context: str) -> bool:
    ans_kw = extract_keywords(answer)
    ctx_kw = extract_keywords(context)
    return ans_kw <= ctx_kw

context_text = "Humans landed on the moon during the Apollo missions."
answer_true = "Humans landed on the moon during the Apollo missions."
answer_false = "Humans landed on Mars during the Apollo missions."
print("Answer 1 faithful?", is_faithful(answer_true, context_text))
print("Answer 2 faithful?", is_faithful(answer_false, context_text))


Answer 1 faithful? True
Answer 2 faithful? False


#### Exercises (Faithfulness)

1. **Expand the stop word list.** Add more common words to `STOP_WORDS` to reduce false positives. Consider words like "it", "they", "have", etc.
2. **Allow synonyms.** Modify `extract_keywords` to replace synonyms (e.g. "man" with "humans") using a simple mapping. How does this affect faithfulness detection?
3. **Compute recall and precision.** Implement functions to compute the recall and precision of answer keywords relative to the context keywords, instead of just a boolean faithful/unfaithful output.


## 5. Validating Structured Outputs

When a language model is asked to call a tool or produce a structured response,
we often require the result to match a particular schema. For example, a
weather API might expect a dictionary with keys `city` and `temperature`.

The function below verifies that a JSON string can be parsed and contains
expected keys with correct Python types. This simple check can catch obvious
errors before passing data to downstream systems. [15]


In [None]:

import json

def validate_json(output: str, schema: dict) -> bool:
    """Validate JSON output against a simple schema.

    The schema maps keys to expected Python types. Returns True if all keys exist and have the correct type.
    """
    try:
        data = json.loads(output)
    except json.JSONDecodeError:
        return False
    for key, typ in schema.items():
        if key not in data or not isinstance(data[key], typ):
            return False
    return True

schema = {'city': str, 'temperature': float}
valid_output = '{"city": "London", "temperature": 18.5}'
invalid_output = '{"temperature": "warm"}'
print("Valid output passes?", validate_json(valid_output, schema))
print("Invalid output passes?", validate_json(invalid_output, schema))


Valid output passes? True
Invalid output passes? False


## 6. Simple Guardrails

In safety-critical applications you may wish to block or flag outputs that
contain sensitive or disallowed content. Here we'll implement a rudimentary
check that looks for forbidden keywords in a piece of text. In reality, more
sophisticated approaches use classifiers or pattern matching to detect
personally identifiable information (PII), toxicity or hallucinations. [15]


In [None]:

def contains_forbidden_content(text: str, forbidden_keywords: set) -> bool:
    """Check for forbidden keywords in text.

    Useful for building simple guardrails to flag sensitive topics.
    """
    lower = text.lower()
    return any(keyword.lower() in lower for keyword in forbidden_keywords)

forbidden = {"social security number", "credit card", "password"}
print(contains_forbidden_content("My password is 12345", forbidden))
print(contains_forbidden_content("It's a sunny day", forbidden))


True
False


#### Exercises (Guardrails)

1. **Extend forbidden lists.** Create different sets of forbidden words for different contexts (e.g. medical information, financial data). Test your function on a variety of outputs.
2. **Case sensitivity.** Modify `contains_forbidden_content` to report not just True/False but which keyword(s) were found and at what positions.
3. **Combine checks.** Compose the BLEU, faithfulness and guardrail functions into a single `review_answer` function that rejects answers with low BLEU, unfaithful content or forbidden keywords.


Foundational LLMs & Transformers
1. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NIPS 2017).
2. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
4. OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
5. Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.

Generative AI & Sampling
6. Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
8. Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, University of Toronto.

Retrieval-Augmented Generation (RAG) & Knowledge Grounding
9. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS 2020.
10. deepset ai (2023). Haystack: Open-Source Framework for Search and RAG Applications. https://haystack.deepset.ai
11. LangChain (2023). LangChain Documentation and Cookbook. https://python.langchain.com

Evaluation & Safety
12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL 2002.
13. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop 2004.
14. OpenAI (2024). Evaluating Model Outputs: Faithfulness and Grounding. OpenAI Docs.
15. Guardrails AI (2024). Open-Source Guardrails Framework. https://github.com/shreyar/guardrails

Prompt Engineering & Instruction Tuning
16. White, J. (2023). The Prompting Guide. https://www.promptingguide.ai
17. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

Agents & Tool Use
18. Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
19. LangChain (2024). LangChain Agents and Tools Documentation.
20. Microsoft (2023). Semantic Kernel Developer Guide. https://learn.microsoft.com/en-us/semantic-kernel/
21. Google DeepMind (2024). Gemini Technical Report. arXiv:2312.11805.

State, Memory & Orchestration
22. LangGraph (2024). Stateful Agent Orchestration Framework. https://langchain-langgraph.vercel.app
23. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

Pedagogical and Course Design References
24. fast.ai (2023). fast.ai Deep Learning Course Notebooks. https://course.fast.ai
25. Ng, A. (2023). DeepLearning.AI Short Courses on Generative AI.
26. MIT 6.S191, Stanford CS324, UC Berkeley CS294-158. (2022–2024). Course Materials and Public Notebooks for ML and LLMs.