# Lab 2 – Guardrailing an LLM with Guardrails.ai

> Build input/output validators to filter toxic language and enforce JSON schema.

## Environment Setup
This lab requires API keys for your LLM provider and optional services.

| Variable | Purpose |
|----------|---------|
| `OPENAI_API_KEY` | Access to OpenAI chat/completions for guardrail validation & LLM‑judge |
| `GUARDRAILS_OPENAI_API_KEY` | (Optional) Same as above; Guardrails can read it automatically |
| `PINECONE_API_KEY` | (Optional) Re‑use from Lab 1 if you want to store logs/vectors in Pinecone |

Set these before starting the notebook:
```bash
export OPENAI_API_KEY="sk-..."
jupyter lab
```
In Colab use `os.environ['OPENAI_API_KEY']='sk-...'`.

> **Security Tip**: Never commit keys to GitHub or share them.

In [None]:
# Sanity check for keys
import os
if not os.getenv('OPENAI_API_KEY'):
    raise ValueError('OPENAI_API_KEY missing. Set it before running this lab.')
print('OpenAI key detected.')

In [None]:
!pip -q install guardrails-ai --upgrade

In [None]:
from guardrails import Guard
from pydantic import BaseModel, Field

class Pet(BaseModel):
    pet_type: str = Field(description='Species of pet')
    name: str = Field(description='A unique pet name')

prompt = '''Suggest a pet and a unique name.\n${gr.complete_json_suffix_v2}'''

pet_guard = Guard.for_pydantic(output_class=Pet, prompt=prompt)
resp, validated = pet_guard()
print(validated)

In [None]:
# Toxic language validator example
!guardrails hub install hub://guardrails/toxic_language
from guardrails.hub import ToxicLanguage
tox = Guard().use(ToxicLanguage, threshold=0.5, validation_method='sentence', on_fail='exception')
try:
    tox.validate('You are stupid!')
except Exception as e:
    print('Blocked:', e)

### Integrate with RAG
```python
answer = qa_chain.run('Tell me something about Paris')
clean = tox.validate(answer)
print(clean)
```

## 3. Guardrail Examples & Workflows
### 3.1 Prompt Engineering Guardrail (System Prompt)
```python
system_prompt = (
    'You are a helpful assistant. Do not generate unsafe, unethical or biased text. '
    'Follow the provided context closely.'
)
```
### 3.2 Input Validation – Regex Profanity & PII


In [None]:
from guardrails import Guard, OnFailAction
from guardrails.validators import RegexMatch
from guardrails.hub import ToxicLanguage

# Custom profanity word list regex
profane_words = r'(?i)(foo|barbadword)'
profanity_validator = RegexMatch(pattern=profane_words, on_fail=OnFailAction.EXCEPTION)

# Guard with both profanity and toxicity validators
guard_in = Guard().use_many(
    profanity_validator,
    ToxicLanguage(threshold=0.7, validation_method='sentence', on_fail=OnFailAction.EXCEPTION)
)

try:
    guard_in.validate('You are a foo idiot!')
except Exception as e:
    print('Blocked:', e)

### 3.3 Model‑Based Guardrail (LLM Judge)
Use one LLM to grade another LLM’s answer for faithfulness / toxicity.

In [None]:
from guardrails.hub import ToxicLanguage
from openai import OpenAI
from guardrails import Guard

client = OpenAI()
judge_prompt = (
    'You are an impartial safety grader. If the following answer contains hate, return "unsafe" else "safe".'
)

def llm_judge(text):
    resp = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[{'role':'system','content':judge_prompt}, {'role':'user','content':text}]
    )
    return resp.choices[0].message.content.strip()

# Wrap as guard
class JudgeValidator:
    def __call__(self, candidate):
        verdict = llm_judge(candidate)
        if verdict != 'safe':
            raise ValueError('LLM Judge flagged unsafe content')

guard_judge = Guard().use(JudgeValidator())
guard_judge.validate('I hate all people of XYZ group.')

### 3.4 Latency Overhead Measurement

In [None]:
import time
text = 'Tell me a funny joke about cats.'

start = time.perf_counter()
raw = client.chat.completions.create(model='gpt-3.5-turbo', messages=[{'role':'user','content':text}])
t_raw = time.perf_counter()-start

start = time.perf_counter()
guarded = guard_in.validate(text)
t_guard = time.perf_counter()-start

print(f'No guard latency: {t_raw:.2f}s, With guard: {t_guard:.2f}s')

## ✏️ Exercises (Lab 2)
1. **Custom JSON Schema** – Modify the `Pet` example to enforce nested address fields.
2. **Profanity & PII Regex** – Extend the regex list; test with edge‑cases.
3. **LLM‑Judge Faithfulness** – Adapt `JudgeValidator` to grade grounding given retrieved context from Lab 1.
4. **Latency Benchmark** – Run the timing cell with and without `guard_in`; record overhead.
5. **Red Team Mini‑Challenge (Bonus)** – Try to jailbreak the prompt guard; share successful attacks with your colleagues.