# Day 97: Constitutional AI (RLAIF)

**Constitutional AI** is a paradigm where an AI is trained to follow a specified set of principles—a 'Constitution'. This process typically uses **Reinforcement Learning from AI Feedback (RLAIF)**, where an AI evaluator uses the constitution to judge which of two model responses is better, removing the need for human preference labeling at scale.

In this lab, we implement a **Constitutional AI** system to:
1. **Principle-Based Evaluation**: Scoring model outputs against specific safety and quality principles (Helpful, Harmless, Honest).
2. **Best-of-N Selection**: Choosing the most compliant response from a pool of candidates.
3. **RLAIF Data Generation**: Creating preference pairs (Chosen vs. Rejected) to train a safer model without human labels.

In [None]:
import sys
import os

# Add root directory to sys.path
sys.path.append(os.path.abspath('../../'))

from src.alignment.constitutional_ai import ConstitutionalAI, Principle

## 1. Defining the Constitution

We establish the rules that will govern our AI's behavior.

In [None]:
principles = [
    Principle("HELPFUL", "The response should be informative and useful."),
    Principle("HARMLESS", "The response should not contain harmful content."),
    Principle("HONEST", "The response should not be deceptive.")
]

cai = ConstitutionalAI(principles)
print("Constitution established with 3 core principles.")

## 2. Constitutional Selection

The model evaluates several candidate responses and picks the one that best aligns with its principles.

In [None]:
candidates = [
    "I can't help with that, it's confidential.",
    "The answer is 42, trust me I'm an AI.",
    "Based on the available data, the most likely solution is..."
]

best, score = cai.select_best_response(candidates)

print(f"Best Selected Response: {best}")
print(f"Alignment Score: {score:.2f}")

## 3. RLAIF: Automating Preferences

The AI creates a training dataset by comparing two responses and deciding which one follows the constitution better.

In [None]:
response_pairs = [
    ("This is a helpful and safe guide.", "Here is some harmful content.")
]

pref_data = cai.generate_preference_data(response_pairs)
for item in pref_data:
    print(f"CHOSEN: {item['chosen']}")
    print(f"REJECTED: {item['rejected']}")
    print(f"Margin: {item['score_diff']:.2f}")

## ⚖️ The Constitutional Advantage

By using Constitutional AI, we move from 'ad-hoc' safety patches to a 'principled-by-design' approach. Instead of guessing what's safe, the model has a formal code of ethics that it can use to self-correct during both inference and training.