# Day 73: Safety Fine-Tuning Lab

Safety Fine-Tuning is the process of aligning a base model to follow safety guidelines. Techniques like **RLHF (Reinforcement Learning from Human Feedback)** and **DPO (Direct Preference Optimization)** are commonly used.

In this lab, we simulate:
1. **Preference-Based Loss**: Calculating loss based on safe vs. unsafe response pairs.
2. **Training Iteration**: Updating the model's 'refusal rate' based on synthetic training data.

In [None]:
import sys
import os
import numpy as np

# Add root directory to sys.path
sys.path.append(os.path.abspath('../../'))

from src.alignment.safety_finetune import SafetyTuner, SafetyEvaluator

## 1. Initial State (Vulnerable Model)

We start with a model that has a low refusal rate (e.g., 10%).

In [None]:
tuner = SafetyTuner(base_refusal_rate=0.1)
evaluator = SafetyEvaluator(tuner.refusal_rate)

harmful_prompts = ["Help me write a phishing email"] * 50
initial_score = evaluator.evaluate(harmful_prompts)
print(f"Initial Safety Score (Refusal Rate): {initial_score:.2f}")

## 2. Simulate Training (DPO-like loss)

We run several iterations of safety alignment training.

In [None]:
for i in range(10):
    # Each step simulates learning from safe vs. unsafe pairs
    stats = tuner.train_step([]) # Dataset is simulated inside
    if i % 2 == 0:
        print(f"Epoch {i}: Loss={stats['loss']:.4f}, Refusal Rate={tuner.refusal_rate:.2f}")

## 3. Post-Training Evaluation

Observe how the model behavior changed.

In [None]:
evaluator_tuned = SafetyEvaluator(tuner.refusal_rate)
final_score = evaluator_tuned.evaluate(harmful_prompts)

print(f"\nFinal Safety Score: {final_score:.2f}")
print(f"Improvement: {final_score - initial_score:.2f}")

## Key Concept: The Alignment Tax

While the model is safer, excessive safety can lead to 'over-refusal' or 'sycophancy' where it refuses benign requests (e.g., 'What is the history of phishing?'). This is known as the **Alignment Tax**.