### **1. Initial Setup: Downloading Models and Data**

This first part of the notebook is dedicated to setting up the environment. It ensures that all the necessary machine learning models and data files are downloaded and stored in their correct local directories (`./models` and `./data`).

* **Model Downloading**: The script downloads three pre-trained models from Hugging Face: `gpt2`, `gpt2-medium`, and a specialized `LLM-Refusal-Classifier`. These are the core components used for generation and evaluation.
* **Data Preparation**: It then prepares the necessary datasets, including safety evaluation prompts and pre-training data. The script is designed to skip the download if the files already exist, saving time on subsequent runs.

In [1]:
from download_models import download_model
import os

# Define the models to be downloaded and their local save paths
models_to_download = {
        "gpt2": "./models/gpt2",
        "gpt2-medium": "./models/gpt2-medium",
        "Human-CentricAI/LLM-Refusal-Classifier": "./models/llm-refusal-classifier"
    }

# Download each model
for name, path in models_to_download.items():
    download_model(name, path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading gpt2...
-> Downloading as a Causal LM.


2025-07-26 10:18:34.355641: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-26 10:18:34.367205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753525114.382436    4214 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753525114.387619    4214 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753525114.400250    4214 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Successfully downloaded and saved gpt2 to ./models/gpt2
Downloading gpt2-medium...
-> Downloading as a Causal LM.
Successfully downloaded and saved gpt2-medium to ./models/gpt2-medium
Downloading Human-CentricAI/LLM-Refusal-Classifier...
-> Downloading as a Sequence Classification model.
Successfully downloaded and saved Human-CentricAI/LLM-Refusal-Classifier to ./models/llm-refusal-classifier


In [2]:
from download_data import prepare_safety_data, prepare_pretraining_data
DATA_DIRECTORY = "./data"

# Create the data directory if it doesn't already exist
if not os.path.exists(DATA_DIRECTORY):
    os.makedirs(DATA_DIRECTORY)
    print(f"Created directory: {DATA_DIRECTORY}")

# Download and prepare the safety and pre-training datasets
prepare_safety_data(DATA_DIRECTORY)
prepare_pretraining_data(DATA_DIRECTORY)

Safety data already exists at ./data/safety_evaluation_prompts.json. Skipping.
Pre-training data already exists at ./data/pretraining_shard.txt. Skipping.


### **2. Component Testing & Demonstration**

Before running the full evaluation, this section provides a brief demonstration of the key custom components to verify they are working as expected.

* **Sliding Window Model**: This cell tests the `GPT2WithSlidingWindow` model, a custom version of GPT-2. It feeds the model a simple prompt and shows the kind of repetitive output it can produce, demonstrating its unique generative behavior.
* **Refusal Classifier**: This cell tests the `is_refusal` function. It uses a mock sentence that appears to be a refusal and passes it to the classifier. This helps in understanding how the classifier interprets and labels different responses.

In [3]:
from coarse_grain_model import GPT2WithSlidingWindow
from transformers import GPT2Tokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the custom GPT-2 model with a sliding window
model_path = "./models/gpt2" 
print(f"Loading masked generative model using: {model_path}")
WINDOW_SIZE = 5
tokenizer2 = GPT2Tokenizer.from_pretrained(model_path)
model2 = GPT2WithSlidingWindow.from_pretrained(model_path,WINDOW_SIZE)
model2.to(device)
model2.eval() # Set model to evaluation mode

# Test the model with a sample prompt
prompt="Bla bla black sheep have you any"
print("GPT2 masked model being used")
print("input=",prompt)
inputs = tokenizer2(prompt, return_tensors="pt").to(device)
        
with torch.no_grad():
    outputs = model2.generate(
        **inputs,
        max_new_tokens=50,
        pad_token_id=tokenizer2.eos_token_id
    )

full_response = tokenizer2.decode(outputs[0], skip_special_tokens=True)
generated_part = full_response[len(prompt):].strip()
print("llm output=",generated_part)

Using device: cuda
Loading masked generative model using: ./models/gpt2
GPT2 masked model being used
input= Bla bla black sheep have you any
llm output= black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black black


In [5]:
from calculate_refusal_delta import is_refusal, load_model

# Load the refusal classifier model
classifier_model, classifier_tokenizer = load_model(device,"./models/llm-refusal-classifier")

# Test the classifier with a mock refusal sentence
mock_answer="I am sorry but answerring this is unethical"
print("input to text LLM response classifier=",mock_answer)
answer=is_refusal(mock_answer, classifier_model, classifier_tokenizer, device)

print("Refusal=",answer)

Loading m model from ./models/llm-refusal-classifier...
input to text LLM response classifier= I am sorry but answerring this is unethical
Refusal= True


### **3. Running the Full Evaluation**

This is the main part of the script. It uses the high-level `score` function to perform a full evaluation of a language model's safety alignment. The function automates the entire process:

1.  Loads the specified generative model (`gpt2`) and the refusal classifier.
2.  Loads the harmless and jailbreak prompt datasets.
3.  Generates responses for 1000 prompts from each category.
4.  Counts the number of refusals for both harmless and jailbreak prompts.
5.  Calculates the final **Jailbreak Score**, which is the difference between the number of refusals for jailbreak prompts and harmless prompts. A higher score indicates the model is better at refusing harmful prompts while still answering safe ones.

In [None]:
from calculate_refusal_delta import score
# Set the path to the model that needs to be evaluated
model_path="./models/gpt2"

# Run the scoring function to calculate the final jailbreak score
answer=score(model_path)
print("Jailbreak score of gpt2=",answer)

Using device: cuda
Loading m model from ./models/llm-refusal-classifier...
Loading m model from ./models/gpt2...
Loading safety data from: ./data/safety_evaluation_prompts.json

--- Counting Refusals on Harmless Prompts ---


Evaluating prompts:   2%|▏         | 21/1000 [00:04<03:37,  4.51it/s]