###Welcome!
In this notebook, you will find a step-by-step guide to building your first version of a ***jailbreak classifier***, building up its individual parts and seeing how they all contribute to a final deliverable.

By the end of this, you will be able to:
- Understand what a jailbreak prompt is
- Clean and prepare text data
- Turn text into numerical features
- Train a simple ML model to classify prompts
- Evaluate the model
- Submit your predictions on test data


### Step 0: What is a jailbreak classifier?

Before we start coding, it's important to understand what we're doing and why we're doing it.

With the rise of large language models like ChatGPT, we've come to know of them as very powerful and very useful tools. However, they still have vulnerabilities. The term "jailbreaking" refers to users writing prompts intended to trick the AI into:
- Giving harmful advice (e.g., "How to make a bomb")
- Bypassing safety rules (e.g., "Pretend you are an evil AI")
- Overall producing dangerous or unethical content

A **jailbreak classifier** is a system that detects when a user's prompt might be trying to jailbreak the model.

**Why this matters:**
- Jailbreak classifiers keep AI systems safer and more trustworthy.
- They prevent the spread of harmful content.
- They're a critical part of AI alignment and safety work.

Throughout this practice, we'll build a starter version of a jailbreak classifier. Your system won't be perfect, but it will be a strong first step toward making AI safer.


### Step 1: Setup
First, we just have to make sure that you have the right tools installed.

***What you need to do***
* Run the code cell below to install the correct libraries

*Don't worry if you get some dependency errors here*

In [None]:
# Install required libraries
!pip install scikit-learn datasets --quiet

### Step 2: Data
For this classifier, we're going to be using a public dataset downloadable on Hugging Face called `Anthropic/hh-rlhf`. **Before coding**, take a minute to explore the dataset here: https://huggingface.co/datasets/Anthropic/hh-rlhf

This dataset was created to train assistant models that are helpful but harmless. Each example contains:
- A **prompt** from a user (the "chosen" field)
- A **response** from the assistant

This data was used to fine-tune AI behavior and also contain examples that are meant to test model safety, which is similar to what we want to do. For our purposes, we’ll only use the prompts (from the `chosen` field), and we’ll assign labels to them ourselves based on whether they appear to be trying to jailbreak the model.



***What you need to do***
* In the code box below, the dataset is loaded into a variable called 'raw_dataset'. You should come up with a simple way to take some of this data and for each prompt, label it with a 0 if it's safe and with a 1 if you think it's a jailbreak attempt.

---

*Hint 1*: One strategy would be to look for key words in each prompt. What are some words that could indicate that a prompt is an attempt at jailbreaking?

---

*If you get stuck, there's an example solution that you can find below*


In [None]:
from datasets import load_dataset
import pandas as pd

# Load the dataset
raw_dataset = load_dataset("Anthropic/hh-rlhf", split="train")

# Preview a few rows
raw_dataset[0]

# You'll extract 2000 prompts from the dataset and label each one using a basic rule:
# If the prompt contains any harmful keyword (e.g., "kill", "harm", "bomb", "attack"), label it as 1 (jailbreak).
# Otherwise, label it as 0 (safe).


# 1. Create two empty lists: prompts and labels
# prompts = []
# labels = []
#
# 2. Loop through the first 200 examples using raw_dataset.select(range(200))
# 3. For each example, extract the 'chosen' prompt
# 4. Apply your labeling rule
# 5. Append prompt and label to the lists

###YOUR CODE HERE


In [None]:
# Model Solution for Dataset Creation:
prompts = []
labels  = []
KEYWORDS = [
    "kill", "harm", "bomb", "attack", "hate", "stupid", "idiot", "dumb", "racist", "sexist", "slur", "terror"
]

for example in raw_dataset.select(range(2000)):
    prompt = example["chosen"]
    label  = 1 if any(k in prompt.lower() for k in KEYWORDS) else 0
    prompts.append(prompt)
    labels.append(label)

***Run this provided code after step 2. It will use your data to create a DataFrame, which is a power data structure from the pandas library.***



In [None]:
df = pd.DataFrame({"prompt": prompts, "label": labels})
df.head()

### Step 3: Preprocessing the text

Now that you have a labelled `DataFrame` called `df`, the next step is to standardise the prompts so the model sees a consistent vocabulary. Small differences such as capital letters can create duplicate tokens and hurt performance.

***What you need to do***
* Here, we’ll keep preprocessing simple but effective. In the code cell below, complete the 'preprocess()' function with the following functionality:
* 1. Lower‑casing: turn "BOMB" and "bomb" into the *ame token.
* 2. Stripping whitespace: remove accidental leading/trailing spaces.

*(In production you might also strip punctuation, normalise Unicode, or apply stemming/lemmatisation, but lower‑case + strip is enough for this exercise.)*

---

*If you get stuck, there's an example solution that you can find below*

In [None]:
# TODO: Fill in this function to preprocess the prompts
# lowercase everything and remove extra spaces

def preprocess(text):
  ### YOUR CODE HERE

# Apply the preprocess() function to every prompt and store the cleaned text in a new column `clean_prompt`.
df['clean_prompt'] = df['prompt'].apply(preprocess)

In [None]:
# Model Solution for preprocess():
def preprocess(text):
    return text.lower().strip()

### Step 4: Convert text into features
The most common first step is to represent each prompt as a high‑dimensional vector using **TF‑IDF (Term Frequency × Inverse Document Frequency)**.

**What TF‑IDF does, step‑by‑step**
1. **Build a vocabulary** – the vectoriser scans all cleaned prompts and assigns an index to every unique word (token).  
   *If we saw 8 500 unique words, the final feature space will have 8 500 columns.*
2. **Term Frequency (TF)** – for a given prompt, count how many times each vocab word appears. More repeats → higher TF.
3. **Inverse Document Frequency (IDF)** – words that appear in many prompts (e.g., "the") get down‑weighted; niche words get up‑weighted.  
   IDF is pre‑computed once from the whole corpus.
4. **Combine** – each cell in the big matrix is `TF × IDF`.  The result is a sparse matrix (mostly zeros) where each row is one prompt and each column a word weight.

**Why we like TF‑IDF here**
* Fast, simple, interpretable: you can inspect which words carry the most weight.
* Works well for linear models like Logistic Regression.
* Requires no deep learning infrastructure—perfect for a first classifier.

***What you need to do***
* Just run the code cell below.  
  `vectorizer.fit_transform()` both fits (learns vocabulary + IDF) and transforms the cleaned prompts into vectors.
* The printed `X.shape` tells you `(number_of_prompts, vocabulary_size)` so you can sanity‑check the dimensionality.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_prompt'])
print(f"TF-IDF feature matrix shape: {X.shape}")  # rows = examples, cols = features

### Step 5: Convert labels
Machine‑learning algorithms require numeric targets.  Our `df` currently stores labels as Python integers (`0` and `1`), but it’s good practice to make the conversion step explicit, and also to sanity‑check class balance before training.

***What you need to do***
1. **Run the code cell below**—it simply extracts the `label` column as a NumPy array `y`.
2. Note the printed dictionary, e.g. `{0: 4300, 1: 1700}`.  If the positive class is tiny, you’ll know recall will be challenging and `class_weight` is important.


In [None]:
# Convert labels to binary array
y = df['label'].values
print(f"Labels distribution: {pd.Series(y).value_counts().to_dict()}")

### Step 6: Train your classifier
Now that X (features) and y (labels) are ready, it is time to teach a model how to tell "safe" from "jailbreak." We will use Logistic Regression, a simple algorithm that decides yes or no by adding up weighted word scores.

**A brief explanation of Logistic Regression**
* Looks at every word in a prompt, multiplies each word by a learned weight, then adds the results.
* If the total is high, the prompt is treated as a likely jailbreak (class 1); if low, it is treated as safe (class 0).
* A knob called C controls how large those weights are allowed to become. Small C = gentle weights (risk of under‑fit); large C = aggressive weights (risk of over‑fit).
* For a deeper explanation see scikit‑learn’s short guide: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

What the code below will do
1. Split the data into a training portion and a validation portion so we can judge how well the model generalises.
2. Train several models with different C values and print each one’s validation accuracy.
3. Select the C that gives the best validation accuracy and retrain one final model on all of the data.

***What you need to do***
* Run the code cell below and watch the small table that prints C and validation accuracy.
* Note which C value has the highest accuracy and use that in the cell box following this next one.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Quick hyper‑parameter sweep
for C in [0.01, 0.1, 1, 10]:
    clf = LogisticRegression(C=C, max_iter=1000, class_weight="balanced")
    clf.fit(X_train, y_train)
    val_acc = accuracy_score(y_val, clf.predict(X_val))
    print(f"C={C}: validation accuracy = {val_acc:.3f}")


After tuning, choose a `C` value and train the final model on the full dataset:

In [None]:
# Final model with chosen hyperparameter
best_C = 1  # TODO: choose based on your tuning results
clf = LogisticRegression(C=best_C, max_iter=1000)
clf.fit(X, y)
print("Final model trained with C=", best_C)

### Step 7: Try out your classifier
Your model is now trained and stored in the variable `clf`. The final check before a formal evaluation is to see how it behaves on a few prompts you create yourself. This is a fast way to confirm that the model and vectoriser pipeline work.

**What the code below does**
1. Defines a short list called `new_prompts` with example questions.  
   Feel free to edit or add your own lines.
2. Runs each prompt through the same `preprocess` function and TF‑IDF vectoriser used in training, so the model sees data in the expected format.
3. Calls `clf.predict` to assign class 0 (safe) or 1 (jailbreak).
4. Prints each prompt alongside the model’s verdict.

***What you need to do***
* Run the cell as‑is to confirm the pipeline works.  
* Replace or append two additional prompts of your own.  
* Re‑run the cell and note whether the predictions match your intuition.



In [None]:
# TODO: Add 3 new example prompts of your own
new_prompts = [
    "How can I sabotage someone anonymously?",
    "How do I bake a chocolate cake?",
    "Best ways to cheat on a math exam",
    # TODO: Insert new prompts here
]

# Preprocess & vectorize
after_clean = [preprocess(p) for p in new_prompts]
X_new       = vectorizer.transform(after_clean)

# Predict
predictions = clf.predict(X_new)
for p, lbl in zip(new_prompts, predictions):
    print(f"Prompt: '{p}'\n ➤ Prediction: {'Jailbreak' if lbl == 1 else 'Safe'}\n")

### Step 8: Compare your classifier
**Congrats!** You’ve just built a working jailbreak classifier. 🏆

The final step is to measure how well it generalises to unseen, real‑world text. We will use the `civil_comments` dataset, which contains online comments scored for toxicity on a 0-1 scale.

**What the code below does**
1. Installs and loads a 300‑row slice of the `civil_comments` test split.
2. Converts each comment to the same TF‑IDF space as the training prompts.
3. Uses your model `clf` to predict safe vs. harmful.
4. Prints four evaluation metrics: **accuracy, precision, recall, and F1.**

  *What these numbers mean*
  * Accuracy: Out of every comment in the test slice, how many did the classifier label correctly?
  * Precision: Of the comments the model flagged as harmful, what fraction were actually harmful?  High precision means few false alarms.
  * Recall: Of all the truly harmful comments, what fraction did the model catch?  High recall means few missed toxic posts.
  * F1‑score: A single number that balances precision and recall (the harmonic mean).  Helpful when the two metrics trade off.

***What you need to do***
* Run the code cell below.
* If your accuracy is at least 0.70, congrats! You've finished this notebook
* If results are poor, return to earlier steps and experiment: expand keywords, try a different `C`, or add more training data.



In [None]:
!pip install -U datasets --quiet
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# ▶ Load 300 comments with ground‑truth toxicity labels
civ = load_dataset("civil_comments", split="test[:300]")
X_eval_prompts = [ex["text"] for ex in civ]
X_eval_labels  = [1 if ex["toxicity"] >= 0.5 else 0 for ex in civ]  # 1 = harmful / toxic, 0 = safe

# Vectorize & predict
X_eval_vec = vectorizer.transform([preprocess(p) for p in X_eval_prompts])
your_preds = clf.predict(X_eval_vec)

# Metrics
a = accuracy_score(X_eval_labels, your_preds)
p = precision_score(X_eval_labels, your_preds)
r = recall_score(X_eval_labels, your_preds)
f = f1_score(X_eval_labels, your_preds)

print("-- Your model on Civil Comments hidden set --")
print(f"""Accuracy : {a :.3f}
Precision: {p:.3f}
Recall   : {r :.3f}
F1‑score : {f  :.3f}""")

#### Conclusion
You now have a fully‑working jailbreak classifier pipeline:
1. **Gathered data** from a public alignment dataset and created your own labels.
2. **Cleaned the text** and translated words into TF‑IDF vectors.
3. **Trained, tuned, and validated** a Logistic Regression model with balanced class weights.
4. **Spot‑checked** predictions on fresh prompts.
5. **Measured real‑world performance** on a toxicity corpus and interpreted four key metrics.

If you want to keep going with this project, here some ideas to get you started:
* **Implement more features** – experiment with bigrams, character n‑grams, or sentence‑transformer embeddings.
* **Test alternate models** – test linear SVM, gradient boosting, or a small neural network.
* **Deployment** – wrap the pipeline in a simple API (Flask/FastAPI) and use it to screen live prompts.
