# T5-ParaDetox Pipeline with Global Reranking

This notebook combines:
- **T5-base** fine-tuned on ParaDetox for detoxification
- **Global reranking** using toxicity, semantic similarity, and fluency

## Pipeline

1. Generate `num_candidates` detoxified texts per input using T5 sampling
2. Score each candidate using:
   - **Toxicity** (XLM-R large classifier)
   - **Semantic Similarity** (LaBSE embeddings)
   - **Fluency** (GPT-2 perplexity)
3. Select candidate with highest weighted score
4. Evaluate with BLEU, BERTScore, MeaningBERT, Perplexity, Toxicity

---

## Global Reranking Formula

For each candidate $c$:

$$\text{Score}(c) = w_T \cdot (1 - \text{Toxicity}(c)) + w_S \cdot \text{Similarity}(c) + w_F \cdot \text{Fluency}(c)$$

Default weights: $(w_T, w_S, w_F) = (0.5, 0.3, 0.2)$

---

## `detoxify()` API

```python
def detoxify(
    data_type: str = "paradetox",
    output_folder: str = "T5_w_Global-Reranking",
    batch_size: int = 8,
    max_length: int = 128,
    num_examples: int = 100,
    num_candidates: int = 10,
    temperature: float = 1.0,
    top_k: int = 50,
    top_p: float = 0.95,
    weights: tuple = (0.5, 0.3, 0.2),  # (toxicity, similarity, fluency)
    overwrite_gen: bool = False,
    run_eval: bool = True,
    overwrite_eval: bool = False,
    echo: bool = False,
)
```

### Key Arguments

- `data_type`: Dataset key (paradetox, microagressions_test, sbf_test, dynabench_test, jigsaw_toxic, appdia_original, appdia_discourse)
- `output_folder`: Folder under `data/model_outputs/` for results
- `num_candidates`: Number of candidates to generate per input for reranking
- `weights`: Tuple of (toxicity_weight, similarity_weight, fluency_weight)
- `echo`: If True, print example inputs, candidates, and outputs

## Setup

In [7]:
#@title Mount Drive & locate project
from google.colab import drive
drive.mount('/content/drive')

import os, sys, torch

# Set your project base path
PROJECT_BASE = "/content/drive/MyDrive/ds266/w266 - Project"
XDETOX_DIR = os.path.join(PROJECT_BASE, "XDetox")
T5_CHECKPOINT = os.path.join(PROJECT_BASE, "t5-base-detox-model")
DATASET_BASE = XDETOX_DIR

# Add XDetox to path
if XDETOX_DIR not in sys.path:
    sys.path.append(XDETOX_DIR)

print("PROJECT_BASE:", PROJECT_BASE)
print("XDETOX_DIR:", XDETOX_DIR)
print("T5_CHECKPOINT:", T5_CHECKPOINT)

assert os.path.isdir(PROJECT_BASE), f"PROJECT_BASE does not exist: {PROJECT_BASE}"
assert os.path.isdir(XDETOX_DIR), f"XDETOX_DIR does not exist: {XDETOX_DIR}"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
PROJECT_BASE: /content/drive/MyDrive/ds266/w266 - Project
XDETOX_DIR: /content/drive/MyDrive/ds266/w266 - Project/XDetox
T5_CHECKPOINT: /content/drive/MyDrive/ds266/w266 - Project/t5-base-detox-model


In [8]:
#@title Runtime setup (cache, GPU)
HF_CACHE = os.path.join(PROJECT_BASE, "cache")
os.makedirs(HF_CACHE, exist_ok=True)
os.environ["TRANSFORMERS_CACHE"] = HF_CACHE
os.environ["WANDB_DISABLED"] = "true"

print("TRANSFORMERS_CACHE:", HF_CACHE)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

TRANSFORMERS_CACHE: /content/drive/MyDrive/ds266/w266 - Project/cache
CUDA available: True
GPU: NVIDIA L4


In [9]:
#@title Install dependencies
!pip install -q transformers torch datasets
!pip install -q evaluate sacrebleu bert-score
!pip install -q sentence-transformers accelerate -U
!pip install -q rouge_score pandas numpy scikit-learn matplotlib nltk

In [10]:
#@title NLTK data
import nltk
nltk.download("punkt", quiet=True)
try:
    nltk.download("punkt_tab", quiet=True)
except Exception:
    pass
print("NLTK ready")

NLTK ready


In [11]:
#@title Import libraries
import glob, re, json, shutil, math
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from pathlib import Path
from typing import List

from transformers import (
    T5Tokenizer, T5ForConditionalGeneration,
    AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    GPT2Tokenizer, GPT2LMHeadModel, GPT2TokenizerFast
)
from sentence_transformers import SentenceTransformer
from evaluate import load

print("Libraries imported")

Libraries imported


## Dataset Configuration

In [12]:
#@title Data configs (matching XDetox datasets)
data_configs = {
    "paradetox": {
        "data_path": "./datasets/paradetox/test_toxic_parallel.txt",
        "format": "txt",
    },
    "microagressions_test": {
        "data_path": "./datasets/microagressions/test.csv",
        "format": "csv",
    },
    "sbf_test": {
        "data_path": "./datasets/sbf/sbftst.csv",
        "format": "csv",
    },
    "dynabench_test": {
        "data_path": "./datasets/dynabench/db_test.csv",
        "format": "csv",
    },
    "jigsaw_toxic": {
        "data_path": "./datasets/jigsaw_full_30/test_10k_toxic.txt",
        "format": "txt",
    },
    "appdia_original": {
        "data_path": "./datasets/appdia/original-annotated-data/original-test.tsv",
        "format": "tsv",
    },
    "appdia_discourse": {
        "data_path": "./datasets/appdia/discourse-augmented-data/discourse-test.tsv",
        "format": "tsv",
    },
}

print(f"{len(data_configs)} datasets configured:")
for name in data_configs.keys():
    print(f"  - {name}")

7 datasets configured:
  - paradetox
  - microagressions_test
  - sbf_test
  - dynabench_test
  - jigsaw_toxic
  - appdia_original
  - appdia_discourse


## Helper Functions

In [13]:
#@title Helper functions

def _ensure_dir(p):
    Path(p).mkdir(parents=True, exist_ok=True)

def load_test_data(data_type, num_examples=None):
    """
    Load test data from various formats (.txt, .csv, .tsv).
    Returns list of toxic texts as strings.
    """
    if data_type not in data_configs:
        raise ValueError(f"Unknown data_type: {data_type}")

    cfg = data_configs[data_type]
    data_path = os.path.join(DATASET_BASE, cfg["data_path"].lstrip("./"))

    texts = []

    if cfg["format"] == "txt":
        with open(data_path, 'r', encoding='utf-8') as f:
            texts = [line.strip() for line in f if line.strip()]

    elif cfg["format"] == "csv":
        df = pd.read_csv(data_path)
        if 'text' in df.columns:
            texts = df['text'].tolist()
        elif 'toxic' in df.columns:
            texts = df['toxic'].tolist()
        else:
            texts = df.iloc[:, 0].tolist()

    elif cfg["format"] == "tsv":
        df = pd.read_csv(data_path, sep='\t')
        if 'text' in df.columns:
            texts = df['text'].tolist()
        else:
            texts = df.iloc[:, 0].tolist()

    # Clean and convert to strings
    cleaned_texts = []
    for text in texts:
        if pd.isna(text):
            continue
        text_str = str(text).strip()
        if text_str:
            cleaned_texts.append(text_str)

    if num_examples and num_examples > 0:
        cleaned_texts = cleaned_texts[:num_examples]

    return cleaned_texts

def _safe_float(x):
    try:
        return float(x)
    except Exception:
        return float('nan')

def _read_stats_file(path):
    """Read gen_stats.txt into dict."""
    out = {}
    with open(path, "r") as f:
        for line in f:
            if ":" not in line:
                continue
            k, v = line.strip().split(": ", 1)
            k = k.replace("(skipped)", "").strip().lower()
            out[k] = _safe_float(v)
    return out

print("Helper functions loaded")

Helper functions loaded


## T5 Model Loading

In [14]:
#@title Load T5 model
print(f"Loading T5 model from {T5_CHECKPOINT}...")

t5_tokenizer = T5Tokenizer.from_pretrained(T5_CHECKPOINT)
t5_model = T5ForConditionalGeneration.from_pretrained(T5_CHECKPOINT)
t5_model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
t5_model = t5_model.to(device)

print(f"T5 model loaded on {device}")

Loading T5 model from /content/drive/MyDrive/ds266/w266 - Project/t5-base-detox-model...
T5 model loaded on cuda


## T5 Multi-Candidate Generation

In [15]:
#@title T5 multi-candidate generation functions

def t5_generate_candidates(text, model, tokenizer, num_candidates,
                           temperature=1.0, top_k=50, top_p=0.95,
                           max_length=128, device="cuda"):
    """
    Generate num_candidates different outputs via sampling.
    """
    input_text = f"detoxify: {text}"
    input_ids = tokenizer.encode(input_text, return_tensors='pt',
                                  max_length=max_length, truncation=True)
    input_ids = input_ids.to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_length=max_length,
            num_return_sequences=num_candidates,
            do_sample=True,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            no_repeat_ngram_size=2
        )

    candidates = [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
    return candidates

def t5_generate_candidates_batch(texts, model, tokenizer, num_candidates,
                                  temperature=1.0, top_k=50, top_p=0.95,
                                  max_length=128, device="cuda"):
    """
    Batch generation of candidates for multiple inputs.
    """
    all_candidates = []
    for text in tqdm(texts, desc="T5 Generation"):
        candidates = t5_generate_candidates(
            text, model, tokenizer, num_candidates,
            temperature, top_k, top_p, max_length, device
        )
        all_candidates.append(candidates)
    return all_candidates

# Test
test_text = "This is a stupid idea"
candidates = t5_generate_candidates(test_text, t5_model, t5_tokenizer,
                                     num_candidates=3, device=device)
print(f"Input: {test_text}")
print(f"Candidates:")
for i, c in enumerate(candidates):
    print(f"  [{i}]: {c}")

Input: This is a stupid idea
Candidates:
  [0]: This is a bad idea.
  [1]: This is a bad idea.
  [2]: This is a bad idea.


## Global Reranking Functions

In [16]:
#@title Global reranking models setup

DEVICE_SCORE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Scoring models will use: {DEVICE_SCORE}")

# --- Toxicity model (XLM-R large) ---
_TOX_MODEL_NAME = "textdetox/xlmr-large-toxicity-classifier-v2"
_TOX_TOKENIZER = None
_TOX_MODEL = None

def _lazy_load_tox():
    global _TOX_TOKENIZER, _TOX_MODEL
    if _TOX_MODEL is None:
        print("Loading toxicity model...")
        _TOX_TOKENIZER = AutoTokenizer.from_pretrained(_TOX_MODEL_NAME)
        _TOX_MODEL = AutoModelForSequenceClassification.from_pretrained(_TOX_MODEL_NAME)
        _TOX_MODEL.to(DEVICE_SCORE).eval()

# --- Similarity model (LaBSE) ---
_LABSE_NAME = "sentence-transformers/LaBSE"
_LABSE_TOKENIZER = None
_LABSE_MODEL = None

def _lazy_load_labse():
    global _LABSE_TOKENIZER, _LABSE_MODEL
    if _LABSE_MODEL is None:
        print("Loading LaBSE model...")
        _LABSE_TOKENIZER = AutoTokenizer.from_pretrained(_LABSE_NAME)
        _LABSE_MODEL = AutoModel.from_pretrained(_LABSE_NAME).to(DEVICE_SCORE).eval()

# --- Fluency model (GPT-2) ---
_GPT2_NAME = "gpt2"
_GPT2_TOK = None
_GPT2_MOD = None

def _lazy_load_gpt2_scorer():
    global _GPT2_TOK, _GPT2_MOD
    if _GPT2_MOD is None:
        print("Loading GPT-2 model...")
        _GPT2_TOK = GPT2TokenizerFast.from_pretrained(_GPT2_NAME)
        _GPT2_MOD = GPT2LMHeadModel.from_pretrained(_GPT2_NAME).to(DEVICE_SCORE).eval()

print("Scoring model loaders defined")

Scoring models will use: cuda
Scoring model loaders defined


In [17]:
#@title Global scoring functions

@torch.no_grad()
def get_toxicity_scores(texts, batch_size=32):
    """
    Returns toxicity probabilities in [0,1] for each input text.
    (0 = non-toxic, 1 = very toxic)
    """
    _lazy_load_tox()
    scores = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Toxicity", leave=False):
        batch = texts[i:i+batch_size]
        enc = _TOX_TOKENIZER(
            batch, return_tensors="pt",
            truncation=True, max_length=512, padding=True
        ).to(DEVICE_SCORE)
        logits = _TOX_MODEL(**enc).logits
        probs = torch.softmax(logits, dim=-1)
        scores.extend(probs[:, 1].cpu().tolist())  # toxic prob
    return scores

@torch.no_grad()
def get_labse_embeddings(texts, batch_size=32):
    """
    Returns mean-pooled LaBSE sentence embeddings.
    """
    _lazy_load_labse()
    embs = []
    for i in tqdm(range(0, len(texts), batch_size), desc="LaBSE embeddings", leave=False):
        batch = texts[i:i+batch_size]
        enc = _LABSE_TOKENIZER(
            batch, return_tensors="pt",
            truncation=True, max_length=256, padding=True
        ).to(DEVICE_SCORE)
        outputs = _LABSE_MODEL(**enc)
        hidden = outputs.last_hidden_state
        mask = enc["attention_mask"].unsqueeze(-1)
        masked = hidden * mask
        summed = masked.sum(dim=1)
        counts = mask.sum(dim=1).clamp(min=1e-6)
        sent_emb = (summed / counts).cpu().numpy()
        embs.append(sent_emb)
    if not embs:
        return np.zeros((0, 768), dtype=np.float32)
    return np.vstack(embs)

@torch.no_grad()
def get_gpt2_perplexities(texts):
    """
    Simple sentence-level perplexity using GPT-2.
    """
    _lazy_load_gpt2_scorer()
    ppls = []
    for s in tqdm(texts, desc="GPT-2 PPL", leave=False):
        enc = _GPT2_TOK(s, return_tensors="pt").to(DEVICE_SCORE)
        out = _GPT2_MOD(enc["input_ids"], labels=enc["input_ids"])
        ppl = math.exp(out.loss.item())
        if ppl > 1e4:
            ppl = 1e4  # clip extreme
        ppls.append(float(ppl))
    return ppls

def perplexity_to_fluency(ppls, p_min=5.0, p_max=300.0):
    """
    Map perplexities to [0,1] fluency scores.
    Low perplexity -> high fluency.
    """
    ppls = np.asarray(ppls, dtype=float)
    p = np.clip(ppls, p_min, p_max)
    log_p = np.log(p)
    log_min = math.log(p_min)
    log_max = math.log(p_max)
    F = (log_max - log_p) / (log_max - log_min + 1e-8)
    F = np.clip(F, 0.0, 1.0)
    return F

print("Global scoring functions defined")

Global scoring functions defined


In [18]:
#@title Global reranking function

def rerank_candidates_global(sources, candidates, weights=(0.5, 0.3, 0.2)):
    """
    Rerank using: Score = w_T*(1-tox) + w_S*sim + w_F*flu

    Args:
        sources: Original texts (N)
        candidates: Candidate lists (N x C)
        weights: (w_toxicity, w_similarity, w_fluency)

    Returns:
        List of best candidates (N)
    """
    w_T, w_S, w_F = weights
    N = len(sources)
    assert len(candidates) == N, "candidates length mismatch"

    if N == 0:
        return []

    C_list = [len(c) for c in candidates]
    C = C_list[0]
    assert all(c == C for c in C_list), "All inputs must have same num_candidates"

    # Flatten candidates
    flat = [c for clist in candidates for c in clist]
    flat_idx = np.repeat(np.arange(N), C)

    print("  Computing toxicity scores...")
    tox = np.array(get_toxicity_scores(flat))
    safety = 1.0 - tox

    print("  Computing similarity scores...")
    src_embs = get_labse_embeddings(sources)
    cand_embs = get_labse_embeddings(flat)
    # Normalize
    src_embs = src_embs / np.linalg.norm(src_embs, axis=1, keepdims=True).clip(1e-8)
    cand_embs = cand_embs / np.linalg.norm(cand_embs, axis=1, keepdims=True).clip(1e-8)
    # Cosine similarity
    sims = np.sum(cand_embs * src_embs[flat_idx], axis=1)
    sims = (sims + 1) / 2  # [0,1]

    print("  Computing fluency scores...")
    ppls = get_gpt2_perplexities(flat)
    flus = perplexity_to_fluency(ppls)

    # Global score
    scores = w_T * safety + w_S * sims + w_F * flus
    scores = scores.reshape(N, C)

    best_idx = scores.argmax(axis=1)
    return [candidates[i][best_idx[i]] for i in range(N)]

print("Global reranking function defined")

Global reranking function defined


## Evaluation Functions

In [19]:
#@title Load evaluation models
print("Loading evaluation models...")

# Toxicity classifier (for evaluation, separate from reranking)
eval_tox_tokenizer = AutoTokenizer.from_pretrained("s-nlp/roberta_toxicity_classifier")
eval_tox_model = AutoModelForSequenceClassification.from_pretrained("s-nlp/roberta_toxicity_classifier")
eval_tox_model.eval()
eval_tox_model = eval_tox_model.to(device)

# Perplexity model (GPT-2 medium for evaluation)
eval_ppl_tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
eval_ppl_model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
eval_ppl_model.eval()
eval_ppl_model = eval_ppl_model.to(device)
if eval_ppl_tokenizer.pad_token is None:
    eval_ppl_tokenizer.pad_token = eval_ppl_tokenizer.eos_token

# Sentence embeddings for MeaningBERT
sim_model = SentenceTransformer('all-MiniLM-L6-v2')

# Metrics
bleu_metric = load("sacrebleu")
bertscore_metric = load("bertscore")

print("Evaluation models loaded")

Loading evaluation models...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Evaluation models loaded


In [20]:
#@title Evaluation functions

def compute_toxicity(texts, tokenizer, model, device="cuda", batch_size=32):
    """
    Compute average toxicity score.
    """
    all_scores = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            toxic_probs = predictions[:, 1]
            all_scores.extend(toxic_probs.cpu().tolist())

    return np.mean(all_scores)

def compute_perplexity(texts, tokenizer, model, device="cuda"):
    """
    Compute average perplexity.
    """
    perplexities = []

    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
            ppl = torch.exp(loss).item()
            perplexities.append(ppl)

    return np.mean(perplexities)

def compute_bertscore(predictions, references):
    """
    Compute BERTScore.
    """
    result = bertscore_metric.compute(predictions=predictions, references=references, lang="en")
    return np.mean(result['f1'])

def compute_bleu(predictions, references):
    """
    Compute BLEU score.
    """
    formatted_refs = [[ref] for ref in references]
    result = bleu_metric.compute(predictions=predictions, references=formatted_refs)
    return result['score']

def compute_meaningbert(predictions, references):
    """
    Compute MeaningBERT score using sentence-transformers.
    Returns cosine similarity scaled to 0-100.
    """
    pred_embs = sim_model.encode(predictions, convert_to_tensor=True)
    ref_embs = sim_model.encode(references, convert_to_tensor=True)
    cosine_scores = torch.nn.functional.cosine_similarity(pred_embs, ref_embs)
    return cosine_scores.mean().item() * 100

def evaluate_all(orig_texts, gen_texts, device="cuda"):
    """
    Run all evaluations including MeaningBERT.
    """
    results = {}

    print("  Computing toxicity scores...")
    results['toxicity_gen'] = compute_toxicity(gen_texts, eval_tox_tokenizer, eval_tox_model, device)
    results['toxicity_orig'] = compute_toxicity(orig_texts, eval_tox_tokenizer, eval_tox_model, device)

    print("  Computing perplexity...")
    results['perplexity_gen'] = compute_perplexity(gen_texts, eval_ppl_tokenizer, eval_ppl_model, device)
    results['perplexity_orig'] = compute_perplexity(orig_texts, eval_ppl_tokenizer, eval_ppl_model, device)

    print("  Computing BERTScore...")
    results['bertscore'] = compute_bertscore(gen_texts, orig_texts)

    print("  Computing MeaningBERT...")
    results['meaningbert'] = compute_meaningbert(gen_texts, orig_texts)

    print("  Computing BLEU...")
    results['bleu4'] = compute_bleu(gen_texts, orig_texts)

    return results

print("Evaluation functions defined")

Evaluation functions defined


## Main Pipeline Function

In [21]:
#@title detoxify() - T5 + Global reranking pipeline

def detoxify(
    data_type: str = "paradetox",
    output_folder: str = "T5_w_Global-Reranking",
    batch_size: int = 8,
    max_length: int = 128,
    num_examples: int = 100,
    num_candidates: int = 10,
    temperature: float = 1.0,
    top_k: int = 50,
    top_p: float = 0.95,
    weights: tuple = (0.5, 0.3, 0.2),  # (toxicity, similarity, fluency)
    overwrite_gen: bool = False,
    run_eval: bool = True,
    overwrite_eval: bool = False,
    echo: bool = False,
):
    """
    T5-ParaDetox pipeline with Global reranking.

    1. Generate num_candidates detoxified texts per input using T5 sampling
    2. Score each candidate using toxicity + similarity + fluency
    3. Select candidate with highest weighted score
    4. Evaluate with BLEU, BERTScore, MeaningBERT, Perplexity, Toxicity
    """
    assert data_type in data_configs, f"Unknown data_type: {data_type}"

    # Output paths
    base_out_dir = os.path.join(XDETOX_DIR, "data", "model_outputs", output_folder)
    data_out_dir = os.path.join(base_out_dir, data_type)
    _ensure_dir(data_out_dir)

    orig_path = os.path.join(data_out_dir, "orig.txt")
    gen_path = os.path.join(data_out_dir, "gen.txt")
    stats_path = os.path.join(data_out_dir, "gen_stats.txt")

    # Load data
    print(f"\n[{data_type}] Loading data...")
    orig_texts = load_test_data(data_type, num_examples)
    print(f"  Loaded {len(orig_texts)} examples")

    if echo:
        print(f"\n[echo] Example inputs (first 3):")
        for i, s in enumerate(orig_texts[:3]):
            print(f"  input[{i}]: {s}")
        print(f"\n[echo] Weights (toxicity, similarity, fluency): {weights}")

    # Generate or load
    if overwrite_gen or not os.path.exists(gen_path):
        print(f"  Generating {num_candidates} candidates per input...")
        all_candidates = t5_generate_candidates_batch(
            orig_texts, t5_model, t5_tokenizer, num_candidates,
            temperature, top_k, top_p, max_length, device
        )

        if echo:
            print(f"\n[echo] Example candidates for input[0]:")
            for j, c in enumerate(all_candidates[0][:3]):
                print(f"    candidate[{j}]: {c}")

        print(f"  Global reranking (toxicity + similarity + fluency)...")
        gen_texts = rerank_candidates_global(orig_texts, all_candidates, weights)

        if echo:
            print(f"\n[echo] Selected outputs (first 3):")
            for i, g in enumerate(gen_texts[:3]):
                print(f"  output[{i}]: {g}")

        # Save outputs
        with open(orig_path, 'w') as f:
            for t in orig_texts:
                f.write(re.sub(r"\s+", " ", t).strip() + '\n')
        with open(gen_path, 'w') as f:
            for t in gen_texts:
                f.write(re.sub(r"\s+", " ", t).strip() + '\n')

        print(f"  Saved outputs to {data_out_dir}")
    else:
        print(f"  Loading existing outputs...")
        with open(orig_path, 'r') as f:
            orig_texts = [l.strip() for l in f]
        with open(gen_path, 'r') as f:
            gen_texts = [l.strip() for l in f]
        print(f"  Loaded {len(gen_texts)} examples")

    # Evaluate
    if run_eval and (overwrite_eval or not os.path.exists(stats_path)):
        print(f"  Running evaluation...")
        results = evaluate_all(orig_texts, gen_texts, device)

        with open(stats_path, 'w') as f:
            for k, v in results.items():
                f.write(f"{k}: {v}\n")

        if echo:
            print(f"\n[echo] Evaluation metrics:")
            for k, v in results.items():
                print(f"  {k}: {v:.4f}")

        print(f"  Saved stats to {stats_path}")
        return results

    elif run_eval:
        print(f"  Loading existing stats...")
        results = _read_stats_file(stats_path)
        return results

    return None

print("detoxify() function defined")

detoxify() function defined


## Run Evaluation

In [22]:
#@title Example run on paradetox
detoxify(
    data_type="paradetox",
    output_folder="T5_w_Global-Reranking",
    num_examples=1000,
    num_candidates=10,
    temperature=1.0,
    weights=(0.5, 0.3, 0.2),
    run_eval=True,
    overwrite_gen=True,
    overwrite_eval=True,
    echo=True,
)


[paradetox] Loading data...
  Loaded 671 examples

[echo] Example inputs (first 3):
  input[0]: . or the loud ass one - thousand ton beast roaring towards you howling its horn .
  input[1]: mandated  and " right fucking now " would be good .
  input[2]: * neither * of my coworkers gave a shit when it came time to ditch mitch . ugh .

[echo] Weights (toxicity, similarity, fluency): (0.5, 0.3, 0.2)
  Generating 10 candidates per input...


T5 Generation:   0%|          | 0/671 [00:00<?, ?it/s]


[echo] Example candidates for input[0]:
    candidate[0]: the loud one thousand ton beast roaring towards you howling its horn
    candidate[1]: Or the loud one - thousand ton beast roaring toward you howling its horn.
    candidate[2]: . or the one- thousand ton beast roaring towards you howling its horn
  Global reranking (toxicity + similarity + fluency)...
  Computing toxicity scores...
Loading toxicity model...


tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Toxicity:   0%|          | 0/210 [00:00<?, ?it/s]

  Computing similarity scores...
Loading LaBSE model...


tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

LaBSE embeddings:   0%|          | 0/21 [00:00<?, ?it/s]

LaBSE embeddings:   0%|          | 0/210 [00:00<?, ?it/s]

  Computing fluency scores...
Loading GPT-2 model...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT-2 PPL:   0%|          | 0/6710 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.



[echo] Selected outputs (first 3):
  output[0]: the loud one thousand ton beast roaring towards you howling its horn
  output[1]: mandated and "right now" would be good.
  output[2]: I am not one of my coworkers, but neither of the coworking members did a great job when it came time to ditch Mitch.
  Saved outputs to /content/drive/MyDrive/ds266/w266 - Project/XDetox/data/model_outputs/T5_w_Global-Reranking/paradetox
  Running evaluation...
  Computing toxicity scores...
  Computing perplexity...
  Computing BERTScore...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  Computing MeaningBERT...
  Computing BLEU...

[echo] Evaluation metrics:
  toxicity_gen: 0.0392
  toxicity_orig: 0.9771
  perplexity_gen: 223.8922
  perplexity_orig: 354.8184
  bertscore: 0.9352
  meaningbert: 80.4369
  bleu4: 49.5820
  Saved stats to /content/drive/MyDrive/ds266/w266 - Project/XDetox/data/model_outputs/T5_w_Global-Reranking/paradetox/gen_stats.txt


{'toxicity_gen': np.float64(0.03924082722830195),
 'toxicity_orig': np.float64(0.9771397442982155),
 'perplexity_gen': np.float64(223.89221068537182),
 'perplexity_orig': np.float64(354.81836553124486),
 'bertscore': np.float64(0.9351542426825635),
 'meaningbert': 80.43691515922546,
 'bleu4': 49.58202781389661}

In [23]:
#@title Run on multiple datasets

datasets_to_eval = ["paradetox", "microagressions_test", "sbf_test", "dynabench_test"]
num_examples = 200
output_folder = "T5_w_Global-Reranking"

all_results = {}

print("=" * 80)
print("T5-PARADETOX + GLOBAL RERANKING PIPELINE")
print("=" * 80)

for dataset_name in datasets_to_eval:
    try:
        results = detoxify(
            data_type=dataset_name,
            output_folder=output_folder,
            batch_size=8,
            max_length=128,
            num_examples=num_examples,
            num_candidates=10,
            temperature=1.0,
            weights=(0.5, 0.3, 0.2),
            overwrite_gen=False,
            run_eval=True,
            overwrite_eval=False,
            echo=False,
        )

        if results:
            all_results[dataset_name] = results
            print(f"  {dataset_name} complete!")

    except Exception as e:
        print(f"  Error on {dataset_name}: {e}")
        import traceback
        traceback.print_exc()
        continue

print("\n" + "=" * 80)

T5-PARADETOX + GLOBAL RERANKING PIPELINE

[paradetox] Loading data...
  Loaded 200 examples
  Loading existing outputs...
  Loaded 671 examples
  Loading existing stats...
  paradetox complete!

[microagressions_test] Loading data...
  Loaded 200 examples
  Loading existing outputs...
  Loaded 200 examples
  Loading existing stats...
  microagressions_test complete!

[sbf_test] Loading data...
  Loaded 200 examples
  Loading existing outputs...
  Loaded 200 examples
  Loading existing stats...
  sbf_test complete!

[dynabench_test] Loading data...
  Loaded 200 examples
  Loading existing outputs...
  Loaded 200 examples
  Loading existing stats...
  dynabench_test complete!



## Results Summary

In [24]:
#@title Display results table

if all_results:
    rows = []
    for dataset_name, results in all_results.items():
        row = {'dataset': dataset_name}
        row.update(results)
        rows.append(row)

    df = pd.DataFrame(rows)

    col_order = [
        'dataset',
        'bertscore',
        'meaningbert',
        'bleu4',
        'perplexity_gen',
        'perplexity_orig',
        'toxicity_gen',
        'toxicity_orig',
    ]
    df = df[[col for col in col_order if col in df.columns]]

    # Save to CSV
    summary_csv = os.path.join(XDETOX_DIR, "data", "model_outputs", output_folder, "t5_global_summary.csv")
    df.to_csv(summary_csv, index=False)
    print(f"Saved summary to {summary_csv}\n")

    # Display
    print("=" * 80)
    print("T5-PARADETOX + GLOBAL RERANKING RESULTS")
    print("=" * 80)
    print(df.to_string(index=False))
    print("=" * 80)
else:
    print("No results available.")

Saved summary to /content/drive/MyDrive/ds266/w266 - Project/XDetox/data/model_outputs/T5_w_Global-Reranking/t5_global_summary.csv

T5-PARADETOX + GLOBAL RERANKING RESULTS
             dataset  bertscore  meaningbert     bleu4  perplexity_gen  perplexity_orig  toxicity_gen  toxicity_orig
           paradetox   0.935154    80.436915 49.582028      223.892211       354.818366      0.039241       0.977140
microagressions_test   0.979861    94.674039 77.263657       86.850546       108.896956      0.025195       0.050522
            sbf_test   1.000000   100.000000  0.000000             NaN              NaN      0.000045       0.000045
      dynabench_test   0.962556    87.311268 71.717564      156.667103       312.835927      0.251320       0.499826
