[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aangelopoulos/conformal-prediction/blob/main/notebooks/toxic-text-outlier-detection.ipynb)

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
!pip install -U --no-cache-dir gdown --pre



In [5]:
# Load cached data from Detoxify model on Jigsaw dataset. See https://github.com/unitaryai/detoxify for details.
# The comments are from Wikipedia talk channels, and we are trying perform outlier detection
# We will only use the non-toxic data, and then with type-1 error control identify the toxic outliers.
if not os.path.exists('../data'):
    os.system('gdown 1h7S6N_Rx7gdfO3ZunzErZy6H7620EbZK -O ../data.tar.gz')
    os.system('tar -xf ../data.tar.gz -C ../')
    os.system('rm ../data.tar.gz')
    
data = np.load('../data/toxic-text/toxic-text-detoxify.npz')
preds = data['preds'] # Toxicity score in [0,1]
toxic = data['labels'] # Toxic (1) or not (0)

Downloading...
From (original): https://drive.google.com/uc?id=1h7S6N_Rx7gdfO3ZunzErZy6H7620EbZK
From (redirected): https://drive.google.com/uc?id=1h7S6N_Rx7gdfO3ZunzErZy6H7620EbZK&confirm=t&uuid=0ac76225-216a-4626-904d-4d9e4056ca3b
To: /Users/hoon/Desktop/jhu/Fall 2025/conformal-prediction/data.tar.gz
100%|██████████| 1.31G/1.31G [01:27<00:00, 15.0MB/s]


In [6]:
# Problem setup
alpha = 0.1 # 1-alpha is the desired type-1 error
n = 10000 # Use 200 calibration points

In [7]:
# Look at only the non-toxic data
nontoxic = toxic == 0
preds_nontoxic = preds[nontoxic]
preds_toxic = preds[np.invert(nontoxic)]

# Split nontoxic data into calibration and validation sets (save the shuffling)
idx = np.array([1] * n + [0] * (preds_nontoxic.shape[0]-n)) > 0
np.random.shuffle(idx)
cal_scores, val_scores = preds_nontoxic[idx], preds_nontoxic[np.invert(idx)]

### Conformal outlier detection happens here

In [8]:
# Use the outlier detection method to get a threshold on the toxicities
qhat = np.quantile(cal_scores, np.ceil((n+1)*(1-alpha))/n)
# Perform outlier detection on the ind and ood data
outlier_ind = val_scores > qhat # We want this to be no more than alpha on average
outlier_ood = preds_toxic > qhat # We want this to be as large as possible, but it doesn't have a guarantee

In [9]:
# Calculate type-1 and type-2 errors
type1 = outlier_ind.mean()
type2 = 1-outlier_ood.mean()
print(f"The type-1 error is {type1:.4f}, the type-2 error is {type2:.4f}, and the threshold is {qhat:.4f}.")

The type-1 error is 0.1062, the type-2 error is 0.2865, and the threshold is 0.4632.


In [10]:
# Show some examples of unflagged and flagged text
content = pd.read_csv('../generation-scripts/toxic_text_utils/test.csv')['content']
print("Unflagged text examples:")
print(list(np.random.choice(content[preds <= qhat],size=(5,))))
print("\n\nFlagged text examples:")
print(list(np.random.choice(content[preds > qhat],size=(5,))))

Unflagged text examples:
['Ups, perdón por no investigar bien desde el principio. Ya actualizé el código, y según pruebas todo anda bien. ~ ', 'Sugiero que el comentario anterior sea borrado pues con tod respeto ha sido escrito por alguien desinformado, efectivamente históricamente ha habido un área catalanoparlante en Murcia. Efectivamente catalán y valenciano son variedades totalmente inteligibles de la misma  lengua  (no voy a entrar en la polémica de si esa lengua debe llamarse catalán-valenciano-balear, simplemente catalán o balcavarés) y efectivamente   Castell de Ferro   es un topónimo que debe su nombre a una comunidad catalana reubicada en esa región (es conocido que durante la reconquista y después de ella aparecieron se formaron unos pocos enclaves catalanoparlantes en Andalucía). ', ' Настоящие ученые всегда предпочитают вместо фразы  Явления А нет и быть не может  употреблять фразу  Мне и современной науке ничего не известно о явлении А .- ', 'Что-то странное творится у ме