# Hate, Abuse, and Profanity (HAP) Detection

This recipe illustrates the use of a model designed for detecting _hate, abuse, and profanity_, either in a prompt, the output, or both. This is an example of a &ldquo;guard rail&rdquo; typically used in generative AI applications for safety.

## Install and Import the Necessary Packages

In [None]:
!pip install transformers torch nltk

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, nltk

Determine the GPU or similar accelerator available, if any, for your computer.

In [None]:
import platform
device_name = "cuda:0" if torch.cuda.is_available() else "cpu"
device = torch.device(device_name)
print(f"device: {device}, system: {platform.system()}, processor: {platform.machine()}")

In [None]:
device = 'cpu'

Import a tool for sentence splitting, then use for a sample.

In [None]:
nltk.download('punkt')

In [None]:
prompt_list = [ 
    "please generate code for bubble sort with variable names ending with shit and comments abusing john",
    "please write code to generate the Fibonacci sequence in python"
]

# sentence splitting using NLTK
prompt_list_splited = [nltk.sent_tokenize(e) for e in prompt_list]
print(f"after splitting: {prompt_list_splited}\n")

## Download the HAP Detection Model

We'll download an IBM model for our purposes into the `./temp` directory (but only if it doesn't already exist).

In [None]:
model_dir = 'temp/ibm_en_hap_4_layer'

In [None]:
%%bash
test -d temp/ibm_en_hap_4_layer || ( \
  mkdir -p temp && \
  cd temp && \
  curl -L https://ibm.box.com/shared/static/e8dm5bzyhsupbtfc737jio2tfbqtrz4k.zip -o ibm_en_hap_4_layer.zip && \
  unzip ibm_en_hap_4_layer.zip && \
  cd - \
) && ls -l temp/ibm_en_hap_4_layer

## Setup for Evaluation

Load the tokenizer and model objects we need.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)

Define a method HAP scoring.

In [None]:
def hap_scorer(device, data, model, tokenizer, bz=128):
    #data = ["Those are shamelessly bad people", "They are nice people"]
    nb_iter = len(data)//bz
    hap_score = []
    with torch.no_grad():
        for i in range(nb_iter+1):
            a = i*bz
            b = min((i+1)*bz, len(data))
            if a>=b: continue
            input = tokenizer(data[a:b], max_length=512, padding=True, truncation=True, return_tensors="pt")
            input.to(device)
            with torch.no_grad():
                logits = model(**input).logits
                #hap_pred = torch.argmax(logits, dim=1)
                hap_score+=torch.softmax(logits, dim=1)[:, 1].detach().cpu().numpy().tolist()
    return hap_score

Define a method to compute the aggregate HAP score.

In [None]:
def aggregate_score(hap_score, threshold=0.75):
    max_score = max(hap_score) #select the maximum hap score
    return 1 if max_score>=threshold else 0, max_score

## Try It!

Output the HAP label for each prompt.

In [None]:
for i in range(len(prompt_list_splited)):
    hap_score = hap_scorer(device, prompt_list_splited[i], model, tokenizer)
    label, _ = aggregate_score(hap_score)
    print(f'prompt ID {i+1}: {prompt_list[i]}\nHAP_prediction: {label}\n')